Yulong Li
2026
TAGS: A Test-Time Generalist–Specialist Framework with Retrieval-Augmented Reasoning and Verification
Jianghao Wu | Feilong Tang | Yulong Li | Ming Hu | Haochen Xue | Shoaib Jameel | Zongyuan Ge | Yutong Xie | Imran Razzak
Findings of the Association for Computational Linguistics: ACL 2026
Jianghao Wu | Feilong Tang | Yulong Li | Ming Hu | Haochen Xue | Shoaib Jameel | Zongyuan Ge | Yutong Xie | Imran Razzak
Findings of the Association for Computational Linguistics: ACL 2026
Recent advances such as Chain-of-Thought prompting have significantly improved large language models (LLMs) in zero-shot medical reasoning. However, prompting-based methods often remain shallow and unstable, while fine-tuned medical LLMs suffer from poor generalization under distribution shifts and limited adaptability to unseen clinical scenarios. To address these limitations, we present TAGS, a test-time framework that combines a broadly capable generalist with a domain-specific specialist to offer complementary perspectives without any model fine-tuning or parameter updates. To support this generalist–specialist reasoning process, we introduce two auxiliary modules: a hierarchical retrieval mechanism that provides multi-scale exemplars by selecting examples based on both semantic and rationale-level similarity, and a reliability scorer that evaluates reasoning consistency to guide final answer aggregation. TAGS achieves strong performance across nine MedQA benchmarks, boosting GPT-4o accuracy by 13.8%, DeepSeek-R1 by 16.8%, and improving a vanilla 7B model from 14.1% to 23.9%. These results surpass several fine-tuned medical LLMs, without any parameter updates.
2025
MMRC: A Large-Scale Benchmark for Understanding Multimodal Large Language Model in Real-World Conversation
Haochen Xue | Feilong Tang | Ming Hu | Yexin Liu | Qidong Huang | Yulong Li | Chengzhi Liu | Zhongxing Xu | Chong Zhang | Chun-Mei Feng | Yutong Xie | Imran Razzak | Zongyuan Ge | Jionglong Su | Junjun He | Yu Qiao
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Haochen Xue | Feilong Tang | Ming Hu | Yexin Liu | Qidong Huang | Yulong Li | Chengzhi Liu | Zhongxing Xu | Chong Zhang | Chun-Mei Feng | Yutong Xie | Imran Razzak | Zongyuan Ge | Jionglong Su | Junjun He | Yu Qiao
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Recent multimodal large language models (MLLMs) have demonstrated significant potential in open-ended conversation, generating more accurate and personalized responses. However, their abilities to memorize, recall, and reason in sustained interactions within real-world scenarios remain underexplored. This paper introduces MMRC, a Multi-Modal Real-world Conversation benchmark for evaluating six core open-ended abilities of MLLMs: information extraction, multi-turn reasoning, information update, image management, memory recall, and answer refusal. With data collected from real-world scenarios, MMRC comprises 5,120 conversations and 28,720 corresponding manually labeled questions, posing a significant challenge to existing MLLMs. Evaluations on 20 MLLMs in MMRC indicate an accuracy drop during open-ended interactions. We identify four common failure patterns: long-term memory degradation, inadequacies in updating factual knowledge, accumulated assumption of error propagation, and reluctance to “say no.” To mitigate these issues, we propose a simple yet effective NOTE-TAKING strategy, which can record key information from the conversation and remind the model during its responses, enhancing conversational capabilities. Experiments across six MLLMs demonstrate significant performance improvements.
2023
PrimeQA: The Prime Repository for State-of-the-Art Multilingual Question Answering Research and Development
Avi Sil | Jaydeep Sen | Bhavani Iyer | Martin Franz | Kshitij Fadnis | Mihaela Bornea | Sara Rosenthal | Scott McCarley | Rong Zhang | Vishwajeet Kumar | Yulong Li | Md Arafat Sultan | Riyaz Bhat | Juergen Bross | Radu Florian | Salim Roukos
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
Avi Sil | Jaydeep Sen | Bhavani Iyer | Martin Franz | Kshitij Fadnis | Mihaela Bornea | Sara Rosenthal | Scott McCarley | Rong Zhang | Vishwajeet Kumar | Yulong Li | Md Arafat Sultan | Riyaz Bhat | Juergen Bross | Radu Florian | Salim Roukos
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
The field of Question Answering (QA) has made remarkable progress in recent years, thanks to the advent of large pre-trained language models, newer realistic benchmark datasets with leaderboards, and novel algorithms for key components such as retrievers and readers. In this paper, we introduce PrimeQA: a one-stop and open-source QA repository with an aim to democratize QA research and facilitate easy replication of state-of-the-art (SOTA) QA methods. PrimeQA supports core QA functionalities like retrieval and reading comprehension as well as auxiliary capabilities such as question generation. It has been designed as an end-to-end toolkit for various use cases: building front-end applications, replicating SOTA methods on public benchmarks, and expanding pre-existing methods. PrimeQA is available at: https://github.com/primeqa.
2022
Learning Cross-Lingual IR from an English Retriever
Yulong Li | Martin Franz | Md Arafat Sultan | Bhavani Iyer | Young-Suk Lee | Avirup Sil
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Yulong Li | Martin Franz | Md Arafat Sultan | Bhavani Iyer | Young-Suk Lee | Avirup Sil
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
We present DR.DECR (Dense Retrieval with Distillation-Enhanced Cross-Lingual Representation), a new cross-lingual information retrieval (CLIR) system trained using multi-stage knowledge distillation (KD). The teacher of DR.DECR relies on a highly effective but computationally expensive two-stage inference process consisting of query translation and monolingual IR, while the student, DR.DECR, executes a single CLIR step. We teach DR.DECR powerful multilingual representations as well as CLIR by optimizing two corresponding KD objectives. Learning useful representations of non-English text from an English-only retriever is accomplished through a cross-lingual token alignment algorithm that relies on the representation capabilities of the underlying multilingual encoders. In both in-domain and zero-shot out-of-domain evaluation, DR.DECR demonstrates far superior accuracy over direct fine-tuning with labeled CLIR data. It is also the best single-model retriever on the XOR-TyDi benchmark at the time of this writing.
Search
Fix author
Co-authors
- Martin Franz 2
- Zongyuan Ge 2
- Ming Hu 2
- Bhavani Iyer 2
- Imran Razzak 2
- Avirup Sil 2
- Md Arafat Sultan 2
- Feilong Tang 2
- Yutong Xie 2
- Haochen Xue 2
- Riyaz Bhat 1
- Mihaela Bornea 1
- Juergen Bross 1
- Kshitij Fadnis 1
- Chun-Mei Feng 1
- Radu Florian 1
- Junjun He 1
- Qidong Huang 1
- Shoaib Jameel 1
- Vishwajeet Kumar 1
- Young-Suk Lee 1
- Chengzhi Liu 1
- Yexin Liu 1
- J. Scott McCarley 1
- Yu Qiao 1
- Sara Rosenthal 1
- Salim Roukos 1
- Jaydeep Sen 1
- Jionglong Su 1
- Jianghao Wu 1
- Zhongxing Xu 1
- Chong Zhang 1
- Rong Zhang 1