Mohammed Ali
2026
RECOR: Reasoning-focused Multi-turn Conversational Retrieval Benchmark
Mohammed Ali | Abdelrahman Abdallah | Amit Agarwal | Hitesh Laxmichand Patel | Adam Jatowt
Findings of the Association for Computational Linguistics: ACL 2026
Mohammed Ali | Abdelrahman Abdallah | Amit Agarwal | Hitesh Laxmichand Patel | Adam Jatowt
Findings of the Association for Computational Linguistics: ACL 2026
Existing benchmarks treat multi-turn conversation and reasoning-intensive retrieval separately, yet real-world information seeking requires both. To bridge this gap, we present a benchmark for reasoning-based conversational information retrieval comprising 707 conversations (2,971 turns) across eleven domains. To ensure quality, our Decomposition-and-Verification framework transforms complex queries into fact-grounded multi-turn dialogues through multi-level validation, where atomic facts are verified against sources and explicit retrieval reasoning is generated for each turn. Comprehensive evaluation reveals that combining conversation history with reasoning doubles retrieval performance (Baseline .236 → History+Reasoning .479 nDCG@10), while reasoning-specialized models substantially outperform dense encoders. Despite these gains, further analysis highlights that implicit reasoning remains challenging, particularly when logical connections are not explicitly stated in the text. [<https://github.com/RECOR-Benchmark/RECOR>]
BracketRank: Large Language Model Document Ranking via Reasoning-based Competitive Elimination
Abdelrahman Abdallah | Mohammed Ali | Bhawna Piryani | Adam Jatowt
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Abdelrahman Abdallah | Mohammed Ali | Bhawna Piryani | Adam Jatowt
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Although Large Language Models (LLMs) show strong potential for zero-shot document ranking, current listwise approaches face three critical limitations. First, context length constraints prevent processing many documents simultaneously. Second, sequential generation creates bottlenecks that cannot run in parallel. Third, ranking results depend heavily on initial document order, leading to inconsistent performance. We introduce BracketRank, a reasoning-driven competitive elimination framework that addresses these challenges through systematic group competition. Our method uses adaptive grouping to automatically optimise group sizes based on LLM context limits, reasoning-enhanced prompts that require explicit relevance explanations, and bracket-style elimination where documents compete through winner and loser brackets. This structure ensures every document has fair advancement opportunities regardless of initial positioning while allowing for parallel processing across competition stages. We evaluate BracketRank on TREC DL 19, TREC DL 20, and eight BEIR benchmark datasets. Results show that BracketRank achieves 77.90 NDCG@5 on TREC DL 19 and 75.85 NDCG@5 on TREC DL 20, outperforming RankGPT and other state-of-the-art methods. On BEIR datasets, BracketRank reaches 54.66 average NDCG@10, demonstrating robust performance across diverse domains.
2025
How Good are LLM-based Rerankers? An Empirical Analysis of State-of-the-Art Reranking Models
Abdelrahman Abdallah | Bhawna Piryani | Jamshid Mozafari | Mohammed Ali | Adam Jatowt
Findings of the Association for Computational Linguistics: EMNLP 2025
Abdelrahman Abdallah | Bhawna Piryani | Jamshid Mozafari | Mohammed Ali | Adam Jatowt
Findings of the Association for Computational Linguistics: EMNLP 2025
In this work, we present a systematic and comprehensive empirical evaluation of state-of-the-art reranking methods, encompassing large language model (LLM)-based, lightweight contextual, and zero-shot approaches, with respect to their performance in information retrieval tasks. We evaluate in total 22 methods, including 40 variants (depending on used LLM) across several established benchmarks, including TREC DL19, DL20, and BEIR, as well as a novel dataset designed to test queries unseen by pretrained models. Our primary goal is to determine, through controlled and fair comparisons, whether a performance disparity exists between LLM-based rerankers and their lightweight counterparts, particularly on novel queries, and to elucidate the underlying causes of any observed differences. To disentangle confounding factors, we analyse the effects of training data overlap, model architecture, and computational efficiency on reranking performance. Our findings indicate that while LLM-based rerankers demonstrate superior performance on familiar queries, their generalisation ability to novel queries varies, with lightweight models offering comparable efficiency. We further identify that the novelty of queries significantly impacts reranking effectiveness, highlighting limitations in existing approaches.