Bhawna Piryani

2026

Question Difficulty Estimation for Large Language Models via Answer Plausibility Scoring
Jamshid Mozafari | Bhawna Piryani | Adam Jatowt
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Estimating question difficulty is a critical component in evaluating and improving large language models (LLMs) for question answering (QA). Existing approaches often rely on readability formulas, retrieval-based signals, or popularity statistics, which may not fully capture the reasoning challenges posed to modern LLMs. In this paper, we introduce Q-DAPS (Question Difficulty based on Answer Plausibility Scores) method, a novel approach that estimates question difficulty by computing the entropy of plausibility scores over candidate answers. We systematically evaluate Q-DAPS across four prominent QA datasets—TriviaQA, NQ, MuSiQue, and QASC—demonstrating that it consistently outperforms baselines. Moreover, Q-DAPS shows strong robustness across hyperparameter variations and question types. Extensive ablation studies further show that Q-DAPS remains robust across different plausibility estimation paradigms, model sizes, and realistic settings. Human evaluations further confirm strong alignment between Q-DAPS’s difficulty estimates and human judgments of question difficulty. Overall, Q-DAPS provides an interpretable, scalable, and bias-resilient approach to question difficulty estimation in modern QA systems.

pdf bib abs

BracketRank: Large Language Model Document Ranking via Reasoning-based Competitive Elimination
Abdelrahman Abdallah | Mohammed Ali | Bhawna Piryani | Adam Jatowt
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Although Large Language Models (LLMs) show strong potential for zero-shot document ranking, current listwise approaches face three critical limitations. First, context length constraints prevent processing many documents simultaneously. Second, sequential generation creates bottlenecks that cannot run in parallel. Third, ranking results depend heavily on initial document order, leading to inconsistent performance. We introduce BracketRank, a reasoning-driven competitive elimination framework that addresses these challenges through systematic group competition. Our method uses adaptive grouping to automatically optimise group sizes based on LLM context limits, reasoning-enhanced prompts that require explicit relevance explanations, and bracket-style elimination where documents compete through winner and loser brackets. This structure ensures every document has fair advancement opportunities regardless of initial positioning while allowing for parallel processing across competition stages. We evaluate BracketRank on TREC DL 19, TREC DL 20, and eight BEIR benchmark datasets. Results show that BracketRank achieves 77.90 NDCG@5 on TREC DL 19 and 75.85 NDCG@5 on TREC DL 20, outperforming RankGPT and other state-of-the-art methods. On BEIR datasets, BracketRank reaches 54.66 average NDCG@10, demonstrating robust performance across diverse domains.

pdf bib abs

It’s High Time: A Survey of Temporal Question Answering
Bhawna Piryani | Abdelrahman Abdallah | Jamshid Mozafari | Avishek Anand | Adam Jatowt
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Time plays a critical role in how information is generated, retrieved, and interpreted. In this survey, we provide a comprehensive overview of Temporal Question Answering (TQA), a research area that focuses on answering questions involving temporal constraints or context. As time-stamped content from sources like news articles, web archives, and knowledge bases continues to grow, TQA systems must address challenges such as detecting temporal intent, normalizing time expressions, ordering events, and reasoning over evolving or ambiguous facts. We organize existing work through a unified perspective that captures the interaction between corpus temporality, question temporality, and model capabilities, enabling a systematic comparison of datasets, tasks, and approaches. We review recent advances in TQA enabled by neural architectures, especially transformer-based models and Large Language Models (LLMs), highlighting progress in temporal language modeling, retrieval-augmented generation (RAG), and temporal reasoning. We also discuss benchmark datasets and evaluation strategies designed to test temporal robustness, recency awareness, and generalization.

pdf bib abs

Rankify: A Comprehensive Python Toolkit for Retrieval, Re-Ranking, and Retrieval-Augmented Generation
Abdelrahman Abdallah | Bhawna Piryani | Jamshid Mozafari | Andreas Herzinger | Jamie Holdcroft | Adam Jatowt
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)

Building retrieval-augmented generation (RAG) systems often requirescombining separate tools for retrieval, re-ranking, and generation,with incompatible data formats, evaluation pipelines, and deployment workflows.We present , an open-source Python toolkit that unifies these stagesin a single modular framework.[PyPI: <https://pypi.org/project/rankify/>],[GitHub: <https://github.com/DataScienceUIBK/Rankify>],[Docs: <https://rankify.readthedocs.io>]%,[Video: <https://youtu.be/kkLzomrM2ec>]provides 42 benchmark datasets with pre-retrieved documents andpre-built indices, 15 retrievers (sparse, dense, and reasoning-augmented),and 24 re-ranking models spanning 41 pointwise, pairwise, and listwise variants.It also supports 6 RAG strategies across four inference backends(Hugging Face, vLLM, LiteLLM, and OpenAI), enabling consistent experimentationfrom local models to hosted APIs.A unified pipeline interface allows users to compose retrieve–rerank–generateworkflows in a few lines of code, while an agentic assistant (RankifyAgent), aREST server (RankifyServer), and an interactive webplayground support deployment and non-programmatic exploration.Across 200+ configurations on QA and BEIR/TREC benchmarks with six generator LLMs,re-ranking consistently improves downstream performance, yielding gains of5–15 points in Exact Match and up to 8.5 points in RAGAS context precisionacross diverse retriever–generator combinations.

2025

pdf bib abs

ASRank: Zero-Shot Re-Ranking with Answer Scent for Document Retrieval
Abdelrahman Abdallah | Jamshid Mozafari | Bhawna Piryani | Adam Jatowt
Findings of the Association for Computational Linguistics: NAACL 2025

Retrieval-Augmented Generation (RAG) models have drawn considerable attention in modern open-domain question answering. The effectiveness of RAG depends on the quality of the top retrieved documents. However, conventional retrieval methods sometimes fail to rank the most relevant documents at the top. In this paper, we introduce ASRANK, a new re-ranking method based on scoring retrieved documents using zero-shot answer scent which relies on a pre-trained large language model to compute the likelihood of the document-derived answers aligning with the answer scent. Our approach demonstrates marked improvements across several datasets, including NQ, TriviaQA, WebQA, ArchivalQA, HotpotQA, and Entity Questions. Notably, ASRANK increases Top-1 retrieval accuracy on NQ from 19.2% to 46.5% for MSS and 22.1% to 47.3% for BM25. It also shows strong retrieval performance on several datasets compared to state-of-the-art methods (47.3 Top-1 by ASRANK vs 35.4 by UPR by BM25).

pdf bib abs

DeAR: Dual-Stage Document Reranking with Reasoning Agents via LLM Distillation
Abdelrahman Abdallah | Jamshid Mozafari | Bhawna Piryani | Adam Jatowt
Findings of the Association for Computational Linguistics: EMNLP 2025

Large Language Models (LLMs) have transformed listwise document reranking by enabling global reasoning over candidate sets, yet single models often struggle to balance fine-grained relevance scoring with holistic cross-document analysis. We propose DeepAgentRank (DeAR), an open-source framework that decouples these tasks through a dual-stage approach, achieving superior accuracy and interpretability. In Stage 1, we distill token-level relevance signals from a frozen 13B LLaMA teacher into a compact 3, 8B student model using a hybrid of cross-entropy, RankNet, and KL divergence losses, ensuring robust pointwise scoring. In Stage 2, we attach a second LoRA adapter and fine-tune on 20K GPT-4o-generated chain-of-thought permutations, enabling listwise reasoning with natural-language justifications. Evaluated on TREC-DL19/20, eight BEIR datasets, and NovelEval-2306, DeAR surpasses open-source baselines by +5.1 nDCG@5 on DL20 and achieves 90.97 nDCG@10 on NovelEval, outperforming GPT-4 by +3.09. Without fine-tuning on Wikipedia, DeAR also excels in open-domain QA, achieving 54.29 Top-1 accuracy on Natural Questions, surpassing baselines like MonoT5, UPR, and RankGPT. Ablations confirm that dual-loss distillation ensures stable calibration, making DeAR a highly effective and interpretable solution for modern reranking systems.

pdf bib abs

How Good are LLM-based Rerankers? An Empirical Analysis of State-of-the-Art Reranking Models
Abdelrahman Abdallah | Bhawna Piryani | Jamshid Mozafari | Mohammed Ali | Adam Jatowt
Findings of the Association for Computational Linguistics: EMNLP 2025

In this work, we present a systematic and comprehensive empirical evaluation of state-of-the-art reranking methods, encompassing large language model (LLM)-based, lightweight contextual, and zero-shot approaches, with respect to their performance in information retrieval tasks. We evaluate in total 22 methods, including 40 variants (depending on used LLM) across several established benchmarks, including TREC DL19, DL20, and BEIR, as well as a novel dataset designed to test queries unseen by pretrained models. Our primary goal is to determine, through controlled and fair comparisons, whether a performance disparity exists between LLM-based rerankers and their lightweight counterparts, particularly on novel queries, and to elucidate the underlying causes of any observed differences. To disentangle confounding factors, we analyse the effects of training data overlap, model architecture, and computational efficiency on reranking performance. Our findings indicate that while LLM-based rerankers demonstrate superior performance on familiar queries, their generalisation ability to novel queries varies, with lightweight models offering comparable efficiency. We further identify that the novelty of queries significantly impacts reranking effectiveness, highlighting limitations in existing approaches.

pdf bib abs

DynRank: Improve Passage Retrieval with Dynamic Zero-Shot Prompting Based on Question Classification
Abdelrahman Abdallah | Jamshid Mozafari | Bhawna Piryani | Mohammed M. Abdelgwad | Adam Jatowt
Proceedings of the 31st International Conference on Computational Linguistics

This paper presents DynRank, a novel framework for enhancing passage retrieval in open-domain question-answering systems through dynamic zero-shot question classification. Traditional approaches rely on static prompts and pre-defined templates, which may limit model adaptability across different questions and contexts. In contrast, DynRank introduces a dynamic prompting mechanism, leveraging a pre-trained question classification model that categorizes questions into fine-grained types. Based on these classifications, contextually relevant prompts are generated, enabling more effective passage retrieval. We integrate DynRank into existing retrieval frameworks and conduct extensive experiments on multiple QA benchmark datasets.

2024

pdf bib abs

Detecting Temporal Ambiguity in Questions
Bhawna Piryani | Abdelrahman Abdallah | Jamshid Mozafari | Adam Jatowt
Findings of the Association for Computational Linguistics: EMNLP 2024

Detecting and answering ambiguous questions has been a challenging task in open-domain question answering. Ambiguous questions have different answers depending on their interpretation and can take diverse forms. Temporally ambiguous questions are one of the most common types of such questions. In this paper, we introduce TEMPAMBIQA, a manually annotated temporally ambiguous QA dataset consisting of 8,162 open-domain questions derived from existing datasets. Our annotations focus on capturing temporal ambiguity to study the task of detecting temporally ambiguous questions. We propose a novel approach by using diverse search strategies based on disambiguate versions of the questions. We also introduce and test non-search, competitive baselines for detecting temporal ambiguity using zero-shot and few-shot approaches.

pdf bib abs

Exploring Hint Generation Approaches for Open-Domain Question Answering
Jamshid Mozafari | Abdelrahman Abdallah | Bhawna Piryani | Adam Jatowt
Findings of the Association for Computational Linguistics: EMNLP 2024

Automatic Question Answering (QA) systems rely on contextual information to provide accurate answers. Commonly, contexts are prepared through either retrieval-based or generation-based methods. The former involves retrieving relevant documents from a corpus like Wikipedia, whereas the latter uses generative models such as Large Language Models (LLMs) to generate the context. In this paper, we introduce a novel context preparation approach called HINTQA, which employs Automatic Hint Generation (HG) techniques. Unlike traditional methods, HINTQA prompts LLMs to produce hints about potential answers for the question rather than generating relevant context. We evaluate our approach across three QA datasets including TriviaQA, Natural Questions, and Web Questions, examining how the number and order of hints impact performance. Our findings show that the HINTQA surpasses both retrieval-based and generation-based approaches. We demonstrate that hints enhance the accuracy of answers more than retrieved and generated contexts.

Co-authors

Avishek Anand 1

Andreas Herzinger 1

Jamie Holdcroft 1

Venues

Fix author