2025
pdf
bib
abs
Positional Bias in Long-Document Ranking: Impact, Assessment, and Mitigation
Leonid Boytsov
|
David Akinpelu
|
Nipun Katyal
|
Tianyi Lin
|
Fangwei Gao
|
Yutian Zhao
|
Jeffrey Huang
|
Eric Nyberg
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
We tested over 20 Transformer models for ranking long documents (including recent LongP models trained with FlashAttention and RankGPT models “powered” by OpenAI and Anthropic cloud APIs).We compared them with the simple FirstP baseline, which applied the same model to truncated input (up to 512 tokens).On MS MARCO, TREC DL, and Robust04 no long-document model outperformed FirstP by more than 5% (on average).We hypothesized that this lack of improvement is not due to inherent model limitations,but due to benchmark positional bias (most relevant passages tend to occur early in documents),which is known to exist in MS MARCO.To confirm this, we analyzed positional relevance distributions across four long-document corpora (with six query sets) and observed the same early-position bias.Surprisingly, we also found bias in six BEIR collections, which are typically categorized asshort-document datasets.We then introduced a new diagnostic dataset, MS MARCO FarRelevant, where relevant spans were deliberately placed beyond the first 512 tokens.On this dataset, many long-context models—including RankGPT—performed at random-baseline level, suggesting overfitting to positional bias.We also experimented with debiasing training data, but with limited success.Our findings (1) highlight the need for careful benchmark design in evaluating long-context models for document ranking, (2) identify model types that are more robust to positional bias, and (3) motivate further work on approaches to debias training data.We release our code and data to support further research.
pdf
bib
abs
Constrained Decoding with Speculative Lookaheads
Nishanth Sridhar Nakshatri
|
Shamik Roy
|
Rajarshi Das
|
Suthee Chaidaroon
|
Leonid Boytsov
|
Rashmi Gangadharaiah
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Constrained decoding with lookahead heuristics (CDLH) is a highly effective method for aligning LLM generations to human preferences. However, the extensive lookahead roll-out operations for each generated token makes CDLH prohibitively expensive, resulting in low adoption in practice. In contrast, common decoding strategies such as greedy decoding are extremely efficient, but achieve very low constraint satisfaction. We propose constrained decoding with speculative lookaheads (CDSL), a technique that significantly improves upon the inference efficiency of CDLH without experiencing the drastic performance reduction seen with greedy decoding. CDSL is motivated by the recently proposed idea of speculative decoding that uses a much smaller draft LLM for generation and a larger target LLM for verification. In CDSL, the draft model is used to generate lookaheads which is verified by a combination of target LLM and task-specific reward functions. This process accelerates decoding by reducing the computational burden while maintaining strong performance. We evaluate CDSL in two constraint decoding tasks with three LLM families and achieve 2.2x to 12.15x speedup over CDLH without significant performance reduction.
2024
pdf
bib
abs
KazQAD: Kazakh Open-Domain Question Answering Dataset
Rustem Yeshpanov
|
Pavel Efimov
|
Leonid Boytsov
|
Ardak Shalkarbayuli
|
Pavel Braslavski
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
We introduce KazQAD—a Kazakh open-domain question answering (ODQA) dataset—that can be used in both reading comprehension and full ODQA settings, as well as for information retrieval experiments. KazQAD contains just under 6,000 unique questions with extracted short answers and nearly 12,000 passage-level relevance judgements. We use a combination of machine translation, Wikipedia search, and in-house manual annotation to ensure annotation efficiency and data quality. The questions come from two sources: translated items from the Natural Questions (NQ) dataset (only for training) and the original Kazakh Unified National Testing (UNT) exam (for development and testing). The accompanying text corpus contains more than 800,000 passages from the Kazakh Wikipedia. As a supplementary dataset, we release around 61,000 question-passage-answer triples from the NQ dataset that have been machine-translated into Kazakh. We develop baseline retrievers and readers that achieve reasonable scores in retrieval (NDCG10 = 0.389 MRR = 0.382), reading comprehension (EM = 38.5 F1 = 54.2), and full ODQA (EM = 17.8 F1 = 28.7) settings. Nevertheless, these results are substantially lower than state-of-the-art results for English QA collections, and we think that there should still be ample room for improvement. We also show that the current OpenAI’s ChatGPTv3.5 is not able to answer KazQAD test questions in the closed-book setting with acceptable quality. The dataset is freely available under the Creative Commons licence (CC BY-SA) at url https://github.com/IS2AI/KazQAD
2020
pdf
bib
abs
Flexible retrieval with NMSLIB and FlexNeuART
Leonid Boytsov
|
Eric Nyberg
Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS)
Our objective is to introduce to the NLP community NMSLIB, describe a new retrieval toolkit FlexNeuART, as well as their integration capabilities. NMSLIB, while being one the fastest k-NN search libraries, is quite generic and supports a variety of distance/similarity functions. Because the library relies on the distance-based structure-agnostic algorithms, it can be further extended by adding new distances. FlexNeuART is a modular, extendible and flexible toolkit for candidate generation in IR and QA applications, which supports mixing of classic and neural ranking signals. FlexNeuART can efficiently retrieve mixed dense and sparse representations (with weights learned from training data), which is achieved by extending NMSLIB. In that, other retrieval systems work with purely sparse representations (e.g., Lucene), purely dense representations (e.g., FAISS and Annoy), or only perform mixing at the re-ranking stage.
2014
pdf
bib
Metaphor Detection with Cross-Lingual Model Transfer
Yulia Tsvetkov
|
Leonid Boytsov
|
Anatole Gershman
|
Eric Nyberg
|
Chris Dyer
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)