João Coelho

2024

pdf bib abs
Dwell in the Beginning: How Language Models Embed Long Documents for Dense Retrieval
João Coelho | Bruno Martins | Joao Magalhaes | Jamie Callan | Chenyan Xiong
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

This study investigates the existence of positional biases in Transformer-based language models for text representation learning, particularly in the context of web document retrieval. We build on previous research that demonstrated loss of information in the middle of input sequences for causal language models, extending it to the domain of embedding learning. We examine positional biases at multiple stages of the training pipeline for an encoder-decoder neural retrieval model, namely language model pre-training, contrastive pre-training, and contrastive fine-tuning. Experiments with the MS-MARCO document collection reveal that after contrastive pre-training the model already generates embeddings that better capture the beginning of the input content, with fine-tuning further aggravating this effect.

Co-authors

Venues

acl1

Fix data

João Coelho

Fixing paper assignments

2024

Co-authors

Venues