Manuel Faysse
2026
ViDoRe V3: A Comprehensive Evaluation of Retrieval Augmented Generation in Complex Real-World Scenarios
António Loison | Quentin Macé | Antoine Edy | Victor Xing | Tom Balough | Gabriel de Souza P. Moreira | Bo Liu | Manuel Faysse | Celine Hudelot | Gautier Viaud
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
António Loison | Quentin Macé | Antoine Edy | Victor Xing | Tom Balough | Gabriel de Souza P. Moreira | Bo Liu | Manuel Faysse | Celine Hudelot | Gautier Viaud
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Retrieval-Augmented Generation (RAG) pipelines must address challenges beyond simple single-document retrieval, such as interpreting visual elements (tables, charts, images), synthesizing information across documents, and providing accurate source grounding. Existing benchmarks fail to capture this complexity, often focusing on textual data, single-document comprehension, or evaluating retrieval and generation in isolation. We introduce ViDoRe V3, a comprehensive multimodal RAG benchmark featuring multi-type queries over visually rich document corpora. It covers 10 datasets across diverse professional domains, comprising ~26,000 document pages paired with 3,099 human-verified queries, each available in 6 languages. Through 12,000 hours of human annotation effort, we provide high-quality annotations for retrieval relevance, bounding box localization, and verified reference answers. Our evaluation of state-of-the-art RAG pipelines reveals that visual retrievers outperform textual ones, late-interaction models and textual reranking substantially improve performance, and hybrid or purely visual contexts enhance answer generation quality. However, current models still struggle with non-textual elements, open-ended queries, and fine-grained visual grounding. To encourage progress in addressing these challenges, the benchmark is released under a commercially permissive license.
2025
Context is Gold to find the Gold Passage: Evaluating and Training Contextual Document Embeddings
Max Conti | Manuel Faysse | Gautier Viaud | Antoine Bosselut | Celine Hudelot | Pierre Colombo
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Max Conti | Manuel Faysse | Gautier Viaud | Antoine Bosselut | Celine Hudelot | Pierre Colombo
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
A limitation of modern document retrieval embedding methods is that they typically encode passages (chunks) from the same documents independently, often overlooking crucial contextual information from the rest of the document that could greatly improve individual chunk representations.In this work, we introduce ConTEB (Context-aware Text Embedding Benchmark), a benchmark designed to evaluate retrieval models on their ability to leverage document-wide context. Our results show that state-of-the-art embedding models struggle in retrieval scenarios where context is required. To address this limitation, we propose InSeNT (In-sequence Negative Training), a novel contrastive post-training approach which combined with late chunking pooling enhances contextual representation learning while preserving computational efficiency. Our method significantly improves retrieval quality on ConTEB without sacrificing base model performance. We further find chunks embedded with our method are more robust to suboptimal chunking strategies and larger retrieval corpus sizes.We open-source all artifacts at https://github.com/illuin-tech/contextual-embeddings.
2023
Revisiting Instruction Fine-tuned Model Evaluation to Guide Industrial Applications
Manuel Faysse | Gautier Viaud | Céline Hudelot | Pierre Colombo
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Manuel Faysse | Gautier Viaud | Céline Hudelot | Pierre Colombo
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Instruction Fine-Tuning (IFT) is a powerful paradigm that strengthens the zero-shot capabilities of Large Language Models (LLMs), but in doing so induces new evaluation metric requirements. We show LLM-based metrics to be well adapted to these requirements, and leverage them to conduct an investigation of task-specialization strategies, quantifying the trade-offs that emerge in practical industrial settings. Our findings offer practitioners actionable insights for real-world IFT model deployment.