Hervé Déjean
Also published as: Herve Dejean, H. Dejean
2025
PISCO: Pretty Simple Compression for Retrieval-Augmented Generation
Maxime Louis | Hervé Déjean | Stéphane Clinchant
Findings of the Association for Computational Linguistics: ACL 2025
Maxime Louis | Hervé Déjean | Stéphane Clinchant
Findings of the Association for Computational Linguistics: ACL 2025
Retrieval-Augmented Generation (RAG) pipelines enhance Large Language Models (LLMs) by retrieving relevant documents, but they face scalability issues due to high inference costs and limited context size. Document compression is a practical solution, but current soft compression methods often suffer from accuracy losses and require extensive pretraining. In this paper, we introduce PISCO, a novel method that achieves a 16x compression rate with minimal accuracy loss (0-3%) across diverse RAG-based question-answering (QA) tasks. Unlike existing approaches, PISCO requires no pretraining or annotated data, relying solely on sequence-level knowledge distillation from document-based questions. With the ability to fine-tune a 7-10B LLM in 24 hours on a single A100 GPU, PISCO offers a highly efficient and scalable solution. We present comprehensive experiments showing that PISCO outperforms existing compression models by 8% in accuracy.
2024
BERGEN: A Benchmarking Library for Retrieval-Augmented Generation
David Rau | Hervé Déjean | Nadezhda Chirkova | Thibault Formal | Shuai Wang | Stéphane Clinchant | Vassilina Nikoulina
Findings of the Association for Computational Linguistics: EMNLP 2024
David Rau | Hervé Déjean | Nadezhda Chirkova | Thibault Formal | Shuai Wang | Stéphane Clinchant | Vassilina Nikoulina
Findings of the Association for Computational Linguistics: EMNLP 2024
Retrieval-Augmented Generation allows to enhance Large Language Models with external knowledge. In response to the recent popularity of generative LLMs, many RAG approaches have been proposed, which involve an intricate number of different configurations such as evaluation datasets, collections, metrics, retrievers, and LLMs. Inconsistent benchmarking poses a major challenge in comparing approaches and understanding the impact of each component in the pipeline. In this work, we study best practices that lay the groundwork for a systematic evaluation of RAG and present BERGEN, an end-to-end library for reproducible research standardizing RAG experiments. In an extensive study focusing on QA, we benchmark different state-of-the-art retrievers, rerankers, and LLMs. Additionally, we analyze existing RAG metrics and datasets.
Retrieval-augmented generation in multilingual settings
Nadezhda Chirkova | David Rau | Hervé Déjean | Thibault Formal | Stéphane Clinchant | Vassilina Nikoulina
Proceedings of the 1st Workshop on Towards Knowledgeable Language Models (KnowLLM 2024)
Nadezhda Chirkova | David Rau | Hervé Déjean | Thibault Formal | Stéphane Clinchant | Vassilina Nikoulina
Proceedings of the 1st Workshop on Towards Knowledgeable Language Models (KnowLLM 2024)
Retrieval-augmented generation (RAG) has recently emerged as a promising solution for incorporating up-to-date or domain-specific knowledge into large language models (LLMs) and improving LLM factuality, but is predominantly studied in English-only settings. In this work, we consider RAG in the multilingual setting (mRAG), i.e. with user queries and the datastore in 13 languages, and investigate which components and with which adjustments are needed to build a well-performing mRAG pipeline, that can be used as a strong baseline in future works. Our findings highlight that despite the availability of high-quality off-the-shelf multilingual retrievers and generators, task-specific prompt engineering is needed to enable generation in user languages. Moreover, current evaluation metrics need adjustments for multilingual setting, to account for variations in spelling named entities. The main limitations to be addressed in future works include frequent code-switching in non-Latin alphabet languages, occasional fluency errors, wrong reading of the provided documents, or irrelevant retrieval. We release the code for the resulting mRAG baseline pipeline at https://github.com/naver/bergen, Documentation: https://github.com/naver/bergen/blob/main/documentations/multilingual.md.
2020
Vital Records: Uncover the past from historical handwritten records
Herve Dejean | Jean-Luc Meunier
Proceedings of the 4th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature
Herve Dejean | Jean-Luc Meunier
Proceedings of the 4th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature
We present Vital Records, a demonstrator based on deep-learning approaches to handwritten-text recognition, table processing and information extraction, which enables data from century-old documents to be parsed and analysed, making it possible to explore death records in space and time. This demonstrator provides a user interface for browsing and visualising data extracted from 80,000 handwritten pages of tabular data.
2004
A Geometric View on Bilingual Lexicon Extraction from Comparable Corpora
Eric Gaussier | J.M. Renders | I. Matveeva | C. Goutte | H. Dejean
Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04)
Eric Gaussier | J.M. Renders | I. Matveeva | C. Goutte | H. Dejean
Proceedings of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04)
2003
Reducing Parameter Space for Word Alignment
Herve Dejean | Eric Gaussier | Cyril Goutte | Kenji Yamada
Proceedings of the HLT-NAACL 2003 Workshop on Building and Using Parallel Texts: Data Driven Machine Translation and Beyond
Herve Dejean | Eric Gaussier | Cyril Goutte | Kenji Yamada
Proceedings of the HLT-NAACL 2003 Workshop on Building and Using Parallel Texts: Data Driven Machine Translation and Beyond
2002
An Approach Based on Multilingual Thesauri and Model Combination for Bilingual Lexicon Extraction
Hervé Déjean | Éric Gaussier | Fatiha Sadat
COLING 2002: The 19th International Conference on Computational Linguistics
Hervé Déjean | Éric Gaussier | Fatiha Sadat
COLING 2002: The 19th International Conference on Computational Linguistics
Combining Labelled and Unlabelled Data: A Case Study on Fisher Kernels and Transductive Inference for Biological Entity Recognition
Cyril Goutte | Hervé Déjean | Eric Gaussier | Nicola Cancedda | Jean-Michel Renders
COLING-02: The 6th Conference on Natural Language Learning 2002 (CoNLL-2002)
Cyril Goutte | Hervé Déjean | Eric Gaussier | Nicola Cancedda | Jean-Michel Renders
COLING-02: The 6th Conference on Natural Language Learning 2002 (CoNLL-2002)
2001
Introduction to the CoNLL-2001 shared task: clause identification
Erik F. Tjong Kim Sang | Hervé Déjean
Proceedings of the ACL 2001 Workshop on Computational Natural Language Learning (ConLL)
Erik F. Tjong Kim Sang | Hervé Déjean
Proceedings of the ACL 2001 Workshop on Computational Natural Language Learning (ConLL)
Learning Computational Grammars
John Nerbonne | Anja Belz | Nicola Cancedda | Hervé Déjean | James Hammerton | Rob Koeling | Stasinos Konstantopoulos | Miles Osborne | Franck Thollard | Erik F. Tjong Kim Sang
Proceedings of the ACL 2001 Workshop on Computational Natural Language Learning (ConLL)
John Nerbonne | Anja Belz | Nicola Cancedda | Hervé Déjean | James Hammerton | Rob Koeling | Stasinos Konstantopoulos | Miles Osborne | Franck Thollard | Erik F. Tjong Kim Sang
Proceedings of the ACL 2001 Workshop on Computational Natural Language Learning (ConLL)
Using ALLiS for clausing
Hervé Déjean
Proceedings of the ACL 2001 Workshop on Computational Natural Language Learning (ConLL)
Hervé Déjean
Proceedings of the ACL 2001 Workshop on Computational Natural Language Learning (ConLL)
2000
Theory Refinement and Natural Language Learning
Herve Dejean
COLING 2000 Volume 1: The 18th International Conference on Computational Linguistics
Herve Dejean
COLING 2000 Volume 1: The 18th International Conference on Computational Linguistics
Applying System Combination to Base Noun Phrase Identification
Erik F. Tjong Kim Sang | Walter Daelemans | Herve Dejean | Rob Koeling | Yuval Krymolowski | Vasin Punyakanok | Dan Roth
COLING 2000 Volume 2: The 18th International Conference on Computational Linguistics
Erik F. Tjong Kim Sang | Walter Daelemans | Herve Dejean | Rob Koeling | Yuval Krymolowski | Vasin Punyakanok | Dan Roth
COLING 2000 Volume 2: The 18th International Conference on Computational Linguistics
How To Evaluate and Compare Tagsets? A Proposal
Hervé Déjean
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)
Hervé Déjean
Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00)
ALLiS: a Symbolic Learning System for Natural Language Learning
Hervé Déjean
Fourth Conference on Computational Natural Language Learning and the Second Learning Language in Logic Workshop
Hervé Déjean
Fourth Conference on Computational Natural Language Learning and the Second Learning Language in Logic Workshop
Learning Syntactic Structures with XML
Hervé Déjean
Fourth Conference on Computational Natural Language Learning and the Second Learning Language in Logic Workshop
Hervé Déjean
Fourth Conference on Computational Natural Language Learning and the Second Learning Language in Logic Workshop
1998
Search
Fix author
Co-authors
- Eric Gaussier 4
- Stéphane Clinchant 3
- Cyril Goutte 3
- Erik Tjong Kim Sang 3
- Nicola Cancedda 2
- Nadezhda Chirkova 2
- Thibault Formal 2
- Rob Koeling 2
- Vassilina Nikoulina 2
- David Rau 2
- Anja Belz 1
- Walter Daelemans 1
- James Hammerton 1
- Stasinos Konstantopoulos 1
- Yuval Krymolowski 1
- Maxime Louis 1
- Irina Matveeva 1
- Jean-Luc Meunier 1
- John Nerbonne 1
- Miles Osborne 1
- Vasin Punyakanok 1
- J.M. Renders 1
- Jean-Michel Renders 1
- Dan Roth 1
- Fatiha Sadat 1
- Franck Thollard 1
- Shuai Wang 1
- Kenji Yamada 1