Jean Maillard


2023

pdf
Text normalization for low-resource languages: the case of Ligurian
Stefano Lusito | Edoardo Ferrante | Jean Maillard
Proceedings of the Sixth Workshop on the Use of Computational Methods in the Study of Endangered Languages

2022

pdf
OCR Improves Machine Translation for Low-Resource Languages
Oana Ignat | Jean Maillard | Vishrav Chaudhary | Francisco Guzmán
Findings of the Association for Computational Linguistics: ACL 2022

We aim to investigate the performance of current OCR systems on low resource languages and low resource scripts.We introduce and make publicly available a novel benchmark, OCR4MT, consisting of real and synthetic data, enriched with noise, for 60 low-resource languages in low resource scripts. We evaluate state-of-the-art OCR systems on our benchmark and analyse most common errors. We show that OCR monolingual data is a valuable resource that can increase performance of Machine Translation models, when used in backtranslation. We then perform an ablation study to investigate how OCR errors impact Machine Translation performance and determine what is the minimum level of OCR quality needed for the monolingual data to be useful for Machine Translation.

pdf
Findings of the WMT’22 Shared Task on Large-Scale Machine Translation Evaluation for African Languages
David Adelani | Md Mahfuz Ibn Alam | Antonios Anastasopoulos | Akshita Bhagia | Marta R. Costa-jussà | Jesse Dodge | Fahim Faisal | Christian Federmann | Natalia Fedorova | Francisco Guzmán | Sergey Koshelev | Jean Maillard | Vukosi Marivate | Jonathan Mbuya | Alexandre Mourachko | Safiyyah Saleem | Holger Schwenk | Guillaume Wenzek
Proceedings of the Seventh Conference on Machine Translation (WMT)

We present the results of the WMT’22 SharedTask on Large-Scale Machine Translation Evaluation for African Languages. The shared taskincluded both a data and a systems track, alongwith additional innovations, such as a focus onAfrican languages and extensive human evaluation of submitted systems. We received 14system submissions from 8 teams, as well as6 data track contributions. We report a largeprogress in the quality of translation for Africanlanguages since the last iteration of this sharedtask: there is an increase of about 7.5 BLEUpoints across 72 language pairs, and the average BLEU scores went from 15.09 to 22.60.

2021

pdf
A Universal Dependencies corpus for Ligurian
Stefano Lusito | Jean Maillard
Proceedings of the Fifth Workshop on Universal Dependencies (UDW, SyntaxFest 2021)

pdf
Multi-Task Retrieval for Knowledge-Intensive Tasks
Jean Maillard | Vladimir Karpukhin | Fabio Petroni | Wen-tau Yih | Barlas Oguz | Veselin Stoyanov | Gargi Ghosh
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Retrieving relevant contexts from a large corpus is a crucial step for tasks such as open-domain question answering and fact checking. Although neural retrieval outperforms traditional methods like tf-idf and BM25, its performance degrades considerably when applied to out-of-domain data. Driven by the question of whether a neural retrieval model can be _universal_ and perform robustly on a wide variety of problems, we propose a multi-task trained model. Our approach not only outperforms previous methods in the few-shot setting, but also rivals specialised neural retrievers, even when in-domain training data is abundant. With the help of our retriever, we improve existing models for downstream tasks and closely match or improve the state of the art on multiple benchmarks.

pdf
KILT: a Benchmark for Knowledge Intensive Language Tasks
Fabio Petroni | Aleksandra Piktus | Angela Fan | Patrick Lewis | Majid Yazdani | Nicola De Cao | James Thorne | Yacine Jernite | Vladimir Karpukhin | Jean Maillard | Vassilis Plachouras | Tim Rocktäschel | Sebastian Riedel
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Challenging problems such as open-domain question answering, fact checking, slot filling and entity linking require access to large, external knowledge sources. While some models do well on individual tasks, developing general models is difficult as each task might require computationally expensive indexing of custom knowledge sources, in addition to dedicated infrastructure. To catalyze research on models that condition on specific information in large textual resources, we present a benchmark for knowledge-intensive language tasks (KILT). All tasks in KILT are grounded in the same snapshot of Wikipedia, reducing engineering turnaround through the re-use of components, as well as accelerating research into task-agnostic memory architectures. We test both task-specific and general baselines, evaluating downstream performance in addition to the ability of the models to provide provenance. We find that a shared dense vector index coupled with a seq2seq model is a strong baseline, outperforming more tailor-made approaches for fact checking, open-domain question answering and dialogue, and yielding competitive results on entity linking and slot filling, by generating disambiguated text. KILT data and code are available at https://github.com/facebookresearch/KILT.

2020

pdf
Decoding Brain Activity Associated with Literal and Metaphoric Sentence Comprehension Using Distributional Semantic Models
Vesna G. Djokic | Jean Maillard | Luana Bulat | Ekaterina Shutova
Transactions of the Association for Computational Linguistics, Volume 8

Recent years have seen a growing interest within the natural language processing (NLP) community in evaluating the ability of semantic models to capture human meaning representation in the brain. Existing research has mainly focused on applying semantic models to decode brain activity patterns associated with the meaning of individual words, and, more recently, this approach has been extended to sentences and larger text fragments. Our work is the first to investigate metaphor processing in the brain in this context. We evaluate a range of semantic models (word embeddings, compositional, and visual models) in their ability to decode brain activity associated with reading of both literal and metaphoric sentences. Our results suggest that compositional models and word embeddings are able to capture differences in the processing of literal and metaphoric sentences, providing support for the idea that the literal meaning is not fully accessible during familiar metaphor comprehension.

pdf
Conversational Semantic Parsing
Armen Aghajanyan | Jean Maillard | Akshat Shrivastava | Keith Diedrick | Michael Haeger | Haoran Li | Yashar Mehdad | Veselin Stoyanov | Anuj Kumar | Mike Lewis | Sonal Gupta
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)

The structured representation for semantic parsing in task-oriented assistant systems is geared towards simple understanding of one-turn queries. Due to the limitations of the representation, the session-based properties such as co-reference resolution and context carryover are processed downstream in a pipelined system. In this paper, we propose a semantic representation for such task-oriented conversational systems that can represent concepts such as co-reference and context carryover, enabling comprehensive understanding of queries in a session. We release a new session-based, compositional task-oriented parsing dataset of 20k sessions consisting of 60k utterances. Unlike Dialog State Tracking Challenges, the queries in the dataset have compositional forms. We propose a new family of Seq2Seq models for the session-based parsing above, which also set state-of-the-art in ATIS, SNIPS, TOP and DSTC2. Notably, we improve the best known results on DSTC2 by up to 5 points for slot-carryover.

2019

pdf
Modeling Affirmative and Negated Action Processing in the Brain with Lexical and Compositional Semantic Models
Vesna Djokic | Jean Maillard | Luana Bulat | Ekaterina Shutova
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Recent work shows that distributional semantic models can be used to decode patterns of brain activity associated with individual words and sentence meanings. However, it is yet unclear to what extent such models can be used to study and decode fMRI patterns associated with specific aspects of semantic composition such as the negation function. In this paper, we apply lexical and compositional semantic models to decode fMRI patterns associated with negated and affirmative sentences containing hand-action verbs. Our results show reduced decoding (correlation) of sentences where the verb is in the negated context, as compared to the affirmative one, within brain regions implicated in action-semantic processing. This supports behavioral and brain imaging studies, suggesting that negation involves reduced access to aspects of the affirmative mental representation. The results pave the way for testing alternate semantic models of negation against human semantic processing in the brain.

2018

pdf
Latent Tree Learning with Differentiable Parsers: Shift-Reduce Parsing and Chart Parsing
Jean Maillard | Stephen Clark
Proceedings of the Workshop on the Relevance of Linguistic Structure in Neural Architectures for NLP

Latent tree learning models represent sentences by composing their words according to an induced parse tree, all based on a downstream task. These models often outperform baselines which use (externally provided) syntax trees to drive the composition order. This work contributes (a) a new latent tree learning model based on shift-reduce parsing, with competitive downstream performance and non-trivial induced trees, and (b) an analysis of the trees learned by our shift-reduce model and by a chart-based model.

2016

pdf
Black Holes and White Rabbits: Metaphor Identification with Visual Features
Ekaterina Shutova | Douwe Kiela | Jean Maillard
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

pdf
RELPRON: A Relative Clause Evaluation Data Set for Compositional Distributional Semantics
Laura Rimell | Jean Maillard | Tamara Polajnar | Stephen Clark
Computational Linguistics, Volume 42, Issue 4 - December 2016

2015

pdf
Learning Adjective Meanings with a Tensor-Based Skip-Gram Model
Jean Maillard | Stephen Clark
Proceedings of the Nineteenth Conference on Computational Natural Language Learning

2014

pdf
A Type-Driven Tensor-Based Semantics for CCG
Jean Maillard | Stephen Clark | Edward Grefenstette
Proceedings of the EACL 2014 Workshop on Type Theory and Natural Language Semantics (TTNLS)