Marie-Anne Lachaux
2020
CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data
Guillaume Wenzek
|
Marie-Anne Lachaux
|
Alexis Conneau
|
Vishrav Chaudhary
|
Francisco Guzmán
|
Armand Joulin
|
Edouard Grave
Proceedings of the Twelfth Language Resources and Evaluation Conference
Pre-training text representations have led to significant improvements in many areas of natural language processing. The quality of these models benefits greatly from the size of the pretraining corpora as long as its quality is preserved. In this paper, we describe an automatic pipeline to extract massive high-quality monolingual datasets from Common Crawl for a variety of languages. Our pipeline follows the data processing introduced in fastText (Mikolov et al., 2017; Grave et al., 2018), that deduplicates documents and identifies their language. We augment this pipeline with a filtering step to select documents that are close to high quality corpora like Wikipedia.
Target Conditioning for One-to-Many Generation
Marie-Anne Lachaux
|
Armand Joulin
|
Guillaume Lample
Findings of the Association for Computational Linguistics: EMNLP 2020
Neural Machine Translation (NMT) models often lack diversity in their generated translations, even when paired with search algorithm, like beam search. A challenge is that the diversity in translations are caused by the variability in the target language, and cannot be inferred from the source sentence alone. In this paper, we propose to explicitly model this one-to-many mapping by conditioning the decoder of a NMT model on a latent variable that represents the domain of target sentences. The domain is a discrete variable generated by a target encoder that is jointly trained with the NMT model.The predicted domain of target sentences are given as input to the decoder during training. At inference, we can generate diverse translations by decoding with different domains. Unlike our strongest baseline (Shen et al., 2019), our method can scale to any number of domains without affecting the performance or the training time. We assess the quality and diversity of translations generated by our model with several metrics, on three different datasets.
Search
Co-authors
- Armand Joulin 2
- Guillaume Wenzek 1
- Alexis Conneau 1
- Vishrav Chaudhary 1
- Francisco Guzmán 1
- show all...