Syrielle Montariol


2021

pdf bib
Proceedings of the 2nd International Workshop on Computational Approaches to Historical Language Change 2021
Nina Tahmasebi | Adam Jatowt | Yang Xu | Simon Hengchen | Syrielle Montariol | Haim Dubossarsky
Proceedings of the 2nd International Workshop on Computational Approaches to Historical Language Change 2021

pdf bib
Transport Optimal pour le Changement Sémantique à partir de Plongements Contextualisés (Optimal Transport for Semantic Change Detection using Contextualised Embeddings )
Syrielle Montariol | Alexandre Allauzen
Actes de la 28e Conférence sur le Traitement Automatique des Langues Naturelles. Volume 1 : conférence principale

Plusieurs méthodes de détection des changements sémantiques utilisant des plongements lexicaux contextualisés sont apparues récemment. Elles permettent une analyse fine du changement d’usage des mots, en agrégeant les plongements contextualisés en clusters qui reflètent les différents usages d’un mot. Nous proposons une nouvelle méthode basée sur le transport optimal. Nous l’évaluons sur plusieurs corpus annotés, montrant un gain de précision par rapport aux autres méthodes utilisant des plongements contextualisés, et l’illustrons sur un corpus d’articles de journaux.

pdf bib
Scalable and Interpretable Semantic Change Detection
Syrielle Montariol | Matej Martinc | Lidia Pivovarova
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Several cluster-based methods for semantic change detection with contextual embeddings emerged recently. They allow a fine-grained analysis of word use change by aggregating embeddings into clusters that reflect the different usages of the word. However, these methods are unscalable in terms of memory consumption and computation time. Therefore, they require a limited set of target words to be picked in advance. This drastically limits the usability of these methods in open exploratory tasks, where each word from the vocabulary can be considered as a potential target. We propose a novel scalable method for word usage-change detection that offers large gains in processing time and significant memory savings while offering the same interpretability and better performance than unscalable methods. We demonstrate the applicability of the proposed method by analysing a large corpus of news articles about COVID-19.

pdf bib
Measure and Evaluation of Semantic Divergence across Two Languages
Syrielle Montariol | Alexandre Allauzen
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Languages are dynamic systems: word usage may change over time, reflecting various societal factors. However, all languages do not evolve identically: the impact of an event, the influence of a trend or thinking, can differ between communities. In this paper, we propose to track these divergences by comparing the evolution of a word and its translation across two languages. We investigate several methods of building time-varying and bilingual word embeddings, using contextualised and non-contextualised embeddings. We propose a set of scenarios to characterize semantic divergence across two languages, along with a setup to differentiate them in a bilingual corpus. We evaluate the different methods by generating a corpus of synthetic semantic change across two languages, English and French, before applying them to newspaper corpora to detect bilingual semantic divergence and provide qualitative insight for the task. We conclude that BERT embeddings coupled with a clustering step lead to the best performance on synthetic corpora; however, the performance of CBOW embeddings is very competitive and more adapted to an exploratory analysis on a large corpus.

2020

pdf bib
Étude des variations sémantiques à travers plusieurs dimensions (Studying semantic variations through several dimensions )
Syrielle Montariol | Alexandre Allauzen
Actes de la 6e conférence conjointe Journées d'Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition). Volume 2 : Traitement Automatique des Langues Naturelles

Au sein d’une langue, l’usage des mots varie selon deux axes : diachronique (dimension temporelle) et synchronique (variation selon l’auteur, la communauté, la zone géographique... ). Dans ces travaux, nous proposons une méthode de détection et d’interprétation des variations d’usages des mots à travers ces différentes dimensions. Pour cela, nous exploitons les capacités d’une nouvelle ligne de plongements lexicaux contextualisés, en particulier le modèle BERT. Nous expérimentons sur un corpus de rapports financiers d’entreprises françaises, pour appréhender les enjeux et préoccupations propres à certaines périodes, acteurs et secteurs d’activités.

pdf bib
Discovery Team at SemEval-2020 Task 1: Context-sensitive Embeddings Not Always Better than Static for Semantic Change Detection
Matej Martinc | Syrielle Montariol | Elaine Zosa | Lidia Pivovarova
Proceedings of the Fourteenth Workshop on Semantic Evaluation

This paper describes the approaches used by the Discovery Team to solve SemEval-2020 Task 1 - Unsupervised Lexical Semantic Change Detection. The proposed method is based on clustering of BERT contextual embeddings, followed by a comparison of cluster distributions across time. The best results were obtained by an ensemble of this method and static Word2Vec embeddings. According to the official results, our approach proved the best for Latin in Subtask 2.

pdf bib
Variations in Word Usage for the Financial Domain
Syrielle Montariol | Alexandre Allauzen | Asanobu Kitamoto
Proceedings of the Second Workshop on Financial Technology and Natural Language Processing

pdf bib
Detecting Omissions of Risk Factors in Company Annual Reports
Corentin Masson | Syrielle Montariol
Proceedings of the Second Workshop on Financial Technology and Natural Language Processing

2019

pdf bib
Apprentissage de plongements de mots dynamiques avec régularisation de la dérive (Learning dynamic word embeddings with drift regularisation)
Syrielle Montariol | Alexandre Allauzen
Actes de la Conférence sur le Traitement Automatique des Langues Naturelles (TALN) PFIA 2019. Volume I : Articles longs

L’usage, le sens et la connotation des mots peuvent changer au cours du temps. Les plongements lexicaux diachroniques permettent de modéliser ces changements de manière non supervisée. Dans cet article nous étudions l’impact de plusieurs fonctions de coût sur l’apprentissage de plongements dynamiques, en comparant les comportements de variantes du modèle Dynamic Bernoulli Embeddings. Les plongements dynamiques sont estimés sur deux corpus couvrant les mêmes deux décennies, le New York Times Annotated Corpus en anglais et une sélection d’articles du journal Le Monde en français, ce qui nous permet de mettre en place un processus d’analyse bilingue de l’évolution de l’usage des mots.

pdf bib
Exploring sentence informativeness
Syrielle Montariol | Aina Garí Soler | Alexandre Allauzen
Actes de la Conférence sur le Traitement Automatique des Langues Naturelles (TALN) PFIA 2019. Volume II : Articles courts

This study is a preliminary exploration of the concept of informativeness –how much information a sentence gives about a word it contains– and its potential benefits to building quality word representations from scarce data. We propose several sentence-level classifiers to predict informativeness, and we perform a manual annotation on a set of sentences. We conclude that these two measures correspond to different notions of informativeness. However, our experiments show that using the classifiers’ predictions to train word embeddings has an impact on embedding quality.

pdf bib
Empirical Study of Diachronic Word Embeddings for Scarce Data
Syrielle Montariol | Alexandre Allauzen
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

Word meaning change can be inferred from drifts of time-varying word embeddings. However, temporal data may be too sparse to build robust word embeddings and to discriminate significant drifts from noise. In this paper, we compare three models to learn diachronic word embeddings on scarce data: incremental updating of a Skip-Gram from Kim et al. (2014), dynamic filtering from Bamler & Mandt (2017), and dynamic Bernoulli embeddings from Rudolph & Blei (2018). In particular, we study the performance of different initialisation schemes and emphasise what characteristics of each model are more suitable to data scarcity, relying on the distribution of detected drifts. Finally, we regularise the loss of these models to better adapt to scarce data.