Simon Hengchen


2023

pdf bib
Proceedings of the 4th Workshop on Computational Approaches to Historical Language Change
Nina Tahmasebi | Syrielle Montariol | Haim Dubossarsky | Andrey Kutuzov | Simon Hengchen | David Alfter | Francesco Periti | Pierluigi Cassotti
Proceedings of the 4th Workshop on Computational Approaches to Historical Language Change

pdf
Superlim: A Swedish Language Understanding Evaluation Benchmark
Aleksandrs Berdicevskis | Gerlof Bouma | Robin Kurtz | Felix Morger | Joey Öhman | Yvonne Adesam | Lars Borin | Dana Dannélls | Markus Forsberg | Tim Isbister | Anna Lindahl | Martin Malmsten | Faton Rekathati | Magnus Sahlgren | Elena Volodina | Love Börjeson | Simon Hengchen | Nina Tahmasebi
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

We present Superlim, a multi-task NLP benchmark and analysis platform for evaluating Swedish language models, a counterpart to the English-language (Super)GLUE suite. We describe the dataset, the tasks, the leaderboard and report the baseline results yielded by a reference implementation. The tested models do not approach ceiling performance on any of the tasks, which suggests that Superlim is truly difficult, a desirable quality for a benchmark. We address methodological challenges, such as mitigating the Anglocentric bias when creating datasets for a less-resourced language; choosing the most appropriate measures; documenting the datasets and making the leaderboard convenient and transparent. We also highlight other potential usages of the dataset, such as, for instance, the evaluation of cross-lingual transfer learning.

2022

pdf bib
Proceedings of the 3rd Workshop on Computational Approaches to Historical Language Change
Nina Tahmasebi | Syrielle Montariol | Andrey Kutuzov | Simon Hengchen | Haim Dubossarsky | Lars Borin
Proceedings of the 3rd Workshop on Computational Approaches to Historical Language Change

2021

pdf
An Unsupervised method for OCR Post-Correction and Spelling Normalisation for Finnish
Quan Duong | Mika Hämäläinen | Simon Hengchen
Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)

Historical corpora are known to contain errors introduced by OCR (optical character recognition) methods used in the digitization process, often said to be degrading the performance of NLP systems. Correcting these errors manually is a time-consuming process and a great part of the automatic approaches have been relying on rules or supervised machine learning. We build on previous work on fully automatic unsupervised extraction of parallel data to train a character-based sequence-to-sequence NMT (neural machine translation) model to conduct OCR error correction designed for English, and adapt it to Finnish by proposing solutions that take the rich morphology of the language into account. Our new method shows increased performance while remaining fully unsupervised, with the added benefit of spelling normalisation. The source code and models are available on GitHub and Zenodo.

pdf
SuperSim: a test set for word similarity and relatedness in Swedish
Simon Hengchen | Nina Tahmasebi
Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)

Language models are notoriously difficult to evaluate. We release SuperSim, a large-scale similarity and relatedness test set for Swedish built with expert human judgements. The test set is composed of 1,360 word-pairs independently judged for both relatedness and similarity by five annotators. We evaluate three different models (Word2Vec, fastText, and GloVe) trained on two separate Swedish datasets, namely the Swedish Gigaword corpus and a Swedish Wikipedia dump, to provide a baseline for future comparison. We will release the fully annotated test set, code, models, and data.

pdf
DWUG: A large Resource of Diachronic Word Usage Graphs in Four Languages
Dominik Schlechtweg | Nina Tahmasebi | Simon Hengchen | Haim Dubossarsky | Barbara McGillivray
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Word meaning is notoriously difficult to capture, both synchronically and diachronically. In this paper, we describe the creation of the largest resource of graded contextualized, diachronic word meaning annotation in four different languages, based on 100,000 human semantic proximity judgments. We describe in detail the multi-round incremental annotation process, the choice for a clustering algorithm to group usages into senses, and possible – diachronic and synchronic – uses for this dataset.

pdf bib
Proceedings of the 2nd International Workshop on Computational Approaches to Historical Language Change 2021
Nina Tahmasebi | Adam Jatowt | Yang Xu | Simon Hengchen | Syrielle Montariol | Haim Dubossarsky
Proceedings of the 2nd International Workshop on Computational Approaches to Historical Language Change 2021

2020

pdf bib
SemEval-2020 Task 1: Unsupervised Lexical Semantic Change Detection
Dominik Schlechtweg | Barbara McGillivray | Simon Hengchen | Haim Dubossarsky | Nina Tahmasebi
Proceedings of the Fourteenth Workshop on Semantic Evaluation

Lexical Semantic Change detection, i.e., the task of identifying words that change meaning over time, is a very active research area, with applications in NLP, lexicography, and linguistics. Evaluation is currently the most pressing problem in Lexical Semantic Change detection, as no gold standards are available to the community, which hinders progress. We present the results of the first shared task that addresses this gap by providing researchers with an evaluation framework and manually annotated, high-quality datasets for English, German, Latin, and Swedish. 33 teams submitted 186 systems, which were evaluated on two subtasks.

pdf
Dataset for Temporal Analysis of English-French Cognates
Esteban Frossard | Mickael Coustaty | Antoine Doucet | Adam Jatowt | Simon Hengchen
Proceedings of the Twelfth Language Resources and Evaluation Conference

Languages change over time and, thanks to the abundance of digital corpora, their evolutionary analysis using computational techniques has recently gained much research attention. In this paper, we focus on creating a dataset to support investigating the similarity in evolution between different languages. We look in particular into the similarities and differences between the use of corresponding words across time in English and French, two languages from different linguistic families yet with shared syntax and close contact. For this we select a set of cognates in both languages and study their frequency changes and correlations over time. We propose a new dataset for computational approaches of synchronized diachronic investigation of language pairs, and subsequently show novel findings stemming from the cognate-focused diachronic comparison of the two chosen languages. To the best of our knowledge, the present study is the first in the literature to use computational approaches and large data to make a cross-language diachronic analysis.

2019

pdf
Time-Out: Temporal Referencing for Robust Modeling of Lexical Semantic Change
Haim Dubossarsky | Simon Hengchen | Nina Tahmasebi | Dominik Schlechtweg
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

State-of-the-art models of lexical semantic change detection suffer from noise stemming from vector space alignment. We have empirically tested the Temporal Referencing method for lexical semantic change and show that, by avoiding alignment, it is less affected by this noise. We show that, trained on a diachronic corpus, the skip-gram with negative sampling architecture with temporal referencing outperforms alignment models on a synthetic task as well as a manual testset. We introduce a principled way to simulate lexical semantic change and systematically control for possible biases.

pdf
GASC: Genre-Aware Semantic Change for Ancient Greek
Valerio Perrone | Marco Palma | Simon Hengchen | Alessandro Vatri | Jim Q. Smith | Barbara McGillivray
Proceedings of the 1st International Workshop on Computational Approaches to Historical Language Change

Word meaning changes over time, depending on linguistic and extra-linguistic factors. Associating a word’s correct meaning in its historical context is a central challenge in diachronic research, and is relevant to a range of NLP tasks, including information retrieval and semantic search in historical texts. Bayesian models for semantic change have emerged as a powerful tool to address this challenge, providing explicit and interpretable representations of semantic change phenomena. However, while corpora typically come with rich metadata, existing models are limited by their inability to exploit contextual information (such as text genre) beyond the document time-stamp. This is particularly critical in the case of ancient languages, where lack of data and long diachronic span make it harder to draw a clear distinction between polysemy (the fact that a word has several senses) and semantic change (the process of acquiring, losing, or changing senses), and current systems perform poorly on these languages. We develop GASC, a dynamic semantic change model that leverages categorical metadata about the texts’ genre to boost inference and uncover the evolution of meanings in Ancient Greek corpora. In a new evaluation framework, our model achieves improved predictive performance compared to the state of the art.

pdf
From the Paft to the Fiiture: a Fully Automatic NMT and Word Embeddings Method for OCR Post-Correction
Mika Hämäläinen | Simon Hengchen
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

A great deal of historical corpora suffer from errors introduced by the OCR (optical character recognition) methods used in the digitization process. Correcting these errors manually is a time-consuming process and a great part of the automatic approaches have been relying on rules or supervised machine learning. We present a fully automatic unsupervised way of extracting parallel data for training a character-based sequence-to-sequence NMT (neural machine translation) model to conduct OCR error correction.