Samuel Rönnqvist


2022

pdf
Explaining Classes through Stable Word Attributions
Samuel Rönnqvist | Aki-Juhani Kyröläinen | Amanda Myntti | Filip Ginter | Veronika Laippala
Findings of the Association for Computational Linguistics: ACL 2022

Input saliency methods have recently become a popular tool for explaining predictions of deep learning models in NLP. Nevertheless, there has been little work investigating methods for aggregating prediction-level explanations to the class level, nor has a framework for evaluating such class explanations been established. We explore explanations based on XLM-R and the Integrated Gradients input attribution method, and propose 1) the Stable Attribution Class Explanation method (SACX) to extract keyword lists of classes in text classification tasks, and 2) a framework for the systematic evaluation of the keyword lists. We find that explanations of individual predictions are prone to noise, but that stable explanations can be effectively identified through repeated training and explanation. We evaluate on web register data and show that the class explanations are linguistically meaningful and distinguishing of the classes.

pdf
Towards better structured and less noisy Web data: Oscar with Register annotations
Veronika Laippala | Anna Salmela | Samuel Rönnqvist | Alham Fikri Aji | Li-Hsin Chang | Asma Dhifallah | Larissa Goulart | Henna Kortelainen | Marc Pàmies | Deise Prina Dutra | Valtteri Skantsi | Lintang Sutawika | Sampo Pyysalo
Proceedings of the Eighth Workshop on Noisy User-generated Text (W-NUT 2022)

Web-crawled datasets are known to be noisy, as they feature a wide range of language use covering both user-generated and professionally edited content as well as noise originating from the crawling process. This article presents one solution to reduce this noise by using automatic register (genre) identification -whether the texts are, e.g., forum discussions, lyrical or how-to pages. We apply the multilingual register identification model by Rönnqvist et al. (2021) and label the widely used Oscar dataset. Additionally, we evaluate the model against eight new languages, showing that the performance is comparable to previous findings on a restricted set of languages. Finally, we present and apply a machine learning method for further cleaning text files originating from Web crawls from remains of boilerplate and other elements not belonging to the main text of the Web page. The register labeled and cleaned dataset covers 351 million documents in 14 languages and is available at https://huggingface.co/datasets/TurkuNLP/register_oscar.

2021

pdf
Beyond the English Web: Zero-Shot Cross-Lingual and Lightweight Monolingual Classification of Registers
Liina Repo | Valtteri Skantsi | Samuel Rönnqvist | Saara Hellström | Miika Oinonen | Anna Salmela | Douglas Biber | Jesse Egbert | Sampo Pyysalo | Veronika Laippala
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop

We explore cross-lingual transfer of register classification for web documents. Registers, that is, text varieties such as blogs or news are one of the primary predictors of linguistic variation and thus affect the automatic processing of language. We introduce two new register-annotated corpora, FreCORE and SweCORE, for French and Swedish. We demonstrate that deep pre-trained language models perform strongly in these languages and outperform previous state-of-the-art in English and Finnish. Specifically, we show 1) that zero-shot cross-lingual transfer from the large English CORE corpus can match or surpass previously published monolingual models, and 2) that lightweight monolingual classification requiring very little training data can reach or surpass our zero-shot performance. We further analyse classification results finding that certain registers continue to pose challenges in particular for cross-lingual transfer.

pdf
Multilingual and Zero-Shot is Closing in on Monolingual Web Register Classification
Samuel Rönnqvist | Valtteri Skantsi | Miika Oinonen | Veronika Laippala
Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)

This article studies register classification of documents from the unrestricted web, such as news articles or opinion blogs, in a multilingual setting, exploring both the benefit of training on multiple languages and the capabilities for zero-shot cross-lingual transfer. While the wide range of linguistic variation found on the web poses challenges for register classification, recent studies have shown that good levels of cross-lingual transfer from the extensive English CORE corpus to other languages can be achieved. In this study, we show that training on multiple languages 1) benefits languages with limited amounts of register-annotated data, 2) on average achieves performance on par with monolingual models, and 3) greatly improves upon previous zero-shot results in Finnish, French and Swedish. The best results are achieved with the multilingual XLM-R model. As data, we use the CORE corpus series featuring register annotated data from the unrestricted web.

2020

pdf
From Web Crawl to Clean Register-Annotated Corpora
Veronika Laippala | Samuel Rönnqvist | Saara Hellström | Juhani Luotolahti | Liina Repo | Anna Salmela | Valtteri Skantsi | Sampo Pyysalo
Proceedings of the 12th Web as Corpus Workshop

The web presents unprecedented opportunities for large-scale collection of text in many languages. However, two critical steps in the development of web corpora remain challenging: the identification of clean text from source HTML and the assignment of genre or register information to the documents. In this paper, we evaluate a multilingual approach to this end. Our starting points are the Swedish and French Common Crawl datasets gathered for the 2017 CoNLL shared task, particularly the URLs. We 1) fetch HTML pages based on the URLs and run boilerplate removal, 2) train a classifier to further clean out undesired text fragments, and 3) annotate text registers. We compare boilerplate removal against the CoNLL texts, and find an improvement. For the further cleaning of undesired material, the best results are achieved using Multilingual BERT with monolingual fine-tuning. However, our results are promising also in a cross-lingual setting, without fine-tuning on the target language. Finally, the register annotations show that most of the documents belong to a relatively small set of registers, which are relatively similar in the two languages. A number of additional flags in the annotation are, however, necessary to reflect the wide range of linguistic variation associated with the documents.

2019

pdf
Template-free Data-to-Text Generation of Finnish Sports News
Jenna Kanerva | Samuel Rönnqvist | Riina Kekki | Tapio Salakoski | Filip Ginter
Proceedings of the 22nd Nordic Conference on Computational Linguistics

News articles such as sports game reports are often thought to closely follow the underlying game statistics, but in practice they contain a notable amount of background knowledge, interpretation, insight into the game, and quotes that are not present in the official statistics. This poses a challenge for automated data-to-text news generation with real-world news corpora as training data. We report on the development of a corpus of Finnish ice hockey news, edited to be suitable for training of end-to-end news generation methods, as well as demonstrate generation of text, which was judged by journalists to be relatively close to a viable product. The new dataset and system source code are available for research purposes.

pdf
Is Multilingual BERT Fluent in Language Generation?
Samuel Rönnqvist | Jenna Kanerva | Tapio Salakoski | Filip Ginter
Proceedings of the First NLPL Workshop on Deep Learning for Natural Language Processing

The multilingual BERT model is trained on 104 languages and meant to serve as a universal language model and tool for encoding sentences. We explore how well the model performs on several languages across several tasks: a diagnostic classification probing the embeddings for a particular syntactic property, a cloze task testing the language modelling ability to fill in gaps in a sentence, and a natural language generation task testing for the ability to produce coherent text fitting a given context. We find that the currently available multilingual BERT model is clearly inferior to the monolingual counterparts, and cannot in many cases serve as a substitute for a well-trained monolingual model. We find that the English and German models perform well at generation, whereas the multilingual model is lacking, in particular, for Nordic languages. The code of the experiments in the paper is available at: https://github.com/TurkuNLP/bert-eval

2017

pdf
A Recurrent Neural Model with Attention for the Recognition of Chinese Implicit Discourse Relations
Samuel Rönnqvist | Niko Schenk | Christian Chiarcos
Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

We introduce an attention-based Bi-LSTM for Chinese implicit discourse relations and demonstrate that modeling argument pairs as a joint sequence can outperform word order-agnostic approaches. Our model benefits from a partial sampling scheme and is conceptually simple, yet achieves state-of-the-art performance on the Chinese Discourse Treebank. We also visualize its attention activity to illustrate the model’s ability to selectively focus on the relevant parts of an input sequence.

2016

pdf
Do We Really Need All Those Rich Linguistic Features? A Neural Network-Based Approach to Implicit Sense Labeling
Niko Schenk | Christian Chiarcos | Kathrin Donandt | Samuel Rönnqvist | Evgeny Stepanov | Giuseppe Riccardi
Proceedings of the CoNLL-16 shared task