Annika Simonsen


2024

pdf
Good or Bad News? Exploring GPT-4 for Sentiment Analysis for Faroese on a Public News Corpora
Iben Nyholm Debess | Annika Simonsen | Hafsteinn Einarsson
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Sentiment analysis in low-resource languages presents unique challenges that Large Language Models may help address. This study explores the efficacy of GPT-4 for sentiment analysis on Faroese news texts, an uncharted task for this language. On the basis of guidelines presented, the sentiment analysis was performed with a multi-class approach at the sentence and document level with 225 sentences analysed in 170 articles. When comparing GPT-4 to human annotators, we observe that GPT-4 performs remarkably well. We explored two prompt configurations and observed a benefit from having clear instructions for the sentiment analysis task, but no benefit from translating the articles to English before the sentiment analysis task. Our results indicate that GPT-4 can be considered as a valuable tool for generating Faroese test data. Furthermore, our investigation reveals the intricacy of news sentiment. This motivates a more nuanced approach going forward, and we suggest a multi-label approach for future research in this domain. We further explored the efficacy of GPT-4 in topic classification on news texts and observed more negative sentiments expressed in international than national news. Overall, this work demonstrates GPT-4’s proficiency on a novel task and its utility for augmenting resources in low-data languages.

pdf
Ice and Fire: Dataset on Sentiment, Emotions, Toxicity, Sarcasm, Hate speech, Sympathy and More in Icelandic Blog Comments
Steinunn Rut Friðriksdóttir | Annika Simonsen | Atli Snær Ásmundsson | Guðrún Lilja Friðjónsdóttir | Anton Karl Ingason | Vésteinn Snæbjarnarson | Hafsteinn Einarsson
Proceedings of the Fourth Workshop on Threat, Aggression & Cyberbullying @ LREC-COLING-2024

This study introduces “Ice and Fire,” a Multi-Task Learning (MTL) dataset tailored for sentiment analysis in the Icelandic language, encompassing a wide range of linguistic tasks, including sentiment and emotion detection, as well as identification of toxicity, hate speech, encouragement, sympathy, sarcasm/irony, and trolling. With 261 fully annotated blog comments and 1045 comments annotated in at least one task, this contribution marks a significant step forward in the field of Icelandic natural language processing. It provides a comprehensive dataset for understanding the nuances of online communication in Icelandic and an interface to expand the annotation effort. Despite the challenges inherent in subjective interpretation of text, our findings highlight the positive potential of this dataset to improve text analysis techniques and encourage more inclusive online discourse in Icelandic communities. With promising baseline performances, “Ice and Fire” sets the stage for future research to enhance automated text analysis and develop sophisticated language technologies, contributing to healthier online environments and advancing Icelandic language resources.

2023

pdf
ASR Language Resources for Faroese
Carlos Hernández Mena | Annika Simonsen | Jon Gudnason
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)

The aim of this work is to present a set of novel language resources in Faroese suitable for the field of Automatic Speech Recognition including: an ASR corpus comprised of 109 hours of transcribed speech data, acoustic models in systems such as WAV2VEC2, NVIDIA-NeMo, Kaldi and PocketSphinx; a set of n-gram language models and a set of pronunciation dictionaries with two different variants of Faroese. We also show comparison results between the distinct acoustic models presented here. All the resources exposed in this document are publicly available under creative commons licences.

pdf
Standardising Pronunciation for a Grapheme-to-Phoneme Converter for Faroese
Sandra Lamhauge | Iben Debess | Carlos Hernández Mena | Annika Simonsen | Jon Gudnason
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)

Pronunciation dictionaries allow computational modelling of the pronunciation of words in a certain language and are widely used in speech technologies, especially in the fields of speech recognition and synthesis. On the other hand, a grapheme-to-phoneme tool is a generalization of a pronunciation dictionary that is not limited to a given and finite vocabulary. In this paper, we present a set of standardized phonological rules for the Faroese language; we introduce FARSAMPA, a machine-readable character set suitable for phonetic transcription of Faroese, and we present a set of grapheme-to-phoneme models for Faroese, which are publicly available and shared under a creative commons license. We present the G2P converter and evaluate the performance. The evaluation shows reliable results that demonstrate the quality of the data.

pdf
Transfer to a Low-Resource Language via Close Relatives: The Case Study on Faroese
Vésteinn Snæbjarnarson | Annika Simonsen | Goran Glavaš | Ivan Vulić
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)

Multilingual language models have pushed state-of-the-art in cross-lingual NLP transfer. The majority of zero-shot cross-lingual transfer, however, use one and the same massively multilingual transformer (e.g., mBERT or XLM-R) to transfer to all target languages, irrespective of their typological, etymological, and phylogenetic relations to other languages. In particular, readily available data and models of resource-rich sibling languages are often ignored. In this work, we empirically show, in a case study for Faroese – a low-resource language from a high-resource language family – that by leveraging the phylogenetic information and departing from the ‘one-size-fits-all’ paradigm, one can improve cross-lingual transfer to low-resource languages. In particular, we leverage abundant resources of other Scandinavian languages (i.e., Danish, Norwegian, Swedish, and Icelandic) for the benefit of Faroese. Our evaluation results show that we can substantially improve the transfer performance to Faroese by exploiting data and models of closely-related high-resource languages. Further, we release a new web corpus of Faroese and Faroese datasets for named entity recognition (NER), semantic text similarity (STS), and new language models trained on all Scandinavian languages.

pdf
Using C-LARA to evaluate GPT-4’s multilingual processing
ChatGPT C-LARA-Instance | Belinda Chiera | Cathy Chua | Chadi Raheb | Manny Rayner | Annika Simonsen | Zhengkang Xiang | Rina Zviel-Girshin
Proceedings of the 21st Annual Workshop of the Australasian Language Technology Association

We present a cross-linguistic study in which the open source C-LARA platform was used to evaluate GPT-4’s ability to perform several key tasks relevant to Computer Assisted Language Learning. For each of the languages English, Farsi, Faroese, Mandarin and Russian, we instructed GPT-4, through C-LARA, to write six different texts, using prompts chosen to obtain texts of widely differing character. We then further instructed GPT-4 to annotate each text with segmentation markup, glosses and lemma/part-of-speech information; native speakers hand-corrected the texts and annotations to obtain error rates on the different component tasks. The C-LARA platform makes it easy to combine the results into a single multimodal document, further facilitating checking of their correctness. GPT-4’s performance varied widely across languages and processing tasks, but performance on different text genres was roughly comparable. In some cases, most notably glossing of English text, we found that GPT-4 was consistently able to revise its annotations to improve them.

2022

pdf
Creating a Basic Language Resource Kit for Faroese
Annika Simonsen | Sandra Saxov Lamhauge | Iben Nyholm Debess | Peter Juel Henrichsen
Proceedings of the Thirteenth Language Resources and Evaluation Conference

The biggest challenges we face in developing LR and LT for Faroese is the lack of existing resources. A few resources already exist for Faroese, but many of them are either of insufficient size and quality or are not easily accessible. Therefore, the Faroese ASR project, Ravnur, set out to make a BLARK for Faroese. The BLARK is still in the making, but many of its resources have already been produced or collected. The LR status is framed by mentioning existing LR of relevant size and quality. The specific components of the BLARK are presented as well as the working principles behind the BLARK. The BLARK will be a pillar in Faroese LR, being relatively substantial in both size, quality, and diversity. It will be open-source, inviting other small languages to use it as an inspiration to create their own BLARK. We comment on the faulty yet sprouting LT situation in the Faroe Islands. The LR and LT challenges are not solved with just a BLARK. Some initiatives are therefore proposed to better the prospects of Faroese LT. The open-source principle of the project should facilitate further development.

pdf
Error Corpora for Different Informant Groups:Annotating and Analyzing Texts from L2 Speakers, People with Dyslexia and Children
Þórunn Arnardóttir | Isidora Glisic | Annika Simonsen | Lilja Stefánsdóttir | Anton Ingason
Proceedings of the 19th International Conference on Natural Language Processing (ICON)

Error corpora are useful for many tasks, in particular for developing spell and grammar checking software and teaching material and tools. We present and compare three specialized Icelandic error corpora; the Icelandic L2 Error Corpus, the Icelandic Dyslexia Error Corpus, and the Icelandic Child Language Error Corpus. Each corpus contains texts written by speakers of a particular group; L2 speakers of Icelandic, people with dyslexia, and children aged 10 to 15. The corpora shed light on errors made by these groups and their frequencies, and all errors are manually labeled according to an annotation scheme. The corpora vary in size, consisting of errors ranging from 7,817 to 24,948, and are published under a CC BY 4.0 license. In this paper, we describe the corpora and their annotation scheme, and draw comparisons between their errors and their frequencies.