Large language models (LLMs) excel in many tasks in NLP and beyond, but most open models have very limited coverage of smaller languages and LLM work tends to focus on languages where nearly unlimited data is available for pretraining. In this work, we study the challenges of creating LLMs for Finnish, a language spoken by less than 0.1% of the world population. We compile an extensive dataset of Finnish combining web crawls, news, social media and eBooks. We pursue two approaches to pretrain models: 1) we train seven monolingual models from scratch (186M to 13B parameters) dubbed FinGPT, 2) we continue the pretraining of the multilingual BLOOM model on a mix of its original training data and Finnish, resulting in a 176 billion parameter model we call BLUUMI. For model evaluation, we introduce FIN-bench, a version of BIG-bench with Finnish tasks. We also assess other model qualities such as toxicity and bias. Our models and tools are openly available at
https://turkunlp.org/gpt3-finnish.
We introduce a corpus with fine-grained named entity annotation for Finnish, following the OntoNotes guidelines to create a resource that is cross-lingually compatible with existing annotations for other languages. We combine and extend two NER corpora recently introduced for Finnish and revise their custom annotation scheme through a combination of automatic and manual processing steps. The resulting corpus consists of nearly 500,000 tokens annotated for over 50,000 mentions categorized into the 18 OntoNotes name and numeric entity types. We evaluate this resource and demonstrate its compatibility with the English OntoNotes annotations by training state-of-the-art mono-, bi- and multilingual deep learning models, finding both that the corpus allows highly accurate recognition of OntoNotes types at 93% F-score and that a comparable level of tagging accuracy can be achieved by a bilingual Finnish-English NER model.
We present a new manually annotated corpus for broad-coverage named entity recognition for Finnish. Building on the original Universal Dependencies Finnish corpus of 754 documents (200,000 tokens) representing ten different genres of text, we introduce annotation marking person, organization, location, product and event names as well as dates. The new annotation identifies in total over 10,000 mentions. An evaluation of inter-annotator agreement indicates that the quality and consistency of annotation are high, at 94.5% F-score for exact match. A comprehensive evaluation using state-of-the-art machine learning methods demonstrates that the new resource maintains compatibility with a previously released single-domain corpus for Finnish NER and makes it possible to recognize named entity mentions in texts drawn from most domains at precision and recall approaching or exceeding 90%. Remaining challenges such as the identification of names in blog posts and transcribed speech are also identified. The newly introduced Turku NER corpus and related resources introduced in this work are released under open licenses via
https://turkunlp.org/turku-ner-corpus .
Named entity recognition (NER) is frequently addressed as a sequence classification task with each input consisting of one sentence of text. It is nevertheless clear that useful information for NER is often found also elsewhere in text. Recent self-attention models like BERT can both capture long-distance relationships in input and represent inputs consisting of several sentences. This creates opportunities for adding cross-sentence information in natural language processing tasks. This paper presents a systematic study exploring the use of cross-sentence information for NER using BERT models in five languages. We find that adding context as additional sentences to BERT input systematically increases NER performance. Multiple sentences in input samples allows us to study the predictions of the sentences in different contexts. We propose a straightforward method, Contextual Majority Voting (CMV), to combine these different predictions and demonstrate this to further increase NER performance. Evaluation on established datasets, including the CoNLL’02 and CoNLL’03 NER benchmarks, demonstrates that our proposed approach can improve on the state-of-the-art NER results on English, Dutch, and Finnish, achieves the best reported BERT-based results on German, and is on par with other BERT-based approaches in Spanish. We release all methods implemented in this work under open licenses.