Sampo Pyysalo


2024

pdf
A New Massive Multilingual Dataset for High-Performance Language Technologies
Ona de Gibert | Graeme Nail | Nikolay Arefyev | Marta Bañón | Jelmer van der Linde | Shaoxiong Ji | Jaume Zaragoza-Bernabeu | Mikko Aulamo | Gema Ramírez-Sánchez | Andrey Kutuzov | Sampo Pyysalo | Stephan Oepen | Jörg Tiedemann
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

We present the HPLT (High Performance Language Technologies) language resources, a new massive multilingual dataset including both monolingual and bilingual corpora extracted from CommonCrawl and previously unused web crawls from the Internet Archive. We describe our methods for data acquisition, management and processing of large corpora, which rely on open-source software tools and high-performance computing. Our monolingual collection focuses on low- to medium-resourced languages and covers 75 languages and a total of ≈ 5.6 trillion word tokens de-duplicated on the document level. Our English-centric parallel corpus is derived from its monolingual counterpart and covers 18 language pairs and more than 96 million aligned sentence pairs with roughly 1.4 billion English tokens. The HPLT language resources are one of the largest open text corpora ever released, providing a great resource for language modeling and machine translation training. We publicly release the corpora, the software, and the tools used in this work.

pdf
Building Question-Answer Data Using Web Register Identification
Anni Eskelinen | Amanda Myntti | Erik Henriksson | Sampo Pyysalo | Veronika Laippala
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

This article introduces a resource-efficient method for developing question-answer (QA) datasets by extracting QA pairs from web-scale data using machine learning (ML). Our method benefits from recent advances in web register (genre) identification and consists of two ML steps with an additional post-processing step. First, using XLM-R and the multilingual CORE web register corpus series with categories such as QA Forum, we train a multilingual classifier to retrieve documents that are likely to contain QA pairs from web-scale data. Second, we develop a NER-style token classifier to identify the QA text spans within these documents. To this end, we experiment with training on a semi-synthetic dataset built on top of the English LFQA, a small set of manually cleaned web QA pairs in English and Finnish, and a Finnish web QA pair dataset cleaned using ChatGPT. The evaluation of our pipeline demonstrates its capability to efficiently retrieve a substantial volume of QA pairs. While the approach is adaptable to any language given the availability of language models and extensive web data, we showcase its efficiency in English and Finnish, developing the first open, non-synthetic and non-machine translated QA dataset for Finnish – Turku WebQA – comprising over 200,000 QA pairs.

2023

pdf
Silver Syntax Pre-training for Cross-Domain Relation Extraction
Elisa Bassignana | Filip Ginter | Sampo Pyysalo | Rob van der Goot | Barbara Plank
Findings of the Association for Computational Linguistics: ACL 2023

Relation Extraction (RE) remains a challenging task, especially when considering realistic out-of-domain evaluations. One of the main reasons for this is the limited training size of current RE datasets: obtaining high-quality (manually annotated) data is extremely expensive and cannot realistically be repeated for each new domain. An intermediate training step on data from related tasks has shown to be beneficial across many NLP tasks. However, this setup still requires supplementary annotated data, which is often not available. In this paper, we investigate intermediate pre-training specifically for RE. We exploit the affinity between syntactic structure and semantic RE, and identify the syntactic relations which are closely related to RE by being on the shortest dependency path between two entities. We then take advantage of the high accuracy of current syntactic parsers in order to automatically obtain large amounts of low-cost pre-training data. By pre-training our RE model on the relevant syntactic relations, we are able to outperform the baseline in five out of six cross-domain setups, without any additional annotated data.

pdf
Multi-CrossRE A Multi-Lingual Multi-Domain Dataset for Relation Extraction
Elisa Bassignana | Filip Ginter | Sampo Pyysalo | Rob van der Goot | Barbara Plank
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)

Most research in Relation Extraction (RE) involves the English language, mainly due to the lack of multi-lingual resources. We propose Multi-CrossRE, the broadest multi-lingual dataset for RE, including 26 languages in addition to English, and covering six text domains. Multi-CrossRE is a machine translated version of CrossRE (Bassignana and Plank, 2022), with a sub-portion including more than 200 sentences in seven diverse languages checked by native speakers. We run a baseline model over the 26 new datasets and–as sanity check–over the 26 back-translations to English. Results on the back-translated data are consistent with the ones on the original English CrossRE, indicating high quality of the translation and the resulting dataset.

pdf
Toxicity Detection in Finnish Using Machine Translation
Anni Eskelinen | Laura Silvala | Filip Ginter | Sampo Pyysalo | Veronika Laippala
Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)

Due to the popularity of social media platforms and the sheer amount of user-generated content online, the automatic detection of toxic language has become crucial in the creation of a friendly and safe digital space. Previous work has been mostly focusing on English leaving many lower-resource languages behind. In this paper, we present novel resources for toxicity detection in Finnish by introducing two new datasets, a machine translated toxicity dataset for Finnish based on the widely used English Jigsaw dataset and a smaller test set of Suomi24 discussion forum comments originally written in Finnish and manually annotated following the definitions of the labels that were used to annotate the Jigsaw dataset. We show that machine translating the training data to Finnish provides better toxicity detection results than using the original English training data and zero-shot cross-lingual transfer with XLM-R, even with our newly annotated dataset from Suomi24.

pdf
FinGPT: Large Generative Models for a Small Language
Risto Luukkonen | Ville Komulainen | Jouni Luoma | Anni Eskelinen | Jenna Kanerva | Hanna-Mari Kupari | Filip Ginter | Veronika Laippala | Niklas Muennighoff | Aleksandra Piktus | Thomas Wang | Nouamane Tazi | Teven Scao | Thomas Wolf | Osma Suominen | Samuli Sairanen | Mikko Merioksa | Jyrki Heinonen | Aija Vahtola | Samuel Antao | Sampo Pyysalo
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Large language models (LLMs) excel in many tasks in NLP and beyond, but most open models have very limited coverage of smaller languages and LLM work tends to focus on languages where nearly unlimited data is available for pretraining. In this work, we study the challenges of creating LLMs for Finnish, a language spoken by less than 0.1% of the world population. We compile an extensive dataset of Finnish combining web crawls, news, social media and eBooks. We pursue two approaches to pretrain models: 1) we train seven monolingual models from scratch (186M to 13B parameters) dubbed FinGPT, 2) we continue the pretraining of the multilingual BLOOM model on a mix of its original training data and Finnish, resulting in a 176 billion parameter model we call BLUUMI. For model evaluation, we introduce FIN-bench, a version of BIG-bench with Finnish tasks. We also assess other model qualities such as toxicity and bias. Our models and tools are openly available at https://turkunlp.org/gpt3-finnish.

2022

pdf
Towards better structured and less noisy Web data: Oscar with Register annotations
Veronika Laippala | Anna Salmela | Samuel Rönnqvist | Alham Fikri Aji | Li-Hsin Chang | Asma Dhifallah | Larissa Goulart | Henna Kortelainen | Marc Pàmies | Deise Prina Dutra | Valtteri Skantsi | Lintang Sutawika | Sampo Pyysalo
Proceedings of the Eighth Workshop on Noisy User-generated Text (W-NUT 2022)

Web-crawled datasets are known to be noisy, as they feature a wide range of language use covering both user-generated and professionally edited content as well as noise originating from the crawling process. This article presents one solution to reduce this noise by using automatic register (genre) identification -whether the texts are, e.g., forum discussions, lyrical or how-to pages. We apply the multilingual register identification model by Rönnqvist et al. (2021) and label the widely used Oscar dataset. Additionally, we evaluate the model against eight new languages, showing that the performance is comparable to previous findings on a restricted set of languages. Finally, we present and apply a machine learning method for further cleaning text files originating from Web crawls from remains of boilerplate and other elements not belonging to the main text of the Web page. The register labeled and cleaned dataset covers 351 million documents in 14 languages and is available at https://huggingface.co/datasets/TurkuNLP/register_oscar.

2021

pdf
Beyond the English Web: Zero-Shot Cross-Lingual and Lightweight Monolingual Classification of Registers
Liina Repo | Valtteri Skantsi | Samuel Rönnqvist | Saara Hellström | Miika Oinonen | Anna Salmela | Douglas Biber | Jesse Egbert | Sampo Pyysalo | Veronika Laippala
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop

We explore cross-lingual transfer of register classification for web documents. Registers, that is, text varieties such as blogs or news are one of the primary predictors of linguistic variation and thus affect the automatic processing of language. We introduce two new register-annotated corpora, FreCORE and SweCORE, for French and Swedish. We demonstrate that deep pre-trained language models perform strongly in these languages and outperform previous state-of-the-art in English and Finnish. Specifically, we show 1) that zero-shot cross-lingual transfer from the large English CORE corpus can match or surpass previously published monolingual models, and 2) that lightweight monolingual classification requiring very little training data can reach or surpass our zero-shot performance. We further analyse classification results finding that certain registers continue to pose challenges in particular for cross-lingual transfer.

pdf bib
WikiBERT Models: Deep Transfer Learning for Many Languages
Sampo Pyysalo | Jenna Kanerva | Antti Virtanen | Filip Ginter
Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)

Deep neural language models such as BERT have enabled substantial recent advances in many natural language processing tasks. However, due to the effort and computational cost involved in their pre-training, such models are typically introduced only for a small number of high-resource languages such as English. While multilingual models covering large numbers of languages are available, recent work suggests monolingual training can produce better models, and our understanding of the tradeoffs between mono- and multilingual training is incomplete. In this paper, we introduce a simple, fully automated pipeline for creating language-specific BERT models from Wikipedia data and introduce 42 new such models, most for languages up to now lacking dedicated deep neural language models. We assess the merits of these models using cloze tests and the state-of-the-art UDify parser on Universal Dependencies data, contrasting performance with results using the multilingual BERT (mBERT) model. We find that the newly introduced WikiBERT models outperform mBERT in cloze tests for nearly all languages, and that UDify using WikiBERT models outperforms the parser using mBERT on average, with the language-specific models showing substantially improved performance for some languages, yet limited improvement or a decrease in performance for others. All of the methods and models introduced in this work are available under open licenses from https://github.com/turkunlp/wikibert.

pdf
Fine-grained Named Entity Annotation for Finnish
Jouni Luoma | Li-Hsin Chang | Filip Ginter | Sampo Pyysalo
Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)

We introduce a corpus with fine-grained named entity annotation for Finnish, following the OntoNotes guidelines to create a resource that is cross-lingually compatible with existing annotations for other languages. We combine and extend two NER corpora recently introduced for Finnish and revise their custom annotation scheme through a combination of automatic and manual processing steps. The resulting corpus consists of nearly 500,000 tokens annotated for over 50,000 mentions categorized into the 18 OntoNotes name and numeric entity types. We evaluate this resource and demonstrate its compatibility with the English OntoNotes annotations by training state-of-the-art mono-, bi- and multilingual deep learning models, finding both that the corpus allows highly accurate recognition of OntoNotes types at 93% F-score and that a comparable level of tagging accuracy can be achieved by a bilingual Finnish-English NER model.

pdf
Quantitative Evaluation of Alternative Translations in a Corpus of Highly Dissimilar Finnish Paraphrases
Li-Hsin Chang | Sampo Pyysalo | Jenna Kanerva | Filip Ginter
Proceedings for the First Workshop on Modelling Translation: Translatology in the Digital Age

2020

pdf
The birth of Romanian BERT
Stefan Dumitrescu | Andrei-Marius Avram | Sampo Pyysalo
Findings of the Association for Computational Linguistics: EMNLP 2020

Large-scale pretrained language models have become ubiquitous in Natural Language Processing. However, most of these models are available either in high-resource languages, in particular English, or as multilingual models that compromise performance on individual languages for coverage. This paper introduces Romanian BERT, the first purely Romanian transformer-based language model, pretrained on a large text corpus. We discuss corpus com-position and cleaning, the model training process, as well as an extensive evaluation of the model on various Romanian datasets. We opensource not only the model itself, but also a repository that contains information on how to obtain the corpus, fine-tune and use this model in production (with practical examples), and how to fully replicate the evaluation process.

pdf
Turku Enhanced Parser Pipeline: From Raw Text to Enhanced Graphs in the IWPT 2020 Shared Task
Jenna Kanerva | Filip Ginter | Sampo Pyysalo
Proceedings of the 16th International Conference on Parsing Technologies and the IWPT 2020 Shared Task on Parsing into Enhanced Universal Dependencies

We present the approach of the TurkuNLP group to the IWPT 2020 shared task on Multilingual Parsing into Enhanced Universal Dependencies. The task involves 28 treebanks in 17 different languages and requires parsers to generate graph structures extending on the basic dependency trees. Our approach combines language-specific BERT models, the UDify parser, neural sequence-to-sequence lemmatization and a graph transformation approach encoding the enhanced structure into a dependency tree. Our submission averaged 84.5% ELAS, ranking first in the shared task. We make all methods and resources developed for this study freely available under open licenses from https://turkunlp.org.

pdf
From Web Crawl to Clean Register-Annotated Corpora
Veronika Laippala | Samuel Rönnqvist | Saara Hellström | Juhani Luotolahti | Liina Repo | Anna Salmela | Valtteri Skantsi | Sampo Pyysalo
Proceedings of the 12th Web as Corpus Workshop

The web presents unprecedented opportunities for large-scale collection of text in many languages. However, two critical steps in the development of web corpora remain challenging: the identification of clean text from source HTML and the assignment of genre or register information to the documents. In this paper, we evaluate a multilingual approach to this end. Our starting points are the Swedish and French Common Crawl datasets gathered for the 2017 CoNLL shared task, particularly the URLs. We 1) fetch HTML pages based on the URLs and run boilerplate removal, 2) train a classifier to further clean out undesired text fragments, and 3) annotate text registers. We compare boilerplate removal against the CoNLL texts, and find an improvement. For the further cleaning of undesired material, the best results are achieved using Multilingual BERT with monolingual fine-tuning. However, our results are promising also in a cross-lingual setting, without fine-tuning on the target language. Finally, the register annotations show that most of the documents belong to a relatively small set of registers, which are relatively similar in the two languages. A number of additional flags in the annotation are, however, necessary to reflect the wide range of linguistic variation associated with the documents.

pdf
Universal Dependencies v2: An Evergrowing Multilingual Treebank Collection
Joakim Nivre | Marie-Catherine de Marneffe | Filip Ginter | Jan Hajič | Christopher D. Manning | Sampo Pyysalo | Sebastian Schuster | Francis Tyers | Daniel Zeman
Proceedings of the Twelfth Language Resources and Evaluation Conference

Universal Dependencies is an open community effort to create cross-linguistically consistent treebank annotation for many languages within a dependency-based lexicalist framework. The annotation consists in a linguistically motivated word segmentation; a morphological layer comprising lemmas, universal part-of-speech tags, and standardized morphological features; and a syntactic layer focusing on syntactic relations between predicates, arguments and modifiers. In this paper, we describe version 2 of the universal guidelines (UD v2), discuss the major changes from UD v1 to UD v2, and give an overview of the currently available treebanks for 90 languages.

pdf
A Broad-coverage Corpus for Finnish Named Entity Recognition
Jouni Luoma | Miika Oinonen | Maria Pyykönen | Veronika Laippala | Sampo Pyysalo
Proceedings of the Twelfth Language Resources and Evaluation Conference

We present a new manually annotated corpus for broad-coverage named entity recognition for Finnish. Building on the original Universal Dependencies Finnish corpus of 754 documents (200,000 tokens) representing ten different genres of text, we introduce annotation marking person, organization, location, product and event names as well as dates. The new annotation identifies in total over 10,000 mentions. An evaluation of inter-annotator agreement indicates that the quality and consistency of annotation are high, at 94.5% F-score for exact match. A comprehensive evaluation using state-of-the-art machine learning methods demonstrates that the new resource maintains compatibility with a previously released single-domain corpus for Finnish NER and makes it possible to recognize named entity mentions in texts drawn from most domains at precision and recall approaching or exceeding 90%. Remaining challenges such as the identification of names in blog posts and transcribed speech are also identified. The newly introduced Turku NER corpus and related resources introduced in this work are released under open licenses via https://turkunlp.org/turku-ner-corpus .

pdf
Exploring Cross-sentence Contexts for Named Entity Recognition with BERT
Jouni Luoma | Sampo Pyysalo
Proceedings of the 28th International Conference on Computational Linguistics

Named entity recognition (NER) is frequently addressed as a sequence classification task with each input consisting of one sentence of text. It is nevertheless clear that useful information for NER is often found also elsewhere in text. Recent self-attention models like BERT can both capture long-distance relationships in input and represent inputs consisting of several sentences. This creates opportunities for adding cross-sentence information in natural language processing tasks. This paper presents a systematic study exploring the use of cross-sentence information for NER using BERT models in five languages. We find that adding context as additional sentences to BERT input systematically increases NER performance. Multiple sentences in input samples allows us to study the predictions of the sentences in different contexts. We propose a straightforward method, Contextual Majority Voting (CMV), to combine these different predictions and demonstrate this to further increase NER performance. Evaluation on established datasets, including the CoNLL’02 and CoNLL’03 NER benchmarks, demonstrates that our proposed approach can improve on the state-of-the-art NER results on English, Dutch, and Finnish, achieves the best reported BERT-based results on German, and is on par with other BERT-based approaches in Spanish. We release all methods implemented in this work under open licenses.

2019

pdf
Biomedical Named Entity Recognition with Multilingual BERT
Kai Hakala | Sampo Pyysalo
Proceedings of the 5th Workshop on BioNLP Open Shared Tasks

We present the approach of the Turku NLP group to the PharmaCoNER task on Spanish biomedical named entity recognition. We apply a CRF-based baseline approach and multilingual BERT to the task, achieving an F-score of 88% on the development data and 87% on the test set with BERT. Our approach reflects a straightforward application of a state-of-the-art multilingual model that is not specifically tailored to either the language nor the application domain. The source code is available at: https://github.com/chaanim/pharmaconer

pdf
CRAFT Shared Tasks 2019 Overview — Integrated Structure, Semantics, and Coreference
William Baumgartner | Michael Bada | Sampo Pyysalo | Manuel R. Ciosici | Negacy Hailu | Harrison Pielke-Lombardo | Michael Regan | Lawrence Hunter
Proceedings of the 5th Workshop on BioNLP Open Shared Tasks

As part of the BioNLP Open Shared Tasks 2019, the CRAFT Shared Tasks 2019 provides a platform to gauge the state of the art for three fundamental language processing tasks — dependency parse construction, coreference resolution, and ontology concept identification — over full-text biomedical articles. The structural annotation task requires the automatic generation of dependency parses for each sentence of an article given only the article text. The coreference resolution task focuses on linking coreferring base noun phrase mentions into chains using the symmetrical and transitive identity relation. The ontology concept annotation task involves the identification of concept mentions within text using the classes of ten distinct ontologies in the biomedical domain, both unmodified and augmented with extension classes. This paper provides an overview of each task, including descriptions of the data provided to participants and the evaluation metrics used, and discusses participant results relative to baseline performances for each of the three tasks.

pdf
Neural Dependency Parsing of Biomedical Text: TurkuNLP entry in the CRAFT Structural Annotation Task
Thang Minh Ngo | Jenna Kanerva | Filip Ginter | Sampo Pyysalo
Proceedings of the 5th Workshop on BioNLP Open Shared Tasks

We present the approach taken by the TurkuNLP group in the CRAFT Structural Annotation task, a shared task on dependency parsing. Our approach builds primarily on the Turku neural parser, a native dependency parser that ranked among the best in the recent CoNLL tasks on parsing Universal Dependencies. To adapt the parser to the biomedical domain, we considered and evaluated a number of approaches, including the generation of custom word embeddings, combination with other in-domain resources, and the incorporation of information from named entity recognition. We achieved a labeled attachment score of 89.7%, the best result among task participants.

pdf
Toward Multilingual Identification of Online Registers
Veronika Laippala | Roosa Kyllönen | Jesse Egbert | Douglas Biber | Sampo Pyysalo
Proceedings of the 22nd Nordic Conference on Computational Linguistics

We consider cross- and multilingual text classification approaches to the identification of online registers (genres), i.e. text varieties with specific situational characteristics. Register is the most important predictor of linguistic variation, and register information could improve the potential of online data for many applications. We introduce the first manually annotated non-English corpus of online registers featuring the full range of linguistic variation found online. The data set consists of 2,237 Finnish documents and follows the register taxonomy developed for the Corpus of Online Registers of English (CORE). Using CORE and the newly introduced corpus, we demonstrate the feasibility of cross-lingual register identification using a simple approach based on convolutional neural networks and multilingual word embeddings. We further find that register identification results can be improved through multilingual training even when a substantial number of annotations is available in the target language.

2017

pdf
Fully Delexicalized Contexts for Syntax-Based Word Embeddings
Jenna Kanerva | Sampo Pyysalo | Filip Ginter
Proceedings of the Fourth International Conference on Dependency Linguistics (Depling 2017)

pdf bib
CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies
Daniel Zeman | Martin Popel | Milan Straka | Jan Hajič | Joakim Nivre | Filip Ginter | Juhani Luotolahti | Sampo Pyysalo | Slav Petrov | Martin Potthast | Francis Tyers | Elena Badmaeva | Memduh Gokirmak | Anna Nedoluzhko | Silvie Cinková | Jan Hajič jr. | Jaroslava Hlaváčová | Václava Kettnerová | Zdeňka Urešová | Jenna Kanerva | Stina Ojala | Anna Missilä | Christopher D. Manning | Sebastian Schuster | Siva Reddy | Dima Taji | Nizar Habash | Herman Leung | Marie-Catherine de Marneffe | Manuela Sanguinetti | Maria Simi | Hiroshi Kanayama | Valeria de Paiva | Kira Droganova | Héctor Martínez Alonso | Çağrı Çöltekin | Umut Sulubacak | Hans Uszkoreit | Vivien Macketanz | Aljoscha Burchardt | Kim Harris | Katrin Marheinecke | Georg Rehm | Tolga Kayadelen | Mohammed Attia | Ali Elkahky | Zhuoran Yu | Emily Pitler | Saran Lertpradit | Michael Mandl | Jesse Kirchner | Hector Fernandez Alcalde | Jana Strnadová | Esha Banerjee | Ruli Manurung | Antonio Stella | Atsuko Shimada | Sookyoung Kwak | Gustavo Mendonça | Tatiana Lando | Rattima Nitisaroj | Josie Li
Proceedings of the CoNLL 2017 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

The Conference on Computational Natural Language Learning (CoNLL) features a shared task, in which participants train and test their learning systems on the same data sets. In 2017, the task was devoted to learning dependency parsers for a large number of languages, in a real-world setting without any gold-standard annotation on input. All test sets followed a unified annotation scheme, namely that of Universal Dependencies. In this paper, we define the task and evaluation methodology, describe how the data sets were prepared, report and analyze the main results, and provide a brief categorization of the different approaches of the participating systems.

2016

pdf bib
Intrinsic Evaluation of Word Vectors Fails to Predict Extrinsic Performance
Billy Chiu | Anna Korhonen | Sampo Pyysalo
Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP

pdf
How to Train good Word Embeddings for Biomedical NLP
Billy Chiu | Gamal Crichton | Anna Korhonen | Sampo Pyysalo
Proceedings of the 15th Workshop on Biomedical Natural Language Processing

pdf
Deep Learning with Minimal Training Data: TurkuNLP Entry in the BioNLP Shared Task 2016
Farrokh Mehryary | Jari Björne | Sampo Pyysalo | Tapio Salakoski | Filip Ginter
Proceedings of the 4th BioNLP Shared Task Workshop

pdf bib
Cancer Hallmark Text Classification Using Convolutional Neural Networks
Simon Baker | Anna Korhonen | Sampo Pyysalo
Proceedings of the Fifth Workshop on Building and Evaluating Resources for Biomedical Text Mining (BioTxtM2016)

Methods based on deep learning approaches have recently achieved state-of-the-art performance in a range of machine learning tasks and are increasingly applied to natural language processing (NLP). Despite strong results in various established NLP tasks involving general domain texts, there is only limited work applying these models to biomedical NLP. In this paper, we consider a Convolutional Neural Network (CNN) approach to biomedical text classification. Evaluation using a recently introduced cancer domain dataset involving the categorization of documents according to the well-established hallmarks of cancer shows that a basic CNN model can achieve a level of performance competitive with a Support Vector Machine (SVM) trained using complex manually engineered features optimized to the task. We further show that simple modifications to the CNN hyperparameters, initialization, and training process allow the model to notably outperform the SVM, establishing a new state of the art result at this task. We make all of the resources and tools introduced in this study available under open licenses from https://cambridgeltl.github.io/cancer-hallmark-cnn/.

pdf
Universal Dependencies v1: A Multilingual Treebank Collection
Joakim Nivre | Marie-Catherine de Marneffe | Filip Ginter | Yoav Goldberg | Jan Hajič | Christopher D. Manning | Ryan McDonald | Slav Petrov | Sampo Pyysalo | Natalia Silveira | Reut Tsarfaty | Daniel Zeman
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

Cross-linguistically consistent annotation is necessary for sound comparative evaluation and cross-lingual learning experiments. It is also useful for multilingual system development and comparative linguistic studies. Universal Dependencies is an open community effort to create cross-linguistically consistent treebank annotation for many languages within a dependency-based lexicalist framework. In this paper, we describe v1 of the universal guidelines, the underlying design principles, and the currently available treebanks for 33 languages.

pdf
Typed Entity and Relation Annotation on Computer Science Papers
Yuka Tateisi | Tomoko Ohta | Sampo Pyysalo | Yusuke Miyao | Akiko Aizawa
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We describe our ongoing effort to establish an annotation scheme for describing the semantic structures of research articles in the computer science domain, with the intended use of developing search systems that can refine their results by the roles of the entities denoted by the query keys. In our scheme, mentions of entities are annotated with ontology-based types, and the roles of the entities are annotated as relations with other entities described in the text. So far, we have annotated 400 abstracts from the ACL anthology and the ACM digital library. In this paper, the scheme and the annotated dataset are described, along with the problems found in the course of annotation. We also show the results of automatic annotation and evaluate the corpus in a practical setting in application to topic extraction.

pdf
Attending to Characters in Neural Sequence Labeling Models
Marek Rei | Gamal Crichton | Sampo Pyysalo
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Sequence labeling architectures use word embeddings for capturing similarity, but suffer when handling previously unseen or rare words. We investigate character-level extensions to such models and propose a novel architecture for combining alternative word representations. By using an attention mechanism, the model is able to dynamically decide how much information to use from a word- or character-level component. We evaluated different architectures on a range of sequence labeling datasets, and character-level extensions were found to improve performance on every benchmark. In addition, the proposed attention-based architecture delivered the best results even with a smaller number of trainable parameters.

2015

pdf
Towards the Classification of the Finnish Internet Parsebank: Detecting Translations and Informality
Veronika Laippala | Jenna Kanerva | Anna Missilä | Sampo Pyysalo | Tapio Salakoski | Filip Ginter
Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015)

pdf
Universal Dependencies for Finnish
Sampo Pyysalo | Jenna Kanerva | Anna Missilä | Veronika Laippala | Filip Ginter
Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015)

pdf
Towards Universal Web Parsebanks
Juhani Luotolahti | Jenna Kanerva | Veronika Laippala | Sampo Pyysalo | Filip Ginter
Proceedings of the Third International Conference on Dependency Linguistics (Depling 2015)

pdf
SETS: Scalable and Efficient Tree Search in Dependency Graphs
Juhani Luotolahti | Jenna Kanerva | Sampo Pyysalo | Filip Ginter
Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations

pdf
Sharing annotations better: RESTful Open Annotation
Sampo Pyysalo | Jorge Campos | Juan Miguel Cejuela | Filip Ginter | Kai Hakala | Chen Li | Pontus Stenetorp | Lars Juhl Jensen
Proceedings of ACL-IJCNLP 2015 System Demonstrations

2013

pdf bib
Proceedings of the BioNLP Shared Task 2013 Workshop
Claire Nédellec | Robert Bossy | Jin-Dong Kim | Jung-jae Kim | Tomoko Ohta | Sampo Pyysalo | Pierre Zweigenbaum
Proceedings of the BioNLP Shared Task 2013 Workshop

pdf bib
Overview of BioNLP Shared Task 2013
Claire Nédellec | Robert Bossy | Jin-Dong Kim | Jung-jae Kim | Tomoko Ohta | Sampo Pyysalo | Pierre Zweigenbaum
Proceedings of the BioNLP Shared Task 2013 Workshop

pdf
Overview of the Cancer Genetics (CG) task of BioNLP Shared Task 2013
Sampo Pyysalo | Tomoko Ohta | Sophia Ananiadou
Proceedings of the BioNLP Shared Task 2013 Workshop

pdf
Overview of the Pathway Curation (PC) task of BioNLP Shared Task 2013
Tomoko Ohta | Sampo Pyysalo | Rafal Rak | Andrew Rowley | Hong-Woo Chun | Sung-Jae Jung | Sung-Pil Choi | Sophia Ananiadou | Jun’ichi Tsujii
Proceedings of the BioNLP Shared Task 2013 Workshop

2012

pdf
PubMed-Scale Event Extraction for Post-Translational Modifications, Epigenetics and Protein Structural Relations
Jari Björne | Sofie Van Landeghem | Sampo Pyysalo | Tomoko Ohta | Filip Ginter | Yves Van de Peer | Sophia Ananiadou | Tapio Salakoski
BioNLP: Proceedings of the 2012 Workshop on Biomedical Natural Language Processing

pdf
New Resources and Perspectives for Biomedical Event Extraction
Sampo Pyysalo | Pontus Stenetorp | Tomoko Ohta | Jin-Dong Kim | Sophia Ananiadou
BioNLP: Proceedings of the 2012 Workshop on Biomedical Natural Language Processing

pdf
Bridging the Gap Between Scope-based and Event-based Negation/Speculation Annotations: A Bridge Not Too Far
Pontus Stenetorp | Sampo Pyysalo | Tomoko Ohta | Sophia Ananiadou | Jun’ichi Tsujii
Proceedings of the Workshop on Extra-Propositional Aspects of Meaning in Computational Linguistics

pdf
Open-domain Anatomical Entity Mention Detection
Tomoko Ohta | Sampo Pyysalo | Jun’ichi Tsujii | Sophia Ananiadou
Proceedings of the Workshop on Detecting Structure in Scholarly Discourse

pdf
brat: a Web-based Tool for NLP-Assisted Text Annotation
Pontus Stenetorp | Sampo Pyysalo | Goran Topić | Tomoko Ohta | Sophia Ananiadou | Jun’ichi Tsujii
Proceedings of the Demonstrations at the 13th Conference of the European Chapter of the Association for Computational Linguistics

2011

pdf
From Pathways to Biomolecular Events: Opportunities and Challenges
Tomoko Ohta | Sampo Pyysalo | Jun’ichi Tsujii
Proceedings of BioNLP 2011 Workshop

pdf
Towards Exhaustive Event Extraction for Protein Modifications
Sampo Pyysalo | Tomoko Ohta | Makoto Miwa | Jun’ichi Tsujii
Proceedings of BioNLP 2011 Workshop

pdf
SimSem: Fast Approximate String Matching in Relation to Semantic Category Disambiguation
Pontus Stenetorp | Sampo Pyysalo | Jun’ichi Tsujii
Proceedings of BioNLP 2011 Workshop

pdf bib
Proceedings of BioNLP Shared Task 2011 Workshop
Jun’ichi Tsujii | Jin-Dong Kim | Sampo Pyysalo
Proceedings of BioNLP Shared Task 2011 Workshop

pdf bib
Overview of BioNLP Shared Task 2011
Jin-Dong Kim | Sampo Pyysalo | Tomoko Ohta | Robert Bossy | Ngan Nguyen | Jun’ichi Tsujii
Proceedings of BioNLP Shared Task 2011 Workshop

pdf
Overview of the Epigenetics and Post-translational Modifications (EPI) task of BioNLP Shared Task 2011
Tomoko Ohta | Sampo Pyysalo | Jun’ichi Tsujii
Proceedings of BioNLP Shared Task 2011 Workshop

pdf
Overview of the Infectious Diseases (ID) task of BioNLP Shared Task 2011
Sampo Pyysalo | Tomoko Ohta | Rafal Rak | Dan Sullivan | Chunhong Mao | Chunxia Wang | Bruno Sobral | Jun’ichi Tsujii | Sophia Ananiadou
Proceedings of BioNLP Shared Task 2011 Workshop

pdf
Overview of the Entity Relations (REL) supporting task of BioNLP Shared Task 2011
Sampo Pyysalo | Tomoko Ohta | Jun’ichi Tsujii
Proceedings of BioNLP Shared Task 2011 Workshop

pdf
BioNLP Shared Task 2011: Supporting Resources
Pontus Stenetorp | Goran Topić | Sampo Pyysalo | Tomoko Ohta | Jin-Dong Kim | Jun’ichi Tsujii
Proceedings of BioNLP Shared Task 2011 Workshop

2010

pdf
Event Extraction for Post-Translational Modifications
Tomoko Ohta | Sampo Pyysalo | Makoto Miwa | Jin-Dong Kim | Jun’ichi Tsujii
Proceedings of the 2010 Workshop on Biomedical Natural Language Processing

pdf
Scaling up Biomedical Event Extraction to the Entire PubMed
Jari Björne | Filip Ginter | Sampo Pyysalo | Jun’ichi Tsujii | Tapio Salakoski
Proceedings of the 2010 Workshop on Biomedical Natural Language Processing

pdf
A Comparative Study of Syntactic Parsers for Event Extraction
Makoto Miwa | Sampo Pyysalo | Tadayoshi Hara | Jun’ichi Tsujii
Proceedings of the 2010 Workshop on Biomedical Natural Language Processing

pdf
Towards Event Extraction from Full Texts on Infectious Diseases
Sampo Pyysalo | Tomoko Ohta | Han-Cheol Cho | Dan Sullivan | Chunhong Mao | Bruno Sobral | Jun’ichi Tsujii | Sophia Ananiadou
Proceedings of the 2010 Workshop on Biomedical Natural Language Processing

pdf
Integration of Static Relations to Enhance Event Extraction from Text
Sofie Van Landeghem | Sampo Pyysalo | Tomoko Ohta | Yves Van de Peer
Proceedings of the 2010 Workshop on Biomedical Natural Language Processing

pdf
Evaluating Dependency Representations for Event Extraction
Makoto Miwa | Sampo Pyysalo | Tadayoshi Hara | Jun’ichi Tsujii
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)

2009

pdf bib
Static Relations: a Piece in the Biomedical Information Extraction Puzzle
Sampo Pyysalo | Tomoko Ohta | Jin-Dong Kim | Jun’ichi Tsujii
Proceedings of the BioNLP 2009 Workshop

pdf
Incorporating GENETAG-style annotation to GENIA corpus
Tomoko Ohta | Jin-Dong Kim | Sampo Pyysalo | Yue Wang | Jun’ichi Tsujii
Proceedings of the BioNLP 2009 Workshop

pdf bib
Overview of BioNLP’09 Shared Task on Event Extraction
Jin-Dong Kim | Tomoko Ohta | Sampo Pyysalo | Yoshinobu Kano | Jun’ichi Tsujii
Proceedings of the BioNLP 2009 Workshop Companion Volume for Shared Task

pdf
Learning to Extract Biological Event and Relation Graphs
Jari Björne | Filip Ginter | Juho Heimonen | Sampo Pyysalo | Tapio Salakoski
Proceedings of the 17th Nordic Conference of Computational Linguistics (NODALIDA 2009)

2008

pdf bib
A Graph Kernel for Protein-Protein Interaction Extraction
Antti Airola | Sampo Pyysalo | Jari Björne | Tapio Pahikkala | Filip Ginter | Tapio Salakoski
Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing

2007

pdf
On the unification of syntactic annotations under the Stanford dependency scheme: A case study on BioInfer and GENIA
Sampo Pyysalo | Filip Ginter | Veronika Laippala | Katri Haverinen | Juho Heimonen | Tapio Salakoski
Biological, translational, and clinical language processing

2004

pdf
Analysis of Link Grammar on Biomedical Dependency Corpus Targeted at Protein-Protein Interactions
Sampo Pyysalo | Filip Ginter | Tapio Pahikkala | Jorma Boberg | Jouni Järvinen | Tapio Salakoski | Jeppe Koivula
Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and its Applications (NLPBA/BioNLP)

Search
Co-authors