Richard Johansson

2021

pdf bib abs
Knowledge Distillation for Swedish NER models: A Search for Performance and Efficiency
Lovisa Hagström | Richard Johansson
Proceedings of the 23rd Nordic Conference on Computational Linguistics (NoDaLiDa)

The current recipe for better model performance within NLP is to increase model size and training data. While it gives us models with increasingly impressive results, it also makes it more difficult to train and deploy state-of-the-art models for NLP due to increasing computational costs. Model compression is a field of research that aims to alleviate this problem. The field encompasses different methods that aim to preserve the performance of a model while decreasing the size of it. One such method is knowledge distillation. In this article, we investigate the effect of knowledge distillation for named entity recognition models in Swedish. We show that while some sequence tagging models benefit from knowledge distillation, not all models do. This prompts us to ask questions about in which situations and for which models knowledge distillation is beneficial. We also reason about the effect of knowledge distillation on computational costs.

pdf bib abs
Transferring Knowledge from Vision to Language: How to Achieve it and how to Measure it?
Tobias Norlund | Lovisa Hagström | Richard Johansson
Proceedings of the Fourth BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP

Large language models are known to suffer from the hallucination problem in that they are prone to output statements that are false or inconsistent, indicating a lack of knowledge. A proposed solution to this is to provide the model with additional data modalities that complements the knowledge obtained through text. We investigate the use of visual data to complement the knowledge of large language models by proposing a method for evaluating visual knowledge transfer to text for uni- or multimodal language models. The method is based on two steps, 1) a novel task querying for knowledge of memory colors, i.e. typical colors of well-known objects, and 2) filtering of model training data to clearly separate knowledge contributions. Additionally, we introduce a model architecture that involves a visual imagination step and evaluate it with our proposed method. We find that our method can successfully be used to measure visual knowledge transfer capabilities in models and that our novel model architecture shows promising results for leveraging multimodal knowledge in a unimodal setting.

2020

pdf bib abs
An Arabic Tweets Sentiment Analysis Dataset (ATSAD) using Distant Supervision and Self Training
Kathrein Abu Kwaik | Stergios Chatzikyriakidis | Simon Dobnik | Motaz Saad | Richard Johansson
Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection

As the number of social media users increases, they express their thoughts, needs, socialise and publish their opinions reviews. For good social media sentiment analysis, good quality resources are needed, and the lack of these resources is particularly evident for languages other than English, in particular Arabic. The available Arabic resources lack of from either the size of the corpus or the quality of the annotation. In this paper, we present an Arabic Sentiment Analysis Corpus collected from Twitter, which contains 36K tweets labelled into positive and negative. We employed distant supervision and self-training approaches into the corpus to annotate it. Besides, we release an 8K tweets manually annotated as a gold standard. We evaluated the corpus intrinsically by comparing it to human classification and pre-trained sentiment analysis models, Moreover, we apply extrinsic evaluation methods exploiting sentiment analysis task and achieve an accuracy of 86%.

pdf bib abs
Training a Swedish Constituency Parser on Six Incompatible Treebanks
Richard Johansson | Yvonne Adesam
Proceedings of the 12th Language Resources and Evaluation Conference

We investigate a transition-based parser that uses Eukalyptus, a function-tagged constituent treebank for Swedish which includes discontinuous constituents. In addition, we show that the accuracy of this parser can be improved by using a multitask learning architecture that makes it possible to train the parser on additional treebanks that use other annotation models.

2019

pdf bib abs
Natural Language Processing in Policy Evaluation: Extracting Policy Conditions from IMF Loan Agreements
Joakim Åkerström | Adel Daoud | Richard Johansson
Proceedings of the 22nd Nordic Conference on Computational Linguistics

Social science researchers often use text as the raw data in investigations: for instance, when investigating the effects of IMF policies on the development of countries under IMF programs, researchers typically encode structured descriptions of the programs using a time-consuming manual effort. Making this process automatic may open up new opportunities in scaling up such investigations. As a first step towards automatizing this coding process, we describe an experiment where we apply a sentence classifier that automatically detects mentions of policy conditions in IMF loan agreements and divides them into different types. The results show that the classifier is generally able to detect the policy conditions, although some types are hard to distinguish.

2018

pdf bib abs
Automatically Linking Lexical Resources with Word Sense Embedding Models
Luis Nieto-Piña | Richard Johansson
Proceedings of the Third Workshop on Semantic Deep Learning

Automatically learnt word sense embeddings are developed as an attempt to refine the capabilities of coarse word embeddings. The word sense representations obtained this way are, however, sensitive to underlying corpora and parameterizations, and they might be difficult to relate to formally defined word senses. We propose to tackle this problem by devising a mechanism to establish links between word sense embeddings and lexical resources created by experts. We evaluate the applicability of these links in a task to retrieve instances of word sense unlisted in the lexicon.

pdf bib abs
The 2018 Shared Task on Extrinsic Parser Evaluation: On the Downstream Utility of English Universal Dependency Parsers
Murhaf Fares | Stephan Oepen | Lilja Øvrelid | Jari Björne | Richard Johansson
Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies

We summarize empirical results and tentative conclusions from the Second Extrinsic Parser Evaluation Initiative (EPE 2018). We review the basic task setup, downstream applications involved, and end-to-end results for seventeen participating teams. Based on in-depth quantitative and qualitative analysis, we correlate intrinsic evaluation results at different layers of morph-syntactic analysis with observed downstream behavior.

2017

pdf bib abs
Training Word Sense Embeddings With Lexicon-based Regularization
Luis Nieto-Piña | Richard Johansson
Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

We propose to improve word sense embeddings by enriching an automatic corpus-based method with lexicographic data. Information from a lexicon is introduced into the learning algorithm’s objective function through a regularizer. The incorporation of lexicographic data yields embeddings that are able to reflect expert-defined word senses, while retaining the robustness, high quality, and coverage of automatic corpus-based methods. These properties are observed in a manual inspection of the semantic clusters that different degrees of regularizer strength create in the vector space. Moreover, we evaluate the sense embeddings in two downstream applications: word sense disambiguation and semantic frame prediction, where they outperform simpler approaches. Our results show that a corpus-based model balanced with lexicographic data learns better representations and improve their performance in downstream tasks.

pdf bib abs
Character-based recurrent neural networks for morphological relational reasoning
Olof Mogren | Richard Johansson
Proceedings of the First Workshop on Subword and Character Level Models in NLP

We present a model for predicting word forms based on morphological relational reasoning with analogies. While previous work has explored tasks such as morphological inflection and reinflection, these models rely on an explicit enumeration of morphological features, which may not be available in all cases. To address the task of predicting a word form given a demo relation (a pair of word forms) and a query word, we devise a character-based recurrent neural network architecture using three separate encoders and a decoder. We also investigate a multiclass learning setup, where the prediction of the relation type label is used as an auxiliary task. Our results show that the exact form can be predicted for English with an accuracy of 94.7%. For Swedish, which has a more complex morphology with more inflectional patterns for nouns and verbs, the accuracy is 89.3%. We also show that using the auxiliary task of learning the relation type speeds up convergence and improves the prediction accuracy for the word generation task.

2016

pdf bib
Embedding Senses for Efficient Graph-based Word Sense Disambiguation
Luis Nieto Piña | Richard Johansson
Proceedings of TextGraphs-10: the Workshop on Graph-based Methods for Natural Language Processing

pdf bib abs
Romanized Berber and Romanized Arabic Automatic Language Identification Using Machine Learning
Wafia Adouane | Nasredine Semmar | Richard Johansson
Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3)

The identification of the language of text/speech input is the first step to be able to properly do any language-dependent natural language processing. The task is called Automatic Language Identification (ALI). Being a well-studied field since early 1960’s, various methods have been applied to many standard languages. The ALI standard methods require datasets for training and use character/word-based n-gram models. However, social media and new technologies have contributed to the rise of informal and minority languages on the Web. The state-of-the-art automatic language identifiers fail to properly identify many of them. Romanized Arabic (RA) and Romanized Berber (RB) are cases of these informal languages which are under-resourced. The goal of this paper is twofold: detect RA and RB, at a document level, as separate languages and distinguish between them as they coexist in North Africa. We consider the task as a classification problem and use supervised machine learning to solve it. For both languages, character-based 5-grams combined with additional lexicons score the best, F-score of 99.75% and 97.77% for RB and RA respectively.

pdf bib abs
Automatic Detection of Arabicized Berber and Arabic Varieties
Wafia Adouane | Nasredine Semmar | Richard Johansson | Victoria Bobicev
Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3)

Automatic Language Identification (ALI) is the detection of the natural language of an input text by a machine. It is the first necessary step to do any language-dependent natural language processing task. Various methods have been successfully applied to a wide range of languages, and the state-of-the-art automatic language identifiers are mainly based on character n-gram models trained on huge corpora. However, there are many languages which are not yet automatically processed, for instance minority and informal languages. Many of these languages are only spoken and do not exist in a written format. Social media platforms and new technologies have facilitated the emergence of written format for these spoken languages based on pronunciation. The latter are not well represented on the Web, commonly referred to as under-resourced languages, and the current available ALI tools fail to properly recognize them. In this paper, we revisit the problem of ALI with the focus on Arabicized Berber and dialectal Arabic short texts. We introduce new resources and evaluate the existing methods. The results show that machine learning models combined with lexicons are well suited for detecting Arabicized Berber and different Arabic varieties and distinguishing between them, giving a macro-average F-score of 92.94%.

pdf bib abs
ASIREM Participation at the Discriminating Similar Languages Shared Task 2016
Wafia Adouane | Nasredine Semmar | Richard Johansson
Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3)

This paper presents the system built by ASIREM team for the Discriminating between Similar Languages (DSL) Shared task 2016. It describes the system which uses character-based and word-based n-grams separately. ASIREM participated in both sub-tasks (sub-task 1 and sub-task 2) and in both open and closed tracks. For the sub-task 1 which deals with Discriminating between similar languages and national language varieties, the system achieved an accuracy of 87.79% on the closed track, ending up ninth (the best results being 89.38%). In sub-task 2, which deals with Arabic dialect identification, the system achieved its best performance using character-based n-grams (49.67% accuracy), ranking fourth in the closed track (the best result being 51.16%), and an accuracy of 53.18%, ranking first in the open track.

pdf bib abs
Retrieving Occurrences of Grammatical Constructions
Anna Ehrlemark | Richard Johansson | Benjamin Lyngfelt
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Finding authentic examples of grammatical constructions is central in constructionist approaches to linguistics, language processing, and second language learning. In this paper, we address this problem as an information retrieval (IR) task. To facilitate research in this area, we built a benchmark collection by annotating the occurrences of six constructions in a Swedish corpus. Furthermore, we implemented a simple and flexible retrieval system for finding construction occurrences, in which the user specifies a ranking function using lexical-semantic similarities (lexicon-based or distributional). The system was evaluated using standard IR metrics on the new benchmark, and we saw that lexical-semantical rerankers improve significantly over a purely surface-oriented system, but must be carefully tailored for each individual construction.

pdf bib abs
Gulf Arabic Linguistic Resource Building for Sentiment Analysis
Wafia Adouane | Richard Johansson
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

This paper deals with building linguistic resources for Gulf Arabic, one of the Arabic variations, for sentiment analysis task using machine learning. To our knowledge, no previous works were done for Gulf Arabic sentiment analysis despite the fact that it is present in different online platforms. Hence, the first challenge is the absence of annotated data and sentiment lexicons. To fill this gap, we created these two main linguistic resources. Then we conducted different experiments: use Naive Bayes classifier without any lexicon; add a sentiment lexicon designed basically for MSA; use only the compiled Gulf Arabic sentiment lexicon and finally use both MSA and Gulf Arabic sentiment lexicons. The Gulf Arabic lexicon gives a good improvement of the classifier accuracy (90.54 %) over a baseline that does not use the lexicon (82.81%), while the MSA lexicon causes the accuracy to drop to (76.83%). Moreover, mixing MSA and Gulf Arabic lexicons causes the accuracy to drop to (84.94%) compared to using only Gulf Arabic lexicon. This indicates that it is useless to use MSA resources to deal with Gulf Arabic due to the considerable differences and conflicting structures between these two languages.

pdf bib abs
A Multi-domain Corpus of Swedish Word Sense Annotation
Richard Johansson | Yvonne Adesam | Gerlof Bouma | Karin Hedberg
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We describe the word sense annotation layer in Eukalyptus, a freely available five-domain corpus of contemporary Swedish with several annotation layers. The annotation uses the SALDO lexicon to define the sense inventory, and allows word sense annotation of compound segments and multiword units. We give an overview of the new annotation tool developed for this project, and finally present an analysis of the inter-annotator agreement between two annotators.

We describe two constraint-based methods that can be used to improve the recall of a shallow discourse parser based on conditional random field chunking. These method uses a set of natural structural constraints as well as others that follow from the annotation guidelines of the Penn Discourse Treebank. We evaluated the resulting systems on the standard test set of the PDTB and achieved a rebalancing of precision and recall with improved F-measures across the board. This was especially notable when we used evaluation metrics taking partial matches into account; for these measures, we achieved F-measure improvements of several points.

pdf bib abs
Semantic Role Labeling with the Swedish FrameNet
Richard Johansson | Karin Friberg Heppin | Dimitrios Kokkinakis
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

We present the first results on semantic role labeling using the Swedish FrameNet, which is a lexical resource currently in development. Several aspects of the task are investigated, including the %design and selection of machine learning features, the effect of choice of syntactic parser, and the ability of the system to generalize to new frames and new genres. In addition, we evaluate two methods to make the role label classifier more robust: cross-frame generalization and cluster-based features. Although the small amount of training data limits the performance achievable at the moment, we reach promising results. In particular, the classifier that extracts the boundaries of arguments works well for new frames, which suggests that it already at this stage can be useful in a semi-automatic setting.

2011

pdf bib
Extracting Opinion Expressions and Their Polarities – Exploration of Pipelines and Joint Models
Richard Johansson | Alessandro Moschitti
Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies

pdf bib
Shallow Discourse Parsing with Conditional Random Fields
Sucheta Ghosh | Richard Johansson | Giuseppe Riccardi | Sara Tonelli
Proceedings of 5th International Joint Conference on Natural Language Processing

2010

pdf bib abs
A Flexible Representation of Heterogeneous Annotation Data
Richard Johansson | Alessandro Moschitti
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

This paper describes a new flexible representation for the annotation of complex structures of metadata over heterogeneous data collections containing text and other types of media such as images or audio files. We argue that existing frameworks are not suitable for this purpose, most importantly because they do not easily generalize to multi-document and multimodal corpora, and because they often require the use of particular software frameworks. In the paper, we define a data model to represent such structured data over multimodal collections. Furthermore, we define a surface realization of the data structure as a simple and readable XML format. We present two examples of annotation tasks to illustrate how the representation and format work for complex structures involving multimodal annotation and cross-document links. The representation described here has been used in a large-scale project focusing on the annotation of a wide range of information ― from low-level features to high-level semantics ― in a multimodal data collection containing both text and images.

pdf bib
Reranking Models in Fine-grained Opinion Analysis
Richard Johansson | Alessandro Moschitti
Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010)

pdf bib
Syntactic and Semantic Structure for Opinion Expression Detection
Richard Johansson | Alessandro Moschitti
Proceedings of the Fourteenth Conference on Computational Natural Language Learning

2009

pdf bib
Statistical Bistratal Dependency Parsing
Richard Johansson
Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing

pdf bib
Text Categorization Using Predicate-Argument Structures
Jacob Persson | Richard Johansson | Pierre Nugues
Proceedings of the 17th Nordic Conference of Computational Linguistics (NODALIDA 2009)

2008

pdf bib abs
Comparing Dependency and Constituent Syntax for Frame-semantic Analysis
Richard Johansson | Pierre Nugues
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

We address the question of which syntactic representation is best suited for role-semantic analysis of English in the FrameNet paradigm. We compare systems based on dependencies and constituents, and a dependency syntax with a rich set of grammatical functions with one with a smaller set. Our experiments show that dependency-based and constituent-based analyzers give roughly equivalent performance, and that a richer set of functions has a positive influence on argument classification for verbs.

pdf bib
The CoNLL 2008 Shared Task on Joint Parsing of Syntactic and Semantic Dependencies
Mihai Surdeanu | Richard Johansson | Adam Meyers | Lluís Màrquez | Joakim Nivre
CoNLL 2008: Proceedings of the Twelfth Conference on Computational Natural Language Learning

pdf bib
Dependency-based Syntactic–Semantic Analysis with PropBank and NomBank
Richard Johansson | Pierre Nugues
CoNLL 2008: Proceedings of the Twelfth Conference on Computational Natural Language Learning

pdf bib
The Effect of Syntactic Representation on Semantic Role Labeling
Richard Johansson | Pierre Nugues
Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)

pdf bib
Dependency-based Semantic Role Labeling of PropBank
Richard Johansson | Pierre Nugues
Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing

2007

pdf bib
Logistic Online Learning Methods and Their Application to Incremental Dependency Parsing
Richard Johansson
Proceedings of the ACL 2007 Student Research Workshop

pdf bib
Incremental Dependency Parsing Using Online Learning
Richard Johansson | Pierre Nugues
Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL)

pdf bib
LTH: Semantic Structure Extraction using Nonprojective Dependency Trees
Richard Johansson | Pierre Nugues
Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007)

pdf bib
Extended Constituent-to-Dependency Conversion for English
Richard Johansson | Pierre Nugues
Proceedings of the 16th Nordic Conference of Computational Linguistics (NODALIDA 2007)

2006

pdf bib abs
Extraction of Temporal Information from Texts in Swedish
Anders Berglund | Richard Johansson | Pierre Nugues
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

This paper describes the implementation and evaluation of a generic component to extract temporal information from texts in Swedish. It proceeds in two steps. The first step extracts time expressions and events, and generates a feature vector for each element it identifies. Using the vectors, the second step determines the temporal relations, possibly none, between the extracted events and orders them in time. We used a machine learning approach to find the relations between events. To run the learning algorithm, we collected a corpus of road accident reports from newspapers websites that we manually annotated. It enabled us to train decision trees and to evaluate the performance of the algorithm.

pdf bib abs
Construction of a FrameNet Labeler for Swedish Text
Richard Johansson | Pierre Nugues
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

We describe the implementation of a FrameNet-based semantic role labeling system for Swedish text. To train the system, we used a semantically annotated corpus that was produced by projection across parallel corpora. As part of the system, we developed two frame element bracketing algorithms that are suitable when no robust constituent parsers are available. Apart from being the first such system for Swedish, this is, as far as we are aware, the first semantic role labeling system for a language for which no role-semantic annotated corpora are available. The estimated accuracy of classification of pre-segmented frame elements is 0.75, and the precision and recall measures for the complete task are 0.67 and 0.47, respectively.

pdf bib
Investigating Multilingual Dependency Parsing
Richard Johansson | Pierre Nugues
Proceedings of the Tenth Conference on Computational Natural Language Learning (CoNLL-X)

pdf bib
A Machine Learning Approach to Extract Temporal Information from Texts in Swedish and Generate Animated 3D Scenes
Anders Berglund | Richard Johansson | Pierre Nugues
11th Conference of the European Chapter of the Association for Computational Linguistics

pdf bib
Automatic Annotation for All Semantic Layers in FrameNet
Richard Johansson | Pierre Nugues
Demonstrations

pdf bib
A FrameNet-Based Semantic Role Labeler for Swedish
Richard Johansson | Pierre Nugues
Proceedings of the COLING/ACL 2006 Main Conference Poster Sessions