Mark Granroth-Wilding


Silo NLP’s Participation at WAT2022
Shantipriya Parida | Subhadarshi Panda | Stig-Arne Grönroos | Mark Granroth-Wilding | Mika Koistinen
Proceedings of the 9th Workshop on Asian Translation

This paper provides the system description of “Silo NLP’s” submission to the Workshop on Asian Translation (WAT2022). We have participated in the Indic Multimodal tasks (English->Hindi, English->Malayalam, and English->Bengali, Multimodal Translation). For text-only translation, we used the Transformer and fine-tuned the mBART. For multimodal translation, we used the same architecture and extracted object tags from the images to use as visual features concatenated with the text sequence for input. Our submission tops many tasks including English->Hindi multimodal translation (evaluation test), English->Malayalam text-only and multimodal translation (evaluation test), English->Bengali multimodal translation (challenge test), and English->Bengali text-only translation (evaluation test).


Pimlico: A toolkit for corpus-processing pipelines and reproducible experiments
Mark Granroth-Wilding
Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS)

We present Pimlico, an open source toolkit for building pipelines for processing large corpora. It is especially focused on processing linguistic corpora and provides wrappers around existing, widely used NLP tools. A particular goal is to ease distribution of reproducible and extensible experiments by making it easy to document and re-run all steps involved, including data loading, pre-processing, model training and evaluation. Once a pipeline is released, it is easy to adapt, for example, to run on a new dataset, or to re-run an experiment with different parameters. The toolkit takes care of many common challenges in writing and distributing corpus-processing code, such as managing data between the steps of a pipeline, installing required software and combining existing toolkits with new, task-specific code.

CoSimLex: A Resource for Evaluating Graded Word Similarity in Context
Carlos Santos Armendariz | Matthew Purver | Matej Ulčar | Senja Pollak | Nikola Ljubešić | Mark Granroth-Wilding
Proceedings of the Twelfth Language Resources and Evaluation Conference

State of the art natural language processing tools are built on context-dependent word embeddings, but no direct method for evaluating these representations currently exists. Standard tasks and datasets for intrinsic evaluation of embeddings are based on judgements of similarity, but ignore context; standard tasks for word sense disambiguation take account of context but do not provide continuous measures of meaning similarity. This paper describes an effort to build a new dataset, CoSimLex, intended to fill this gap. Building on the standard pairwise similarity task of SimLex-999, it provides context-dependent similarity measures; covers not only discrete differences in word sense but more subtle, graded changes in meaning; and covers not only a well-resourced language (English) but a number of less-resourced languages. We define the task and evaluation metrics, outline the dataset collection methodology, and describe the status of the dataset so far.

A Comparison of Unsupervised Methods for Ad hoc Cross-Lingual Document Retrieval
Elaine Zosa | Mark Granroth-Wilding | Lidia Pivovarova
Proceedings of the workshop on Cross-Language Search and Summarization of Text and Speech (CLSSTS2020)

We address the problem of linking related documents across languages in a multilingual collection. We evaluate three diverse unsupervised methods to represent and compare documents: (1) multilingual topic model; (2) cross-lingual document embeddings; and (3) Wasserstein distance.We test the performance of these methods in retrieving news articles in Swedish that are known to be related to a given Finnish article.The results show that ensembles of the methods outperform the stand-alone methods, suggesting that they capture complementary characteristics of the documents


Unsupervised Learning of Cross-Lingual Symbol Embeddings Without Parallel Data
Mark Granroth-Wilding | Hannu Toivonen
Proceedings of the Society for Computation in Linguistics (SCiL) 2019

Cross-Family Similarity Learning for Cognate Identification in Low-Resource Languages
Eliel Soisalon-Soininen | Mark Granroth-Wilding
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

We address the problem of cognate identification across vocabulary pairs of any set of languages. In particular, we focus on the case where the examined pair of languages are low-resource to the extent that no training data whatsoever in these languages, or even closely related ones, are available for the task. We investigate the extent to which training data from another, unrelated language family can be used instead. Our approach consists of learning a similarity metric from example cognates in Indo-European languages and applying it to low-resource Sami languages of the Uralic family. We apply two models following previous work: a Siamese convolutional neural network (S-CNN) and a support vector machine (SVM), and compare them with a Levenshtein-distance baseline. We test performance on three Sami languages and find that the S-CNN outperforms the other approaches, suggesting that it is better able to learn such general characteristics of cognateness that carry over across language families. We also experiment with fine-tuning the S-CNN model with data from within the language family in order to quantify how well this model can make use of a small amount of target-domain data to adapt.

Multilingual Dynamic Topic Model
Elaine Zosa | Mark Granroth-Wilding
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019)

Dynamic topic models (DTMs) capture the evolution of topics and trends in time series data.Current DTMs are applicable only to monolingual datasets. In this paper we present the multilingual dynamic topic model (ML-DTM), a novel topic model that combines DTM with an existing multilingual topic modeling method to capture cross-lingual topics that evolve across time. We present results of this model on a parallel German-English corpus of news articles and a comparable corpus of Finnish and Swedish news articles. We demonstrate the capability of ML-DTM to track significant events related to a topic and show that it finds distinct topics and performs as well as existing multilingual topic models in aligning cross-lingual topics.


Data-Driven News Generation for Automated Journalism
Leo Leppänen | Myriam Munezero | Mark Granroth-Wilding | Hannu Toivonen
Proceedings of the 10th International Conference on Natural Language Generation

Despite increasing amounts of data and ever improving natural language generation techniques, work on automated journalism is still relatively scarce. In this paper, we explore the field and challenges associated with building a journalistic natural language generation system. We present a set of requirements that should guide system design, including transparency, accuracy, modifiability and transferability. Guided by the requirements, we present a data-driven architecture for automated journalism that is largely domain and language independent. We illustrate its practical application in the production of news articles about the 2017 Finnish municipal elections in three languages, demonstrating the successfulness of the data-driven, modular approach of the design. We then draw some lessons for future automated journalism.