Ljiljana Dolamic


NLP DI at NADI Shared Task Subtask-1: Sub-word Level Convolutional Neural Models and Pre-trained Binary Classifiers for Dialect Identification
Vani Kanjirangat | Tanja Samardzic | Ljiljana Dolamic | Fabio Rinaldi
Proceedings of the The Seventh Arabic Natural Language Processing Workshop (WANLP)

In this paper, we describe our systems submitted to the NADI Subtask 1: country-wise dialect classifications. We designed two types of solutions. The first type is convolutional neural network CNN) classifiers trained on subword segments of optimized lengths. The second type is fine-tuned classifiers with BERT-based language specific pre-trained models. To deal with the missing dialects in one of the test sets, we experimented with binary classifiers, analyzing the predicted probability distribution patterns and comparing them with the development set patterns. The better performing approach on the development set was fine-tuning language specific pre-trained model (best F-score 26.59%). On the test set, on the other hand, we obtained the best performance with the CNN model trained on subword tokens obtained with a Unigram model (the best F-score 26.12%). Re-training models on samples of training data simulating missing dialects gave the maximum performance on the test set version with a number of dialects lesser than the training set (F-score 16.44%)

Early Guessing for Dialect Identification
Vani Kanjirangat | Tanja Samardzic | Fabio Rinaldi | Ljiljana Dolamic
Findings of the Association for Computational Linguistics: EMNLP 2022

This paper deals with the problem of incre-mental dialect identification. Our goal is toreliably determine the dialect before the fullutterance is given as input. The major partof the previous research on dialect identification has been model-centric, focusing on performance. We address a new question: How much input is needed to identify a dialect? Ourapproach is a data-centric analysis that resultsin general criteria for finding the shortest inputneeded to make a plausible guess. Workingwith three sets of language dialects (Swiss German, Indo-Aryan and Arabic languages), weshow that it is possible to generalize across dialects and datasets with two input shorteningcriteria: model confidence and minimal inputlength (adjusted for the input type). The sourcecode for experimental analysis can be found atGithub.

mattica@SMM4H’22: Leveraging sentiment for stance & premise joint learning
Oscar Lithgow-Serrano | Joseph Cornelius | Fabio Rinaldi | Ljiljana Dolamic
Proceedings of The Seventh Workshop on Social Media Mining for Health Applications, Workshop & Shared Task

This paper describes our submissions to the Social Media Mining for Health Applications (SMM4H) shared task 2022. Our team (mattica) participated in detecting stances and premises in tweets about health mandates related to COVID-19 (Task 2). Our approach was based on using an in-domain Pretrained Language Model, which we fine-tuned by combining different strategies such as leveraging an additional stance detection dataset through two-stage fine-tuning, joint-learning Stance and Premise detection objectives; and ensembling the sentiment-polarity given by an off-the-shelf fine-tuned model.


The IICT-Yverdon System for the WMT 2021 Unsupervised MT and Very Low Resource Supervised MT Task
Àlex R. Atrio | Gabriel Luthier | Axel Fahy | Giorgos Vernikos | Andrei Popescu-Belis | Ljiljana Dolamic
Proceedings of the Sixth Conference on Machine Translation

In this paper, we present the systems submitted by our team from the Institute of ICT (HEIG-VD / HES-SO) to the Unsupervised MT and Very Low Resource Supervised MT task. We first study the improvements brought to a baseline system by techniques such as back-translation and initialization from a parent model. We find that both techniques are beneficial and suffice to reach performance that compares with more sophisticated systems from the 2020 task. We then present the application of this system to the 2021 task for low-resource supervised Upper Sorbian (HSB) to German translation, in both directions. Finally, we present a contrastive system for HSB-DE in both directions, and for unsupervised German to Lower Sorbian (DSB) translation, which uses multi-task training with various training schedules to improve over the baseline.