2022
pdf
abs
A Transformer Architecture for the Prediction of Cognate Reflexes
Giuseppe G. A. Celano
Proceedings of the 4th Workshop on Research in Computational Linguistic Typology and Multilingual NLP
This paper presents the transformer model built to participate in the SIGTYP 2022 Shared Task on the Prediction of Cognate Reflexes. It consists of an encoder-decoder architecture with multi-head attention mechanism. Its output is concatenated with the one hot encoding of the language label of an input character sequence to predict a target character sequence. The results show that the transformer outperforms the baseline rule-based system only partially.
2021
pdf
abs
A ResNet-50-Based Convolutional Neural Network Model for Language ID Identification from Speech Recordings
Giuseppe G. A. Celano
Proceedings of the Third Workshop on Computational Typology and Multilingual NLP
This paper describes the model built for the SIGTYP 2021 Shared Task aimed at identifying 18 typologically different languages from speech recordings. Mel-frequency cepstral coefficients derived from audio files are transformed into spectrograms, which are then fed into a ResNet-50-based CNN architecture. The final model achieved validation and test accuracies of 0.73 and 0.53, respectively.
2020
pdf
bib
abs
SIGTYP 2020 Shared Task: Prediction of Typological Features
Johannes Bjerva
|
Elizabeth Salesky
|
Sabrina J. Mielke
|
Aditi Chaudhary
|
Giuseppe G. A. Celano
|
Edoardo Maria Ponti
|
Ekaterina Vylomova
|
Ryan Cotterell
|
Isabelle Augenstein
Proceedings of the Second Workshop on Computational Research in Linguistic Typology
Typological knowledge bases (KBs) such as WALS (Dryer and Haspelmath, 2013) contain information about linguistic properties of the world’s languages. They have been shown to be useful for downstream applications, including cross-lingual transfer learning and linguistic probing. A major drawback hampering broader adoption of typological KBs is that they are sparsely populated, in the sense that most languages only have annotations for some features, and skewed, in that few features have wide coverage. As typological features often correlate with one another, it is possible to predict them and thus automatically populate typological KBs, which is also the focus of this shared task. Overall, the task attracted 8 submissions from 5 teams, out of which the most successful methods make use of such feature correlations. However, our error analysis reveals that even the strongest submitted systems struggle with predicting feature values for languages where few features are known.
pdf
abs
A Gradient Boosting-Seq2Seq System for Latin POS Tagging and Lemmatization
Giuseppe G. A. Celano
Proceedings of LT4HALA 2020 - 1st Workshop on Language Technologies for Historical and Ancient Languages
The paper presents the system used in the EvaLatin shared task to POS tag and lemmatize Latin. It consists of two components. A gradient boosting machine (LightGBM) is used for POS tagging, mainly fed with pre-computed word embeddings of a window of seven contiguous tokens—the token at hand plus the three preceding and following ones—per target feature value. Word embeddings are trained on the texts of the Perseus Digital Library, Patrologia Latina, and Biblioteca Digitale di Testi Tardo Antichi, which together comprise a high number of texts of different genres from the Classical Age to Late Antiquity. Word forms plus the outputted POS labels are used to feed a seq2seq algorithm implemented in Keras to predict lemmas. The final shared-task accuracies measured for Classical Latin texts are in line with state-of-the-art POS taggers (∼0.96) and lemmatizers (∼0.95).