Lorena Martín Rodríguez

Also published as: Lorena Martín Rodríguez

2023

pdf
Speech-to-text recognition for multilingual spoken data in language documentation
Lorena Martín Rodríguez | Christopher Cox
Proceedings of the Sixth Workshop on the Use of Computational Methods in the Study of Endangered Languages

2022

pdf abs
OversampledML at SemEval-2022 Task 8: When multilingual news similarity met Zero-shot approaches
Mayank Jobanputra | Lorena Martín Rodríguez
Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022)

We investigate the capabilities of pre-trained models, without any fine-tuning, for a document-level multilingual news similarity task of SemEval-2022. We utilize title and news content with appropriate pre-processing techniques. Our system derives 14 different similarity features using a combination of state-of-the-art methods (MPNet) with well-known statistical methods (i.e. TF-IDF, Word Mover’s distance). We formulate multilingual news similarity task as a regression task and approximate the overall similarity between two news articles using these features. Our best-performing system achieved a correlation score of 70.1% and was ranked 20th among the 34 participating teams. In this paper, in addition to a system description, we also provide further analysis of our results and an ablation study highlighting the strengths and limitations of our features. We make our code publicly available at https://github.com/cicl-iscl/multinewssimilarity

pdf abs
Tupían Language Ressources: Data, Tools, Analyses
Lorena Martín Rodríguez | Tatiana Merzhevich | Wellington Silva | Tiago Tresoldi | Carolina Aragon | Fabrício F. Gerardi
Proceedings of the 1st Annual Meeting of the ELRA/ISCA Special Interest Group on Under-Resourced Languages

TuLaR (Tupian Language Resources) is a project for collecting, documenting, analyzing, and developing computational and pedagogical material for low-resource Brazilian indigenous languages. It provides valuable data for language research regarding typological, syntactic, morphological, and phonological aspects. Here we present TuLaR’s databases, with special consideration to TuDeT (Tupian Dependency Treebanks), an annotated corpus under development for nine languages of the Tupian family, built upon the Universal Dependencies framework. The annotation within such a framework serves a twofold goal: enriching the linguistic documentation of the Tupian languages due to the rapid and consistent annotation, and providing computational resources for those languages, thanks to the suitability of our framework for developing NLP tools. We likewise present a related lexical database, some tools developed by the project, and examine future goals for our initiative.

Co-authors

Fabrício F. Gerardi 1

Christopher Cox 1