Fernando Benites


2020

pdf
TRANSLIT: A Large-scale Name Transliteration Resource
Fernando Benites | Gilbert François Duivesteijn | Pius von Däniken | Mark Cieliebak
Proceedings of the Twelfth Language Resources and Evaluation Conference

Transliteration is the process of expressing a proper name from a source language in the characters of a target language (e.g. from Cyrillic to Latin characters). We present TRANSLIT, a large-scale corpus with approx. 1.6 million entries in more than 180 languages with about 3 million variations of person and geolocation names. The corpus is based on various public data sources, which have been transformed into a unified format to simplify their usage, plus a newly compiled dataset from Wikipedia. In addition, we apply several machine learning methods to establish baselines for automatically detecting transliterated names in various languages. Our best systems achieve an accuracy of 92% on identification of transliterated pairs.

pdf
CEASR: A Corpus for Evaluating Automatic Speech Recognition
Malgorzata Anna Ulasik | Manuela Hürlimann | Fabian Germann | Esin Gedik | Fernando Benites | Mark Cieliebak
Proceedings of the Twelfth Language Resources and Evaluation Conference

In this paper, we present CEASR, a Corpus for Evaluating the quality of Automatic Speech Recognition (ASR). It is a data set based on public speech corpora, containing metadata along with transcripts generated by several modern state-of-the-art ASR systems. CEASR provides this data in a unified structure, consistent across all corpora and systems, with normalised transcript texts and metadata. We use CEASR to evaluate the quality of ASR systems by calculating an average Word Error Rate (WER) per corpus, per system and per corpus-system pair. Our experiments show a substantial difference in accuracy between commercial versus open-source ASR tools as well as differences up to a factor ten for single systems on different corpora. Using CEASR allowed us to very efficiently and easily obtain these results. Our corpus enables researchers to perform ASR-related evaluations and various in-depth analyses with noticeably reduced effort, i.e. without the need to collect, process and transcribe the speech data themselves.

pdf
ZHAW-InIT - Social Media Geolocation at VarDial 2020
Fernando Benites | Manuela Hürlimann | Pius von Däniken | Mark Cieliebak
Proceedings of the 7th Workshop on NLP for Similar Languages, Varieties and Dialects

We describe our approaches for the Social Media Geolocation (SMG) task at the VarDial Evaluation Campaign 2020. The goal was to predict geographical location (latitudes and longitudes) given an input text. There were three subtasks corresponding to German-speaking Switzerland (CH), Germany and Austria (DE-AT), and Croatia, Bosnia and Herzegovina, Montenegro and Serbia (BCMS). We submitted solutions to all subtasks but focused our development efforts on the CH subtask, where we achieved third place out of 16 submissions with a median distance of 15.93 km and had the best result of 14 unconstrained systems. In the DE-AT subtask, we ranked sixth out of ten submissions (fourth of 8 unconstrained systems) and for BCMS we achieved fourth place out of 13 submissions (second of 11 unconstrained systems).

2019

pdf
TwistBytes - Identification of Cuneiform Languages and German Dialects at VarDial 2019
Fernando Benites | Pius von Däniken | Mark Cieliebak
Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects

We describe our approaches for the German Dialect Identification (GDI) and the Cuneiform Language Identification (CLI) tasks at the VarDial Evaluation Campaign 2019. The goal was to identify dialects of Swiss German in GDI and Sumerian and Akkadian in CLI. In GDI, the system should distinguish four dialects from the German-speaking part of Switzerland. Our system for GDI achieved third place out of 6 teams, with a macro averaged F-1 of 74.6%. In CLI, the system should distinguish seven languages written in cuneiform script. Our system achieved third place out of 8 teams, with a macro averaged F-1 of 74.7%.

2018

pdf
Classifying Patent Applications with Ensemble Methods
Fernando Benites | Shervin Malmasi | Marcos Zampieri
Proceedings of the Australasian Language Technology Association Workshop 2018

We present methods for the automatic classification of patent applications using an annotated dataset provided by the organizers of the ALTA 2018 shared task - Classifying Patent Applications. The goal of the task is to use computational methods to categorize patent applications according to a coarse-grained taxonomy of eight classes based on the International Patent Classification (IPC). We tested a variety of approaches for this task and the best results, 0.778 micro-averaged F1-Score, were achieved by SVM ensembles using a combination of words and characters as features. Our team, BMZ, was ranked first among 14 teams in the competition.

pdf
Twist Bytes - German Dialect Identification with Data Mining Optimization
Fernando Benites | Ralf Grubenmann | Pius von Däniken | Dirk von Grünigen | Jan Deriu | Mark Cieliebak
Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018)

We describe our approaches used in the German Dialect Identification (GDI) task at the VarDial Evaluation Campaign 2018. The goal was to identify to which out of four dialects spoken in German speaking part of Switzerland a sentence belonged to. We adopted two different meta classifier approaches and used some data mining insights to improve the preprocessing and the meta classifier parameters. Especially, we focused on using different feature extraction methods and how to combine them, since they influenced very differently the performance of the system. Our system achieved second place out of 8 teams, with a macro averaged F-1 of 64.6%.