Gerhard Jäger

Also published as: Gerhard J\"ager

2026

Hard to Be Heard: Phoneme-Level ASR Analysis of Phonologically Complex, Low-Resource Endangered Languages
V.S.D.S.Mahesh Akavarapu | Michael Daniel | Gerhard J\"ager
Findings of the Association for Computational Linguistics: ACL 2026

We present a phoneme-level analysis of automatic speech recognition (ASR) for two low-resourced and phonologically complex East Caucasian languages, Archi and Rutul, based on curated and standardized speech–transcript resources totaling approximately 50 minutes and 1 hour 20 minutes of audio, respectively. Existing recordings and transcriptions are consolidated and processed into a form suitable for ASR training and evaluation. We evaluate several state-of-the-art audio and audio–language models, including wav2vec2, Whisper, and Qwen2-Audio. For wav2vec2, we introduce a language-specific phoneme vocabulary with heuristic output-layer initialization, which yields consistent improvements and achieves performance comparable to or exceeding Whisper in these extremely low-resource settings. Beyond standard word and character error rates, we conduct a detailed phoneme-level error analysis. We find that phoneme recognition accuracy strongly correlates with training frequency, exhibiting a characteristic sigmoid-shaped learning curve. For Archi, this relationship partially breaks for Whisper, pointing to model-specific generalization effects beyond what is predicted by training frequency. Overall, our results indicate that many errors attributed to phonological complexity are better explained by data scarcity. These findings demonstrate the value of phoneme-level evaluation for understanding ASR behavior in low-resource, typologically complex languages.

2025

pdf bib abs

Beyond cognacy
Gerhard Jäger
Proceedings of the 7th Workshop on Research in Computational Linguistic Typology and Multilingual NLP

Computational phylogenetics has become an established tool in historical linguistics, with many language families now analyzed using likelihood-based inference. However, standard approaches rely on expert-annotated cognate sets, which are sparse, labor-intensive to produce, and limited to individual language families. This paper explores alternatives by comparing the established method to two fully automated methods that extract phylogenetic signal directly from lexical data. One uses automatic cognate clustering with unigram/concept features; the other applies multiple sequence alignment (MSA) derived from a pair-hidden Markov model. Both are evaluated against expert classifications from Glottolog and typological data from Grambank. Also, the intrinsic strengths of the phylogenetic signal in the characters are compared. Results show that MSA-based inference yields trees more consistent with linguistic classifications, better predicts typological variation, and provides a clearer phylogenetic signal, suggesting it as a promising, scalable alternative to traditional cognate-based methods. This opens new avenues for global-scale language phylogenies beyond expert annotation bottlenecks.

2024

pdf bib abs

Are Sounds Sound for Phylogenetic Reconstruction?
Luise Häuser | Gerhard Jäger | Johann-Mattis List | Taraka Rama | Alexandros Stamatakis
Proceedings of the 6th Workshop on Research in Computational Linguistic Typology and Multilingual NLP

In traditional studies on language evolution, scholars often emphasize the importance of sound laws and sound correspondences for phylogenetic inference of language family trees. However, to date, computational approaches have typically not taken this potential into account. Most computational studies still rely on lexical cognates as major data source for phylogenetic reconstruction in linguistics, although there do exist a few studies in which authors praise the benefits of comparing words at the level of sound sequences. Building on (a) ten diverse datasets from different language families, and (b) state-of-the-art methods for automated cognate and sound correspondence detection, we test, for the first time, the performance of sound-based versus cognate-based approaches to phylogenetic reconstruction. Our results show that phylogenies reconstructed from lexical cognates are topologically closer, by approximately one third with respect to the generalized quartet distance on average, to the gold standard phylogenies than phylogenies reconstructed from sound correspondences.

pdf bib

Computational Approaches for Integrating out Subjectivity in Cognate Synonym Selection
Luise Häuser | Gerhard Jäger | Alexandros Stamatakis
Proceedings of the Society for Computation in Linguistics 2024

2022

pdf bib abs

Typological Word Order Correlations with Logistic Brownian Motion
Kai Hartung | Gerhard Jäger | Sören Gröttrup | Munir Georges
Proceedings of the 4th Workshop on Research in Computational Linguistic Typology and Multilingual NLP

In this study we address the question to what extent syntactic word-order traits of different languages have evolved under correlation and whether such dependencies can be found universally across all languages or restricted to specific language families. To do so, we use logistic Brownian Motion under a Bayesian framework to model the trait evolution for 768 languages from 34 language families. We test for trait correlations both in single families and universally over all families. Separate models reveal no universal correlation patterns and Bayes Factor analysis of models over all covered families also strongly indicate lineage specific correlation patters instead of universal dependencies.

pdf bib abs

Bayesian Phylogenetic Cognate Prediction
Gerhard Jäger
Proceedings of the 4th Workshop on Research in Computational Linguistic Typology and Multilingual NLP

In Jäger (2019) a computational framework was defined to start from parallel word lists of related languages and infer the corresponding vocabulary of the shared proto-language. The SIGTYP 2022 Shared Task is closely related. The main difference is that what is to be reconstructed is not the proto-form but an unknown word from an extant language. The system described here is a re-implementation of the tools used in the mentioned paper, adapted to the current task.

2020

pdf bib abs

Imputing typological values via phylogenetic inference
Gerhard Jäger
Proceedings of the Second Workshop on Computational Research in Linguistic Typology

This paper describes a workflow to impute missing values in a typological database, a sub- set of the World Atlas of Language Structures (WALS). Using a world-wide phylogeny de- rived from lexical data, the model assumes a phylogenetic continuous time Markov chain governing the evolution of typological val- ues. Data imputation is performed via a Max- imum Likelihood estimation on the basis of this model. As back-off model for languages whose phylogenetic position is unknown, a k- nearest neighbor classification based on geo- graphic distance is performed.

2018

pdf bib abs

Are Automatic Methods for Cognate Detection Good Enough for Phylogenetic Reconstruction in Historical Linguistics?
Taraka Rama | Johann-Mattis List | Johannes Wahle | Gerhard Jäger
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

We evaluate the performance of state-of-the-art algorithms for automatic cognate detection by comparing how useful automatically inferred cognates are for the task of phylogenetic inference compared to classical manually annotated cognate sets. Our findings suggest that phylogenies inferred from automated cognate sets come close to phylogenies inferred from expert-annotated ones, although on average, the latter are still superior. We conclude that future work on phylogenetic reconstruction can profit much from automatic cognate detection. Especially where scholars are merely interested in exploring the bigger picture of a language family’s phylogeny, algorithms for automatic cognate detection are a useful complement for current research on language phylogenies.

2017

pdf bib abs

Using support vector machines and state-of-the-art algorithms for phonetic alignment to identify cognates in multi-lingual wordlists
Gerhard Jäger | Johann-Mattis List | Pavel Sofroniev
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers

Most current approaches in phylogenetic linguistics require as input multilingual word lists partitioned into sets of etymologically related words (cognates). Cognate identification is so far done manually by experts, which is time consuming and as of yet only available for a small number of well-studied language families. Automatizing this step will greatly expand the empirical scope of phylogenetic methods in linguistics, as raw wordlists (in phonetic transcription) are much easier to obtain than wordlists in which cognate words have been fully identified and annotated, even for under-studied languages. A couple of different methods have been proposed in the past, but they are either disappointing regarding their performance or not applicable to larger datasets. Here we present a new approach that uses support vector machines to unify different state-of-the-art methods for phonetic alignment and cognate detection within a single framework. Training and evaluating these method on a typologically broad collection of gold-standard data shows it to be superior to the existing state of the art.

2012

pdf bib

Estimating and visualizing language similarities using weighted alignment and force-directed graph layout
Gerhard Jäger
Proceedings of the EACL 2012 Joint Workshop of LINGVIS & UNCLH

Gerhard Jäger

2026

2025

2024

2022

2020

2018

2017

2012

2002

Co-authors

Venues