Matt Jones


2025

pdf bib
Spoken Document Retrieval for an Unwritten Language: A Case Study on Gormati
Sanjay Booshanam | Kelly Chen | Ondrej Klejch | Thomas Reitmaier | Dani Kalarikalayil Raju | Electra Wallington | Nina Markl | Jennifer Pearson | Matt Jones | Simon Robinson | Peter Bell
Findings of the Association for Computational Linguistics: EMNLP 2025

Speakers of unwritten languages have the potential to benefit from speech-based automatic information retrieval systems. This paper proposes a speech embedding technique that facilitates such a system that we can be used in a zero-shot manner on the target language. After conducting development experiments on several written Indic languages, we evaluate our method on a corpus of Gormati – an unwritten language – that was previously collected in partnership with an agrarian Banjara community in Maharashtra State, India, specifically for the purposes of information retrieval. Our system achieves a Top 5 retrieval rate of 87.9% on this data, giving the hope that it may be useable by unwritten language speakers worldwide.