Matt Jones
2025
Spoken Document Retrieval for an Unwritten Language: A Case Study on Gormati
Sanjay Booshanam
|
Kelly Chen
|
Ondrej Klejch
|
Thomas Reitmaier
|
Dani Kalarikalayil Raju
|
Electra Wallington
|
Nina Markl
|
Jennifer Pearson
|
Matt Jones
|
Simon Robinson
|
Peter Bell
Findings of the Association for Computational Linguistics: EMNLP 2025
Speakers of unwritten languages have the potential to benefit from speech-based automatic information retrieval systems. This paper proposes a speech embedding technique that facilitates such a system that we can be used in a zero-shot manner on the target language. After conducting development experiments on several written Indic languages, we evaluate our method on a corpus of Gormati – an unwritten language – that was previously collected in partnership with an agrarian Banjara community in Maharashtra State, India, specifically for the purposes of information retrieval. Our system achieves a Top 5 retrieval rate of 87.9% on this data, giving the hope that it may be useable by unwritten language speakers worldwide.
Search
Fix author
Co-authors
- Peter Bell 1
- Sanjay Booshanam 1
- Kelly Chen 1
- Ondřej Klejch 1
- Nina Markl 1
- show all...