Jacob Lee Suchardt
2026
Fill-in-the-Blanks: Automatic Generation and Evaluation of Language Models’ Pseudonyms for English and Swedish Texts
Maria Irena Szawerna | Jacob Lee Suchardt
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Maria Irena Szawerna | Jacob Lee Suchardt
Proceedings of the Fifteenth Language Resources and Evaluation Conference
While considerable effort has gone into developing solutions for detecting Personally Identifiable Information (PII) in linguistic data, less research has gone into automating the generation of appropriate pseudonyms and developing evaluation methods, both relevant for the creation of privacy-friendly language resources. We conduct pilot experiments using Masked and Generative Large Language Models to generate predictions for redacted PII-spans in a cloze-like fashion for English legal texts and parallel news articles in Swedish and English. Furthermore, we explore metrics for automatic evaluation of the generated pseudonyms in the legal data, and investigate the effect of part-of-speech constraints on performance. For the parallel, multilingual data, we contribute our manual PII-annotation and conduct a fine-grained error analysis across two of our pseudonym generation methods and a baseline. Our results illustrate the complexity of pseudonym evaluation and the particular challenge of automatic, at-scale evaluation as well as the models’ tendency to predict prototypical and even stereotypical answers.
WikIPA: Integrating WikiPron and Lingua Libre for Multilingual IPA Transcription
Pierluigi Cassotti | Jacob Lee Suchardt | Domenico De Cristofaro
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Pierluigi Cassotti | Jacob Lee Suchardt | Domenico De Cristofaro
Proceedings of the Fifteenth Language Resources and Evaluation Conference
We present WikIPA, a new multilingual benchmark designed for automatic speech-to-IPA (STIPA) transcription. By integrating human-curated IPA transcriptions from WikiPron with spoken recordings and metadata from Lingua Libre, WikIPA connects textual phonetic representations with real speech across 78 languages. This open resource supports both broad (phonemic) and narrow (phonetic) transcription tasks, enabling fine-grained evaluation of multilingual phonetic transcription systems. WikIPA provides over 289,000 paired entries and serves as a large-scale foundation for STIPA. We benchmark several state-of-the-art STIPA systems, including MultIPA, (Lo)WhIPA, and ZIPA. Results show that ZIPA achieves the lowest mean error rates across most languages, outperforming Whisper- and Wav2Vec-based baselines. Error analyses reveal that remaining discrepancies largely stem from minor phonetic confusions rather than complete transcription failures, emphasizing the challenge of modeling fine-grained articulatory variation. WikIPA thus establishes the first systematic, multilingual evaluation framework for speech-to-IPA transcription and highlights the potential of combining open, community-driven resources to advance STIPA evaluation.
2025
Towards Language-Agnostic STIPA: Universal Phonetic Transcription to Support Language Documentation at Scale
Jacob Lee Suchardt | Hana El-Shazli | Pierluigi Cassotti
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Jacob Lee Suchardt | Hana El-Shazli | Pierluigi Cassotti
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
This paper explores the use of existing state-of-the-art speech recognition models (ASR) for the task of generating narrow phonetic transcriptions using the International Phonetic Alphabet (STIPA). Unlike conventional ASR systems focused on orthographic output for high-resource languages, STIPA can be used as a language-agnostic interface valuable for documenting under-resourced and unwritten languages. We introduce a new dataset for South Levantine Arabic and present the first large-scale evaluation of STIPA models across 51 language families. Additionally, we provide a use case on Sanna, a severely endangered language. Our findings show that fine-tuned ASR models can produce accurate IPA transcriptions with limited supervision, significantly reducing phonetic error rates even in extremely low-resource settings. The results highlight the potential of STIPA for scalable language documentation.