Optical Character Recognition for the International Phonetic Alphabet

Shu Okabe, Dejvi Zelo, Alexander Fraser


Abstract
As grammar books are increasingly used as additional reference resources specifically for very low-resource languages, a significant portion comes from scans and relies on the quality of the Optical Character Recognition (OCR) tool. We focus here on a particular script used in linguistics to transcribe sounds: the International Phonetic Alphabet (IPA). We consider two data sources: actual grammar book PDFs for two languages under documentation, Japhug and Kagayanen, and a synthetically generated dataset based on Wiktionary. We compare two neural OCR frameworks, Tesseract and Calamari, and a recent large vision-language model, Qwen2.5-VL-7B, all three in an off-the-shelf setting and with fine-tuning. While their zero-shot performance is relatively poor for IPA characters in general due to character set mismatch, fine-tuning with the synthetic dataset leads to notable improvements.
Anthology ID:
2026.eacl-short.19
Volume:
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers)
Month:
March
Year:
2026
Address:
Rabat, Morocco
Editors:
Vera Demberg, Kentaro Inui, Lluís Marquez
Venue:
EACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
265–273
Language:
URL:
https://preview.aclanthology.org/ingest-eacl/2026.eacl-short.19/
DOI:
Bibkey:
Cite (ACL):
Shu Okabe, Dejvi Zelo, and Alexander Fraser. 2026. Optical Character Recognition for the International Phonetic Alphabet. In Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 2: Short Papers), pages 265–273, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):
Optical Character Recognition for the International Phonetic Alphabet (Okabe et al., EACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-eacl/2026.eacl-short.19.pdf
Checklist:
 2026.eacl-short.19.checklist.pdf