Recovering Text from Endangered Languages Corrupted PDF documents

Nicolas Stefanovitch


Abstract
In this paper we present an approach to efficiently recover texts from corrupted documents of endangered languages. Textual resources for such languages are scarce, and sometimes the few available resources are corrupted PDF documents. Endangered languages are not supported by standard tools and present even the additional difficulties of not possessing any corpus over which to train language models to assist with the recovery. The approach presented is able to fully recover born digital PDF documents with minimal effort, thereby helping the preservation effort of endangered languages, by extending the range of documents usable for corpus building.
Anthology ID:
2022.computel-1.10
Volume:
Proceedings of the Fifth Workshop on the Use of Computational Methods in the Study of Endangered Languages
Month:
May
Year:
2022
Address:
Dublin, Ireland
Editors:
Sarah Moeller, Antonios Anastasopoulos, Antti Arppe, Aditi Chaudhary, Atticus Harrigan, Josh Holden, Jordan Lachler, Alexis Palmer, Shruti Rijhwani, Lane Schwartz
Venue:
ComputEL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
78–82
Language:
URL:
https://aclanthology.org/2022.computel-1.10
DOI:
10.18653/v1/2022.computel-1.10
Bibkey:
Cite (ACL):
Nicolas Stefanovitch. 2022. Recovering Text from Endangered Languages Corrupted PDF documents. In Proceedings of the Fifth Workshop on the Use of Computational Methods in the Study of Endangered Languages, pages 78–82, Dublin, Ireland. Association for Computational Linguistics.
Cite (Informal):
Recovering Text from Endangered Languages Corrupted PDF documents (Stefanovitch, ComputEL 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-3/2022.computel-1.10.pdf
Video:
 https://preview.aclanthology.org/nschneid-patch-3/2022.computel-1.10.mp4