Abstract
In this paper we present an approach to efficiently recover texts from corrupted documents of endangered languages. Textual resources for such languages are scarce, and sometimes the few available resources are corrupted PDF documents. Endangered languages are not supported by standard tools and present even the additional difficulties of not possessing any corpus over which to train language models to assist with the recovery. The approach presented is able to fully recover born digital PDF documents with minimal effort, thereby helping the preservation effort of endangered languages, by extending the range of documents usable for corpus building.- Anthology ID:
- 2022.computel-1.10
- Volume:
- Proceedings of the Fifth Workshop on the Use of Computational Methods in the Study of Endangered Languages
- Month:
- May
- Year:
- 2022
- Address:
- Dublin, Ireland
- Editors:
- Sarah Moeller, Antonios Anastasopoulos, Antti Arppe, Aditi Chaudhary, Atticus Harrigan, Josh Holden, Jordan Lachler, Alexis Palmer, Shruti Rijhwani, Lane Schwartz
- Venue:
- ComputEL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 78–82
- Language:
- URL:
- https://aclanthology.org/2022.computel-1.10
- DOI:
- 10.18653/v1/2022.computel-1.10
- Cite (ACL):
- Nicolas Stefanovitch. 2022. Recovering Text from Endangered Languages Corrupted PDF documents. In Proceedings of the Fifth Workshop on the Use of Computational Methods in the Study of Endangered Languages, pages 78–82, Dublin, Ireland. Association for Computational Linguistics.
- Cite (Informal):
- Recovering Text from Endangered Languages Corrupted PDF documents (Stefanovitch, ComputEL 2022)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-3/2022.computel-1.10.pdf