Cuaċ: Fast and Small Universal Representations of Corpora
John Philip McCrae, Bernardo Stearns, Alamgir Munir Qazi, Shubhanker Banerjee, Atul Kr. Ojha
Abstract
36 The increasing size and diversity of corpora in natural language processing requires highly efficient processing frameworks. Building on the universal corpus format, Teanga, we present Cuaċ, a format for the compact representation of corpora. We describe this methodology based on short-string compression and indexing techniques and show that the files created with this methodology are similar to compressed human-readable serializations and can be further compressed using lossless compression. We also show that this introduces no computational penalty on the time to process files. This methodology aims to speed up natural language processing pipelines and is the basis for a fast database system for corpora.- Anthology ID:
- 2025.ldk-1.17
- Volume:
- Proceedings of the 5th Conference on Language, Data and Knowledge
- Month:
- September
- Year:
- 2025
- Address:
- Naples, Italy
- Editors:
- Mehwish Alam, Andon Tchechmedjiev, Jorge Gracia, Dagmar Gromann, Maria Pia di Buono, Johanna Monti, Maxim Ionov
- Venues:
- LDK | WS
- SIG:
- Publisher:
- Unior Press
- Note:
- Pages:
- 153–161
- Language:
- URL:
- https://preview.aclanthology.org/ldl-25-ingestion/2025.ldk-1.17/
- DOI:
- Cite (ACL):
- John Philip McCrae, Bernardo Stearns, Alamgir Munir Qazi, Shubhanker Banerjee, and Atul Kr. Ojha. 2025. Cuaċ: Fast and Small Universal Representations of Corpora. In Proceedings of the 5th Conference on Language, Data and Knowledge, pages 153–161, Naples, Italy. Unior Press.
- Cite (Informal):
- Cuaċ: Fast and Small Universal Representations of Corpora (McCrae et al., LDK 2025)
- PDF:
- https://preview.aclanthology.org/ldl-25-ingestion/2025.ldk-1.17.pdf