Cuaċ: Fast and Small Universal Representations of Corpora

John Philip McCrae, Bernardo Stearns, Alamgir Munir Qazi, Shubhanker Banerjee, Atul Kr. Ojha


Abstract
36 The increasing size and diversity of corpora in natural language processing requires highly efficient processing frameworks. Building on the universal corpus format, Teanga, we present Cuaċ, a format for the compact representation of corpora. We describe this methodology based on short-string compression and indexing techniques and show that the files created with this methodology are similar to compressed human-readable serializations and can be further compressed using lossless compression. We also show that this introduces no computational penalty on the time to process files. This methodology aims to speed up natural language processing pipelines and is the basis for a fast database system for corpora.
Anthology ID:
2025.ldk-1.17
Volume:
Proceedings of the 5th Conference on Language, Data and Knowledge
Month:
September
Year:
2025
Address:
Naples, Italy
Editors:
Mehwish Alam, Andon Tchechmedjiev, Jorge Gracia, Dagmar Gromann, Maria Pia di Buono, Johanna Monti, Maxim Ionov
Venues:
LDK | WS
SIG:
Publisher:
Unior Press
Note:
Pages:
153–161
Language:
URL:
https://preview.aclanthology.org/ldl-25-ingestion/2025.ldk-1.17/
DOI:
Bibkey:
Cite (ACL):
John Philip McCrae, Bernardo Stearns, Alamgir Munir Qazi, Shubhanker Banerjee, and Atul Kr. Ojha. 2025. Cuaċ: Fast and Small Universal Representations of Corpora. In Proceedings of the 5th Conference on Language, Data and Knowledge, pages 153–161, Naples, Italy. Unior Press.
Cite (Informal):
Cuaċ: Fast and Small Universal Representations of Corpora (McCrae et al., LDK 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ldl-25-ingestion/2025.ldk-1.17.pdf