SuperCAT: The (New and Improved) Corpus Analysis Toolkit
K. Bretonnel Cohen, William A. Baumgartner Jr., Irina Temnikova
Abstract
This paper reports SuperCAT, a corpus analysis toolkit. It is a radical extension of SubCAT, the Sublanguage Corpus Analysis Toolkit, from sublanguage analysis to corpus analysis in general. The idea behind SuperCAT is that representative corpora have no tendency towards closure―that is, they tend towards infinity. In contrast, non-representative corpora have a tendency towards closure―roughly, finiteness. SuperCAT focuses on general techniques for the quantitative description of the characteristics of any corpus (or other language sample), particularly concerning the characteristics of lexical distributions. Additionally, SuperCAT features a complete re-engineering of the previous SubCAT architecture.- Anthology ID:
- L16-1442
- Volume:
- Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
- Month:
- May
- Year:
- 2016
- Address:
- Portorož, Slovenia
- Editors:
- Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
- Venue:
- LREC
- SIG:
- Publisher:
- European Language Resources Association (ELRA)
- Note:
- Pages:
- 2784–2788
- Language:
- URL:
- https://aclanthology.org/L16-1442
- DOI:
- Cite (ACL):
- K. Bretonnel Cohen, William A. Baumgartner Jr., and Irina Temnikova. 2016. SuperCAT: The (New and Improved) Corpus Analysis Toolkit. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 2784–2788, Portorož, Slovenia. European Language Resources Association (ELRA).
- Cite (Informal):
- SuperCAT: The (New and Improved) Corpus Analysis Toolkit (Cohen et al., LREC 2016)
- PDF:
- https://preview.aclanthology.org/ml4al-ingestion/L16-1442.pdf