Abstract
Web corpora are often constructed automatically, and their contents are therefore often not well understood. One technique for assessing the composition of such a web corpus is to empirically measure its similarity to a reference corpus whose composition is known. In this paper we evaluate a number of measures of corpus similarity, including a method based on topic modelling which has not been previously evaluated for this task. To evaluate these methods we use known-similarity corpora that have been previously used for this purpose, as well as a number of newly-constructed known-similarity corpora targeting differences in genre, topic, time, and region. Our findings indicate that, overall, the topic modelling approach did not improve on a chi-square method that had previously been found to work well for measuring corpus similarity.- Anthology ID:
- L16-1042
- Volume:
- Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
- Month:
- May
- Year:
- 2016
- Address:
- Portorož, Slovenia
- Editors:
- Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
- Venue:
- LREC
- SIG:
- Publisher:
- European Language Resources Association (ELRA)
- Note:
- Pages:
- 273–279
- Language:
- URL:
- https://preview.aclanthology.org/icon-24-ingestion/L16-1042/
- DOI:
- Cite (ACL):
- Richard Fothergill, Paul Cook, and Timothy Baldwin. 2016. Evaluating a Topic Modelling Approach to Measuring Corpus Similarity. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 273–279, Portorož, Slovenia. European Language Resources Association (ELRA).
- Cite (Informal):
- Evaluating a Topic Modelling Approach to Measuring Corpus Similarity (Fothergill et al., LREC 2016)
- PDF:
- https://preview.aclanthology.org/icon-24-ingestion/L16-1042.pdf