Abstract
Web corpora are often constructed automatically, and their contents are therefore often not well understood. One technique for assessing the composition of such a web corpus is to empirically measure its similarity to a reference corpus whose composition is known. In this paper we evaluate a number of measures of corpus similarity, including a method based on topic modelling which has not been previously evaluated for this task. To evaluate these methods we use known-similarity corpora that have been previously used for this purpose, as well as a number of newly-constructed known-similarity corpora targeting differences in genre, topic, time, and region. Our findings indicate that, overall, the topic modelling approach did not improve on a chi-square method that had previously been found to work well for measuring corpus similarity.- Anthology ID:
 - L16-1042
 - Volume:
 - Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)
 - Month:
 - May
 - Year:
 - 2016
 - Address:
 - Portorož, Slovenia
 - Editors:
 - Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, Stelios Piperidis
 - Venue:
 - LREC
 - SIG:
 - Publisher:
 - European Language Resources Association (ELRA)
 - Note:
 - Pages:
 - 273–279
 - Language:
 - URL:
 - https://aclanthology.org/L16-1042
 - DOI:
 - Cite (ACL):
 - Richard Fothergill, Paul Cook, and Timothy Baldwin. 2016. Evaluating a Topic Modelling Approach to Measuring Corpus Similarity. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pages 273–279, Portorož, Slovenia. European Language Resources Association (ELRA).
 - Cite (Informal):
 - Evaluating a Topic Modelling Approach to Measuring Corpus Similarity (Fothergill et al., LREC 2016)
 - PDF:
 - https://preview.aclanthology.org/ingest-acl-2023-videos/L16-1042.pdf