This is an internal, incomplete preview of a proposed change to the ACL Anthology.
For efficiency reasons, we don't generate MODS or Endnote formats, and the preview may be incomplete in other ways, or contain mistakes.
Do not treat this content as an official publication.
AnttiKanner
Fixing paper assignments
Please select all papers that belong to the same person.
Indicate below which author they should be assigned to.
Coding themes, frames, opinions and other attributes are widely used in the social sciences and doing that is also a base for building supervised text classifiers. Coding content needs a lot of resources, and lately this process has been utilized particularly in the training set annotation for machine learning models. Although the objectivity of coding is not always the purpose of coding, it helps in building the machine learning model, if the codings are uniformly done. Usually machine learning models are built by first defining annotation scheme, which contains definitions of categories and instructions for coding. It is known that multiple aspects affect to the annotation results, such as, the domain of annotation, number of annotators, and number of categories in annotation. In this article, we present few more problems that we show to be related with the annotation results in our case study. Those are negated presence of a category, low proportional presence of relevant content and implicit presence of a category. These problems should be resolved in all schemes on the level of scheme definition. To extract our problem categories, we focus on a media research case of extensive data on both the process as well as the results.
Web-scale corpora present valuable research opportunities but often lack detailed metadata, making them challenging to use in linguistics and social sciences. This study tackles this problem by exploring automatic methods to classify web corpora into specific categories, focusing on text registers such as Interactive Discussion and literary genres such as Politics and Social Sciences. We train two machine learning models to classify documents from the large web-crawled OSCAR dataset: a register classifier using the multilingual, manually annotated CORE corpus, and a genre classifier using a dataset based on Kindle US&UK. Fine-tuned from XLM-R Large, the register and genre classifiers achieved F1-scores of 0.74 and 0.70, respectively. Our analysis includes evaluating the distribution of the predicted text classes and examining the intersection of genre-register pairs using topic modelling. The results show expected combinations between certain registers and genres, such as the Lyrical register often aligning with the Literature & Fiction genre. However, most registers, such as Interactive Discussion, are divided across multiple genres, like Engineering & Transportation and Politics & Social Sciences, depending on the discussion topic. This enriched metadata provides valuable insights and supports new ways of studying digital cultural heritage.
We approach the problem of recognition and attribution of quotes in Finnish news media. Solving this task would create possibilities for large-scale analysis of media wrt. the presence and styles of presentation of different voices and opinions. We describe the annotation of a corpus of media texts, numbering around 1500 articles, with quote attribution and coreference information. Further, we compare two methods for automatic quote recognition: a rule-based one operating on dependency trees and a machine learning one built on top of the BERT language model. We conclude that BERT provides more promising results even with little training data, achieving 95% F-score on direct quote recognition and 84% for indirect quotes. Finally, we discuss open problems and further associated tasks, especially the necessity of resolving speaker mentions to entity references.