The NewSoMe Corpus: A Unifying Opinion Annotation Framework across Genres and in Multiple Languages

Roser Saurí, Judith Domingo, Toni Badia


Abstract
We present the NewSoMe (News and Social Media) Corpus, a set of subcorpora with annotations on opinion expressions across genres (news reports, blogs, product reviews and tweets) and covering multiple languages (English, Spanish, Catalan and Portuguese). NewSoMe is the result of an effort to increase the opinion corpus resources available in languages other than English, and to build a unifying annotation framework for analyzing opinion in different genres, including controlled text, such as news reports, as well as different types of user generated contents (UGC). Given the broad design of the resource, most of the annotation effort were carried out resorting to crowdsourcing platforms: Amazon Mechanical Turk and CrowdFlower. This created an excellent opportunity to research on the feasibility of crowdsourcing methods for annotating big amounts of text in different languages.
Anthology ID:
L14-1306
Volume:
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Month:
May
Year:
2014
Address:
Reykjavik, Iceland
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
2229–2236
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/350_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Roser Saurí, Judith Domingo, and Toni Badia. 2014. The NewSoMe Corpus: A Unifying Opinion Annotation Framework across Genres and in Multiple Languages. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 2229–2236, Reykjavik, Iceland. European Language Resources Association (ELRA).
Cite (Informal):
The NewSoMe Corpus: A Unifying Opinion Annotation Framework across Genres and in Multiple Languages (Saurí et al., LREC 2014)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/350_Paper.pdf