KOJAK: A New Corpus for Studying German Discourse Particle ja

Adil Soubki, Owen Rambow, Chong Kang


Abstract
In German, ja can be used as a discourse particle to indicate that a proposition, according to the speaker, is believed by both the speaker and audience. We use this observation to create KoJaK, a distantly-labeled English dataset derived from Europarl for studying when a speaker believes a statement to be common ground. This corpus is then analyzed to identify lexical choices in English that correspond with German ja. Finally, we perform experiments on the dataset to predict if an English clause corresponds to a German clause containing ja and achieve an F-measure of 75.3% on a balanced test corpus.
Anthology ID:
2022.codi-1.1
Volume:
Proceedings of the 3rd Workshop on Computational Approaches to Discourse
Month:
October
Year:
2022
Address:
Gyeongju, Republic of Korea and Online
Editors:
Chloe Braud, Christian Hardmeier, Junyi Jessy Li, Sharid Loaiciga, Michael Strube, Amir Zeldes
Venue:
CODI
SIG:
Publisher:
International Conference on Computational Linguistics
Note:
Pages:
1–6
Language:
URL:
https://aclanthology.org/2022.codi-1.1
DOI:
Bibkey:
Cite (ACL):
Adil Soubki, Owen Rambow, and Chong Kang. 2022. KOJAK: A New Corpus for Studying German Discourse Particle ja. In Proceedings of the 3rd Workshop on Computational Approaches to Discourse, pages 1–6, Gyeongju, Republic of Korea and Online. International Conference on Computational Linguistics.
Cite (Informal):
KOJAK: A New Corpus for Studying German Discourse Particle ja (Soubki et al., CODI 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/emnlp-22-attachments/2022.codi-1.1.pdf