Multilingual and Cross-Lingual Complex Word Identification
Seid Muhie Yimam, Sanja Štajner, Martin Riedl, Chris Biemann
Abstract
Complex Word Identification (CWI) is an important task in lexical simplification and text accessibility. Due to the lack of CWI datasets, previous works largely depend on Simple English Wikipedia and edit histories for obtaining ‘gold standard’ annotations, which are of doubtable quality, and limited only to English. We collect complex words/phrases (CP) for English, German and Spanish, annotated by both native and non-native speakers, and propose language independent features that can be used to train multilingual and cross-lingual CWI models. We show that the performance of cross-lingual CWI systems (using a model trained on one language and applying it on the other languages) is comparable to the performance of monolingual CWI systems.- Anthology ID:
- R17-1104
- Volume:
- Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017
- Month:
- September
- Year:
- 2017
- Address:
- Varna, Bulgaria
- Editors:
- Ruslan Mitkov, Galia Angelova
- Venue:
- RANLP
- SIG:
- Publisher:
- INCOMA Ltd.
- Note:
- Pages:
- 813–822
- Language:
- URL:
- https://doi.org/10.26615/978-954-452-049-6_104
- DOI:
- 10.26615/978-954-452-049-6_104
- Cite (ACL):
- Seid Muhie Yimam, Sanja Štajner, Martin Riedl, and Chris Biemann. 2017. Multilingual and Cross-Lingual Complex Word Identification. In Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017, pages 813–822, Varna, Bulgaria. INCOMA Ltd..
- Cite (Informal):
- Multilingual and Cross-Lingual Complex Word Identification (Yimam et al., RANLP 2017)
- PDF:
- https://doi.org/10.26615/978-954-452-049-6_104