Utilizing constituent structure for compound analysis

Kristín Bjarnadóttir, Jón Daðason


Abstract
Compounding is extremely productive in Icelandic and multi-word compounds are common. The likelihood of finding previously unseen compounds in texts is thus very high, which makes out-of-vocabulary words a problem in the use of NLP tools. The tool de-scribed in this paper splits Icelandic compounds and shows their binary constituent structure. The probability of a constituent in an unknown (or unanalysed) compound forming a combined constituent with either of its neighbours is estimated, with the use of data on the constituent structure of over 240 thousand compounds from the Database of Modern Icelandic Inflection, and word frequencies from Íslenskur orðasjóður, a corpus of approx. 550 million words. Thus, the structure of an unknown compound is derived by com-parison with compounds with partially the same constituents and similar structure in the training data. The granularity of the split re-turned by the decompounder is important in tasks such as semantic analysis or machine translation, where a flat (non-structured) se-quence of constituents is insufficient.
Anthology ID:
L14-1720
Volume:
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Month:
May
Year:
2014
Address:
Reykjavik, Iceland
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/954_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Kristín Bjarnadóttir and Jón Daðason. 2014. Utilizing constituent structure for compound analysis. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), Reykjavik, Iceland. European Language Resources Association (ELRA).
Cite (Informal):
Utilizing constituent structure for compound analysis (Bjarnadóttir & Daðason, LREC 2014)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/954_Paper.pdf