Media monitoring and information extraction for the highly inflected agglutinative language Hungarian

Júlia Pajzs, Ralf Steinberger, Maud Ehrmann, Mohamed Ebrahim, Leonida Della Rocca, Stefano Bucci, Eszter Simon, Tamás Váradi


Abstract
The Europe Media Monitor (EMM) is a fully-automatic system that analyses written online news by gathering articles in over 70 languages and by applying text analysis software for currently 21 languages, without using linguistic tools such as parsers, part-of-speech taggers or morphological analysers. In this paper, we describe the effort of adding to EMM Hungarian text mining tools for news gathering; document categorisation; named entity recognition and classification for persons, organisations and locations; name lemmatisation; quotation recognition; and cross-lingual linking of related news clusters. The major challenge of dealing with the Hungarian language is its high degree of inflection and agglutination. We present several experiments where we apply linguistically light-weight methods to deal with inflection and we propose a method to overcome the challenges. We also present detailed frequency lists of Hungarian person and location name suffixes, as found in real-life news texts. This empirical data can be used to draw further conclusions and to improve existing Named Entity Recognition software. Within EMM, the solutions described here will also be applied to other morphologically complex languages such as those of the Slavic language family. The media monitoring and analysis system EMM is freely accessible online via the web page http://emm.newsbrief.eu/overview.html.
Anthology ID:
L14-1381
Volume:
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)
Month:
May
Year:
2014
Address:
Reykjavik, Iceland
Venue:
LREC
SIG:
Publisher:
European Language Resources Association (ELRA)
Note:
Pages:
2049–2056
Language:
URL:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/449_Paper.pdf
DOI:
Bibkey:
Cite (ACL):
Júlia Pajzs, Ralf Steinberger, Maud Ehrmann, Mohamed Ebrahim, Leonida Della Rocca, Stefano Bucci, Eszter Simon, and Tamás Váradi. 2014. Media monitoring and information extraction for the highly inflected agglutinative language Hungarian. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14), pages 2049–2056, Reykjavik, Iceland. European Language Resources Association (ELRA).
Cite (Informal):
Media monitoring and information extraction for the highly inflected agglutinative language Hungarian (Pajzs et al., LREC 2014)
Copy Citation:
PDF:
http://www.lrec-conf.org/proceedings/lrec2014/pdf/449_Paper.pdf