Mohamed Ebrahim


2014

pdf
Media monitoring and information extraction for the highly inflected agglutinative language Hungarian
Júlia Pajzs | Ralf Steinberger | Maud Ehrmann | Mohamed Ebrahim | Leonida Della Rocca | Stefano Bucci | Eszter Simon | Tamás Váradi
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

The Europe Media Monitor (EMM) is a fully-automatic system that analyses written online news by gathering articles in over 70 languages and by applying text analysis software for currently 21 languages, without using linguistic tools such as parsers, part-of-speech taggers or morphological analysers. In this paper, we describe the effort of adding to EMM Hungarian text mining tools for news gathering; document categorisation; named entity recognition and classification for persons, organisations and locations; name lemmatisation; quotation recognition; and cross-lingual linking of related news clusters. The major challenge of dealing with the Hungarian language is its high degree of inflection and agglutination. We present several experiments where we apply linguistically light-weight methods to deal with inflection and we propose a method to overcome the challenges. We also present detailed frequency lists of Hungarian person and location name suffixes, as found in real-life news texts. This empirical data can be used to draw further conclusions and to improve existing Named Entity Recognition software. Within EMM, the solutions described here will also be applied to other morphologically complex languages such as those of the Slavic language family. The media monitoring and analysis system EMM is freely accessible online via the web page http://emm.newsbrief.eu/overview.html.

2012

pdf
JRC Eurovoc Indexer JEX - A freely available multi-label categorisation tool
Ralf Steinberger | Mohamed Ebrahim | Marco Turchi
Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12)

EuroVoc (2012) is a highly multilingual thesaurus consisting of over 6,700 hierarchically organised subject domains used by European Institutions and many authorities in Member States of the European Union (EU) for the classification and retrieval of official documents. JEX is JRC-developed multi-label classification software that learns from manually labelled data to automatically assign EuroVoc descriptors to new documents in a profile-based category-ranking task. The JEX release consists of trained classifiers for 22 official EU languages, of parallel training data in the same languages, of an interface that allows viewing and amending the assignment results, and of a module that allows users to re-train the tool on their own document collections. JEX allows advanced users to change the document representation so as to possibly improve the categorisation result through linguistic pre-processing. JEX can be used as a tool for interactive EuroVoc descriptor assignment to increase speed and consistency of the human categorisation process, or it can be used fully automatically. The output of JEX is a language-independent EuroVoc feature vector lending itself also as input to various other Language Technology tasks, including cross-lingual clustering and classification, cross-lingual plagiarism detection, sentence selection and ranking, and more.

2011

pdf
Creating Sentiment Dictionaries via Triangulation
Josef Steinberger | Polina Lenkova | Mohamed Ebrahim | Maud Ehrmann | Ali Hurriyetoglu | Mijail Kabadjov | Ralf Steinberger | Hristo Tanev | Vanni Zavarella | Silvia Vázquez
Proceedings of the 2nd Workshop on Computational Approaches to Subjectivity and Sentiment Analysis (WASSA 2.011)

pdf
Highly Multilingual Coreference Resolution Exploiting a Mature Entity Repository
Josef Steinberger | Jenya Belyaeva | Jonathan Crawley | Leonida Della-Rocca | Mohamed Ebrahim | Maud Ehrmann | Mijail Kabadjov | Ralf Steinberger | Erik van der Goot
Proceedings of the International Conference Recent Advances in Natural Language Processing 2011

2010

pdf
Adapting a resource-light highly multilingual Named Entity Recognition system to Arabic
Wajdi Zaghouani | Bruno Pouliquen | Mohamed Ebrahim | Ralf Steinberger
Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC'10)

We present a fully functional Arabic information extraction (IE) system that is used to analyze large volumes of news texts every day to extract the named entity (NE) types person, organization, location, date and number, as well as quotations (direct reported speech) by and about people. The Named Entity Recognition (NER) system was not developed for Arabic, but - instead - a highly multilingual, almost language-independent NER system was adapted to also cover Arabic. The Semitic language Arabic substantially differs from the Indo-European and Finno-Ugric languages currently covered. This paper thus describes what Arabic language-specific resources had to be developed and what changes needed to be made to the otherwise language-independent rule set in order to be applicable to the Arabic language. The achieved evaluation results are generally satisfactory, but could be improved for certain entity types. The results of the IE tools can be seen on the Arabic pages of the freely accessible Europe Media Monitor (EMM) application NewsExplorer, which can be found at http://press.jrc.it/overview.html.