Alon Itai


How to construct a multi-lingual domain ontology
Nitsan Chrizman | Alon Itai
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

The research focuses on automatic construction of multi-lingual domain-ontologies, i.e., creating a DAG (directed acyclic graph) consisting of concepts relating to a specific domain and the relations between them. The domain example on which the research performed is “Organized Crime”. The contribution of the work is the investigation of and comparison between several data sources and methods to create multi-lingual ontologies. The first subtask was to extract the domain’s concepts. The best source turned out to be Wikepedia’s articles that are under the catgegory. The second task was to create an English ontology, i.e., the relationships between the concepts. Again the relationships between concepts and the hierarchy were derived from Wikipedia. The final task was to create an ontology for a language with far fewer resources (Hebrew). The task was accomplished by deriving the concepts from the Hebrew Wikepedia and assessing their relevance and the relationships between them from the English ontology.


Using Movie Subtitles for Creating a Large-Scale Bilingual Corpora
Einav Itamar | Alon Itai
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

This paper presents a method for compiling a large-scale bilingual corpus from a database of movie subtitles. To create the corpus, we propose an algorithm based on Gale and Church’s sentence alignment algorithm(1993). However, our algorithm not only relies on character length information, but also uses subtitle-timing information, which is encoded in the subtitle files. Timing is highly correlated between subtitles in different versions (for the same movie), since subtitles that match should be displayed at the same time. However, the absolute time values can’t be used for alignment, since the timing is usually specified by frame numbers and not by real time, and converting it to real time values is not always possible, hence we use normalized subtitle duration instead. This results in a significant reduction in the alignment error rate.


A Computational Lexicon of Contemporary Hebrew
Alon Itai | Shuly Wintner | Shlomo Yona
Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06)

Computational lexicons are among the most important resources for natural language processing (NLP). Their importance is even greater in languages with rich morphology, where the lexicon is expected to provide morphological analyzers with enough information to enable themto correctly process intricately inflected forms. We describe the Haifa Lexicon of Contemporary Hebrew, the broadest-coverage publicly available lexicon of Modern Hebrew, currently consisting of over 20,000 entries.While other lexical resources of Modern Hebrew have been developed in the past, this is the first publicly available large-scale lexicon of the language. In addition to supporting morphological processors (analyzers and generators), which was our primary objective, thelexicon is used as a research tool in Hebrew lexicography and lexical semantics. It is open for browsing on the web and several search tools and interfaces were developed which facilitate on-line access to its information. The lexicon is currently used for a variety of NLP applications.


A corpus based morphological analyzer for unvocalized modern Hebrew
Alon Itai | Erel Segal
Workshop on Machine Translation for Semitic languages: issues and approaches

Most words in Modern Hebrew texts are morphologically ambiguous. We describe a method for finding the correct morphological analysis of each word in a Modern Hebrew text. The program first uses a small tagged corpus to estimate the probability of each possible analysis of each word regardless of its context and chooses the most probable analysis. It then applies automatically learned rules to correct the analysis of each word according to its neighbors. Finally, it uses a simple syntactical analyzer to further correct the analysis, thus combining statistical methods with rule-based syntactic analysis. It is shown that this combination greatly improves the accuracy of the morphological analysis—achieving up to 96.2% accuracy.


Learning Morpho-Lexical Probabilities from an Untagged Corpus with an Application to Hebrew
Moshe Levinger | Uzzi Ornan | Alon Itai
Computational Linguistics, Volume 21, Number 3, September 1995


Word Sense Disambiguation Using a Second Language Monolingual Corpus
Ido Dagan | Alon Itai
Computational Linguistics, Volume 20, Number 4, December 1994


Two Languages Are More Informative Than One
Ido Dagan | Alon Itai | Ulrike Schwall
29th Annual Meeting of the Association for Computational Linguistics


Automatic Processing of Large Corpora for the Resolution of Anaphora References
Ido Dagan | Alon Itai
COLING 1990 Volume 3: Papers presented to the 13th International Conference on Computational Linguistics