Jack Halpern

2026

A Comprehensive Full-Form Lexicon for Arabic NLP and Speech Technology
Yannis Haralambous | Jack Halpern
Proceedings of the Fifteenth Language Resources and Evaluation Conference

Natural Language Processing (NLP) applications require morphological data with precise grammatical attributes, while speech technology requires abundant phonemic and phonetic data. This presents a challenge for Arabic due to its abundant morphological, orthographic, and phonemic ambiguity in both MSA and its various dialects. Existing systems struggle with incomplete and unstructured web data, leading to suboptimal performance in both morphological analysis and speech applications. This paper presents ArabLEX, a full-form lexicon (includes all wordforms, i.e., fully inflected/cliticized members of a lexeme class) that addresses these issues by providing a large-scale database designed to enhance NLP accuracy. It comprises approximately 570 million entries with fully inflected forms and detailed morphological, phonetic, and orthographic attributes. ArabLEX serves as a foundational framework for developing comprehensive Arabic lexical resources for NLP, particularly for speech technology, as well as dialect databases.

2018

pdf bib

Very Large-Scale Lexical Resources to Enhance Chinese and Japanese Machine Translation
Jack Halpern
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2016

pdf bib

Linguistic Issues in the Machine Transliteration of Chinese, Japanese and Arabic Names
Jack Halpern
Proceedings of the Sixth Named Entity Workshop

2008

pdf bib abs

Exploiting Lexical Resources for Disambiguating CJK and Arabic Orthographic Variants
Jack Halpern
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

The orthographical complexities of Chinese, Japanese, Korean (CJK) and Arabic pose a special challenge to developers of NLP applications. These difficulties are exacerbated by the lack of a standardized orthography in these languages, especially the highly irregular Japanese orthography and the ambiguities of the Arabic script. This paper focuses on CJK and Arabic orthographic variation and provides a brief analysis of the linguistic issues. The basic premise is that statistical methods by themselves are inadequate, and that linguistic knowledge supported by large-scale lexical databases should play a central role in achieving high accuracy in disambiguating and normalizing orthographic variants.

The pitfalls and complexities of Chinese to Chinese conversion
Jack Halpern | Jouni Kerman
Proceedings of Machine Translation Summit VII

Co-authors

Venues

WS1

Fix author