Jack Halpern
2026
A Comprehensive Full-Form Lexicon for Arabic NLP and Speech Technology
Yannis Haralambous | Jack Halpern
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Yannis Haralambous | Jack Halpern
Proceedings of the Fifteenth Language Resources and Evaluation Conference
Natural Language Processing (NLP) applications require morphological data with precise grammatical attributes, while speech technology requires abundant phonemic and phonetic data. This presents a challenge for Arabic due to its abundant morphological, orthographic, and phonemic ambiguity in both MSA and its various dialects. Existing systems struggle with incomplete and unstructured web data, leading to suboptimal performance in both morphological analysis and speech applications. This paper presents ArabLEX, a full-form lexicon (includes all wordforms, i.e., fully inflected/cliticized members of a lexeme class) that addresses these issues by providing a large-scale database designed to enhance NLP accuracy. It comprises approximately 570 million entries with fully inflected forms and detailed morphological, phonetic, and orthographic attributes. ArabLEX serves as a foundational framework for developing comprehensive Arabic lexical resources for NLP, particularly for speech technology, as well as dialect databases.
2018
Very Large-Scale Lexical Resources to Enhance Chinese and Japanese Machine Translation
Jack Halpern
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
Jack Halpern
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)
2016
Linguistic Issues in the Machine Transliteration of Chinese, Japanese and Arabic Names
Jack Halpern
Proceedings of the Sixth Named Entity Workshop
Jack Halpern
Proceedings of the Sixth Named Entity Workshop
2008
Exploiting Lexical Resources for Disambiguating CJK and Arabic Orthographic Variants
Jack Halpern
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
Jack Halpern
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)
The orthographical complexities of Chinese, Japanese, Korean (CJK) and Arabic pose a special challenge to developers of NLP applications. These difficulties are exacerbated by the lack of a standardized orthography in these languages, especially the highly irregular Japanese orthography and the ambiguities of the Arabic script. This paper focuses on CJK and Arabic orthographic variation and provides a brief analysis of the linguistic issues. The basic premise is that statistical methods by themselves are inadequate, and that linguistic knowledge supported by large-scale lexical databases should play a central role in achieving high accuracy in disambiguating and normalizing orthographic variants.
2006
The Role of Lexical Resources in CJK Natural Language Processing
Jack Halpern
Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing
Jack Halpern
Proceedings of the Fifth SIGHAN Workshop on Chinese Language Processing
The Role of Lexical Resources in CJK Natural Language Processing
Jack Halpern
Proceedings of the Workshop on Multilingual Language Resources and Interoperability
Jack Halpern
Proceedings of the Workshop on Multilingual Language Resources and Interoperability
2002
Lexicon-based Orthographic Disambiguation in CJK Intelligent Information Retrieval
Jack Halpern
COLING-02: The 3rd Workshop on Asian Language Resources and International Standardization
Jack Halpern
COLING-02: The 3rd Workshop on Asian Language Resources and International Standardization