Haeji Jung
2026
Happiness is Sharing a Vocabulary: A Study of Transliteration Methods
Haeji Jung | Jinju Kim | Kyungjin Kim | Youjeong Roh | David R. Mortensen
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Haeji Jung | Jinju Kim | Kyungjin Kim | Youjeong Roh | David R. Mortensen
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
Transliteration has emerged as a promising means to bridge the gap between various languages in multilingual NLP, showing promising results especially for languages using non-Latin scripts. We investigate the degree to which shared script, overlapping token vocabularies, and shared phonology contribute to performance of multilingual models. To this end, we conduct controlled experiments using three kinds of transliteration (romanization, phonemic transcription, and substitution ciphers) as well as orthography. We evaluate each model on three downstream tasks—named entity recognition (NER), part-of-speech tagging (POS) and natural language inference (NLI)—and find that romanization significantly outperforms other input types in 7 out of 8 evaluation settings, largely consistent with our hypothesis that it is the most effective approach. We further analyze how each factor contributed to the success, and suggest that having longer (subword) tokens shared with pre-trained languages leads to better utilization of the model.
2024
Zero-Shot Cross-Lingual NER Using Phonemic Representations for Low-Resource Languages
Jimin Sohn | Haeji Jung | Alex Cheng | Jooeon Kang | Yilin Du | David R Mortensen
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Jimin Sohn | Haeji Jung | Alex Cheng | Jooeon Kang | Yilin Du | David R Mortensen
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Existing zero-shot cross-lingual NER approaches require substantial prior knowledge of the target language, which is impractical for low-resource languages.In this paper, we propose a novel approach to NER using phonemic representation based on the International Phonetic Alphabet (IPA) to bridge the gap between representations of different languages.Our experiments show that our method significantly outperforms baseline models in extremely low-resource languages, with the highest average F1 score (46.38%) and lowest standard deviation (12.67), particularly demonstrating its robustness with non-Latin scripts. Ourcodes are available at https://github.com/Gabriel819/zeroshot_ner.git
Mitigating the Linguistic Gap with Phonemic Representations for Robust Cross-lingual Transfer
Haeji Jung | Changdae Oh | Jooeon Kang | Jimin Sohn | Kyungwoo Song | Jinkyu Kim | David R Mortensen
Proceedings of the Fourth Workshop on Multilingual Representation Learning (MRL 2024)
Haeji Jung | Changdae Oh | Jooeon Kang | Jimin Sohn | Kyungwoo Song | Jinkyu Kim | David R Mortensen
Proceedings of the Fourth Workshop on Multilingual Representation Learning (MRL 2024)
Approaches to improving multilingual language understanding often struggle with significant performance gaps between high-resource and low-resource languages. While there are efforts to align the languages in a single latent space to mitigate such gaps, how different input-level representations influence such gaps has not been investigated, particularly with phonemic inputs. We hypothesize that the performance gaps are affected by representation discrepancies between those languages, and revisit the use of phonemic representations as a means to mitigate these discrepancies.To demonstrate the effectiveness of phonemic representations, we present experiments on three representative cross-lingual tasks on 12 languages in total. The results show that phonemic representations exhibit higher similarities between languages compared to orthographic representations, and it consistently outperforms grapheme-based baseline model on languages that are relatively low-resourced.We present quantitative evidence from three cross-lingual tasks that demonstrate the effectiveness of phonemic representations, and it is further justified by a theoretical analysis of the cross-lingual performance gap.
2023
Visually-Situated Natural Language Understanding with Contrastive Reading Model and Frozen Large Language Models
Geewook Kim | Hodong Lee | Daehee Kim | Haeji Jung | Sanghee Park | Yoonsik Kim | Sangdoo Yun | Taeho Kil | Bado Lee | Seunghyun Park
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Geewook Kim | Hodong Lee | Daehee Kim | Haeji Jung | Sanghee Park | Yoonsik Kim | Sangdoo Yun | Taeho Kil | Bado Lee | Seunghyun Park
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Recent advances in Large Language Models (LLMs) have stimulated a surge of research aimed at extending their applications to the visual domain. While these models exhibit promise in generating abstract image captions and facilitating natural conversations, their performance on text-rich images still requires improvement. In this paper, we introduce Contrastive Reading Model (Cream), a novel neural architecture designed to enhance the language-image understanding capability of LLMs by capturing intricate details that are often overlooked in existing methods. Cream combines vision and auxiliary encoders, fortified by a contrastive feature alignment technique, to achieve a more effective comprehension of language information in visually situated contexts within the images. Our approach bridges the gap between vision and language understanding, paving the way for the development of more sophisticated Document Intelligence Assistants. Through rigorous evaluations across diverse visually-situated language understanding tasks that demand reasoning capabilities, we demonstrate the compelling performance of Cream, positioning it as a prominent model in the field of visual document understanding. We provide our codebase and newly-generated datasets at https://github.com/naver-ai/cream.