Jimin Sun


2023

pdf
A Multi-dimensional Evaluation of Tokenizer-free Multilingual Pretrained Models
Jimin Sun | Patrick Fernandes | Xinyi Wang | Graham Neubig
Findings of the Association for Computational Linguistics: EACL 2023

Recent works on tokenizer-free multilingual pretrained models show promising results in improving cross-lingual transfer and reducing engineering overhead compared to subword-based alternatives. However, previous work mainly focuses on reporting accuracy on a limited set of tasks and data settings, placing less emphasis on other important factors when tuning and deploying the models in practice, such as memory usage, inference speed, and finetuning data efficiency. We attempt to fill this gap by performing a comprehensive empirical comparison of multilingual tokenizer-free and subword-based models considering the various dimensions. Surprisingly, we find that subword-based models might still be the most practical choice in many settings, achieving better performance for lower inference latency and memory usage. Based on these results, we encourage future work in tokenizer-free methods to consider these factors when designing and evaluating new models.

2021

pdf
Cross-Cultural Similarity Features for Cross-Lingual Transfer Learning of Pragmatically Motivated Tasks
Jimin Sun | Hwijeen Ahn | Chan Young Park | Yulia Tsvetkov | David R. Mortensen
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

Much work in cross-lingual transfer learning explored how to select better transfer languages for multilingual tasks, primarily focusing on typological and genealogical similarities between languages. We hypothesize that these measures of linguistic proximity are not enough when working with pragmatically-motivated tasks, such as sentiment analysis. As an alternative, we introduce three linguistic features that capture cross-cultural similarities that manifest in linguistic patterns and quantify distinct aspects of language pragmatics: language context-level, figurative language, and the lexification of emotion concepts. Our analyses show that the proposed pragmatic features do capture cross-cultural similarities and align well with existing work in sociolinguistics and linguistic anthropology. We further corroborate the effectiveness of pragmatically-driven transfer in the downstream task of choosing transfer languages for cross-lingual sentiment analysis.

pdf
Kakao Enterprise’s WMT21 Machine Translation Using Terminologies Task Submission
Yunju Bak | Jimin Sun | Jay Kim | Sungwon Lyu | Changmin Lee
Proceedings of the Sixth Conference on Machine Translation

This paper describes Kakao Enterprise’s submission to the WMT21 shared Machine Translation using Terminologies task. We integrate terminology constraints by pre-training with target lemma annotations and fine-tuning with exact target annotations utilizing the given terminology dataset. This approach yields a model that achieves outstanding results in terms of both translation quality and term consistency, ranking first based on COMET in the En→Fr language direction. Furthermore, we explore various methods such as back-translation, explicitly training terminologies as additional parallel data, and in-domain data selection.

2020

pdf
NLPDove at SemEval-2020 Task 12: Improving Offensive Language Detection with Cross-lingual Transfer
Hwijeen Ahn | Jimin Sun | Chan Young Park | Jungyun Seo
Proceedings of the Fourteenth Workshop on Semantic Evaluation

This paper describes our approach to the task of identifying offensive languages in a multilingual setting. We investigate two data augmentation strategies: using additional semi-supervised labels with different thresholds and cross-lingual transfer with data selection. Leveraging the semi-supervised dataset resulted in performance improvements compared to the baseline trained solely with the manually-annotated dataset. We propose a new metric, Translation Embedding Distance, to measure the transferability of instances for cross-lingual data selection. We also introduce various preprocessing steps tailored for social media text along with methods to fine-tune the pre-trained multilingual BERT (mBERT) for offensive language identification. Our multilingual systems achieved competitive results in Greek, Danish, and Turkish at OffensEval 2020.