Junghwan Kim
Papers on this page may belong to the following people: Junghwan Kim
2025
Leveraging Multilingual Training for Authorship Representation: Enhancing Generalization across Languages and Domains
Junghwan Kim | Haotian Zhang | David Jurgens
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Junghwan Kim | Haotian Zhang | David Jurgens
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Authorship representation (AR) learning, which models an author’s unique writing style, has demonstrated strong performance in authorship attribution tasks. However, prior research has primarily focused on monolingual settings—mostly in English—leaving the potential benefits of multilingual AR models underexplored. We introduce a novel method for multilingual AR learning that incorporates two key innovations: probabilistic content masking, which encourages the model to focus on stylistically indicative words rather than content-specific words, and language-aware batching, which improves contrastive learning by reducing cross-lingual interference. Our model is trained on over 4.5 million authors across 36 languages and 13 domains. It consistently outperforms monolingual baselines in 21 out of 22 non-English languages, achieving an average Recall@8 improvement of 4.85%, with a maximum gain of 15.91% in a single language. Furthermore, it exhibits stronger cross-lingual and cross-domain generalization compared to a monolingual model trained solely on English. Our analysis confirms the effectiveness of both proposed techniques, highlighting their critical roles in the model’s improved performance.
2024
ABLE: Agency-BeLiefs Embedding to Address Stereotypical Bias through Awareness Instead of Obliviousness
Michelle YoungJin Kim | Junghwan Kim | Kristen Johnson
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Michelle YoungJin Kim | Junghwan Kim | Kristen Johnson
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Natural Language Processing (NLP) models tend to inherit and amplify stereotypical biases present in their training data, leading to harmful societal consequences. Current efforts to rectify these biases typically revolve around making models oblivious to bias, which is at odds with the idea that humans require increased awareness to tackle these biases better. This prompts a fundamental research question: are bias-oblivious models the only viable solution to combat stereotypical biases? This paper answers this question by proposing the Agency-BeLiefs Embedding (ABLE) model, a novel approach that actively encodes stereotypical biases into the embedding space. ABLE draws upon social psychological theory to acquire and represent stereotypical biases in the form of agency and belief scores rather than directly representing stereotyped groups. Our experimental results showcase ABLE’s effectiveness in learning agency and belief stereotypes while preserving the language model’s proficiency. Furthermore, we underscore the practical significance of incorporating stereotypes within the ABLE model by demonstrating its utility in various downstream tasks. Our approach exemplifies the potential benefits of addressing bias through awareness, as opposed to the prevailing approach of mitigating bias through obliviousness.
2023
Race, Gender, and Age Biases in Biomedical Masked Language Models
Michelle YoungJin Kim | Junghwan Kim | Kristen Marie Johnson
Findings of the Association for Computational Linguistics: ACL 2023
Michelle YoungJin Kim | Junghwan Kim | Kristen Marie Johnson
Findings of the Association for Computational Linguistics: ACL 2023
Biases cause discrepancies in healthcare services. Race, gender, and age of a patient affect interactions with physicians and the medical treatments one receives. These biases in clinical practices can be amplified following the release of pre-trained language models trained on biomedical corpora. To bring awareness to such repercussions, we examine social biases present in the biomedical masked language models. We curate prompts based on evidence-based practice and compare generated diagnoses based on biases. For a case study, we measure bias in diagnosing coronary artery disease and using cardiovascular procedures based on bias. Our study demonstrates that biomedical models are less biased than BERT in gender, while the opposite is true for race and age.