Leveraging Multilingual Training for Authorship Representation: Enhancing Generalization across Languages and Domains

Junghwan Kim; Haotian Zhang; David Jurgens

Leveraging Multilingual Training for Authorship Representation: Enhancing Generalization across Languages and Domains

Junghwan Kim, Haotian Zhang, David Jurgens

Abstract

Authorship representation (AR) learning, which models an author’s unique writing style, has demonstrated strong performance in authorship attribution tasks. However, prior research has primarily focused on monolingual settings—mostly in English—leaving the potential benefits of multilingual AR models underexplored. We introduce a novel method for multilingual AR learning that incorporates two key innovations: probabilistic content masking, which encourages the model to focus on stylistically indicative words rather than content-specific words, and language-aware batching, which improves contrastive learning by reducing cross-lingual interference. Our model is trained on over 4.5 million authors across 36 languages and 13 domains. It consistently outperforms monolingual baselines in 21 out of 22 non-English languages, achieving an average Recall@8 improvement of 4.85%, with a maximum gain of 15.91% in a single language. Furthermore, it exhibits stronger cross-lingual and cross-domain generalization compared to a monolingual model trained solely on English. Our analysis confirms the effectiveness of both proposed techniques, highlighting their critical roles in the model’s improved performance.

Anthology ID:: 2025.emnlp-main.1766
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 34855–34880
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1766/
DOI:
Bibkey:
Cite (ACL):: Junghwan Kim, Haotian Zhang, and David Jurgens. 2025. Leveraging Multilingual Training for Authorship Representation: Enhancing Generalization across Languages and Domains. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 34855–34880, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Leveraging Multilingual Training for Authorship Representation: Enhancing Generalization across Languages and Domains (Kim et al., EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1766.pdf
Checklist:: 2025.emnlp-main.1766.checklist.pdf

PDF Cite Search Checklist Fix data