Xinchen Yang
2026
Can You Make It Sound Like You? Post-Editing LLM-Generated Text for Personal Style
Connor Baumler | Calvin Bao | Huy Nghiem | Xinchen Yang | Marine Carpuat | Hal Daum\'e Iii
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Connor Baumler | Calvin Bao | Huy Nghiem | Xinchen Yang | Marine Carpuat | Hal Daum\'e Iii
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Despite the growing use of large language models (LLMs) for writing tasks, users may hesitate to rely on LLMs when personal style is important. Post-editing LLM-generated drafts or translations is a common collaborative writing strategy, but it remains unclear whether users can effectively reshape LLM-generated text to reflect their personal style. We conduct a pre-registered online study (n=81) in which participants post-edit LLM-generated drafts for writing tasks where personal style matters to them. Using embedding-based style similarity metrics, we find that post-editing increases stylistic similarity to participants’ unassisted writing and reduces similarity to fully LLM-generated output. However, post-edited text still remains stylistically closer in style to LLM text than to participants’ unassisted control text, and it exhibits reduced stylistic diversity compared to unassisted human text. We find a gap between perceived stylistic authenticity and model-measured stylistic similarity, with post-edited text often perceived as representative of participants’ personal style despite remaining detectable LLM stylistic traces.
2025
Evaluating Evaluation Metrics for Ancient Chinese to English Machine Translation
Eric R. Bennett | HyoJung Han | Xinchen Yang | Andrew Schonebaum | Marine Carpuat
Proceedings of the Second Workshop on Ancient Language Processing
Eric R. Bennett | HyoJung Han | Xinchen Yang | Andrew Schonebaum | Marine Carpuat
Proceedings of the Second Workshop on Ancient Language Processing
Evaluation metrics are an important driver of progress in Machine Translation (MT), but they have been primarily validated on high-resource modern languages. In this paper, we conduct an empirical evaluation of metrics commonly used to evaluate MT from Ancient Chinese into English. Using LLMs, we construct a contrastive test set, pairing high-quality MT and purposefully flawed MT of the same Pre-Qin texts. We then evaluate the ability of each metric to discriminate between accurate and flawed translations.