StRuCom: A Novel Dataset of Structured Code Comments in Russian

Maria Dziuba; Valentin Malykh

StRuCom: A Novel Dataset of Structured Code Comments in Russian

Abstract

Structured code comments in docstring format are essential for code comprehension and maintenance, but existing machine learning models for their generation perform poorly for Russian compared to English. To bridge this gap, we present StRuCom — the first large-scale dataset (153K examples) specifically designed for Russian code documentation. Unlike machine-translated English datasets that distort terminology (e.g., technical loanwords vs. literal translations) and docstring structures, StRuCom combines human-written comments from Russian GitHub repositories with synthetically generated ones, ensuring compliance with Python, Java, JavaScript, C#, and Go standards through automated validation.

Anthology ID:: 2025.acl-srw.34
Volume:: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop)
Month:: July
Year:: 2025
Address:: Vienna, Austria
Editors:: Jin Zhao, Mingyang Wang, Zhu Liu
Venues:: ACL | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 517–527
Language:
URL:: https://preview.aclanthology.org/landing_page/2025.acl-srw.34/
DOI:
Bibkey:
Cite (ACL):: Maria Dziuba and Valentin Malykh. 2025. StRuCom: A Novel Dataset of Structured Code Comments in Russian. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop), pages 517–527, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):: StRuCom: A Novel Dataset of Structured Code Comments in Russian (Dziuba & Malykh, ACL 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/landing_page/2025.acl-srw.34.pdf

PDF Cite Search Fix data