Rethinking Cross-Subject Data Splitting for Brain-to-Text Decoding

Congchi Yin, Qian Yu, Zhiwei Fang, Changping Peng, Piji Li


Abstract
Recent major milestones have successfully reconstructed natural language from non-invasive brain signals (e.g. functional Magnetic Resonance Imaging (fMRI) and Electroencephalogram (EEG)) across subjects. However, we find current dataset splitting strategies for cross-subject brain-to-text decoding are wrong. Specifically, we first demonstrate that all current splitting methods suffer from data leakage problem, which refers to the leakage of validation and test data into training set, resulting in significant overfitting and overestimation of decoding models. In this study, we develop a right cross-subject data splitting criterion without data leakage for decoding fMRI and EEG signal to text. Some SOTA brain-to-text decoding models are re-evaluated correctly with the proposed criterion for further research.
Anthology ID:
2025.emnlp-main.289
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
5686–5700
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.289/
DOI:
Bibkey:
Cite (ACL):
Congchi Yin, Qian Yu, Zhiwei Fang, Changping Peng, and Piji Li. 2025. Rethinking Cross-Subject Data Splitting for Brain-to-Text Decoding. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 5686–5700, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Rethinking Cross-Subject Data Splitting for Brain-to-Text Decoding (Yin et al., EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.289.pdf
Checklist:
 2025.emnlp-main.289.checklist.pdf