Realistic Training Data Generation and Rule Enhanced Decoding in LLM for NameGuess

Yikuan Xia; Jiazun Chen; Sujian Li (李素建); Jun Gao

Realistic Training Data Generation and Rule Enhanced Decoding in LLM for NameGuess

Yikuan Xia, Jiazun Chen, Sujian Li, Jun Gao

Abstract

The wide use of abbreviated column names (derived from English words or Chinese Pinyin) in database tables poses significant challenges for table-centric tasks in natural language processing and database management. Such a column name expansion task, referred to as the NameGuess task, has previously been addressed by fine-tuning Large Language Models (LLMs) on synthetically generated rule-based data. However, the current approaches yield suboptimal performance due to two fundamental limitations: 1) the rule-generated abbreviation data fails to reflect real-world distribution, and 2) the failure of LLMs to follow the rule-sensitive patterns in NameGuess persistently. For the data realism issue, we propose a novel approach that integrates a subsequence abbreviation generator trained on human-annotated data and collects non-subsequence abbreviations to improve the training set. For the rule violation issue, we propose a decoding system constrained on an automaton that represents the rules of abbreviation expansion. We extended the original English NameGuess test set to include non-subsequence and PinYin scenarios. Experimental results show that properly tuned 7/8B moderate-size LLMs with a refined decoding system can surpass the few-shot performance of state-of-the-art LLMs, such as the GPT-4 series. The code and data are presented in the supplementary material.

Anthology ID:: 2025.emnlp-main.357
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 7001–7018
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.357/
DOI:
Bibkey:
Cite (ACL):: Yikuan Xia, Jiazun Chen, Sujian Li, and Jun Gao. 2025. Realistic Training Data Generation and Rule Enhanced Decoding in LLM for NameGuess. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 7001–7018, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Realistic Training Data Generation and Rule Enhanced Decoding in LLM for NameGuess (Xia et al., EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.357.pdf
Checklist:: 2025.emnlp-main.357.checklist.pdf

PDF Cite Search Checklist Fix data