Junjie Xing

2025

Language models such as GPT and Llama have shown remarkable ability on diverse natural language tasks, yet their performance on complex table tasks (e.g., NL-to-Code, data cleaning, etc.) continues to be suboptimal. To improve their performance, task-specific fine-tuning is often needed, which, however, require expensive human labeling and is prone to over-fitting.In this work, we propose Table-Specialist, a new self-trained fine-tuning paradigm specifically designed for table tasks. Our insight is that for each table task, there often exist two dual versions of the same task, one generative and one classification in nature. Leveraging their duality, we propose a Generator-Validator paradigm to iteratively generate-then-validate training data from language models, to fine-tune stronger Table-Specialist models that can specialize in a given task, without using manually-labeled data.Extensive evaluations of Table-Specialist on Llama, GPT-3.5 and GPT-4 suggest that our Table-Specialist has (1) **strong performance** on diverse table tasks over vanilla language-models – for example, Table-Specialist fine-tuned on GPT-3.5 not only outperforms vanilla GPT-3.5, but can often match or surpass GPT-4 level quality, (2) **lower cost** to deploy, because when Table-Specialist fine-tuned on GPT-3.5 achieve GPT-4 level quality, it becomes possible to deploy smaller models with lower latency/cost at comparable quality, and (3) **better generalizability** when evaluated across multiple benchmarks, since Table-Specialist is fine-tuned on a broad range of training data systematically generated from diverse real tables. Our code is available at [microsoft/Table-Specialist](https://github.com/microsoft/Table-Specialist). Specialist models fine-tuned using Table-Specialist have been integrated into Microsoft Excel for use cases such as automated table data cleaning.

2018

pdf bib abs
Adaptive Multi-Task Transfer Learning for Chinese Word Segmentation in Medical Text
Junjie Xing | Kenny Zhu | Shaodian Zhang
Proceedings of the 27th International Conference on Computational Linguistics

Chinese word segmentation (CWS) trained from open source corpus faces dramatic performance drop when dealing with domain text, especially for a domain with lots of special terms and diverse writing styles, such as the biomedical domain. However, building domain-specific CWS requires extremely high annotation cost. In this paper, we propose an approach by exploiting domain-invariant knowledge from high resource to low resource domains. Extensive experiments show that our model achieves consistently higher accuracy than the single-task CWS and other transfer learning baselines, especially when there is a large disparity between source and target domains.

Co-authors

Shaodian Zhang 1

Mengyu Zhou 1

Kenny Zhu 1

Venues

coling1
emnlp1

Fix author