Le Tian


2025

pdf bib
POINTS-Reader: Distillation-Free Adaptation of Vision-Language Models for Document Conversion
Yuan Liu | Zhongyin Zhao | Le Tian | Haicheng Wang | Xubing Ye | Yangxiu You | Zilin Yu | Chuhan Wu | Zhou Xiao | Yang Yu | Jie Zhou
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

High-quality labeled data is essential for training accurate document conversion models, particularly in domains with complex formats such as tables, formulas, and multi-column text. However, manual annotation is both costly and time-consuming, while automatic labeling using existing models often lacks accuracy in handling such challenging scenarios. Consequently, training student models by distilling outputs from teacher models can significantly limit their performance in real-world applications. In this paper, we propose a fully automated, distillation-free framework comprising two stages for constructing high-quality document extraction datasets and models capable of handling diverse document formats and layouts. In the first stage, we introduce a method for generating large-scale, diverse synthetic data, which enables a model to extract key elements in a unified format with strong initial performance. In the second stage, we present a self-improvement approach that further adapts the model, initially trained on synthetic data, to real-world documents. Specifically, we first use the fine-tuned model to annotate real documents, then apply a suite of filtering strategies to verify annotation quality, and finally retrain the model on the verified dataset. By iteratively repeating this process, we progressively enhance both the model’s conversion capabilities and the quality of the generated data. We train a public POINTS-1.5 model to obtain POINTS-Reader, which surpasses many existing public and proprietary models of comparable or larger size. Our model will be made publicly available.

2022

pdf bib
Dual Context-Guided Continuous Prompt Tuning for Few-Shot Learning
Jie Zhou | Le Tian | Houjin Yu | Zhou Xiao | Hui Su | Jie Zhou
Findings of the Association for Computational Linguistics: ACL 2022

Prompt-based paradigm has shown its competitive performance in many NLP tasks. However, its success heavily depends on prompt design, and the effectiveness varies upon the model and training data. In this paper, we propose a novel dual context-guided continuous prompt (DCCP) tuning method. To explore the rich contextual information in language structure and close the gap between discrete prompt tuning and continuous prompt tuning, DCCP introduces two auxiliary training objectives and constructs input in a pair-wise fashion. Experimental results demonstrate that our method is applicable to many NLP tasks, and can often outperform existing prompt tuning methods by a large margin in the few-shot setting.

2016

pdf bib
Deep LSTM based Feature Mapping for Query Classification
Yangyang Shi | Kaisheng Yao | Le Tian | Daxin Jiang
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

2013

pdf bib
Latent Semantic Tensor Indexing for Community-based Question Answering
Xipeng Qiu | Le Tian | Xuanjing Huang
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)