Xinyu Xing


2025

pdf bib
UTF: Under-trained Tokens as Fingerprints —— a Novel Approach to LLM Identification
Jiacheng Cai | Jiahao Yu | Yangguang Shao | Yuhang Wu | Xinyu Xing
Proceedings of the The First Workshop on LLM Security (LLMSEC)

Fingerprinting large language models (LLMs) is essential for verifying model ownership, ensuring authenticity, and preventing misuse. Traditional fingerprinting methods often require significant computational overhead or white-box verification access. In this paper, we introduce UTF, a novel and efficient approach to fingerprinting LLMs by leveraging under-trained tokens. Under-trained tokens are tokens that the model has not fully learned during its training phase. By utilizing these tokens, we perform supervised fine-tuning to embed specific input-output pairs into the model. This process allows the LLM to produce predetermined outputs when presented with certain inputs, effectively embedding a unique fingerprint. Our method has minimal overhead and impact on model’s performance, and does not require white-box access to target model’s ownership identification. Compared to existing fingerprinting methods, UTF is also more effective and robust to fine-tuning and random guess.

2021

pdf bib
Structure-Aware Pre-Training for Table-to-Text Generation
Xinyu Xing | Xiaojun Wan
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

2020

pdf bib
Automatic Generation of Citation Texts in Scholarly Papers: A Pilot Study
Xinyu Xing | Xiaosheng Fan | Xiaojun Wan
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

In this paper, we study the challenging problem of automatic generation of citation texts in scholarly papers. Given the context of a citing paper A and a cited paper B, the task aims to generate a short text to describe B in the given context of A. One big challenge for addressing this task is the lack of training data. Usually, explicit citation texts are easy to extract, but it is not easy to extract implicit citation texts from scholarly papers. We thus first train an implicit citation extraction model based on BERT and leverage the model to construct a large training dataset for the citation text generation task. Then we propose and train a multi-source pointer-generator network with cross attention mechanism for citation text generation. Empirical evaluation results on a manually labeled test dataset verify the efficacy of our model. This pilot study confirms the feasibility of automatically generating citation texts in scholarly papers and the technique has the great potential to help researchers prepare their scientific papers.

2019

pdf bib
Asking the Crowd: Question Analysis, Evaluation and Generation for Open Discussion on Online Forums
Zi Chai | Xinyu Xing | Xiaojun Wan | Bo Huang
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Teaching machines to ask questions is an important yet challenging task. Most prior work focused on generating questions with fixed answers. As contents are highly limited by given answers, these questions are often not worth discussing. In this paper, we take the first step on teaching machines to ask open-answered questions from real-world news for open discussion (openQG). To generate high-qualified questions, effective ways for question evaluation are required. We take the perspective that the more answers a question receives, the better it is for open discussion, and analyze how language use affects the number of answers. Compared with other factors, e.g. topic and post time, linguistic factors keep our evaluation from being domain-specific. We carefully perform variable control on 11.5M questions from online forums to get a dataset, OQRanD, and further perform question analysis. Based on these conclusions, several models are built for question evaluation. For openQG task, we construct OQGenD, the first dataset as far as we know, and propose a model based on conditional generative adversarial networks and our question evaluation model. Experiments show that our model can generate questions with higher quality compared with commonly-used text generation methods.