Zitong Zhao
2025
WASA: WAtermark-based Source Attribution for Large Language Model-Generated Data
Xinyang Lu
|
Jingtan Wang
|
Zitong Zhao
|
Zhongxiang Dai
|
Chuan-Sheng Foo
|
See-Kiong Ng
|
Bryan Kian Hsiang Low
Findings of the Association for Computational Linguistics: ACL 2025
The impressive performances of Large Language Models (LLMs) and their immense potential for commercialization have given rise to serious concerns over the Intellectual Property (IP) of their training data. In particular, the synthetic texts generated by LLMs may infringe the IP of the data being used to train the LLMs. To this end, it is imperative to be able to perform source attribution by identifying the data provider who contributed to the generation of a synthetic text by an LLM. In this paper, we show that this problem can be tackled by watermarking, i.e., by enabling an LLM to generate synthetic texts with embedded watermarks that contain information about their source(s). We identify the key properties of such watermarking frameworks (e.g., source attribution accuracy, robustness against adversaries), and propose a source attribution framework that satisfies these key properties due to our algorithmic designs. Our framework enables an LLM to learn an accurate mapping from the generated texts to data providers, which sets the foundation for effective source attribution. Extensive empirical evaluations show that our framework achieves effective source attribution.
Uncovering Scaling Laws for Large Language Models via Inverse Problems
Arun Verma
|
Zhaoxuan Wu
|
Zijian Zhou
|
Xiaoqiang Lin
|
Zhiliang Chen
|
Rachael Hwee Ling Sim
|
Rui Qiao
|
Jingtan Wang
|
Nhung Bui
|
Xinyuan Niu
|
Wenyang Hu
|
Gregory Kang Ruey Lau
|
Zi-Yu Khoo
|
Zitong Zhao
|
Xinyi Xu
|
Apivich Hemachandra
|
See-Kiong Ng
|
Bryan Kian Hsiang Low
Findings of the Association for Computational Linguistics: EMNLP 2025
Large Language Models (LLMs) are large-scale pretrained models that have achieved remarkable success across diverse domains. These successes have been driven by unprecedented complexity and scale in both data and computations. However, due to the high costs of training such models, brute-force trial-and-error approaches to improve LLMs are not feasible. Inspired by the success of inverse problems in uncovering fundamental scientific laws, this position paper advocates that inverse problems can also efficiently uncover scaling laws that guide the building of LLMs to achieve the desirable performance with significantly better cost-effectiveness.
Search
Fix author
Co-authors
- Bryan Kian Hsiang Low 2
- See Kiong Ng 2
- Jingtan Wang 2
- Nhung Bui 1
- Zhiliang Chen 1
- show all...