This is an internal, incomplete preview of a proposed change to the ACL Anthology.
For efficiency reasons, we don't generate MODS or Endnote formats, and the preview may be incomplete in other ways, or contain mistakes.
Do not treat this content as an official publication.
WeijieZhao
Fixing paper assignments
Please select all papers that belong to the same person.
Indicate below which author they should be assigned to.
Large Language Models (LLMs) heavily rely on high-quality training data, making data valuation crucial for optimizing model performance, especially when working within a limited budget. In this work, we aim to offer a third-party data valuation approach that benefits both data providers and model developers. We introduce a linearized future influence kernel (LinFiK), which assesses the value of individual data samples in improving LLM performance during training. We further propose ALinFiK, a learning strategy to approximate LinFiK, enabling scalable data valuation. Our comprehensive evaluations demonstrate that this approach surpasses existing baselines in effectiveness and efficiency, demonstrating significant scalability advantages as LLM parameters increase.
Large Language Models (LLMs) have revolutionized natural language processing, but their widespread use has raised significant copyright concerns. This tutorial addresses the complex intersection of LLMs and copyright law, providing researchers and practitioners with essential knowledge and tools to navigate this challenging landscape. The tutorial begins with an overview of relevant copyright principles and their application to AI, followed by an examination of specific copyright issues in LLM development and deployment. A key focus will be on technical approaches to copyright risk assessment and mitigation in LLMs. We will introduce benchmarks for evaluating copyright-related risks, including memorization detection and probing techniques. The tutorial will then cover practical mitigation strategies, such as machine unlearning, efficient fine-tuning methods, and alignment approaches to reduce copyright infringement risks. Ethical considerations and future directions in copyright-aware AI development will also be discussed.
Given a Large Language Model (LLM) generation, how can we identify which training data led to this generation? In this paper, we proposed RapidIn, a scalable framework adapting to LLMs for estimating the influence of each training data. The proposed framework consists of two stages: caching and retrieval. First, we compress the gradient vectors by over 200,000x, allowing them to be cached on disk or in GPU/CPU memory. Then, given a generation, RapidIn efficiently traverses the cached gradients to estimate the influence within minutes, achieving over a 6,326x speedup. Moreover, RapidIn supports multi-GPU parallelization to substantially accelerate caching and retrieval. Our empirical result confirms the efficiency and effectiveness of RapidIn.