Shilin Zhang


2024

pdf
LLM-as-a-Coauthor: Can Mixed Human-Written and Machine-Generated Text Be Detected?
Qihui Zhang | Chujie Gao | Dongping Chen | Yue Huang | Yixin Huang | Zhenyang Sun | Shilin Zhang | Weiye Li | Zhengyan Fu | Yao Wan | Lichao Sun
Findings of the Association for Computational Linguistics: NAACL 2024

With the rapid development and widespread application of Large Language Models (LLMs), the use of Machine-Generated Text (MGT) has become increasingly common, bringing with it potential risks, especially in terms of quality and integrity in fields like news, education, and science. Current research mainly focuses on purely MGT detection, without adequately addressing mixed scenarios including AI-revised Human-Written Text (HWT) or human-revised MGT. To tackle this challenge, we define mixtext, a form of mixed text involving both AI and human-generated content. Then we introduce MixSet, the first dataset dedicated to studying these mixtext scenarios. Leveraging MixSet, we executed comprehensive experiments to assess the efficacy of prevalent MGT detectors in handling mixtext situations, evaluating their performance in terms of effectiveness, robustness, and generalization. Our findings reveal that existing detectors struggle to identify mixtext, particularly in dealing with subtle modifications and style adaptability. This research underscores the urgent need for more fine-grain detectors tailored for mixtext, offering valuable insights for future research. Code and Models are available at https://github.com/Dongping-Chen/MixSet.

pdf
PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference
Dongjie Yang | Xiaodong Han | Yan Gao | Yao Hu | Shilin Zhang | Hai Zhao
Findings of the Association for Computational Linguistics ACL 2024

Large Language Models (LLMs) have shown remarkable comprehension abilities but face challenges in GPU memory usage during inference, hindering their scalability for real-time applications like chatbots. To accelerate inference, we store computed keys and values (KV cache) in the GPU memory. Existing methods study the KV cache compression to reduce memory by pruning the pre-computed KV cache. However, they neglect the inter-layer dependency between layers and huge memory consumption in pre-computation. To explore these deficiencies, we find that the number of crucial keys and values that influence future generations decreases layer by layer and we can extract them by the consistency in attention weights. Based on the findings, we propose PyramidInfer, a method that compresses the KV cache by layer-wise retaining crucial context. PyramidInfer saves significant memory by computing fewer keys and values without sacrificing performance. Experimental results show PyramidInfer improves 2.2x throughput compared to Accelerate with over 54% GPU memory reduction in KV cache.