Cross-Domain Detection of GPT-2-Generated Technical Text
Juan Diego Rodriguez, Todd Hay, David Gros, Zain Shamsi, Ravi Srinivasan
Abstract
Machine-generated text presents a potential threat not only to the public sphere, but also to the scientific enterprise, whereby genuine research is undermined by convincing, synthetic text. In this paper we examine the problem of detecting GPT-2-generated technical research text. We first consider the realistic scenario where the defender does not have full information about the adversary’s text generation pipeline, but is able to label small amounts of in-domain genuine and synthetic text in order to adapt to the target distribution. Even in the extreme scenario of adapting a physics-domain detector to a biomedical detector, we find that only a few hundred labels are sufficient for good performance. Finally, we show that paragraph-level detectors can be used to detect the tampering of full-length documents under a variety of threat models.- Anthology ID:
- 2022.naacl-main.88
- Volume:
- Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
- Month:
- July
- Year:
- 2022
- Address:
- Seattle, United States
- Editors:
- Marine Carpuat, Marie-Catherine de Marneffe, Ivan Vladimir Meza Ruiz
- Venue:
- NAACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 1213–1233
- Language:
- URL:
- https://aclanthology.org/2022.naacl-main.88
- DOI:
- 10.18653/v1/2022.naacl-main.88
- Cite (ACL):
- Juan Diego Rodriguez, Todd Hay, David Gros, Zain Shamsi, and Ravi Srinivasan. 2022. Cross-Domain Detection of GPT-2-Generated Technical Text. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1213–1233, Seattle, United States. Association for Computational Linguistics.
- Cite (Informal):
- Cross-Domain Detection of GPT-2-Generated Technical Text (Rodriguez et al., NAACL 2022)
- PDF:
- https://preview.aclanthology.org/ml4al-ingestion/2022.naacl-main.88.pdf
- Code
- ciads-ut/cross-domain-detection-gpt-2
- Data
- S2ORC, Semantic Scholar, WebText