@inproceedings{lucas-havens-2023-gpts,
    title = "{GPT}s Don{'}t Keep Secrets: Searching for Backdoor Watermark Triggers in Autoregressive Language Models",
    author = "Lucas, Evan  and
      Havens, Timothy",
    editor = "Ovalle, Anaelia  and
      Chang, Kai-Wei  and
      Mehrabi, Ninareh  and
      Pruksachatkun, Yada  and
      Galystan, Aram  and
      Dhamala, Jwala  and
      Verma, Apurv  and
      Cao, Trista  and
      Kumar, Anoop  and
      Gupta, Rahul",
    booktitle = "Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023)",
    month = jul,
    year = "2023",
    address = "Toronto, Canada",
    publisher = "Association for Computational Linguistics",
    url = "https://preview.aclanthology.org/ingest-emnlp/2023.trustnlp-1.21/",
    doi = "10.18653/v1/2023.trustnlp-1.21",
    pages = "242--248",
    abstract = "This work analyzes backdoor watermarks in an autoregressive transformer fine-tuned to perform a generative sequence-to-sequence task, specifically summarization. We propose and demonstrate an attack to identify trigger words or phrases by analyzing open ended generations from autoregressive models that have backdoor watermarks inserted. It is shown in our work that triggers based on random common words are easier to identify than those based on single, rare tokens. The attack proposed is easy to implement and only requires access to the model weights. Code used to create the backdoor watermarked models and analyze their outputs is shared at [github link to be inserted for camera ready version]."
}Markdown (Informal)
[GPTs Don’t Keep Secrets: Searching for Backdoor Watermark Triggers in Autoregressive Language Models](https://preview.aclanthology.org/ingest-emnlp/2023.trustnlp-1.21/) (Lucas & Havens, TrustNLP 2023)
ACL