Laura Weidinger


Fixing paper assignments

  1. Please select all papers that belong to the same person.
  2. Indicate below which author they should be assigned to.
Provide a valid ORCID iD here. This will be used to match future papers to this author.
Provide the name of the school or the university where the author has received or will receive their highest degree (e.g., Ph.D. institution for researchers, or current affiliation for students). This will be used to form the new author page ID, if needed.

TODO: "submit" and "cancel" buttons here


2024

pdf bib
STAR: SocioTechnical Approach to Red Teaming Language Models
Laura Weidinger | John F J Mellor | Bernat Guillén Pegueroles | Nahema Marchal | Ravin Kumar | Kristian Lum | Canfer Akbulut | Mark Diaz | A. Stevie Bergman | Mikel D. Rodriguez | Verena Rieser | William Isaac
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

This research introduces STAR, a sociotechnical framework that improves on current best practices for red teaming safety of large language models. STAR makes two key contributions: it enhances steerability by generating parameterised instructions for human red teamers, leading to improved coverage of the risk surface. Parameterised instructions also provide more detailed insights into model failures at no increased cost. Second, STAR improves signal quality by matching demographics to assess harms for specific groups, resulting in more sensitive annotations. STAR further employs a novel step of arbitration to leverage diverse viewpoints and improve label reliability, treating disagreement not as noise but as a valuable contribution to signal quality.

2022

pdf bib
Accounting for Offensive Speech as a Practice of Resistance
Mark Diaz | Razvan Amironesei | Laura Weidinger | Iason Gabriel
Proceedings of the Sixth Workshop on Online Abuse and Harms (WOAH)

Tasks such as toxicity detection, hate speech detection, and online harassment detection have been developed for identifying interactions involving offensive speech. In this work we articulate the need for a relational understanding of offensiveness to help distinguish denotative offensive speech from offensive speech serving as a mechanism through which marginalized communities resist oppressive social norms. Using examples from the queer community, we argue that evaluations of offensive speech must focus on the impacts of language use. We call this the cynic perspective– or a characteristic of language with roots in Cynic philosophy that pertains to employing offensive speech as a practice of resistance. We also explore the degree to which NLP systems may encounter limits to modeling relational context.