Anna Sotnikova


2025

pdf bib
Challenges for AI in Multimodal STEM Assessments: a Human-AI Comparison
Aymeric de Chillaz | Anna Sotnikova | Patrick Jermann | Antoine Bosselut
Proceedings of the 20th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2025)

Generative AI systems have rapidly advanced, with multimodal input capabilities enabling reasoning beyond text-based tasks. In education, these advancements could influence assessment design and question answering, presenting both opportunities and challenges. To investigate these effects, we introduce a high-quality dataset of 201 university-level STEM questions, manually annotated with features such as image type, role, problem complexity, and question format. Our study analyzes how these features affect generative AI performance compared to students. We evaluate four model families with five prompting strategies, comparing results to the average of 546 student responses per question. Although the best model correctly answers on average 58.5% of the questions using majority vote aggregation, human participants consistently outperform AI on questions involving visual components. Interestingly, human performance remains stable across question features but varies by subject, whereas AI performance is susceptible to both subject matter and question features. Finally, we provide actionable insights for educators, demonstrating how question design can enhance academic integrity by leveraging features that challenge current AI systems without increasing the cognitive burden for students

pdf bib
Multilingual Large Language Models Leak Human Stereotypes across Language Boundaries
Yang Trista Cao | Anna Sotnikova | Jieyu Zhao | Linda X. Zou | Rachel Rudinger | Hal Daumé III
Proceedings of the Fourth Workshop on NLP for Positive Impact (NLP4PI)

Multilingual large language models have gained prominence for their proficiency in processing and generating text across languages. Like their monolingual counterparts, multilingual models are likely to pick up on stereotypes and other social biases during training. In this paper, we study a phenomenon we term “stereotype leakage”, which refers to how training a model multilingually may lead to stereotypes expressed in one language showing up in the models’ behavior in another. We propose a measurement framework for stereotype leakage and investigate its effect in English, Russian, Chinese, and Hindi and with GPT-3.5, mT5, and mBERT. Our findings show a noticeable leakage of positive, negative, and nonpolar associations across all languages. We find that GPT-3.5 exhibits the most stereotype leakage of these models, and Hindi is the most susceptible to leakage effects.

2024

pdf bib
“Flex Tape Can’t Fix That”: Bias and Misinformation in Edited Language Models
Karina H Halevy | Anna Sotnikova | Badr AlKhamissi | Syrielle Montariol | Antoine Bosselut
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Weight-based model editing methods update the parametric knowledge of language models post-training. However, these methods can unintentionally alter unrelated parametric knowledge representations, potentially increasing the risk of harm. In this work, we investigate how weight editing methods unexpectedly amplify model biases after edits. We introduce a novel benchmark dataset, Seesaw-CF, for measuring bias amplification of model editing methods for demographic traits such as race, geographic origin, and gender. We use Seesaw-CF to examine the impact of model editing on bias in five large language models. Our results demonstrate that edited models exhibit, to various degrees, more biased behavior for certain demographic groups than before they were edited, specifically becoming less confident in properties for Asian and African subjects. Additionally, editing facts about place of birth, country of citizenship, or gender has particularly negative effects on the model’s knowledge about unrelated properties, such as field of work, a pattern observed across multiple models.

2023

pdf bib
Which Examples Should be Multiply Annotated? Active Learning When Annotators May Disagree
Connor Baumler | Anna Sotnikova | Hal Daumé III
Findings of the Association for Computational Linguistics: ACL 2023

Linguistic annotations, especially for controversial topics like hate speech detection, are frequently contested due to annotator backgrounds and positionalities. In such situations, preserving this disagreement through the machine learning pipeline can be important for downstream use cases. However, capturing disagreement can increase annotation time and expense. Fortunately, for many tasks, not all examples are equally controversial; we develop an active learning approach, Disagreement Aware Active Learning (DAAL) that concentrates annotations on examples where model entropy and annotator entropy are the most different. Because we cannot know the true entropy of annotations on unlabeled examples, we estimate a model that predicts annotator entropy trained using very few multiply-labeled examples. We find that traditional uncertainty-based active learning underperforms simple passive learning on tasks with high levels of disagreement, but that our active learning approach is able to successfully improve on passive and active baselines, reducing the number of annotations required by at least 24% on average across several datasets.

2022

pdf bib
Theory-Grounded Measurement of U.S. Social Stereotypes in English Language Models
Yang Trista Cao | Anna Sotnikova | Hal Daumé III | Rachel Rudinger | Linda Zou
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

NLP models trained on text have been shown to reproduce human stereotypes, which can magnify harms to marginalized groups when systems are deployed at scale. We adapt the Agency-Belief-Communion (ABC) stereotype model of Koch et al. (2016) from social psychology as a framework for the systematic study and discovery of stereotypic group-trait associations in language models (LMs). We introduce the sensitivity test (SeT) for measuring stereotypical associations from language models. To evaluate SeT and other measures using the ABC model, we collect group-trait judgments from U.S.-based subjects to compare with English LM stereotypes. Finally, we extend this framework to measure LM stereotyping of intersectional identities.

2021

pdf bib
Analyzing Stereotypes in Generative Text Inference Tasks
Anna Sotnikova | Yang Trista Cao | Hal Daumé III | Rachel Rudinger
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021