Francesco Ortu
2025
Revealing Hidden Mechanisms of Cross-Country Content Moderation with Natural Language Processing
Neemesh Yadav
|
Jiarui Liu
|
Francesco Ortu
|
Roya Ensafi
|
Zhijing Jin
|
Rada Mihalcea
Findings of the Association for Computational Linguistics: ACL 2025
The ability of Natural Language Processing (NLP) methods to categorize text into multiple classes has motivated their use in online content moderation tasks, such as hate speech and fake news detection. However, there is limited understanding of how or why these methods make such decisions, or why certain content is moderated in the first place. To investigate the hidden mechanisms behind content moderation, we explore multiple directions: 1) training classifiers to reverse-engineer content moderation decisions across countries; 2) explaining content moderation decisions by analyzing Shapley values and LLM-guided explanations. Our primary focus is on content moderation decisions made across countries, using pre-existing corpora sampled from the Twitter Stream Grab. Our experiments reveal interesting patterns in censored posts, both across countries and over time. Through human evaluations of LLM-generated explanations across three LLMs, we assess the effectiveness of using LLMs in content moderation. Finally, we discuss potential future directions, as well as the limitations and ethical considerations of this work.
2024
Competition of Mechanisms: Tracing How Language Models Handle Facts and Counterfactuals
Francesco Ortu
|
Zhijing Jin
|
Diego Doimo
|
Mrinmaya Sachan
|
Alberto Cazzaniga
|
Bernhard Schölkopf
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Interpretability research aims to bridge the gap between the empirical success and our scientific understanding of the inner workings of large language models (LLMs). However, most existing research in this area focused on analyzing a single mechanism, such as how models copy or recall factual knowledge. In this work, we propose the formulation of competition of mechanisms, which instead of individual mechanisms focuses on the interplay of multiple mechanisms, and traces how one of them becomes dominant in the final prediction. We uncover how and where the competition of mechanisms happens within LLMs using two interpretability methods, logit inspection and attention modification. Our findings show traces of the mechanisms and their competition across various model components, and reveal attention positions that effectively control the strength of certain mechanisms.
Search
Fix author
Co-authors
- Zhijing Jin 2
- Alberto Cazzaniga 1
- Diego Doimo 1
- Roya Ensafi 1
- Jiarui Liu 1
- show all...