Andrew Halterman

2026

What is a protest anyway? Codebook conceptualization is still a first-order concern in LLM-era classification
Andrew Halterman | Katherine A. Keith
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Generative large language models (LLMs) are now used extensively for text classification in computational social science (CSS). In this work, we focus on the steps before and after LLM prompting: conceptualization of the categories to classify and using LLM predictions in downstream statistical inference. We argue these steps have been overlooked in much of LLM-era CSS and LLMs can tempt analysts to skip conceptualization altogether. For example, a political scientist classifying "protest" with LLMs may never be forced to craft a definition: unlike human annotators who would ask clarifying questions, an LLM can silently accept an underspecified concept to classify and return plausible-looking labels. Using simulations, we show that conceptualization failures induce downstream inferential bias that cannot be corrected solely by a more accurate LLM or post-hoc bias correction methods. We conclude by reminding CSS analysts that conceptualization is still a first-order concern in the LLM-era and provide concrete advice for pursuing low-cost, unbiased, low-variance downstream estimates.

2023

pdf bib abs

Detecting and Geocoding Battle Events from Social Media Messages on the Russo-Ukrainian War: Shared Task 2, CASE 2023
Hristo Tanev | Nicolas Stefanovitch | Andrew Halterman | Onur Uca | Vanni Zavarella | Ali Hurriyetoglu | Bertrand De Longueville | Leonida Della Rocca
Proceedings of the 6th Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Text

The purpose of the shared task 2 at the Challenges and Applications of Automated Extraction of Socio-political Events from Text (CASE) 2023 workshop was to test the abilities of the participating models and systems to detect and geocode armed conflicts events in social media messages from Telegram channels reporting on the Russo Ukrainian war. The evaluation followed an approach which was introduced in CASE 2021 (Giorgi et al., 2021): For each system we consider the correlation of the spatio-temporal distribution of its detected events and the events identified for the same period in the ACLED (Armed Conflict Location and Event Data Project) database (Raleigh et al., 2010). We use ACLED for the ground truth, since it is a well established standard in the field of event extraction and political trend analysis, which relies on human annotators for the encoding of security events using a fine grained taxonomy. Two systems participated in this shared task, we report in this paper on both the shared task and the participating systems.

2022

pdf bib abs

Political Event Coding as Text-to-Text Sequence Generation
Yaoyao Dai | Benjamin Radford | Andrew Halterman
Proceedings of the 5th Workshop on Challenges and Applications of Automated Extraction of Socio-political Events from Text (CASE)

We report on the current status of an effort to produce political event data from unstructured text via a Transformer language model. Compelled by the current lack of publicly available and up-to-date event coding software, we seek to train a model that can produce structured political event records at the sentence level. Our approach differs from previous efforts in that we conceptualize this task as one of text-to-text sequence generation. We motivate this choice by outlining desirable properties of text generation models for the needs of event coding. To overcome the lack of sufficient training data, we also describe a method for generating synthetic text and event record pairs that we use to fit our model.

2021

pdf bib

Corpus-Level Evaluation for Event QA: The IndiaPoliceEvents Corpus Covering the 2002 Gujarat Violence
Andrew Halterman | Katherine A. Keith | Sheikh Sarwar | Brendan O’Connor
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

pdf bib

Few-Shot Upsampling for Protest Size Detection
Andrew Halterman | Benjamin J. Radford
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

2019

pdf bib abs

Geolocating Political Events in Text
Andrew Halterman
Proceedings of the Third Workshop on Natural Language Processing and Computational Social Science

This work introduces a general method for automatically finding the locations where political events in text occurred. Using a novel set of 8,000 labeled sentences, I create a method to link automatically extracted events and locations in text. The model achieves human level performance on the annotation task and outperforms previous event geolocation systems. It can be applied to most event extraction systems across geographic contexts. I formalize the event–location linking task, describe the neural network model, describe the potential uses of such a system in political science, and demonstrate a workflow to answer an open question on the role of conventional military offensives in causing civilian casualties in the Syrian civil war.

Co-authors

Benjamin Radford 1

Benjamin J. Radford 1

Leonida Della Rocca 1

Sheikh Sarwar 1

Nicolas Stefanovitch 1

Hristo Tanev 1

Onur Uca 1

Vanni Zavarella 1

Venues

Fix author