John P. Lalor

Also published as: John Lalor


2025

pdf bib
No Simple Answer to Data Complexity: An Examination of Instance-Level Complexity Metrics for Classification Tasks
Ryan A. Cook | John P. Lalor | Ahmed Abbasi
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Natural Language Processing research has become increasingly concerned with understanding data quality and complexity at the instance level. Instance-level complexity scores can be used for tasks such as filtering out noisy observations and subsampling informative examples. However, there exists a diverse taxonomy of complexity metrics that can be used for a classification task, making metric selection itself a difficult task. We empirically examine the relationship between these metrics and find that simply storing training loss provides similar complexity rankings as other more computationally intensive techniques. Metric similarity allows us to subsample data with higher aggregate complexity along several metrics using a single a priori available meta-feature. Further, this choice of complexity metric does not impact demographic fairness, even in downstream predictions. Researchers should consider metric availability and similarity, as using the wrong metric or sampling strategy may hurt performance.

pdf bib
Bias A-head? Analyzing Bias in Transformer-Based Language Model Attention Heads
Yi Yang | Hanyu Duan | Ahmed Abbasi | John P. Lalor | Kar Yan Tam
Proceedings of the 5th Workshop on Trustworthy NLP (TrustNLP 2025)

Transformer-based pretrained large language models (PLM) such as BERT and GPT have achieved remarkable success in NLP tasks. However, PLMs are prone to encoding stereotypical biases. Although a burgeoning literature has emerged on stereotypical bias mitigation in PLMs, such as work on debiasing gender and racial stereotyping, how such biases manifest and behave internally within PLMs remains largely unknown. Understanding the internal stereotyping mechanisms may allow better assessment of model fairness and guide the development of effective mitigation strategies. In this work, we focus on attention heads, a major component of the Transformer architecture, and propose a bias analysis framework to explore and identify a small set of biased heads that are found to contribute to a PLM’s stereotypical bias. We conduct extensive experiments to validate the existence of these biased heads and to better understand how they behave. We investigate gender and racial bias in the English language in two types of Transformer-based PLMs: the encoder-based BERT model and the decoder-based autoregressive GPT model, LLaMA-2 (7B), and LLaMA-2-Chat (7B). Overall, the results shed light on understanding the bias behavior in pretrained language models.

2024

pdf bib
Item Response Theory for Natural Language Processing
John P. Lalor | Pedro Rodriguez | João Sedoc | Jose Hernandez-Orallo
Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Tutorial Abstracts

This tutorial will introduce the NLP community to Item Response Theory (IRT; Baker 2001). IRT is a method from the field of psychometrics for model and dataset assessment. IRT has been used for decades to build test sets for human subjects and estimate latent characteristics of dataset examples. Recently, there has been an uptick in work applying IRT to tasks in NLP. It is our goal to introduce the wider NLP community to IRT and show its benefits for a number of NLP tasks. From this tutorial, we hope to encourage wider adoption of IRT among NLP researchers.

pdf bib
Stars Are All You Need: A Distantly Supervised Pyramid Network for Unified Sentiment Analysis
Wenchang Li | Yixing Chen | Shuang Zheng | Lei Wang | John Lalor
Proceedings of the Ninth Workshop on Noisy and User-generated Text (W-NUT 2024)

Data for the Rating Prediction (RP) sentiment analysis task such as star reviews are readily available. However, data for aspect-category sentiment analysis (ACSA) is often desired because of the fine-grained nature but are expensive to collect. In this work we present a method for learning ACSA using only RP labels. We propose Unified Sentiment Analysis (Uni-SA) to efficiently understand aspect and review sentiment in a unified manner. We propose a Distantly Supervised Pyramid Network (DSPN) to efficiently perform Aspect-Category Detection (ACD), ACSA, and OSA using only RP labels for training. We evaluate DSPN on multi-aspect review datasets in English and Chinese and find that with only star rating labels for supervision, DSPN performs comparably well to a variety of benchmark models. We also demonstrate the interpretability of DSPN’s outputs on reviews to show the pyramid structure inherent in document level end-to-end sentiment analysis.

2022

pdf bib
Clustering Examples in Multi-Dataset Benchmarks with Item Response Theory
Pedro Rodriguez | Phu Mon Htut | John Lalor | João Sedoc
Proceedings of the Third Workshop on Insights from Negative Results in NLP

In natural language processing, multi-dataset benchmarks for common tasks (e.g., SuperGLUE for natural language inference and MRQA for question answering) have risen in importance. Invariably, tasks and individual examples vary in difficulty. Recent analysis methods infer properties of examples such as difficulty. In particular, Item Response Theory (IRT) jointly infers example and model properties from the output of benchmark tasks (i.e., scores for each model-example pair). Therefore, it seems sensible that methods like IRT should be able to detect differences between datasets in a task. This work shows that current IRT models are not as good at identifying differences as we would expect, explain why this is difficult, and outline future directions that incorporate more (textual) signal from examples.

pdf bib
Benchmarking Intersectional Biases in NLP
John Lalor | Yi Yang | Kendall Smith | Nicole Forsgren | Ahmed Abbasi
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

There has been a recent wave of work assessing the fairness of machine learning models in general, and more specifically, on natural language processing (NLP) models built using machine learning techniques. While much work has highlighted biases embedded in state-of-the-art language models, and more recent efforts have focused on how to debias, research assessing the fairness and performance of biased/debiased models on downstream prediction tasks has been limited. Moreover, most prior work has emphasized bias along a single dimension such as gender or race. In this work, we benchmark multiple NLP models with regards to their fairness and predictive performance across a variety of NLP tasks. In particular, we assess intersectional bias - fairness across multiple demographic dimensions. The results show that while current debiasing strategies fare well in terms of the fairness-accuracy trade-off (generally preserving predictive power in debiased models), they are unable to effectively alleviate bias in downstream tasks. Furthermore, this bias is often amplified across dimensions (i.e., intersections). We conclude by highlighting possible causes and making recommendations for future NLP debiasing research.

2021

pdf bib
Evaluation Examples are not Equally Informative: How should that change NLP Leaderboards?
Pedro Rodriguez | Joe Barrow | Alexander Miserlis Hoyle | John P. Lalor | Robin Jia | Jordan Boyd-Graber
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Leaderboards are widely used in NLP and push the field forward. While leaderboards are a straightforward ranking of NLP models, this simplicity can mask nuances in evaluation items (examples) and subjects (NLP models). Rather than replace leaderboards, we advocate a re-imagining so that they better highlight if and where progress is made. Building on educational testing, we create a Bayesian leaderboard model where latent subject skill and latent item difficulty predict correct responses. Using this model, we analyze the ranking reliability of leaderboards. Afterwards, we show the model can guide what to annotate, identify annotation errors, detect overfitting, and identify informative examples. We conclude with recommendations for future benchmark tasks.

pdf bib
Constructing a Psychometric Testbed for Fair Natural Language Processing
Ahmed Abbasi | David Dobolyi | John P. Lalor | Richard G. Netemeyer | Kendall Smith | Yi Yang
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing

Psychometric measures of ability, attitudes, perceptions, and beliefs are crucial for understanding user behavior in various contexts including health, security, e-commerce, and finance. Traditionally, psychometric dimensions have been measured and collected using survey-based methods. Inferring such constructs from user-generated text could allow timely, unobtrusive collection and analysis. In this paper we describe our efforts to construct a corpus for psychometric natural language processing (NLP) related to important dimensions such as trust, anxiety, numeracy, and literacy, in the health domain. We discuss our multi-step process to align user text with their survey-based response items and provide an overview of the resulting testbed which encompasses survey-based psychometric measures and accompanying user-generated text from 8,502 respondents. Our testbed also encompasses self-reported demographic information, including race, sex, age, income, and education - thereby affording opportunities for measuring bias and benchmarking fairness of text classification methods. We report preliminary results on use of the text to predict/categorize users’ survey response labels - and on the fairness of these models. We also discuss the important implications of our work and resulting testbed for future NLP research on psychometrics and fairness.

2020

pdf bib
Dynamic Data Selection for Curriculum Learning via Ability Estimation
John P. Lalor | Hong Yu
Findings of the Association for Computational Linguistics: EMNLP 2020

Curriculum learning methods typically rely on heuristics to estimate the difficulty of training examples or the ability of the model. In this work, we propose replacing difficulty heuristics with learned difficulty parameters. We also propose Dynamic Data selection for Curriculum Learning via Ability Estimation (DDaCLAE), a strategy that probes model ability at each training epoch to select the best training examples at that point. We show that models using learned difficulty and/or ability outperform heuristic-based curriculum learning models on the GLUE classification tasks.

pdf bib
An Empirical Analysis of Human-Bot Interaction on Reddit
Ming-Cheng Ma | John P. Lalor
Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020)

Automated agents (“bots”) have emerged as an ubiquitous and influential presence on social media. Bots engage on social media platforms by posting content and replying to other users on the platform. In this work we conduct an empirical analysis of the activity of a single bot on Reddit. Our goal is to determine whether bot activity (in the form of posted comments on the website) has an effect on how humans engage on Reddit. We find that (1) the sentiment of a bot comment has a significant, positive effect on the subsequent human reply, and (2) human Reddit users modify their comment behaviors to overlap with the text of the bot, similar to how humans modify their text to mimic other humans in conversation. Understanding human-bot interactions on social media with relatively simple bots is important for preparing for more advanced bots in the future.

2019

pdf bib
Learning Latent Parameters without Human Response Patterns: Item Response Theory with Artificial Crowds
John P. Lalor | Hao Wu | Hong Yu
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Incorporating Item Response Theory (IRT) into NLP tasks can provide valuable information about model performance and behavior. Traditionally, IRT models are learned using human response pattern (RP) data, presenting a significant bottleneck for large data sets like those required for training deep neural networks (DNNs). In this work we propose learning IRT models using RPs generated from artificial crowds of DNN models. We demonstrate the effectiveness of learning IRT models using DNN-generated data through quantitative and qualitative analyses for two NLP tasks. Parameters learned from human and machine RPs for natural language inference and sentiment analysis exhibit medium to large positive correlations. We demonstrate a use-case for latent difficulty item parameters, namely training set filtering, and show that using difficulty to sample training data outperforms baseline methods. Finally, we highlight cases where human expectation about item difficulty does not match difficulty as estimated from the machine RPs.

2018

pdf bib
Understanding Deep Learning Performance through an Examination of Test Set Difficulty: A Psychometric Case Study
John P. Lalor | Hao Wu | Tsendsuren Munkhdalai | Hong Yu
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

Interpreting the performance of deep learning models beyond test set accuracy is challenging. Characteristics of individual data points are often not considered during evaluation, and each data point is treated equally. In this work we examine the impact of a test set question’s difficulty to determine if there is a relationship between difficulty and performance. We model difficulty using well-studied psychometric methods on human response patterns. Experiments on Natural Language Inference (NLI) and Sentiment Analysis (SA) show that the likelihood of answering a question correctly is impacted by the question’s difficulty. In addition, as DNNs are trained on larger datasets easy questions start to have a higher probability of being answered correctly than harder questions.

2016

pdf bib
Building an Evaluation Scale using Item Response Theory
John P. Lalor | Hao Wu | Hong Yu
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing

pdf bib
Citation Analysis with Neural Attention Models
Tsendsuren Munkhdalai | John P. Lalor | Hong Yu
Proceedings of the Seventh International Workshop on Health Text Mining and Information Analysis