Justin Debenedetto

Also published as: Justin DeBenedetto

2026

Mendel292 at SemEval-2026 Task 4: Disentangled Narrative Embeddings for Story Similarity
Mauricio Gruppi | Sankalpa Rijal | Justin Debenedetto
Proceedings of the 20th International Workshop on Semantic Evaluation (2026)

This paper describes Mendel292, our system for SemEval-2026 Task 4 on Narrative Story Similarity. We introduce a narrative encoder that decomposes story representations into explicit subspaces for abstract theme, course of action, and outcome, built on a pre-trained sentence embedding model and trainable BiLSTM projection layer with a triplet margin loss objective. We augment the training set via backtranslation, and incorporate weakly supervised multi-task objectives derived from unsupervised narrative clustering.The proposed architecture was designed to learn a latent representation of narratives in a few-shot setting due to a limited amount of traninig data.Despite using a rich pre-trained transformer, the model was outperformed by a unsupervised pooling approach on the classification task.While our systems do not match the top leaderboard scores, they allow us to systematically study the effects of subspace factorization, weak labels, and data augmentation on narrative similarity modeling.

pdf bib abs

A Systematic Comparison of Parameter-Efficient Fine-Tuning Techniques for Low-Resource Neural Machine Translation: Evidence from Indigenous Languages of the Americas
Drew Stackhouse | Justin Debenedetto
Proceedings of the Sixth Workshop on NLP for Indigenous Languages of the Americas (AmericasNLP)

We present the first systematic benchmark of parameter-efficient fine-tuning (PEFT) for low-resource neural machine translation (NMT) of indigenous languages of the Americas. We evaluate eight PEFT methods alongside full fine-tuning on NLLB-200-distilled-600M across 13 indigenous-to-Spanish language pairs spanning four resource tiers (357-125,008 training sentences). OFT (Orthogonal Finetuning) achieves the highest development-set chrF++ among PEFT methods (26.63) while training only 0.28% of parameters. LoRA (Low-Rank Adaptation) offers a strong efficiency-quality tradeoff (25.27 chrF++, 0.19%). On held-out test data, full fine-tuning ranks first (25.12) with OFT a close second (25.06; p = 0.43). VeRA (Vector-based Random Matrix Adaptation) and Prefix Tuning consistently underperform. These results demonstrate that PEFT is a viable alternative to full fine-tuning for indigenous-language NMT.

2024

pdf bib abs

Linearization Order Matters for AMR-to-Text Generation Input
Justin DeBenedetto
Proceedings of the 2024 UMR Parsing Workshop

Abstract Meaning Representation (AMR) is a semantic graph formalism designed to capture sentence meaning using a directed graph. Many systems treat AMR-to-text generation as a sequence-to-sequence problem, drawing upon existing models. The largest AMR dataset (AMR 3.0) provides a sequence format which is considered equivalent to the graph format. However, due to the position-sensitive nature of sequence-to-sequence models, graph traversal order affects system performance. In this work we explore the effect that different, valid orderings have on the performance of sequence-to-sequence AMR-to-text systems and find that changing the traversal order can result in a BLEU score drop of up to 17.5 on a state-of-the-art system.

pdf bib abs

Automatic Quality Estimation for Data Selection and Curriculum Learning
Hiep Nguyen | Lynn Yip | Justin DeBenedetto
The 2nd BabyLM Challenge at the 28th Conference on Computational Natural Language Learning

The size of neural models within natural language processing has increased at a rapid pace in recent years.With this increase in model size comes an increase in the amount of training data required for training.While these larger models have shown strong performance, their use comes with added training and data costs, can be resource-prohibitive for many researchers, and uses an amount of language data that is not always available for all languages.This work focuses on exploring quality estimation as a method of data selection or filtering.The aim is to provide models with higher quality data as compared to larger amounts of data.This approach was applied to machine translation models with varying data sizes as well as to the BabyLM Challenge.Given the 100M word dataset provided in the BabyLM Challenge, we test out various strategies for selecting 10M words for pretraining and use a curriculum learning approach based on the quality estimation scoring.We find small improvements in certain data settings.

2023

pdf bib abs

Introducing Rhetorical Parallelism Detection: A New Task with Datasets, Metrics, and Baselines
Stephen Bothwell | Justin DeBenedetto | Theresa Crnkovich | Hildegund Müller | David Chiang
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Rhetoric, both spoken and written, involves not only content but also style. One common stylistic tool is parallelism: the juxtaposition of phrases which have the same sequence of linguistic (e.g., phonological, syntactic, semantic) features. Despite the ubiquity of parallelism, the field of natural language processing has seldom investigated it, missing a chance to better understand the nature of the structure, meaning, and intent that humans convey. To address this, we introduce the task of rhetorical parallelism detection. We construct a formal definition of it; we provide one new Latin dataset and one adapted Chinese dataset for it; we establish a family of metrics to evaluate performance on it; and, lastly, we create baseline systems and novel sequence labeling schemes to capture it. On our strictest metric, we attain F₁ scores of 0.40 and 0.43 on our Latin and Chinese datasets, respectively.

pdf bib

Byte-ranked Curriculum Learning for BabyLM Strict-small Shared Task 2023
Justin DeBenedetto
Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning

2018

pdf bib abs

Part-of-Speech Tagging on an Endangered Language: a Parallel Griko-Italian Resource
Antonios Anastasopoulos | Marika Lekakou | Josep Quer | Eleni Zimianiti | Justin DeBenedetto | David Chiang
Proceedings of the 27th International Conference on Computational Linguistics

Most work on part-of-speech (POS) tagging is focused on high resource languages, or examines low-resource and active learning settings through simulated studies. We evaluate POS tagging techniques on an actual endangered language, Griko. We present a resource that contains 114 narratives in Griko, along with sentence-level translations in Italian, and provides gold annotations for the test set. Based on a previously collected small corpus, we investigate several traditional methods, as well as methods that take advantage of monolingual data or project cross-lingual POS tags. We show that the combination of a semi-supervised method with cross-lingual transfer is more appropriate for this extremely challenging setting, with the best tagger achieving an accuracy of 72.9%. With an applied active learning scheme, which we use to collect sentence-level annotations over the test set, we achieve improvements of more than 21 percentage points.

Co-authors

Venues