Russell Scheinberg
2026
PortNLP at CRAC 2026: QLoRA Fine-Tuning with Bounded Entity Registry for Multilingual Coreference Resolution
Amber Shore | Russell Scheinberg | Malini Nagasundaram | Ameeta Agrawal
Proceedings of the 2nd Joint Workshop on Computational Approaches to Discourse, Context and Document-Level Inferences and Computational Models of Reference, Anaphora and Coreference (CODI-CRAC 2026)
Amber Shore | Russell Scheinberg | Malini Nagasundaram | Ameeta Agrawal
Proceedings of the 2nd Joint Workshop on Computational Approaches to Discourse, Context and Document-Level Inferences and Computational Models of Reference, Anaphora and Coreference (CODI-CRAC 2026)
We describe PortNLP’s submission to the CRAC 2026 Shared Task on Multilingual Coreference Resolution (LLM track). Our system fine-tunes Qwen 3 14B with QLoRA on CorefUD 1.4 gold annotations across 27 corpora spanning 19 languages. Documents are processed in 500-700 character chunks with a bounded rolling context consisting of 500 characters of recent annotated text and a scored entity registry that tracks up to 30 active entities via a frequency-times-recency decay formula. We employ data augmentation and language-aware sampling strategies to handle typological and data-size diversity. Our system achieves 68.69 CoNLL F1 averaged across all 27 test corpora. We additionally present probing experiments on the LoRA adapter’s internal representations, finding that coreference signal is concentrated in attention value projections rather than MLP modules, with the strongest readout at the earliest transformer layer.
Do Language Models Show Structural Priming Across Different Domains?
So Young Lee | Russell Scheinberg | Ameeta Agrawal
Proceedings of the 1st Workshop on Computational Developmental Linguistics (CDL)
So Young Lee | Russell Scheinberg | Ameeta Agrawal
Proceedings of the 1st Workshop on Computational Developmental Linguistics (CDL)
We test whether large language models show cross-domain structural priming by asking whether arithmetic expressions influence relative-clause attachment preferences. Experiment 1 examines English and French using materials based on prior psycholinguistic studies, and Experiment 2 extends the test to a larger multilingual dataset. Across both experiments, we find no robust priming effect. Instead, responses largely reflect baseline attachment preferences, which vary across languages and only partially align with human patterns. These findings suggest that, although language models show some structural sensitivity, they provide limited evidence of abstract structural generalization across domains.
2025
Who Relies More on World Knowledge and Bias for Syntactic Ambiguity Resolution: Humans or LLMs?
So Young Lee | Russell Scheinberg | Amber Shore | Ameeta Agrawal
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
So Young Lee | Russell Scheinberg | Amber Shore | Ameeta Agrawal
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
This study explores how recent large language models (LLMs) navigate relative clause attachment ambiguity and use world knowledge biases for disambiguation in six typologically diverse languages: English, Chinese, Japanese, Korean, Russian, and Spanish. We describe the process of creating a novel dataset – MultiWho – for fine-grained evaluation of relative clause attachment preferences in ambiguous and unambiguous contexts. Our experiments with three LLMs indicate that, contrary to humans, LLMs consistently exhibit a preference for local attachment, displaying limited responsiveness to syntactic variations or language-specific attachment patterns.Although LLMs performed well in unambiguous cases, they rigidly prioritized world knowledge biases, lacking the flexibility of human language processing. These findings highlight the need for more diverse, pragmatically nuanced multilingual training to improve LLMs’ handling of complex structures and human-like comprehension.
Explain-then-Process: Using Grammar Prompting to Enhance Grammatical Acceptability Judgments
Russell Scheinberg | Ameeta Agrawal | Amber Shore | So Young Lee
Findings of the Association for Computational Linguistics: ACL 2025
Russell Scheinberg | Ameeta Agrawal | Amber Shore | So Young Lee
Findings of the Association for Computational Linguistics: ACL 2025
Large language models (LLMs) can explain grammatical rules, yet they often fail to apply those rules when judging sentence acceptability. We present grammar prompting, an explain-then-process paradigm: a large LLM first produces a concise explanation of the relevant syntactic phenomenon, then that explanation is fed back as additional context to the target model – either an LLM or a smaller language model (SLM) – before deciding which sentence of a minimal pair is grammatical. On the English BLiMP, Chinese SLING, and Russian RuBLiMP benchmarks, this simple prompt design yields substantial improvements over strong baselines across a wide range of syntactic phenomena. Feeding an LLM’s metalinguistic explanation back to the target model bridges the gap between knowing a rule and using it. On SLMs, grammar prompting alone trims the average LLM-SLM accuracy gap by 20%, and when paired with chain-of-thought, by 56% (13.0 pp → 5.8 pp), all at negligible cost. The lightweight, language-agnostic cue lets low-cost SLMs approach frontier-LLM performance in multilingual settings.
Correct-Detect: Balancing Performance and Ambiguity Through the Lens of Coreference Resolution in LLMs
Amber Shore | Russell Scheinberg | Ameeta Agrawal | So Young Lee
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Amber Shore | Russell Scheinberg | Ameeta Agrawal | So Young Lee
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Large Language Models (LLMs) are intended to reflect human linguistic competencies. But humans have access to a broad and embodied context, which is key in detecting and resolving linguistic ambiguities, even in isolated text spans. A foundational case of semantic ambiguity is found in the task of coreference resolution: how is a pronoun related to an earlier person mention? This capability is implicit in nearly every downstream task, and the presence of ambiguity at this level can alter performance significantly. We show that LLMs can achieve good performance with minimal prompting in both coreference disambiguation and the detection of ambiguity in coreference, however, they cannot do both at the same time. We present the CORRECT-DETECT trade-off: though models have both capabilities and deploy them implicitly, successful performance balancing these two abilities remains elusive.
2024
Multilingual Relative Clause Attachment Ambiguity Resolution in Large Language Models
So Young Lee | Russell Scheinberg | Amber Shore | Ameeta Agrawal
Proceedings of the 38th Pacific Asia Conference on Language, Information and Computation
So Young Lee | Russell Scheinberg | Amber Shore | Ameeta Agrawal
Proceedings of the 38th Pacific Asia Conference on Language, Information and Computation
Evaluating Multilingual Long-Context Models for Retrieval and Reasoning
Ameeta Agrawal | Andy Dang | Sina Bagheri Nezhad | Rhitabrat Pokharel | Russell Scheinberg
Proceedings of the Fourth Workshop on Multilingual Representation Learning (MRL 2024)
Ameeta Agrawal | Andy Dang | Sina Bagheri Nezhad | Rhitabrat Pokharel | Russell Scheinberg
Proceedings of the Fourth Workshop on Multilingual Representation Learning (MRL 2024)
Recent large language models (LLMs) demonstrate impressive capabilities in handling long contexts, some exhibiting near-perfect recall on synthetic retrieval tasks. However, these evaluations have mainly focused on English text and involved a single target sentence within lengthy contexts. Our work investigates how LLM performance generalizes to multilingual settings with multiple hidden target sentences. We create a new dataset – mLongRR – to comprehensively evaluate several multilingual long-context LLMs on retrieval and reasoning tasks across five languages: English, Vietnamese, Indonesian, Swahili, and Somali. These languages share the Latin script but belong to distinct language families and resource levels. Our analysis reveals a significant performance gap between languages. The best-performing models such as Gemini-1.5 and GPT-4o, achieve around 96% accuracy in English to around 36% in Somali with a single target sentence. However, this accuracy drops to 40% in English and 0% in Somali when dealing with three target sentences. Our findings highlight the challenges long-context LLMs face when processing longer contexts, an increase in the number of target sentences, or languages of lower resource levels.