Jesse Atuhurra

2026

VLURes: Benchmarking Long-Text Grounding and Cross-Lingual Robustness in Vision Language Models
Jesse Atuhurra | Iqra Ali | Tomoya Iwakura | Hidetaka Kamigaito | Tatsuya Hiraoka
Findings of the Association for Computational Linguistics: ACL 2026

We introduce ***VLURes***, a multilingual benchmark for evaluating Vision-Language Models (VLMs) under *long-text grounding*: selecting and reasoning over the image-relevant subset of article-length text that contains distractors and ungrounded claims. *VLURes* contains **4,000** web-curated *image + long-text* pairs across **English (En), Japanese (Ja), Swahili (Sw), and Urdu (Ur)** and **10** topical categories, and defines **eight** tasks spanning image-only perception (OR, SU, RU, SS, IC) and image+text grounding (ITM, *Unrelatedness*, VQA). To construct web-realistic pairs, we apply language-adapted CLIP alignment to select representative images and filter weakly grounded pages. Across **10** proprietary and open VLMs evaluated under zero-shot and one-shot prompting, with and without rationales, the best model (GPT-4o) reaches **90.8%** overall accuracy but remains **6.7** points below human performance (**97.5%**) on Object Recognition, and cross-lingual sensitivity persists, while open models are substantially weaker and often lack reliable multilingual VL support. *VLURes* provides a practical testbed for long-text grounding and multilingual robustness in web-realistic agent settings.

2025

pdf bib abs

HLU: Human Vs LLM Generated Text Detection Dataset for Urdu at Multiple Granularities
Iqra Ali | Jesse Atuhurra | Hidetaka Kamigaito | Taro Watanabe
Proceedings of the 31st International Conference on Computational Linguistics

The rise of large language models (LLMs) generating human-like text has raised concerns about misuse, especially in low-resource languages like Urdu. To address this gap, we introduce the HLU dataset, which consists of three datasets: Document, Paragraph, and Sentence level. The document-level dataset contains 1,014 instances of human-written and LLM-generated articles across 13 domains, while the paragraph and sentence-level datasets each contain 667 instances. We conducted both human and automatic evaluations. In the human evaluation, the average accuracy at the document level was 35%, while at the paragraph and sentence levels, accuracies were 75.68% and 88.45%, respectively. For automatic evaluation, we finetuned the XLMRoBERTa model for both monolingual and multilingual settings achieving consistent results in both. Additionally, we assessed the performance of GPT4 and Claude3Opus using zero-shot prompting. Our experiments and evaluations indicate that distinguishing between human and machine-generated text is challenging for both humans and LLMs, marking a significant step in addressing this issue in Urdu.

Co-authors

Venues

COLING1
Findings1

Fix author