@inproceedings{yam-yam-2026-pinetree,
title = "Pinetree at {S}em{E}val-2026 Task 7: A Large-Scale Failure Analysis of Cultural Grounding in Language Models",
author = "Yam, Yen Yee and
Yam, Hong Meng",
editor = "Kochmar, Ekaterina and
Ghosh, Debanjan and
North, Kai and
Komachi, Mamoru",
booktitle = "Proceedings of the 20th {I}nternational {W}orkshop on {S}emantic {E}valuation (2026)",
month = jul,
year = "2026",
address = "San Diego, California, USA",
publisher = "Association for Computational Linguistics",
url = "https://preview.aclanthology.org/ingest-acl-workshops/2026.semeval-1.422/",
pages = "3399--3407",
ISBN = "979-8-89176-414-9",
abstract = "Using a simple prompting strategy without fine-tuning or retrieval augmentation, our system achieved 88.85{\%} micro-average and 90.55{\%} macro-average accuracy, ranking {\#}4 overall on SemEval-2026 Task 7. Our primary contribution is a failure analysis of 5,241 incorrect predictions (11.15{\%} of the dataset), categorized using the six-topic BLEnD taxonomy. Errors concentrate in Food (39.42{\%}) and Holidays/Celebration/Leisure (15.76{\%}), but within-topic error rates are highest on Family (21.04{\%}) and Work life (20.45{\%}), which topics with limited representational density. Global-brand attractor errors account for only 2.50{\%} of failures and are tightly localized: 98.5{\%} fall on a single template (most popular sport team) in four low-resource cultures. Outside these templates, brand-default effects are statistically negligible. These findings support representational sparsity and knowledge-density asymmetry, not ideological skew, as the dominant cause of cultural misalignment in everyday behavioral tasks."
}Markdown (Informal)
[Pinetree at SemEval-2026 Task 7: A Large-Scale Failure Analysis of Cultural Grounding in Language Models](https://preview.aclanthology.org/ingest-acl-workshops/2026.semeval-1.422/) (Yam & Yam, SemEval 2026)
ACL