Yen Yee Yam
2026
Pinetree at SemEval-2026 Task 7: A Large-Scale Failure Analysis of Cultural Grounding in Language Models
Yen Yee Yam | Hong Meng Yam
Proceedings of the 20th International Workshop on Semantic Evaluation (2026)
Yen Yee Yam | Hong Meng Yam
Proceedings of the 20th International Workshop on Semantic Evaluation (2026)
Using a simple prompting strategy without fine-tuning or retrieval augmentation, our system achieved 88.85% micro-average and 90.55% macro-average accuracy, ranking #4 overall on SemEval-2026 Task 7. Our primary contribution is a failure analysis of 5,241 incorrect predictions (11.15% of the dataset), categorized using the six-topic BLEnD taxonomy. Errors concentrate in Food (39.42%) and Holidays/Celebration/Leisure (15.76%), but within-topic error rates are highest on Family (21.04%) and Work life (20.45%), which topics with limited representational density. Global-brand attractor errors account for only 2.50% of failures and are tightly localized: 98.5% fall on a single template (most popular sport team) in four low-resource cultures. Outside these templates, brand-default effects are statistically negligible. These findings support representational sparsity and knowledge-density asymmetry, not ideological skew, as the dominant cause of cultural misalignment in everyday behavioral tasks.
Yam at SemEval-2026 Task 4: Failure-Driven Prompt Evolution for Narrative Comparison
Yen Yee Yam | Hong Meng Yam
Proceedings of the 20th International Workshop on Semantic Evaluation (2026)
Yen Yee Yam | Hong Meng Yam
Proceedings of the 20th International Workshop on Semantic Evaluation (2026)
We present a structured, parameter-free system for SemEval-2026 Task 4 on Narrative Story Similarity. Instead of treating similarity as scalar embedding proximity, we align model reasoning with the task ontology by decomposing each story into abstract theme, course of action, and outcome, and performing contrastive comparison over these dimensions. Our primary contribution is a closed-loop, failure-driven prompt optimization procedure that iteratively refines concise guideline documents while keeping model parameters fixed and reverting updates that degrade performance, thereby isolating improvements attributable to structured reasoning rather than representation learning. Ontology-aligned decomposition alone achieves 70% accuracy on both train and test sets; with controlled guideline evolution, performance improves to 76% on train and 73% on test without additional supervision or fine-tuning. These results demonstrate that structured prompt optimization can meaningfully enhance contrastive narrative reasoning in a fully parameter-free setting.