Taleef Tamsal


2026

We describe PFW Task 8’s system for SemEval 2026 Task 8 (MTRAGEval), a benchmark for multi-turn retrieval-augmented generation across four English-language corpora. Our submission combines BM25, SPLADE-v3, and Jina Embeddings v4 with weighted reciprocal rank fusion for retrieval, plus zero-shot GPT 4o/GPT-4o-mini prompting for generation. Officially, our system ranks 6th of 26 on Task B (H = 0.756), 14th of 29 on Task C (H = 0.533), and 20th of 38 on Task A (nDCG@5 = 0.433). For the camera-ready analysis, we re-run retrieval at the official nDCG@5 cutoff, strengthen the prompt ablation with per-domain statistics and exact tests, and analyze official outputs by answerability and domain. On a balanced 100-example development sample, explicit citation-format instructions—not chain of-thought alone—raise citation use from 4% to 93%, and a fixed-context Task C control improves from H = 0.463 with GPT-4o-mini to H = 0.523 with GPT-4o. Official analytics also show near-perfect UNANSWERABLE handling (H = 0.990) but weak behavior on UNDERSPECIFIED turns, where the system answers or refuses instead of clarifying. Our code is publicly available.
This paper describes the PFW system for SemEval-2026 Task 6 (CLARITY), which addresses the classification of response clarity and evasion techniques in political interview question-answer pairs. Rather than relying on large language model prompting, we pursue a competitive non-LLM approach based on fine-tuning DeBERTa-xlarge and DeBERTa-v3-large with a multi-seed ensemble strategy: 5-fold cross-validation with 10 random seeds yields 50 models per architecture, combined through simple logit averaging. Our system achieves a macro F1 of 0.76 on Subtask 1 (clarity-level classification) and 0.50 on Subtask 2 (evasion-type classification). We also find that three post-hoc optimization techniques—learned ensemble weights, thresh old calibration, and hierarchical masking— each improve out-of-fold performance yet degrade evaluation scores by 0.02–0.10 F1. This pattern should be interpreted cautiously: the 237-sample evaluation set likely contributes substantial variance, and two of the three degradations fall within the ±0.06 95% CI expected from sampling noise. Still, the consistent directional pattern across all three prediction-level interventions provides suggestive evidence for an optimization paradox, highlighting the risk of overfitting to cross-validation predictions when evaluation data is limited. Our code is publicly available at https://github.com/ Taleef7/semeval-2026-task6.
Search
Co-authors
    Venues
    Fix author