Jairo Serrano
2026
VerbaNexAI at SemEval-2026 Task 4: Two-Stage Narrative Similarity via Fine-Tuned Bi-Encoder with MLP Ensemble
Pablo Pertuz-Duran | Edwin Puertas | Juan Carlos Martinez Santos | Jairo Serrano
Proceedings of the 20th International Workshop on Semantic Evaluation (2026)
Pablo Pertuz-Duran | Edwin Puertas | Juan Carlos Martinez Santos | Jairo Serrano
Proceedings of the 20th International Workshop on Semantic Evaluation (2026)
This paper describes VerbaNex AI’s participation in SemEval-2026 Task 4: Narrative Similarity, a shared task on assessing semantic relatedness between short narrative texts. The task comprises two tracks: Track A requires selecting which of two candidate stories is more similar to an anchor, and Track B requires producing fixed-size story embeddings whose cosine similarity reflects narrative relatedness. We propose a unified two-stage system built on Qwen3-Embedding-0.6B. The first stage fine tunes the encoder as a bi-encoder with a 512 dimensional projection head using a composite loss combining margin ranking, pairwise softmax, and multiple negatives ranking objectives. The second stage trains a lightweight MLP head over frozen bi-encoder embeddings using pairwise interaction features, with k-foldcross-validation and logit-averaging ensemble inference. The system was trained exclusively on the official supervised data without leveraging the additional 1,900 synthetic triples generated by LLM released by the organizers. Al though the system ranked first on both tracks in the development phase, its performance did not transfer to the official test set, where it ranked 47 on Track A and 22 on Track B.
VerbaNexAI at SemEval-2026 Task 5: Few-Shot Chain-of-Thought with Selective Self-Consistency and Isotonic Calibration for Word Sense Plausibility Rating
Daniel Peña Gnecco | Edwin Puertas | Juan Carlos Martinez Santos | Jairo Serrano
Proceedings of the 20th International Workshop on Semantic Evaluation (2026)
Daniel Peña Gnecco | Edwin Puertas | Juan Carlos Martinez Santos | Jairo Serrano
Proceedings of the 20th International Workshop on Semantic Evaluation (2026)
We present a system for rating word sense plausibility in ambiguous narrative contexts for SemEval-2026 Task 5. Our approach ensembles three large language models (Llama-3.1 70B, Qwen-2.5 32B, and Gemma-2 27B) using a computationally efficient, uncertainty-aware pipeline. We combine few-shot chain-of-thought prompting with selective self-consistency, which applies stochastic multiple sampling exclusively to items identified as inherently ambiguous. This targeted strategy reduces inference costs by approximately 45% while maintaining robustness in predictions. To correct the systematic bias of LLMs toward extreme ratings, we apply isotonic regression to shift the output distribution toward patterns of human judgment. Our system achieves a Spearman correlation of 0.67 and an accuracy within 0.76 standard deviations, ranking 34th out of 79 participating teams (top 43% without task-specific fine-tuning). Detailed error analysis reveals that while our system performs strongly on clear contexts (ρ = 0.78), current prompting paradigms struggle significantly to model multimodal human disagreement in genuinely ambiguous cases (ρ = 0.58), highlighting an important challenge for future work on subjective semantic tasks.
VerbaNexAI at SemEval-2026 Task 6: Automatic Detection of Political Evasion through Hierarchical Classification with RoBERTa Large
Jeison Jimenez Alvear | Deyson Gómez Sánchez | Juan Carlos Martinez Santos | Edwin Puertas | Jairo Serrano
Proceedings of the 20th International Workshop on Semantic Evaluation (2026)
Jeison Jimenez Alvear | Deyson Gómez Sánchez | Juan Carlos Martinez Santos | Edwin Puertas | Jairo Serrano
Proceedings of the 20th International Workshop on Semantic Evaluation (2026)
This paper describes VerbaNex AI’s participation in SemEval-2026 Task 6: CLARITY, a shared task on automatic detection of question evasion in political interview transcripts. The task requires classifying question-answer pairs into three clarity levels (Task 1) and identifying nine evasion techniques (Task 2). We propose and evaluate two independent systems based on RoBERTa-Large. The first is a standard sequence classifier that treats each question-answer pair as a single input sequence, leveraging RoBERTa’s native two-segment encoding to model the relationship between the two texts jointly. The second is a dual-encoder architecture that processes the question and answer independently and computes geometric interaction features to model the semantic misalignment between them explicitly. Both systems are trained on Task 2 labels and derive Task 1 predictions via the hierarchical mapping proposed by the task organizers. Our best result was achieved by the standard sequence classifier, reaching Rank 10 on Task 2 and Rank 25 on Task 1.
VerbaNexAI at SemEval-2026 Task 7: Integrating Web Snippets and RAG for the Evaluation of Multilingual Cultural Knowledge in LLMs
Danileth Almanza | Jairo Serrano | Edwin Puertas | Juan Carlos Martinez Santos
Proceedings of the 20th International Workshop on Semantic Evaluation (2026)
Danileth Almanza | Jairo Serrano | Edwin Puertas | Juan Carlos Martinez Santos
Proceedings of the 20th International Workshop on Semantic Evaluation (2026)
In multilingual and multicultural contexts, LLMs require contextualization mechanisms to generate culturally coherent responses. In this sense, this study presents a LLaMA-based approach to answer short cultural questions in different languages within Task 7 of SemEval-2026 (Track 1: SAQ), without access to official training data. The system integrates controlled synthetic data generation, evidence retrieval through web snippets, and a Retrieval-Augmented Generation (RAG) framework with Few-shot learning. BLEnD is used solely as a thematic guide, ensuring semantic independence. During development, the LLaMA-3.1-8B model achieved 38.51\% global accuracy, while LLaMA-3.2-1B obtained 15.54\%. In large-scale evaluation (30,500 instances), the 1B model achieved 16.69\%, maintaining stability after prompt optimization. The results demonstrate that contextual retrieval improves multilingual cultural knowledge evaluation and highlight the importance of pipeline design and model capacity.