Edwin Puertas - ACL Anthology

This page is part of a temporary preview of a proposed change that may be incomplete or contain mistakes. It is not official and will be removed when the change is merged or abandoned.

Edwin Puertas

2026

VerbaNex AI at SemEval-2026 Task 2: DeBERTa for Longitudinal Valence and Arousal Prediction
Melissa Moreno | Juan Carlos Martinez Santos | Edwin Puertas
Proceedings of the 20th International Workshop on Semantic Evaluation (2026)

This paper describes our submission to SemEval 2026 Subtask 1: Longitudinal Affect Assessment, which aims to predict continuous valence and arousal scores from chronologically ordered texts. Implement two regression based configurations built on DeBERTa fine tuning: a contextual model and a hybrid model that incorporates normalized lexical features derived from the NRC VAD lexicon. Both systems preserve temporal ordering and apply user level data splits to ensure generalization to unseen individuals. Results show competitive performance, with stronger outcomes in valence than in arousal. The integration of lexical features does not yield consistent improvements for arousal, highlighting the difficulty of modeling emotional intensity dynamics. Error analysis indicates challenges in handling implicit emotions, pragmatic ambiguity, and subtle affective shifts over time. Overall, findings underscore the importance of combining contextual representations with structured lexical knowledge while addressing longitudinal variability in emotional activation.

VerbaNexAI at SemEval-2026 Task 4: Two-Stage Narrative Similarity via Fine-Tuned Bi-Encoder with MLP Ensemble
Pablo Pertuz-Duran | Edwin Puertas | Juan Carlos Martinez Santos | Jairo Serrano
Proceedings of the 20th International Workshop on Semantic Evaluation (2026)

This paper describes VerbaNex AI’s participation in SemEval-2026 Task 4: Narrative Similarity, a shared task on assessing semantic relatedness between short narrative texts. The task comprises two tracks: Track A requires selecting which of two candidate stories is more similar to an anchor, and Track B requires producing fixed-size story embeddings whose cosine similarity reflects narrative relatedness. We propose a unified two-stage system built on Qwen3-Embedding-0.6B. The first stage fine tunes the encoder as a bi-encoder with a 512 dimensional projection head using a composite loss combining margin ranking, pairwise softmax, and multiple negatives ranking objectives. The second stage trains a lightweight MLP head over frozen bi-encoder embeddings using pairwise interaction features, with k-foldcross-validation and logit-averaging ensemble inference. The system was trained exclusively on the official supervised data without leveraging the additional 1,900 synthetic triples generated by LLM released by the organizers. Al though the system ranked first on both tracks in the development phase, its performance did not transfer to the official test set, where it ranked 47 on Track A and 22 on Track B.

VerbaNexAI at SemEval-2026 Task 5: Few-Shot Chain-of-Thought with Selective Self-Consistency and Isotonic Calibration for Word Sense Plausibility Rating
Daniel Peña Gnecco | Edwin Puertas | Juan Carlos Martinez Santos | Jairo Serrano
Proceedings of the 20th International Workshop on Semantic Evaluation (2026)

We present a system for rating word sense plausibility in ambiguous narrative contexts for SemEval-2026 Task 5. Our approach ensembles three large language models (Llama-3.1 70B, Qwen-2.5 32B, and Gemma-2 27B) using a computationally efficient, uncertainty-aware pipeline. We combine few-shot chain-of-thought prompting with selective self-consistency, which applies stochastic multiple sampling exclusively to items identified as inherently ambiguous. This targeted strategy reduces inference costs by approximately 45% while maintaining robustness in predictions. To correct the systematic bias of LLMs toward extreme ratings, we apply isotonic regression to shift the output distribution toward patterns of human judgment. Our system achieves a Spearman correlation of 0.67 and an accuracy within 0.76 standard deviations, ranking 34th out of 79 participating teams (top 43% without task-specific fine-tuning). Detailed error analysis reveals that while our system performs strongly on clear contexts (ρ = 0.78), current prompting paradigms struggle significantly to model multimodal human disagreement in genuinely ambiguous cases (ρ = 0.58), highlighting an important challenge for future work on subjective semantic tasks.

VerbaNexAI at SemEval-2026 Task 6: Automatic Detection of Political Evasion through Hierarchical Classification with RoBERTa Large
Jeison Jimenez Alvear | Deyson Gómez Sánchez | Juan Carlos Martinez Santos | Edwin Puertas | Jairo Serrano
Proceedings of the 20th International Workshop on Semantic Evaluation (2026)

This paper describes VerbaNex AI’s participation in SemEval-2026 Task 6: CLARITY, a shared task on automatic detection of question evasion in political interview transcripts. The task requires classifying question-answer pairs into three clarity levels (Task 1) and identifying nine evasion techniques (Task 2). We propose and evaluate two independent systems based on RoBERTa-Large. The first is a standard sequence classifier that treats each question-answer pair as a single input sequence, leveraging RoBERTa’s native two-segment encoding to model the relationship between the two texts jointly. The second is a dual-encoder architecture that processes the question and answer independently and computes geometric interaction features to model the semantic misalignment between them explicitly. Both systems are trained on Task 2 labels and derive Task 1 predictions via the hierarchical mapping proposed by the task organizers. Our best result was achieved by the standard sequence classifier, reaching Rank 10 on Task 2 and Rank 25 on Task 1.

VerbaNexAI at SemEval-2026 Task 7: Integrating Web Snippets and RAG for the Evaluation of Multilingual Cultural Knowledge in LLMs
Danileth Almanza | Jairo Serrano | Edwin Puertas | Juan Carlos Martinez Santos
Proceedings of the 20th International Workshop on Semantic Evaluation (2026)

In multilingual and multicultural contexts, LLMs require contextualization mechanisms to generate culturally coherent responses. In this sense, this study presents a LLaMA-based approach to answer short cultural questions in different languages within Task 7 of SemEval-2026 (Track 1: SAQ), without access to official training data. The system integrates controlled synthetic data generation, evidence retrieval through web snippets, and a Retrieval-Augmented Generation (RAG) framework with Few-shot learning. BLEnD is used solely as a thematic guide, ensuring semantic independence. During development, the LLaMA-3.1-8B model achieved 38.51\% global accuracy, while LLaMA-3.2-1B obtained 15.54\%. In large-scale evaluation (30,500 instances), the 1B model achieved 16.69\%, maintaining stability after prompt optimization. The results demonstrate that contextual retrieval improves multilingual cultural knowledge evaluation and highlight the importance of pipeline design and model capacity.

VerbaNexAI at ClinicalSkillQA: From Visual Evidence to Procedural Order A Two-Stage Generative Vision-Language Framework for ClinSkillQA
Andrea Menco Tovar | Jairo E. Serrano | Edwin Puertas | Juan Carlos Martinez-Santos
Proceedings of the BioNLP 2026 (Shared Tasks)

This work addresses the temporal ordering task of clinical frames in the Basic Life Support (BLS) subset of ClinSkillQA. A two-stage hybrid pipeline based on Qwen2-VL-2B-Instruct in a zero-shot configuration is proposed. In Stage 1, each image is processed independently to extract factual visual evidence, which is then transformed, using deterministic rules, into a structured representation. In Stage 2, ordering is formulated as an ordinal scoring task over procedural stages, with ties broken using PCA applied to multimodal embeddings. Evaluation followed the official benchmark protocol, considering Task Accuracy, Pairwise Accuracy, and BERTScore. In the test phase, the system achieved Task Accuracy = 0.17, Pairwise Micro Accuracy = 0.60, and BERT F1 = 0.71, with complete coverage in both predictions and rationales. The results demonstrate an interpretable and reproducible foundation, although challenges in fine-grained temporal discrimination remain.

2025

VerbaNexAI at SemEval-2025 Task 3: Fact Retrieval with Google Snippets for LLM Context Filtering to identify Hallucinations
Anderson Morillo | Edwin Puertas | Juan Carlos Martinez Santos
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

Thefirst approach leverages advanced LLMs, employing a chain-of-thought prompting strategywith one-shot learning and Google snippets forcontext retrieval, demonstrating superior performance. The second approach utilizes traditional NLP analysis techniques, including semantic ranking, token-level extraction, and rigorous data cleaning, to identify hallucinations

VerbaNexAI at SemEval-2025 Task 2: Enhancing Entity-Aware Translation with Wikidata-Enriched MarianMT
Daniel Peña Gnecco | Juan Carlos Martinez Santos | Edwin Puertas
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

This paper presents the VerbaNexAi Lab system for SemEval-2025 Task 2: Entity-Aware Machine Translation (EA-MT), focusing on translating named entities from English to Spanish across categories such as musical works, foods, and landmarks. Our approach integrates detailed data preprocessing, enrichment with 240,432 Wikidata entity pairs, and fine-tuning of the MarianMT model to enhance entity translation accuracy. Official results reveal a COMET score of 87.09, indicating high fluency, an M-ETA score of 24.62, highlighting challenges in entity precision, and an Overall Score of 38.38, ranking last among 34 systems. While Wikidata improved translations for common entities like “Águila de San Juan,” our static methodology underperformed compared to dynamic LLM-based approaches.

UTBNLP at Semeval-2025 Task 11: Predicting Emotion Intensity with BERT and VAD-Informed Attention.
Melissa Moreno Novoa | Edwin Puertas | Juan Carlos Martinez-Santos
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

Emotion intensity prediction plays a crucial role in affective computing, allowing for a more precise understanding of how emotions are conveyed in text. This study proposes a system that estimates emotion intensity levels by integrating contextual language representations with numerical emotion-based features derived from Valence, Arousal, and Dominance (VAD). The methodology combines BERT embeddings, predefined VAD values per emotion, and machine learning techniques to enhance emotion detection, without relying on external lexicons. The system was evaluated on the SemEval-2025 Task 11 Track B dataset, predicting five emotions (anger, fear, joy, sadness, and surprise) on an ordinal scale.The results highlight the effectiveness of integrating contextual representations with predefined VAD values, enabling a more nuanced representation of emotional intensity. However, challenges arose in distinguishing intermediate intensity levels, affecting classification accuracy for certain emotions. Despite these limitations, the study provides insights into the strengths and weaknesses of combining deep learning with numerical emotion modeling, contributing to the development of more robust emotion prediction systems. Future research will explore advanced architectures and additional linguistic features to enhance model generalization across diverse textual domains.

VerbaNexAI at SemEval-2025 Task 11 Track A: A RoBERTa-Based Approach for the Classification of Emotions in Text
Danileth Almanza Gonzalez | Edwin Puertas | Juan Carlos Martinez-Santos
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

Emotion detection in text has become a highly relevant research area due to the growing interest in understanding emotional states from human interaction in the digital world. This study presents an approach for emotion detection in text using a RoBERTa-based model, optimized for multi-label classification of the emotions joy, sadness, fear, anger, and surprise in the context of the SemEval 2025 - Task 11: Bridging the Gap in Text-Based Emotion Detection competition. Advanced preprocessing strategies were incorporated, including the augmentation of the training dataset through automatic translation to improve the representativeness of less frequent emotions. Additionally, a loss function adjustment mechanism was implemented to mitigate class imbalance, enabling the model to enhance its detection capability for underrepresented categories. The experimental results reflect competitive performance, with a macro F1 of 0.6577 on the development set and 0.6266 on the test set. In the competition, the model ranked 47th, demonstrating solid performance against the challenge posed.

VerbaNexAI at SemEval-2025 Task 9: Advances and Challenges in the Automatic Detection of Food Hazards
Andrea Menco Tovar | Edwin Puertas | Juan Carlos Martinez-Santos
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

Ensuring food safety requires effective detection of potential hazards in food products. This paper presents the participation of VerbaNexAI in the SemEval-2025 Task 9 challenge, which focuses on the automatic identification and classification of food hazards from descriptive texts. Our approach employs a machine learning-based strategy, leveraging a Random Forest classifier combined with TF-IDF vectorization and character n-grams (n=2-5) to enhance linguistic pattern recognition. The system achieved competitive performance in hazard and product classification tasks, obtaining notable macro and micro F1 scores. However, we identified challenges such as handling underrepresented categories and improving generalization in multilingual contexts. Our findings highlight the need to refine preprocessing techniques and model architectures to enhance food hazard detection. We made the source code publicly available to encourage reproducibility and collaboration in future research.

2024

VerbaNexAI Lab at SemEval-2024 Task 1: A Multilayer Artificial Intelligence Model for Semantic Relationship Detection
Anderson Morillo | Daniel Peña | Juan Carlos Martinez Santos | Edwin Puertas
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)

This paper presents an artificial intelligence model designed to detect semantic relationships in natural language, addressing the challenges of SemEval 2024 Task 1. Our goal is to advance machine understanding of the subtleties of human language through semantic analysis. Using a novel combination of convolutional neural networks (CNNs), long short-term memory (LSTM) networks, and an attention mechanism, our model is trained on the STR-2022 dataset. This approach enhances its ability to detect semantic nuances in different texts. The model achieved an 81.92% effectiveness rate and ranked 24th in SemEval 2024 Task 1. These results demonstrate its robustness and adaptability in detecting semantic relationships and validate its performance in diverse linguistic contexts. Our work contributes to natural language processing by providing insights into semantic textual relatedness. It sets a benchmark for future research and promises to inspire innovations that could transform digital language processing and interaction.

VerbaNexAI Lab at SemEval-2024 Task 3: Deciphering emotional causality in conversations using multimodal analysis approach
Victor Pacheco | Elizabeth Martinez | Juan Cuadrado | Juan Carlos Martinez Santos | Edwin Puertas
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)

This study delineates our participation in the SemEval-2024 Task 3: Multimodal Emotion Cause Analysis in Conversations, focusing on developing and applying an innovative methodology for emotion detection and cause analysis in conversational contexts. Leveraging logistic regression, we analyzed conversational utterances to identify emotions per utterance. Subsequently, we employed a dependency analysis pipeline, utilizing SpaCy to extract significant chunk features, including object, subject, adjectival modifiers, and adverbial clause modifiers. These features were analyzed within a graph-like framework, conceptualizing the dependency relationships as edges connecting emotional causes (tails) to their corresponding emotions (heads). Despite the novelty of our approach, the preliminary results were unexpectedly humbling, with a consistent score of 0.0 across all evaluated metrics. This paper presents our methodology, the challenges encountered, and an analysis of the potential factors contributing to these outcomes, offering insights into the complexities of emotion-cause analysis in multimodal conversational data.

VerbaNexAI Lab at SemEval-2024 Task 10: Emotion recognition and reasoning in mixed-coded conversations based on an NRC VAD approach
Santiago Garcia | Elizabeth Martinez | Juan Cuadrado | Juan Carlos Martinez-Santos | Edwin Puertas
Proceedings of the 18th International Workshop on Semantic Evaluation (SemEval-2024)

This study introduces an innovative approach to emotion recognition and reasoning about emotional shifts in code-mixed conversations, leveraging the NRC VAD Lexicon and computational models such as Transformer and GRU. Our methodology systematically identifies and categorizes emotional triggers, employing Emotion Flip Reasoning (EFR) and Emotion Recognition in Conversation (ERC). Through experiments with the MELD and MaSaC datasets, we demonstrate the model’s precision in accurately identifying emotional shift triggers and classifying emotions, evidenced by a significant improvement in accuracy as shown by an increase in the F1 score when including VAD analysis. These results underscore the importance of incorporating complex emotional dimensions into conversation analysis, paving new pathways for understanding emotional dynamics in code-mixed texts.

VerbaNexAI at MEDIQA-CORR: Efficacy of GRU with BioWordVec and ClinicalBERT in Error Correction in Clinical Notes
Juan Pajaro | Edwin Puertas | David Villate | Laura Estrada | Laura Tinjaca
Proceedings of the 6th Clinical Natural Language Processing Workshop

The automatic identification of medical errors in clinical notes is crucial for improving the quality of healthcare services.LLMs emerge as a powerful artificial intelligence tool for automating this task. However, LLMs present vulnerabilities, high costs, and sometimes a lack of transparency. This article addresses the detection of medical errors through the fine-tuning approach, conducting a comprehensive comparison between various models and exploring in depth the components of the machine learning pipeline. The results obtained with the fine-tuned ClinicalBert and Gated recurrent units (Gru) models show an accuracy of 0.56 and 0.55, respectively. This approach not only mitigates the problems associated with the use of LLMs but also demonstrates how exhaustive iteration in critical phases of the pipeline, especially in feature selection, can facilitate the automation of clinical record analysis.

2023

UTB-NLP at SemEval-2023 Task 3: Weirdness, Lexical Features for Detecting Categorical Framings, and Persuasion in Online News
Juan Cuadrado | Elizabeth Martinez | Anderson Morillo | Daniel Peña | Kevin Sossa | Juan Carlos Martinez-Santos | Edwin Puertas
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)

Nowadays, persuasive messages are more and more frequent in social networks, which generates great concern in several communities, given that persuasion seeks to guide others towards the adoption of ideas, attitudes or actions that they consider to be beneficial to themselves. The efficient detection of news genre categories, detection of framing and detection of persuasion techniques requires several scientific disciplines, such as computational linguistics and sociology. Here we illustrate how we use lexical features given a news article, determine whether it is an opinion piece, aims to report factual news, or is satire. This paper presents a novel strategy for news based on Lexical Weirdness. The results are part of our participation in subtasks 1 and 2 in SemEval 2023 Task 3.