Saurav K. Aryal - ACL Anthology

This page is part of a temporary preview of a proposed change that may be incomplete or contain mistakes. It is not official and will be removed when the change is merged or abandoned.

Saurav K. Aryal

Also published as: Saurav Aryal

2026

AI4PC-Howard University at SemEval-2026 Task 2: Fine-Tuning DistilBERT, DeBERTa and ModernBERT for Valence–Arousal Prediction and Change Estimation
Araj Shah | Utsav Shah | Saurav Aryal
Proceedings of the 20th International Workshop on Semantic Evaluation (2026)

We present lightweight, reproducible models for longitudinal valence–arousal (VA) prediction in the SemEval-2026 Task 2 essay corpus. Using only the official data, we enforce user-disjoint splits to prevent leakage and evaluate three settings: essay-level VA state estimation, short-horizon VA change forecasting, and long-horizon disposition change prediction. Our submitted systems use DistilBERT for essay-level regression, ModernBERT-based history modeling with a GRU and a blended previous-delta baseline for short-horizon change, and pooled DeBERTa history embeddings with a compact MLP for disposition change. On the official evaluation, across our best performing approaches, we achieve rcomp =0.665/0.468 (valence/arousal) for Subtask 1, r = 0.597/0.413 for Subtask 2A, and r =0.046/0.348 for Subtask 2B.

AI4PC-Howard University at SemEval-2026 Task 12: Evidence-Guided Abductive Scoring with Option-Conditioned Retrieval and Constrained LLM Evaluation
Ifeoluwakiitan Ayandosu | Saurav Aryal
Proceedings of the 20th International Workshop on Semantic Evaluation (2026)

Abductive event reasoning in the wild requires selecting plausible explanations for an event from noisy, partially relevant multi-document context. We present an evidence-guided abductive scoring pipeline for SemEval-2026 Task~12 that separates evidence selection from explanation scoring.For each topic, we chunk documents and retrieve option-conditioned evidence using dense embeddings, then apply a cross-encoder reranker to form compact evidence packs per option. A constrained large language model scorer evaluates each option using only its evidence pack and outputs structured signals capturing evidence support, explanatory directness, and contradiction. We then apply deterministic decision rules to produce single or multi-label predictions, including robust handling of “none of the above” style options through lexical-cue detection rather than reliance on option position. This modular design reduces distraction from irrelevant documents, improves comparability across options, and enables controlled calibration for multi-answer outputs. Our approach demonstrates that retrieval-focused evidence compression combined with disciplined, signal-based scoring can effectively support abductive reasoning without explicit knowledge graphs or end-to-end prompting over full document context.

Howard University-AI4PC at SemEval-2026 Task 7: Culturally Aware Multilingual Model Routing Through a Mixture-of-Specialists Framework
Isaac Adjei | Saurav Aryal
Proceedings of the 20th International Workshop on Semantic Evaluation (2026)

SemEval-2026 Task 7 (BLEnD) evaluates culturally contextual multiple-choice reasoning across 26 languages and 30 geographic regions, emphasizing everyday knowledge, cultural norms, and region-specific variations in language use. This paper presents the Howard University–AI4PC system, a Phase~1 implementation of a culturally aware Mixture-of-Specialists (MoS) framework designed to improve multilingual cultural reasoning without requiring large-scale fine-tuning. Our approach integrates four key components: (1) linguistic and regional metadata extraction for identifying language, dialect, and cultural context; (2) a hierarchical routing strategy that selects the most culturally aligned model path; (3) Model Control Prompting (MCP), which injects region-aware constraints, dialectal hints, and output-format controls; and (4) a lightweight retrieval-augmented layer that supplies culturally specific factual cues. Although specialist LoRA/QLoRA adapters are planned for future phases, the routing and prompting layers alone achieve 80.01\% accuracy on 47{,}014 test MCQs, demonstrating that cultural grounding and linguistically informed routing substantially enhance performance even in the absence of trained experts. We summarize the task, describe the system in detail, present quantitative and qualitative analyses, and outline next-stage extensions involving specialist model training and expanded cultural knowledge integration.

AI4PC-Howard University at SemEval-2026 Task 5: Calibrated Hybrid Ensembling and Retrieval-Augmented LLM Reasoning for Narrative Word-Sense Plausibility
Kwaku Asare | Saurav Aryal
Proceedings of the 20th International Workshop on Semantic Evaluation (2026)

We present two complementary approaches for rating word-sense plausibility in SemEval-2026 Task 5 (literary homonyms in five-sentence stories). Approach 1 is a retrieve-then-generate pipeline using an open-weight Llama 3.1 70B Instruct model with structured reasoning and a self-correction pass. Approach 2 is a hybrid ensemble that combines API-based LLM prompting with transformer representations and a learned calibration layer trained on the development set. On the development set, Approach 2 achieves Spearman ρ = 0.7393 (p 10-102) with accuracy 0.8010 (471/588). Approach 1 achieves ρ = 0.5187 (p 10-65) with accuracy 0.6032 (561/930). We emphasize that Approach 1 does not exceed RoBERTabase in accuracy (0.6032 vs. 0.6410), but provides stronger rank correlation.

Howard University-AI4PC at SemEval-2026 Task 8: Query Reformulation and Dense-Lexical Retrieval Fusion for Multi-Turn Retrieval-Augmented Generation
Sijan Shrestha | Saurav Aryal
Proceedings of the 20th International Workshop on Semantic Evaluation (2026)

We present a training-free hybrid retrieve-then-rerank system for multi-turn retrieval-augmented generation, submitted to allthree subtasks of SemEval-2026 Task 8(MTRAGEval): passage retrieval (Task A),generation with reference passages (Task B),and end-to-end RAG (Task C). Our system ad-dresses the core multi-turn challenges—non-standalone questions, unanswerable queries,and shifting passage relevance—across fourdomain-specific corpora: ClapNQ, Cloud,FiQA, and Govt. Queries are reformulatedthrough LLM-driven rewriting, decompositioninto sub-queries, and Hypothetical DocumentEmbeddings (HyDE). Retrieved candidatesfrom dense vector search (BGE-base-en-v1.5)and BM25 lexical matching are fused via Re-ciprocal Rank Fusion and reranked by a cross-encoder (BGE-reranker-large). Llama-3.3-70B-Instruct generates extractive, context-groundedresponses with built-in abstention for unanswer-able queries. Using only open-source mod-els without fine-tuning, the system achievesnDCG@5 of 0.4098 on Task A (22nd/38), aharmonic mean of 0.7462 on Task B (9th/26),and 0.5796 on Task C (2nd/29), coming within1.1% of the top submission. We attribute thestrong Task C result to the synergy betweenmulti-signal query reformulation and faithfulextractive generation.

Howard University-AI4PC at SemEval-2026 Task 1: Exploring Prompt Strategies for Automatic Humor Generation
Lawal Abdulmujeeb | Saurav Aryal
Proceedings of the 20th International Workshop on Semantic Evaluation (2026)

We present our solution system for SemEval-2026 Task 1-Subtask A, a humor generation task requiring systems to generate jokes, given either a news headline or word-pair inputs. Our approach used the Llama-3.1-8B-Instruct model and we selected this model after comparing several candidate models and humor strategies across our experiments. For the headline inputs, we used a two-shot prompt to frame the output as a tweet and specifying the tone proved to be a particularly important factor in output quality. As for the word-pair inputs, we instructed the model to commit to an everyday situation and generate a funny thought based on that. Also, while experimenting, we noticed that models would start a joke one way with the first word and abruptly shift context mid-joke just to include the second word, and committing to a single situation helped handle that. We also made use of personas here, specifically using Dave Chappelle. Our final system shared 2nd place with 3 other systems out of 32 total systems and achieved an Elo score of 1020. Achieving these results, with no fine-tuning, suggests that careful prompt design alone can yield competitive results.

AI4PC-Howard University at SemEval-2026 Task 9: Evaluating Teacher-Student Weak Supervision and Direct LLM Prompting for Multilingual Political Polarization Detection
Surangana Aryal | Saurav Aryal
Proceedings of the 20th International Workshop on Semantic Evaluation (2026)

We describe the AI4PC–Howard University submission to SemEval-2026 Task 9, Subtask 1 on multilingual political polarization detection across 22 languages. We investigated two approaches: (1) a weakly supervised teacher–student framework in which a large language model (LLM) generated pseudo-labels to train an XLM-RoBERTa-base classifier, and (2) (2) a context-engineered prompt-based approach using Meta-Llama-3.1-8B-Instruct. The teacher–student approach exhibited instability under distribution shift and collapsed toward majority predictions at test time. Consequently, our final submission used direct inference with Meta-Llama-3.1-8B-Instruct. While this approach produced competitive macro-F1 across evaluated languages, results reveal strong positive-class bias and substantial precision–recall imbalance. Our findings highlight limitations of weak supervision for subjective political tasks and underscore trade-offs between scalability, bias, and computational cost in LLM-only multilingual systems.

2025

Howard University-AI4PC at SemEval-2025 Task 9: Using Open-weight BART-MNLI for Zero Shot Classification of Food Recall Documents
Saurav K. Aryal | Kritika Pant
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

We present our system for SemEval-2025 Task 9: Food Hazard Detection, a shared task focused on the explainable classification of food-incident reports. The task involves predicting hazard and product categories (ST1) and their exact vectors (ST2) from short texts. Our approach leverages zero-shot classification using the BART-large-MNLI model, which allows classification without task-specific fine-tuning. Our model achieves competitive performance, emphasizing hazard prediction accuracy, as evaluated by the macro-F1 score.

Howard University-AI4PC at SemEval-2025 Task 2: Improving Machine Translation With Context-Aware Entity-Only Pre-translations with GPT4o
Saurav K. Aryal | Jabez Agyemang - Prempeh
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

This paper presents our work on a 3-Step GPT translation system developed for SemEval-2025 Task 2 to enhance the translation of named entities within machine translation. Our approach integrates (1) entity extraction via wikidata, (2) GPT-based refinement of entity translations, and (3) final context-aware GPT translation. Results from the original dataset of six languages show significant improvements in the handling of named entities compared to direct GPT-based translation baselines. We further discuss replicability, observed challenges, and outline future research directions.

Howard University-AI4PC at SemEval-2025 Task 1: Using GPT-4o and CLIP-ViLT to Decode Figurative Language Across Text and Images
Saurav K. Aryal | Lawal Abdulmujeeb
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

Correctly identifying idiomatic expressions remains a major challenge in Natural Language Processing (NLP), as these expressions often have meanings that cannot be directly inferred from their individual words. The SemEval-2025 Task 1 introduces two subtasks, A and B, designed to test models’ ability to interpret idioms using multimodal data, including both text and images. This paper focuses on Subtask A, where the goal is to determine which among several images best represents the intended meaning of an idiomatic expression in a given sentence.To address this, we employed a two-stage approach. First, we used GPT-4o to analyze sentences, extracting relevant keywords and sentiments to better understand the idiomatic usage. This processed information was then passed to a CLIP-VIT model, which ranked the available images based on their relevance to the idiomatic expression. Our results showed that this approach performed significantly better than directly feeding sentences and idiomatic compounds into the models without preprocessing. Specifically, our method achieved a Top-1 accuracy of 0.67 in English, whereas performance in Portuguese was notably lower at 0.23. These findings highlight both the promise of multimodal approaches for idiom interpretation and the challenges posed by language-specific differences in model performance.

Howard University - AI4PC at SemEval-2025 Task 3: Logit-based Supervised Token Classification for Multilingual Hallucination Span Identification Using XGBOD
Saurav K. Aryal | Mildness Akomoize
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

This paper describes our system for SemEval-2025 Task 3, Mu-SHROOM, which focuses on detecting hallucination spans in multilingual LLM outputs. We reframe hallucination detection as a point-wise anomaly detection problem by treating logits as time-series data. Our approach extracts features from token-level logits, addresses class imbalance with SMOTE, and trains an XGBOD model for probabilistic character-level predictions. Our system, which relies solely on information derived from the logits and token offsets (using pretrained tokenizers), achieves competitive intersection-over-union (IoU) and correlation scores on the validation and test set.

Howard University-AI4PC at SemEval-2025 Task 7: Crosslingual Fact-Checked Claim Retrieval-Combining Zero-Shot Claim Extraction and KNN-Based Classification for Multilingual Claim Matching
Suprabhat Rijal | Saurav K. Aryal
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

SemEval Task 7 introduced a dataset for multilingual and cross-lingual fact checking. We propose a system that leverages similarity matching, KNN, zero-shot classification and summarization to retrieve fact-checks for social media posts across multiple languages. Our approach achieves performance within the expected range, aligning with baseline results. Although competitive, the findings highlight the potential and challenges of zero-shot methods, providing a foundation for future research in cross-lingual information verification.

Howard University-AI4PC at SemEval-2025 Task 4: Unlearning Sensitive Content From Large Language Models Using Finetuning and Distillation for Selective Knowledge Removal
Aayush Acharya | Saurav K. Aryal
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

This paper presents our approach and submission to the SemEval 2025 task on “Unlearning Sensitive Content from Large Language Models.” The task focuses on making LLMs forget specific knowledge, such as copyrighted material and personally identifiable information (PII), without needing expensive retraining from scratch on the OLMo model. We propose a method to unlearn using fine-tuning and knowledge distillation. Our approach involves fine-tuning separate models on “retain” and “forget” datasets to preserve or suppress knowledge selectively. We then distill the model by suppressing logarithmic data from the fine-tuned model without learning using a combined loss of L2, KL divergence and cosine similarity while retaining knowledge from the fine-tuned model with retention using KL divergence loss.

Howard University-AI4PC at SemEval-2025 Task 8: DeepTabCoder - Code-based Retrieval and In-context Learning for Question-Answering over Tabular Data
Saharsha Tiwari | Saurav K. Aryal
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

This paper presents our approach, named DeepTabCoder, to SemEval 2025 - Task 8: DataBench, which focuses on question-answering over tabular data. We utilize a code-based retrieval system combined with in-context learning, which generates and executes code to answer questions, leveraging DeepSeek-V3 for code generation. DeepTabCoder outperforms the baseline, achieving accuracies of 81.42% on the DataBench dataset and 80.46% on the DataBench Lite dataset.

Howard University-AI4PC at SemEval-2025 Task 11: Combining Expert Personas via Prompting for Enhanced Multilingual Emotion Analysis
Amir Ince | Saurav K. Aryal
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

For our approach to SemEval-2025 Task 11, we employ a multi-tier evaluation framework for perceived emotion analysis. Our system consists of a smaller-parameter-size large language model that independently predicts a given text’s perceived emotion while explaining the reasoning behind its decision. The initial model’s persona is varied through careful prompting, allowing it to represent multiple perspectives. These outputs, including both predictions and reasoning, are aggregated and fed into a final decision-making model that determines the ultimate emotion classification. We evaluated our approach in official SemEval Task 11 on subtasks A and C in all the languages provided.

Howard University-AI4PC at SemEval-2025 Task 10: Ensembling LLMs for Multi-lingual Multi-Label and Multi-Class Meta-Classification
Saurav K. Aryal | Prasun Dhungana
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

This paper describes our approach and submission to the SemEval 2025 shared task on “Multilingual Characterization and Extraction of Narratives from Online News”. The purpose of this task was to assign primary and fine-grained roles to named entities in news articles from five different languages, on the topics of Climate Change and Ukraine-Russia War. In this paper, we explain how we approached the task by utilizing multiple LLMs via Prompt Engineering and combining their results into a final task result through an ensemble meta-classification technique. Our experimental results demonstrate that this integrated approach outperforms the provided baseline in detecting bias, deception, and manipulation in news media across multiple languages.

2023

Howard University Computer Science at SemEval-2023 Task 12: A 2-Step System Design for Multilingual Sentiment Classification with Language Identification
Saurav Aryal | Howard Prioleau
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)

The recent release of the AfriSenti-SemEval shared Task 12 has made available 14 new datasets annotated for sentiment analysis on African Languages. We proposed and evaluated two approaches to this task, Delta TF-IDF, and a proposed Language-Specific Model Fusion Algorithm using Language Identification, both of which produced comparable or better classification performance than the current state-of-art models on this task: AfriBERTa, AfroXLMR, and AfroLM.

Co-authors

Surangana Aryal 1

Ifeoluwakiitan Ayandosu 1

Prasun Dhungana 1

Howard Prioleau 1

Suprabhat Rijal 1

Sijan Shrestha 1

Saharsha Tiwari 1

Venues

SemEval17
WS16