Meizhi Jin


2026

Large language models (LLMs) often exhibit significant cultural representation biases in multilingual everyday knowledge understanding, struggling to accurately capture region-specific customs and values. This paper presents our system submission for SemEval 2026 Task 7: BLEnD Challenge Track 2 (MCQ) (SemEval-2026 Task 7 Organizers, 2026). To address these challenges, we propose a training-free retrieval-augmented generation (RAG) framework. Without introducing any external data, we manuallyconstructed a localized multicultural knowledge base for each language-region and used text-embedding-v4 for region-specific cultural background retrieval. In the generation stage, we adopted a strict zero-shot setting: prompts contain no task instance question-answer examples, only injecting locale-relevant background cultural descriptions via RAG to compensate for contextual information absence, combined with a dual-model ensemble strategy using Gemini 3 Flash (preview) (Google DeepMind, 2025) and GPT-5.2 Chat (OpenAI, 2025). Our system achieved an overall score of 96.35 on the final Evaluation dataset.Additionally, we conducted in-depth analysis of model performance on specific languages, particularly highlighting severe cultural alignment challenges faced by large models in dialectal variants like Moroccan Arabic (ar-MA) and highly localized subjective Japanese (jaJP) everyday scenarios
This paper describes our system used in the SemEval-2026 Task 7: Cross-Language Cultural Everyday Knowledge QA (track 1). Cultural knowledge typically exhibits significant regional specificity and is deeply rooted in particular linguistic conventions, posing severe challenges to general-purpose large language models (LLMs). We propose a retrieval-augmented generation (RAG) framework: this framework utilizes text-embedding-v4 as the retrieval core to precisely extract social knowledge and expression patterns from region-specific large-scale multilingual cultural knowledge bases, and drives the gpt-5.2-chat model to generate concise answers that are both logically factual and highly aligned with the target region’s cultural context. In the official evaluation, our system ranked first among all participating teams with a total score of 78.7672, fully demonstrating the method’s outstanding performance in cross-cultural accuracy and linguistic authenticity.

2025

This paper describes our system used in the SemEval-2025 Task 11: Bridging the Gap in Text-Based Emotion Detection. To address the highly subjective nature of emotion detection tasks, we propose a model ensemble strategy designed to capture the varying subjective perceptions of different users towards textual content. The base models of this ensemble strategy consist of several large language models, which are then combined using methods such as neural networks, decision trees, linear regression, and weighted voting. In Track A, out of 28 languages, our system achieved first place in 19 languages. In Track B, out of 11 languages, our system ranked first in 10 languages. Furthermore, our system attained the highest average performance across all languages in both Track A and Track B.

2023

This paper describes our system used in the SemEval-2023 Task12: Sentiment Analysis for Low-resource African Languages using Twit- ter Dataset (Muhammad et al., 2023c). The AfriSenti-SemEval Shared Task 12 is based on a collection of Twitter datasets in 14 African languages for sentiment classification. It con- sists of three sub-tasks. Task A is a monolin- gual sentiment classification which covered 12 African languages. Task B is a multilingual sen- timent classification which combined training data from Task A (12 African languages). Task C is a zero-shot sentiment classification. We uti- lized various strategies, including monolingual training, multilingual mixed training, and trans- lation technology, and proposed a weighted vot- ing method that combined the results of differ- ent strategies. Substantially, in the monolingual subtask, our system achieved Top-1 in two lan- guages (Yoruba and Twi) and Top-2 in four languages (Nigerian Pidgin, Algerian Arabic, and Swahili, Multilingual). In the multilingual subtask, Our system achived Top-2 in publish leaderBoard.

2021

Question answering from semi-structured tables can be seen as a semantic parsing task and is significant and practical for pushing the boundary of natural language understanding. Existing research mainly focuses on understanding contents from unstructured evidence, e.g., news, natural language sentences and documents. The task of verification from structured evidence, such as tables, charts, and databases, is still less-explored. This paper describes sattiy team’s system in SemEval-2021 task 9: Statement Verification and Evidence Finding with Tables (SEM-TAB-FACT)(CITATION). This competition aims to verify statements and to find evidence from tables for scientific articles and to promote proper interpretation of the surrounding article. In this paper we exploited ensemble models of pre-trained language models over tables, TaPas and TaBERT, for Task A and adjust the result based on some rules extracted for Task B. Finally, in the leadboard, we attain the F1 scores of 0.8496 and 0.7732 in Task A for the 2-way and 3-way evaluation, respectively, and the F1 score of 0.4856 in Task B.

2020

This paper describes xsysigma team’s system for SemEval 2020 Task 7: Assessing the Funniness of Edited News Headlines. The target of this task is to assess the funniness changes of news headlines after minor editing and is divided into two subtasks: Subtask 1 is a regression task to detect the humor intensity of the sentence after editing; and Subtask 2 is a classification task to predict funnier of the two edited versions of an original headline. In this paper, we only report our implement of Subtask 2. We first construct sentence pairs with different features for Enhancement Inference BERT(EI-BERT)’s input. We then conduct data augmentation strategy and Pseudo-Label method. After that, we apply feature enhancement interaction on the encoding of each sentence for classification with EI-BERT. Finally, we apply weighted fusion algorithm to the logits results which obtained by different pre-trained models. We achieve 64.5% accuracy in subtask2 and rank the first and the fifth in dev and test dataset 1 , respectively.