Xin Tao

2025

Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities by integrating visual and textual inputs, yet modality alignment remains one of the most challenging aspects. Current MLLMs typically rely on simple adapter architectures and pretraining approaches to bridge vision encoders with large language models (LLM), guided by image-level supervision. We identify this paradigm often leads to suboptimal alignment between modalities, significantly constraining the LLM’s ability to properly interpret and reason with visual features particularly for smaller language models. To address this fundamental limitation, we propose Supervised Embedding Alignment (SEA), a token-level supervision alignment method that enables more precise visual-text alignment during pretraining. SEA introduces minimal computational overhead while preserving language capabilities and substantially improving cross-modal understanding. Our comprehensive analyses reveal critical insights into the adapter’s role in multimodal integration, and extensive experiments demonstrate that SEA consistently improves performance across various model sizes, with smaller models benefiting the most (average performance gain of 7.61% for Gemma-2B). This work establishes a foundation for developing more effective alignment strategies for future multimodal systems.

2021

pdf bib abs
ZYJ123@DravidianLangTech-EACL2021: Offensive Language Identification based on XLM-RoBERTa with DPCNN
Yingjia Zhao | Xin Tao
Proceedings of the First Workshop on Speech and Language Technologies for Dravidian Languages

The development of online media platforms has given users more opportunities to post and comment freely, but the negative impact of offensive language has become increasingly apparent. It is very necessary for the automatic identification system of offensive language. This paper describes our work on the task of Offensive Language Identification in Dravidian language-EACL 2021. To complete this task, we propose a system based on the multilingual model XLM-Roberta and DPCNN. The test results on the official test data set confirm the effectiveness of our system. The weighted average F1-score of Kannada, Malayalam, and Tami language are 0.69, 0.92, and 0.76 respectively, ranked 6th, 6th, and 3rd

pdf bib abs
ZYJ@LT-EDI-EACL2021:XLM-RoBERTa-Based Model with Attention for Hope Speech Detection
Yingjia Zhao | Xin Tao
Proceedings of the First Workshop on Language Technology for Equality, Diversity and Inclusion

Due to the development of modern computer technology and the increase in the number of online media users, we can see all kinds of posts and comments everywhere on the internet. Hope speech can not only inspire the creators but also make other viewers pleasant. It is necessary to effectively and automatically detect hope speech. This paper describes the approach of our team in the task of hope speech detection. We use the attention mechanism to adjust the weight of all the output layers of XLM-RoBERTa to make full use of the information extracted from each layer, and use the weighted sum of all the output layers to complete the classification task. And we use the Stratified-K-Fold method to enhance the training data set. We achieve a weighted average F1-score of 0.59, 0.84, and 0.92 for Tamil, Malayalam, and English language, ranked 3rd, 2nd, and 2nd.

pdf bib abs
ZYJ at SemEval-2021 Task 7: HaHackathon: Detecting and Rating Humor and Offense with ALBERT-Based Model
Yingjia Zhao | Xin Tao
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)

This article introduces the submission of subtask 1 and subtask 2 that we participate in SemEval-2021 Task 7: HaHackathon: Detecting and Rating Humor and Offense, we use a model based on ALBERT that uses ALBERT as the module for extracting text features. We modify the upper layer structure by adding specific networks to better summarize the semantic information. Finally, our system achieves an F-Score of 0.9348 in subtask 1a, RMSE of 0.7214 in subtask 1b, F-Score of 0.4603 in subtask 1c, and RMSE of 0.5204 in subtask 2.

2020

pdf bib abs
YNUtaoxin at SemEval-2020 Task 11: Identification Fragments of Propaganda Technique by Neural Sequence Labeling Models with Different Tagging Schemes and Pre-trained Language Model
Xin Tao | Xiaobing Zhou
Proceedings of the Fourteenth Workshop on Semantic Evaluation

We only participated in the first subtask, and a neural sequence model was used to perform the sequence tagging task. We investigated the effects of different markup strategies on model performance. Bert that performed very well in NLP was used as a feature extractor.

Co-authors

Venues

Fix data