Shikun Feng
2026
MoE Adapter for Large Audio Language Models: Sparsity, Disentanglement, and Gradient-Conflict-Free
Yishu Lei | Shuwei He | Hu Jing | Dan Zhang | Xianlong Luo | Danxiang Zhu | Shikun Feng | Rui Liu | Jingzhou HE | Yu Sun | Hua Wu | Haifeng Wang
Findings of the Association for Computational Linguistics: ACL 2026
Yishu Lei | Shuwei He | Hu Jing | Dan Zhang | Xianlong Luo | Danxiang Zhu | Shikun Feng | Rui Liu | Jingzhou HE | Yu Sun | Hua Wu | Haifeng Wang
Findings of the Association for Computational Linguistics: ACL 2026
Extending the input modality of Large Language Models (LLMs) to the audio domain is essential for achieving comprehensive multimodal perception. However, it is well-known that acoustic information is intrinsically heterogeneous, entangling attributes such as speech, music, and environmental context. Existing research is limited to a dense, parameter-shared adapter to model these diverse patterns, which induces gradient conflict during optimization, as parameter updates required for distinct attributes contradict each other. To address this limitation, we introduce the MoE-Adapter, a sparse Mixture-of-Experts (MoE) architecture designed to decouple acoustic information. Specifically, it employs a dynamic gating mechanism that routes audio tokens to specialized experts capturing complementary feature subspaces while retaining shared experts for global context, thereby mitigating gradient conflicts and enabling fine-grained feature learning. Comprehensive experiments show that the MoE-Adapter achieves superior performance on both audio semantic and paralinguistic tasks, consistently outperforming dense linear baselines with comparable computational costs. To facilitate future research, our code are publicly available at https://github.com/Alittleegg/Eureka-Audio.
CORD: Bridging the Audio–Text Reasoning Gap via Weighted On-policy Cross-modal Distillation
Hu Jing | Danxiang Zhu | Xianlong Luo | Dan Zhang | Shuwei He | Yishu Lei | Shikun Feng | Hai-Tao Zheng | Jingzhou HE | Yu Sun | Hua Wu | Haifeng Wang
Findings of the Association for Computational Linguistics: ACL 2026
Hu Jing | Danxiang Zhu | Xianlong Luo | Dan Zhang | Shuwei He | Yishu Lei | Shikun Feng | Hai-Tao Zheng | Jingzhou HE | Yu Sun | Hua Wu | Haifeng Wang
Findings of the Association for Computational Linguistics: ACL 2026
Large Audio Language Models (LALMs) have garnered significant research interest. Despite being built upon text-based large language models (LLMs), LALMs frequently exhibit a degradation in knowledge and reasoning capabilities. We hypothesize that this limitation stems from the failure of current training paradigms to effectively bridge the acoustic-semantic gap within the feature representation space. To address this challenge, we propose CORD, a unified alignment framework that performs online cross-modal self-distillation. Specifically, it aligns audio-conditioned reasoning with its text-conditioned counterpart within a unified model. Leveraging the text modality as an internal teacher, CORD performs multi-granularity alignment throughout the audio rollout process. At the token level, it employs on-policy reverse KL divergence with importance-aware weighting to prioritize early and semantically critical tokens. At the sequence level, CORD introduces a judge-based global reward to optimize complete reasoning trajectories via Group Relative Policy Optimization (GRPO). Empirical results across multiple benchmarks demonstrate that CORD consistently enhances audio-conditioned reasoning and substantially bridges the audio–text performance gap with only 80k synthetic training samples, validating the efficacy and data efficiency of our on-policy, multi-level cross-modal alignment approach.
2022
ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich Document Understanding
Qiming Peng | Yinxu Pan | Wenjin Wang | Bin Luo | Zhenyu Zhang | Zhengjie Huang | Yuhui Cao | Weichong Yin | Yongfeng Chen | Yin Zhang | Shikun Feng | Yu Sun | Hao Tian | Hua Wu | Haifeng Wang
Findings of the Association for Computational Linguistics: EMNLP 2022
Qiming Peng | Yinxu Pan | Wenjin Wang | Bin Luo | Zhenyu Zhang | Zhengjie Huang | Yuhui Cao | Weichong Yin | Yongfeng Chen | Yin Zhang | Shikun Feng | Yu Sun | Hao Tian | Hua Wu | Haifeng Wang
Findings of the Association for Computational Linguistics: EMNLP 2022
Recent years have witnessed the rise and success of pre-training techniques in visually-rich document understanding. However, most existing methods lack the systematic mining and utilization of layout-centered knowledge, leading to sub-optimal performances. In this paper, we propose ERNIE-Layout, a novel document pre-training solution with layout knowledge enhancement in the whole workflow, to learn better representations that combine the features from text, layout, and image. Specifically, we first rearrange input sequences in the serialization stage, and then present a correlative pre-training task, reading order prediction, to learn the proper reading order of documents. To improve the layout awareness of the model, we integrate a spatial-aware disentangled attention into the multi-modal transformer and a replaced regions prediction task into the pre-training phase. Experimental results show that ERNIE-Layout achieves superior performance on various downstream tasks, setting new state-of-the-art on key information extraction, document image classification, and document question answering datasets. The code and models are publicly available at PaddleNLP.
2021
Alpha at SemEval-2021 Task 6: Transformer Based Propaganda Classification
Zhida Feng | Jiji Tang | Jiaxiang Liu | Weichong Yin | Shikun Feng | Yu Sun | Li Chen
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)
Zhida Feng | Jiji Tang | Jiaxiang Liu | Weichong Yin | Shikun Feng | Yu Sun | Li Chen
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)
This paper describes our system participated in Task 6 of SemEval-2021: the task focuses on multimodal propaganda technique classification and it aims to classify given image and text into 22 classes. In this paper, we propose to use transformer based architecture to fuse the clues from both image and text. We explore two branches of techniques including fine-tuning the text pretrained transformer with extended visual features, and fine-tuning the multimodal pretrained transformers. For the visual features, we have tested both grid features based on ResNet and salient region features from pretrained object detector. Among the pretrained multimodal transformers, we choose ERNIE-ViL, a two-steam cross-attended transformers pretrained on large scale image-caption aligned data. Fine-tuing ERNIE-ViL for our task produce a better performance due to general joint multimodal representation for text and image learned by ERNIE-ViL. Besides, as the distribution of the classification labels is very unbalanced, we also make a further attempt on the loss function and the experiment result shows that focal loss would perform better than cross entropy loss. Last we have won first for subtask C in the final competition.
abcbpc at SemEval-2021 Task 7: ERNIE-based Multi-task Model for Detecting and Rating Humor and Offense
Chao Pang | Xiaoran Fan | Weiyue Su | Xuyi Chen | Shuohuan Wang | Jiaxiang Liu | Xuan Ouyang | Shikun Feng | Yu Sun
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)
Chao Pang | Xiaoran Fan | Weiyue Su | Xuyi Chen | Shuohuan Wang | Jiaxiang Liu | Xuan Ouyang | Shikun Feng | Yu Sun
Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021)
This paper describes our system participated in Task 7 of SemEval-2021: Detecting and Rating Humor and Offense. The task is designed to detect and score humor and offense which are influenced by subjective factors. In order to obtain semantic information from a large amount of unlabeled data, we applied unsupervised pre-trained language models. By conducting research and experiments, we found that the ERNIE 2.0 and DeBERTa pre-trained models achieved impressive performance in various subtasks. Therefore, we applied the above pre-trained models to fine-tune the downstream neural network. In the process of fine-tuning the model, we adopted multi-task training strategy and ensemble learning method. Based on the above strategy and method, we achieved RMSE of 0.4959 for subtask 1b, and finally won the first place.
2020
PGL at TextGraphs 2020 Shared Task: Explanation Regeneration using Language and Graph Learning Methods
Weibin Li | Yuxiang Lu | Zhengjie Huang | Weiyue Su | Jiaxiang Liu | Shikun Feng | Yu Sun
Proceedings of the Graph-based Methods for Natural Language Processing (TextGraphs)
Weibin Li | Yuxiang Lu | Zhengjie Huang | Weiyue Su | Jiaxiang Liu | Shikun Feng | Yu Sun
Proceedings of the Graph-based Methods for Natural Language Processing (TextGraphs)
This paper describes the system designed by the Baidu PGL Team which achieved the first place in the TextGraphs 2020 Shared Task. The task focuses on generating explanations for elementary science questions. Given a question and its corresponding correct answer, we are asked to select the facts that can explain why the answer is correct for the question and answering (QA) from a large knowledge base. To address this problem, we use a pre-trained language model to recall the top-K relevant explanations for each question. Then, we adopt a re-ranking approach based on a pre-trained language model to rank the candidate explanations. To further improve the rankings, we also develop an architecture consisting both powerful pre-trained transformers and GNNs to tackle the multi-hop inference problem. The official evaluation shows that, our system can outperform the second best system by 1.91 points.
ERNIE at SemEval-2020 Task 10: Learning Word Emphasis Selection by Pre-trained Language Model
Zhengjie Huang | Shikun Feng | Weiyue Su | Xuyi Chen | Shuohuan Wang | Jiaxiang Liu | Xuan Ouyang | Yu Sun
Proceedings of the Fourteenth Workshop on Semantic Evaluation
Zhengjie Huang | Shikun Feng | Weiyue Su | Xuyi Chen | Shuohuan Wang | Jiaxiang Liu | Xuan Ouyang | Yu Sun
Proceedings of the Fourteenth Workshop on Semantic Evaluation
This paper describes the system designed by ERNIE Team which achieved the first place in SemEval-2020 Task 10: Emphasis Selection For Written Text in Visual Media. Given a sentence, we are asked to find out the most important words as the suggestion for automated design. We leverage the unsupervised pre-training model and finetune these models on our task. After our investigation, we found that the following models achieved an excellent performance in this task: ERNIE 2.0, XLM-ROBERTA, ROBERTA and ALBERT. We combine a pointwise regression loss and a pairwise ranking loss which is more close to the final Match m metric to finetune our models. And we also find that additional feature engineering and data augmentation can help improve the performance. Our best model achieves the highest score of 0.823 and ranks first for all kinds of metrics.
Kk2018 at SemEval-2020 Task 9: Adversarial Training for Code-Mixing Sentiment Classification
Jiaxiang Liu | Xuyi Chen | Shikun Feng | Shuohuan Wang | Xuan Ouyang | Yu Sun | Zhengjie Huang | Weiyue Su
Proceedings of the Fourteenth Workshop on Semantic Evaluation
Jiaxiang Liu | Xuyi Chen | Shikun Feng | Shuohuan Wang | Xuan Ouyang | Yu Sun | Zhengjie Huang | Weiyue Su
Proceedings of the Fourteenth Workshop on Semantic Evaluation
Code switching is a linguistic phenomenon which may occur within a multilingual setting where speakers share more than one language. With the increasing communication between groups with different languages, this phenomenon is more and more popular. However, there are little research and data in this area, especially in code-mixing sentiment classification. In this work, the domain transfer learning from state-of-the-art uni-language model ERNIE is tested on the code-mixing dataset, and surprisingly, a strong baseline is achieved. And further more, the adversarial training with a multi-lingual model is used to achieved 1st place of SemEval-2020 Task9 Hindi-English sentiment classification competition.
Search
Fix author
Co-authors
- Yu Sun 8
- Jiaxiang Liu 5
- Zhengjie Huang 4
- Weiyue Su 4
- Xuyi Chen 3
- Xuan Ouyang 3
- Haifeng Wang 3
- Shuohuan Wang 3
- Hua Wu (吴华) 3
- Jingzhou He 2
- Shuwei He 2
- Hu Jing 2
- Yishu Lei 2
- Xianlong Luo 2
- Weichong Yin 2
- Dan Zhang 2
- Danxiang Zhu 2
- Yuhui Cao 1
- Li Chen 1
- Yongfeng Chen 1
- Xiaoran Fan 1
- Zhida Feng 1
- Weibin Li 1
- Rui Liu 1
- Yuxiang Lu 1
- Bin Luo 1
- Yinxu Pan 1
- Chao Pang 1
- Qiming Peng 1
- Jiji Tang 1
- Hao Tian 1
- Wenjin Wang 1
- Yin Zhang 1
- Zhenyu Zhang 1
- Hai-Tao Zheng 1