Jianyu Wang
2026
Detecting AI-Generated Content on Social Media with Multi-modal Language Models
Chenyang Yang | Shen Yan | Yibo Yang | Litao Hu | Yuchen Liu | Yuan Zeng | Hanchao Yu | Yinan Zhu | Sumedha Singla | Brian Vanover | Huijun Qian | Zihao Wang | Fujun Liu | Aashu Singh | Jianyu Wang | Xuewen Zhang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Chenyang Yang | Shen Yan | Yibo Yang | Litao Hu | Yuchen Liu | Yuan Zeng | Hanchao Yu | Yinan Zhu | Sumedha Singla | Brian Vanover | Huijun Qian | Zihao Wang | Fujun Liu | Aashu Singh | Jianyu Wang | Xuewen Zhang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)
Generative AI has enabled the creation of photorealistic images and videos that are increasingly disseminated on social media, often used for spam, misinformation, manipulation, and fraud. Existing AI-generated content (AIGC) detection methods face challenges including poor generalization to new generation models, reliance on single modalities, and lack of interpretable explanations. We present our pipeline that mitigates these issues by continuously curating diverse multi-modal social media data and training a compact vision-language model for detection and explanation. Our model achieves state-of-the-art detection performance on public benchmarks and demonstrates robust detection and explanation capabilities on internal social media datasets across multiple platforms. We deployed our model for post recommendation on social media platforms and observed positive downstream impacts on user engagement, demonstrating that it is feasible to perform effective AIGC detection in dynamic, real-world social media environments.
2025
SeaLLMs 3: Open Foundation and Chat Multilingual Large Language Models for Southeast Asian Languages
Wenxuan Zhang | Hou Pong Chan | Yiran Zhao | Mahani Aljunied | Jianyu Wang | Chaoqun Liu | Yue Deng | Zhiqiang Hu | Weiwen Xu | Yew Ken Chia | Xin Li | Lidong Bing
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (System Demonstrations)
Wenxuan Zhang | Hou Pong Chan | Yiran Zhao | Mahani Aljunied | Jianyu Wang | Chaoqun Liu | Yue Deng | Zhiqiang Hu | Weiwen Xu | Yew Ken Chia | Xin Li | Lidong Bing
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (System Demonstrations)
Large Language Models (LLMs) have shown remarkable abilities across various tasks, yet their development has predominantly centered on high-resource languages like English and Chinese, leaving low-resource languages underserved. To address this disparity, we present SeaLLMs 3, the latest iteration of the SeaLLMs model family, tailored for Southeast Asian languages. This region, characterized by its rich linguistic diversity, has lacked adequate language technology support. SeaLLMs 3 aims to bridge this gap by covering a comprehensive range of languages spoken in this region, including English, Chinese, Indonesian, Vietnamese, Thai, Tagalog, Malay, Burmese, Khmer, Lao, Tamil, and Javanese. Leveraging efficient language enhancement techniques and a specially constructed instruction tuning dataset, SeaLLMs 3 significantly reduces training costs while maintaining high performance and versatility. Our model excels in tasks such as world knowledge, mathematical reasoning, translation, and instruction following, achieving state-of-the-art performance among similarly sized models. Additionally, we prioritized safety and reliability by addressing both general and culture-specific considerations and incorporated mechanisms to reduce hallucinations. This work underscores the importance of inclusive AI, showing that advanced LLM capabilities can benefit underserved linguistic and cultural communities.
2024
CoGenesis: A Framework Collaborating Large and Small Language Models for Secure Context-Aware Instruction Following
Kaiyan Zhang | Jianyu Wang | Ermo Hua | Biqing Qi | Ning Ding | Bowen Zhou
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Kaiyan Zhang | Jianyu Wang | Ermo Hua | Biqing Qi | Ning Ding | Bowen Zhou
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
With the advancement of language models (LMs), their exposure to private data is increasingly inevitable, and their deployment (especially for smaller ones) on personal devices, such as PCs and smartphones, has become a prevailing trend. In contexts laden with user information, enabling models to both safeguard user privacy and execute commands efficiently emerges as an essential research imperative. In this paper, we propose CoGenesis, a collaborative generation framework integrating large (hosted on cloud infrastructure) and small models (deployed on local devices) to address privacy concerns logically. Initially, we design a pipeline to create personalized writing instruction datasets enriched with extensive context details as the testbed of this research issue. Subsequently, we introduce two variants of CoGenesis based on sketch and logits respectively. Our experimental findings, based on our synthesized dataset and two additional open-source datasets, indicate that: 1) Large-scale models perform well when provided with user context but struggle in the absence of such context. 2) While specialized smaller models fine-tuned on the synthetic dataset show promise, they still lag behind their larger counterparts. 3) Our CoGenesis framework, utilizing mixed-scale models, showcases competitive performance, providing a feasible solution to privacy issues.
SeaLLMs - Large Language Models for Southeast Asia
Xuan-Phi Nguyen | Wenxuan Zhang | Xin Li | Mahani Aljunied | Zhiqiang Hu | Chenhui Shen | Yew Ken Chia | Xingxuan Li | Jianyu Wang | Qingyu Tan | Liying Cheng | Guanzheng Chen | Yue Deng | Sen Yang | Chaoqun Liu | Hang Zhang | Lidong Bing
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
Xuan-Phi Nguyen | Wenxuan Zhang | Xin Li | Mahani Aljunied | Zhiqiang Hu | Chenhui Shen | Yew Ken Chia | Xingxuan Li | Jianyu Wang | Qingyu Tan | Liying Cheng | Guanzheng Chen | Yue Deng | Sen Yang | Chaoqun Liu | Hang Zhang | Lidong Bing
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 3: System Demonstrations)
Despite the remarkable achievements of large language models (LLMs) in various tasks, there remains a linguistic bias that favors high-resource languages, such as English, often at the expense of low-resource and regional languages. To address this imbalance, we introduce SeaLLMs, an innovative series of language models that specifically focuses on Southeast Asian (SEA) languages. SeaLLMs are built upon popular English-centric models through continued pre-training with an extended vocabulary, specialized instruction and alignment tuning to better capture the intricacies of regional languages. This allows them to respect and reflect local cultural norms, customs, stylistic preferences, and legal considerations. Our comprehensive evaluation demonstrates that SeaLLM models exhibit superior performance across a wide spectrum of linguistic tasks and assistant-style instruction-following capabilities relative to comparable open-source models. Moreover, they outperform ChatGPT-3.5 in non-Latin languages, such as Thai, Khmer, Lao, and Burmese, by large margins while remaining lightweight and cost-effective to operate.
2022
RLPrompt: Optimizing Discrete Text Prompts with Reinforcement Learning
Mingkai Deng | Jianyu Wang | Cheng-Ping Hsieh | Yihan Wang | Han Guo | Tianmin Shu | Meng Song | Eric Xing | Zhiting Hu
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Mingkai Deng | Jianyu Wang | Cheng-Ping Hsieh | Yihan Wang | Han Guo | Tianmin Shu | Meng Song | Eric Xing | Zhiting Hu
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Prompting has shown impressive success in enabling large pre-trained language models (LMs) to perform diverse NLP tasks, especially with only few downstream data. Automatically finding the optimal prompt for each task, however, is challenging. Most existing work resorts to tuning *soft* prompts (e.g., embeddings) which fall short of interpretability, reusability across LMs, and applicability when gradients are not accessible. *Discrete* prompts, on the other hand, are difficult to optimize, and are often created by “enumeration (e.g., paraphrasing)-then-selection” heuristics that do not explore the prompt space systematically. This paper proposes RLPrompt, an efficient discrete prompt optimization approach with reinforcement learning (RL). RLPrompt formulates a parameter-efficient policy network that generates the optimized discrete prompt after training with reward. To harness the complex and stochastic reward signals from the large LM environment, we incorporate effective reward stabilization that substantially enhances training efficiency. RLPrompt is flexibly applicable to different types of LMs, such as masked (e.g., BERT) and left-to-right models (e.g., GPTs), for both classification and generation tasks. Experiments on few-shot classification and unsupervised text style transfer show superior performance over a wide range of existing fine-tuning or prompting methods. Interestingly, the resulting optimized prompts are often ungrammatical gibberish text; and surprisingly, those gibberish prompts are transferrable between different LMs to retain significant performance, indicating that LM prompting may not follow human language patterns.
2019
Compositional Generalization for Primitive Substitutions
Yuanpeng Li | Liang Zhao | Jianyu Wang | Joel Hestness
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)
Yuanpeng Li | Liang Zhao | Jianyu Wang | Joel Hestness
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)
Compositional generalization is a basic mechanism in human language learning, but current neural networks lack such ability. In this paper, we conduct fundamental research for encoding compositionality in neural networks. Conventional methods use a single representation for the input sentence, making it hard to apply prior knowledge of compositionality. In contrast, our approach leverages such knowledge with two representations, one generating attention maps, and the other mapping attended input words to output symbols. We reduce the entropy in each representation to improve generalization. Our experiments demonstrate significant improvements over the conventional methods in five NLP tasks including instruction learning and machine translation. In the SCAN domain, it boosts accuracies from 14.0% to 98.8% in Jump task, and from 92.0% to 99.7% in TurnLeft task. It also beats human performance on a few-shot learning task. We hope the proposed approach can help ease future research towards human-level compositional language learning.
Search
Fix author
Co-authors
- Mahani Aljunied 2
- Lidong Bing 2
- Yew Ken Chia 2
- Yue Deng 2
- Zhiqiang Hu 2
- Xin Li 2
- Chaoqun Liu 2
- Wenxuan Zhang 2
- Hou Pong Chan 1
- Guanzheng Chen 1
- Liying Cheng 1
- Mingkai Deng 1
- Ning Ding 1
- Han Guo 1
- Joel Hestness 1
- Cheng-Ping Hsieh 1
- Litao Hu 1
- Zhiting Hu 1
- Ermo Hua 1
- Xingxuan Li 1
- Yuanpeng Li 1
- Fujun Liu 1
- Yuchen Liu (刘雨辰) 1
- Xuan-Phi Nguyen 1
- Biqing Qi 1
- Huijun Qian 1
- Chenhui Shen 1
- Tianmin Shu 1
- Aashu Singh 1
- Sumedha Singla 1
- Meng Song 1
- Qingyu Tan 1
- Brian Vanover 1
- Yihan Wang 1
- Zihao Wang 1
- Eric Xing 1
- Weiwen Xu 1
- Shen Yan 1
- Chenyang Yang 1
- Sen Yang 1
- Yibo Yang 1
- Hanchao Yu 1
- Yuan Zeng 1
- Hang Zhang 1
- Kaiyan Zhang 1
- Xuewen Zhang 1
- Liang Zhao (赵亮) 1
- Yiran Zhao 1
- Bowen Zhou 1
- Yinan Zhu 1