Congfeng Cao

2026

Fusion Training for Mathematical Generalization in Large Language Models
Congfeng Cao | Pengyu Zhang | Jelke Bloem
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL 2026)

Thinking Mode Fusion (TMF) enables large language models to support both concise responses and long-form reasoning by unifying a non-thinking mode and a thinking mode within a single model. However, its training dynamics, including the data ratio and training schedule between the two modes, remain underexplored. In this work, we present a systematic study of TMF by analyzing the effects of the training schedule and data ratio between thinking and non-thinking modes. Focusing on mathematical problem solving, we construct a benchmark with multiple thinking-to-non-thinking data ratios and three training schedules. Our results reveal an asymmetric interaction between the two modes: increasing the ratio of non-thinking supervision reduces the accuracy of the thinking mode. We further show that different training schedules modulate this trade-off and that the optimal schedule depends on the data ratio. Finally, we quantify a negative correlation between non-thinking and thinking mode supervision, highlighting an inherent tension between these two modes. These findings provide practical guidance for designing effective TMF training settings. All code and data are released to support further research at: Fusion Bench.

2025

pdf bib abs

How Aligned Are Unimodal Language and Graph Encodings of Chemical Molecules?
Congfeng Cao | Zhi Zhang | Jelke Bloem | Khalil Sima’an
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics

Chemical molecules can be represented as graphs or as language descriptions. Training unimodal models on graphs results in different encodings than training them on language. Therefore, the existing literature force-aligns the unimodal models during training to use them in downstream applications such as drug discovery. But to what extent are graph and language unimodal model representations inherently aligned, i.e., aligned prior to any force-alignment training? Knowing this is useful for a more expedient and effective forced-alignment. For the first time, we explore methods to gauge the alignment of graph and language unimodal models. We find compelling differences between models and their ability to represent slight structural differences without force-alignment. We also present an unified unimodal alignment (U2A) benchmark for gauging the inherent alignment between graph and language encoders which we make available with this paper.

pdf bib abs

NeuroAda: Activating Each Neuron’s Potential for Parameter-Efficient Fine-Tuning
Zhi Zhang | Yixian Shen | Congfeng Cao | Ekaterina Shutova
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Existing parameter-efficient fine-tuning (PEFT) methods primarily fall into two categories: addition-based and selective in-situ adaptation. The former, such as LoRA, introduce additional modules to adapt the model to downstream tasks, offering strong memory efficiency. However, their representational capacity is often limited, making them less suitable for fine-grained adaptation. In contrast, the latter directly fine-tunes a carefully chosen subset of the original model parameters, allowing for more precise and effective adaptation, but at the cost of significantly increased memory consumption.To reconcile this trade-off, we propose NeuroAda, a novel PEFT method that enables fine-grained model finetuning while maintaining high memory efficiency. Our approach first identifies important parameters (i.e., connections within the network) as in selective adaptation, and then introduces bypass connections for these selected parameters. During finetuning, only the bypass connections are updated, leaving the original model parameters frozen.Empirical results on 23+ tasks spanning both natural language generation and understanding demonstrate that NeuroAda achieves state-of-the-art performance with as little as ≤ 0.02% trainable parameters, while reducing CUDA memory usage by up to 60%.We release our code here: https://github.com/FightingFighting/NeuroAda.git.

Co-authors

Pengyu Zhang 1

Venues

Fix author