2025
pdf
bib
abs
Finite State Automata Inside Transformers with Chain-of-Thought: A Mechanistic Study on State Tracking
Yifan Zhang
|
Wenyu Du
|
Dongming Jin
|
Jie Fu
|
Zhi Jin
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Chain-of-thought (CoT) significantly enhances the performance of large language models (LLMs) across a wide range of tasks, and prior research shows that CoT can theoretically increase expressiveness. However, there is limited mechanistic understanding of the algorithms that Transformer+CoT can learn. Our key contributions are: (1) We evaluate the state tracking capabilities of Transformer+CoT and its variants, confirming the effectiveness of CoT. (2) Next, we identify the circuit (a subset of model components, responsible for tracking the world state), indicating that late-layer MLP neurons play a key role. We propose two metrics, compression and distinction, and show that the neuron sets for each state achieve nearly 100% accuracy, providing evidence of an implicit finite state automaton (FSA) embedded within the model. (3) Additionally, we explore three challenging settings: skipping intermediate steps, introducing data noises, and testing length generalization. Our results demonstrate that Transformer+CoT learns robust algorithms (FSAs), highlighting its resilience in challenging scenarios. Our code is available at https://github.com/IvanChangPKU/FSA.
pdf
bib
abs
Fine-Grained Manipulation of Arithmetic Neurons
Wenyu Du
|
Rui Zheng
|
Tongxu Luo
|
Stephen Chung
|
Jie Fu
Proceedings of the 8th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP
It is a longstanding challenge to understand how neural models perform mathematical reasoning. Recent mechanistic interpretability work indicates that large language models (LLMs) use a “bag of heuristics” in middle to late-layer MLP neurons for arithmetic, where each heuristic promotes logits for specific numerical patterns. Building on this, we aim for fine-grained manipulation of these heuristic neurons to causally steer model predictions towards specific arithmetic outcomes, moving beyond simply disrupting accuracy. This paper presents a methodology that enables the systematic identification and causal manipulation of heuristic neurons, which is applied to the addition task in this study. We train a linear classifier to predict heuristics based on activation values, achieving over 90% classification accuracy. The trained classifier also allows us to rank neurons by their importance to a given heuristic. By targeting a small set of top-ranked neurons (K=50), we demonstrate high success rates—over 80% for the ones place and nearly 70% for the tens place—in controlling addition outcomes. This manipulation is achieved by transforming the activation of identified neurons into specific target heuristics by zeroing out source-heuristic neurons and adjusting target-heuristic neurons towards their class activation centroids. We explain these results by hypothesizing that high-ranking neurons possess ‘cleaner channels’ for their heuristics, supported by Signal-to-Noise Ratio (SNR) analysis where these neurons show higher SNR scores. Our work offers a robust approach to dissect, causally test, and precisely influence LLM arithmetic, advancing understanding of their internal mechanisms.
2024
pdf
bib
Unlocking Continual Learning Abilities in Language Models
Wenyu Du
|
Shuang Cheng
|
Tongxu Luo
|
Zihan Qiu
|
Zeyu Huang
|
Ka Chun Cheung
|
Reynold Cheng
|
Jie Fu
Findings of the Association for Computational Linguistics: EMNLP 2024
2023
pdf
bib
abs
f-Divergence Minimization for Sequence-Level Knowledge Distillation
Yuqiao Wen
|
Zichao Li
|
Wenyu Du
|
Lili Mou
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Knowledge distillation (KD) is the process of transferring knowledge from a large model to a small one. It has gained increasing attention in the natural language processing community, driven by the demands of compressing ever-growing language models. In this work, we propose an FDISTILL framework, which formulates sequence-level knowledge distillation as minimizing a generalized f-divergence function. We propose four distilling variants under our framework and show that existing SeqKD and ENGINE approaches are approximations of our FDISTILL methods. We further derive step-wise decomposition for our FDISTILL, reducing intractable sequence-level divergence to word-level losses that can be computed in a tractable manner. Experiments across four datasets show that our methods outperform existing KD approaches, and that our symmetric distilling losses can better force the student to learn from the teacher distribution.
2021
pdf
bib
abs
End-to-End AMR Coreference Resolution
Qiankun Fu
|
Linfeng Song
|
Wenyu Du
|
Yue Zhang
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
Although parsing to Abstract Meaning Representation (AMR) has become very popular and AMR has been shown effective on the many sentence-level downstream tasks, little work has studied how to generate AMRs that can represent multi-sentence information. We introduce the first end-to-end AMR coreference resolution model in order to build multi-sentence AMRs. Compared with the previous pipeline and rule-based approaches, our model alleviates error propagation and it is more robust for both in-domain and out-domain situations. Besides, the document-level AMRs obtained by our model can significantly improve over the AMRs generated by a rule-based method (Liu et al., 2015) on text summarization.
pdf
bib
abs
Linguistic Dependencies and Statistical Dependence
Jacob Louis Hoover
|
Wenyu Du
|
Alessandro Sordoni
|
Timothy J. O’Donnell
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Are pairs of words that tend to occur together also likely to stand in a linguistic dependency? This empirical question is motivated by a long history of literature in cognitive science, psycholinguistics, and NLP. In this work we contribute an extensive analysis of the relationship between linguistic dependencies and statistical dependence between words. Improving on previous work, we introduce the use of large pretrained language models to compute contextualized estimates of the pointwise mutual information between words (CPMI). For multiple models and languages, we extract dependency trees which maximize CPMI, and compare to gold standard linguistic dependencies. Overall, we find that CPMI dependencies achieve an unlabelled undirected attachment score of at most ≈ 0.5. While far above chance, and consistently above a non-contextualized PMI baseline, this score is generally comparable to a simple baseline formed by connecting adjacent words. We analyze which kinds of linguistic dependencies are best captured in CPMI dependencies, and also find marked differences between the estimates of the large pretrained language models, illustrating how their different training schemes affect the type of dependencies they capture.
2020
pdf
bib
abs
Exploiting Syntactic Structure for Better Language Modeling: A Syntactic Distance Approach
Wenyu Du
|
Zhouhan Lin
|
Yikang Shen
|
Timothy J. O’Donnell
|
Yoshua Bengio
|
Yue Zhang
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
It is commonly believed that knowledge of syntactic structure should improve language modeling. However, effectively and computationally efficiently incorporating syntactic structure into neural language models has been a challenging topic. In this paper, we make use of a multi-task objective, i.e., the models simultaneously predict words as well as ground truth parse trees in a form called “syntactic distances”, where information between these two separate objectives shares the same intermediate representation. Experimental results on the Penn Treebank and Chinese Treebank datasets show that when ground truth parse trees are provided as additional training signals, the model is able to achieve lower perplexity and induce trees with better quality.