Bo Fu

2025

pdf bib abs
ICLER: Intent CLassification with Enhanced Reasoning
Dezheng Gao | Dong Xiaozheng | SHuangtao Yang | Bo Fu
Findings of the Association for Computational Linguistics: EMNLP 2025

In recent years, intent classification technology based on In-Context Learning (ICL) has made significant progress. However, when applied to enterprise vertical domains, existing methods are inadequate in identifying micro-grained intentions. This study identifies two primary causes of errors in data analysis: (1) Retrieving incorrect instances, this is often due to the limitations of embedding models in capturing subtle sentence-level information in business scenarios (such as entity-related or phenomenon-specific details) (2) Insufficient reasoning ability of Large Language Models (LLMs), which tend to rely on surface-level semantics while overlooking deeper semantic associations and business logic, leading to misclassification. To address these issues, we propose ICLER, an intent classification method with enhanced reasoning. This method first optimizes the embedding model by introducing a reasoning mechanism to enhance its ability to fine-grained sentence-level information. Then, this mechanism is incorporated into the ICL framework, maintaining computational efficiency while significantly enhancing intent recognition accuracy. Experimental results demonstrate that ICLER significantly outperforms the original ICL method in intent identification within vertical domains. Moreover, it yields accuracy improvements of 0.04% to 1.14% on general datasets and its fine-tuned embedding model achieves an average performance gain of 5.56% on selected classification tasks in the MTEB benchmark.

pdf bib abs
MARIO-0.5B: A Multi-Agent Lightweight Model for Real-Time Open Information Extraction in Low-Resource Settings
Donghai Zhang | SHuangtao Yang | Dong Xiaozheng | Wei Song | Bo Fu
Findings of the Association for Computational Linguistics: EMNLP 2025

Large language models (LLMs) have shown remarkable capabilities in open information extraction. However, their substantial resource requirements often restrict their deployment in resource-constrained industrial settings, particularly on edge devices. The high computational demands also lead to increased latency, making them difficult to apply in real-time applications. In this paper, we introduce MARIO-0.5B, an ultra-lightweight model trained on instruction-based samples in Chinese, English, Korean, and Russian. We also present a novel multi-agent framework, SMOIE, which integrates schema mining, information extraction, reasoning, and decision-making to effectively support MARIO-0.5B.The experimental results show that our framework outperforms large-scale models with up to 70B parameters, reducing computational resources by 140x and delivering 11x faster response times. Moreover, it operates efficiently in CPU-only environments, which makes it well-suited for widespread industrial deployment.

2024

pdf bib abs
Mini-DA: Improving Your Model Performance through Minimal Data Augmentation using LLM
Shuangtao Yang | Xiaoyi Liu | Xiaozheng Dong | Bo Fu
Proceedings of the Fifth Workshop on Data Science with Human-in-the-Loop (DaSH 2024)

When performing data augmentation using large language models (LLMs), the common approach is to directly generate a large number of new samples based on the original dataset, and then model is trained on the integration of augmented dataset and the original dataset. However, data generation demands extensive computational resources. In this study, we propose Mini-DA, a minimized data augmentation method that leverages the feedback from the target model during the training process to select only the most challenging samples from the validation set for augmentation. Our experimental results show in text classification task, by using as little as 13 percent of the original augmentation volume, Mini-DA can achieve performance comparable to full data augmentation for intent detection task, significantly improving data and computational resource utilization efficiency.

2023

pdf bib abs
Knowdee at BLP-2023 Task 2: Improving Bangla Sentiment Analysis Using Ensembled Models with Pseudo-Labeling
Xiaoyi Liu | Mao Teng | SHuangtao Yang | Bo Fu
Proceedings of the First Workshop on Bangla Language Processing (BLP-2023)

This paper outlines our submission to the Sentiment Analysis Shared Task at the Bangla Language Processing (BLP) Workshop at EMNLP2023 (Hasan et al., 2023a). The objective of this task is to detect sentiment in each text by classifying it as Positive, Negative, or Neutral. This shared task is based on the MUltiplatform BAngla SEntiment (MUBASE) (Hasan et al., 2023b) and SentNob (Islam et al., 2021) dataset, which consists of public comments from various social media platforms. Our proposed method for this task is based on the pre-trained Bangla language model BanglaBERT (Bhattacharjee et al., 2022). We trained an ensemble of BanglaBERT on the original dataset and used it to generate pseudo-labels for data augmentation. This expanded dataset was then used to train our final models. During the evaluation phase, 30 teams submitted their systems, and our system achieved the second highest performance with F1 score of 0.7267. The source code of the proposed approach is available at https://github.com/KnowdeeAI/blp_task2_knowdee.git.

2016

pdf bib abs
Integrating Topic Modeling with Word Embeddings by Mixtures of vMFs
Ximing Li | Jinjin Chi | Changchun Li | Jihong Ouyang | Bo Fu
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Gaussian LDA integrates topic modeling with word embeddings by replacing discrete topic distribution over word types with multivariate Gaussian distribution on the embedding space. This can take semantic information of words into account. However, the Euclidean similarity used in Gaussian topics is not an optimal semantic measure for word embeddings. Acknowledgedly, the cosine similarity better describes the semantic relatedness between word embeddings. To employ the cosine measure and capture complex topic structure, we use von Mises-Fisher (vMF) mixture models to represent topics, and then develop a novel mix-vMF topic model (MvTM). Using public pre-trained word embeddings, we evaluate MvTM on three real-world data sets. Experimental results show that our model can discover more coherent topics than the state-of-the-art baseline models, and achieve competitive classification performance.

Co-authors

Venues

Fix author