Shuangyong Song (宋双永)

Shuangyong Song

Also published as: 双永宋

2025

Extensive research has been conducted to explore the capabilities of large language models (LLMs) in table reasoning. However, the essential task of transforming tables information into reports remains a significant challenge for industrial applications. This task is plagued by two critical issues: 1) the complexity and diversity of tables lead to suboptimal reasoning outcomes; and 2) existing table benchmarks lack the capacity to adequately assess the practical application of this task. To fill this gap, we propose the table-to-report task and construct a bilingual benchmark named T2R-bench, where the key information flow from the tables to the reports for this task. The benchmark comprises 457 industrial tables, all derived from real-world scenarios and encompassing 19 industry domains as well as four types of industrial tables. Furthermore, we propose a novel evaluation criteria to fairly measure the quality of report generation. Expeimental results show that Deepseek-R1 only achieves the best performance with 62.71% overall score, indicating that LLMs still have room for improvement on T2R-bench.

With the rapid advancement of large language models (LLMs), recent researchers have increasingly focused on the superior capabilities of LLMs in text/code understanding and generation to tackle text-to-SQL tasks. Traditional approaches adopt schema linking to first eliminate redundant tables and columns and prompt LLMs for SQL generation. However, they often struggle with accurately identifying corresponding tables and columns, due to discrepancies in naming conventions between natural language questions (NL) and database schemas. Besides, existing methods overlook the challenge of effectively transforming structure information from NL into SQL. To address these limitations, we introduce UCS-SQL, a novel text-to-SQL framework, uniting both content and structure pipes to bridge the gap between NL and SQL. Specifically, the content pipe focuses on identifying key content within the original content, while the structure pipe is dedicated to transforming the linguistic structure from NL to SQL. Additionally, we strategically selects few-shot examples by considering both the SQL Skeleton and Question Expression (SS-QE selection method), thus providing targeted examples for SQL generation. Experimental results on BIRD and Spider demonstrate the effectiveness of our UCS-SQL framework.

Multilingual spoken language understanding (SLU) involves intent detection (ID) and slot filling (SF) across multiple languages. The inherent linguistic diversity presents significant challenges in achieving performance comparable to traditional SLU. Recent studies have attempted to improve multilingual SLU performance by sharing multilingual encoders. However, these approaches have not directly established information flow between languages. To address this, we first demonstrate the feasibility of such information transfer and pinpoint the key challenges: prediction error mitigation and multilingual slot alignment. We then propose the INformation Transfer network (INT) to tackle these challenges. The gate unit in INT controls the information flow between languages, reducing the adverse impact of prediction errors on both ID and SF. Additionally, we reformulate SF as a span prediction problem and introduce a slot-matching attention mechanism to achieve slot alignment across languages. Experimental results on the MASSIVE and MASSIVE-UG datasets show that our model outperforms all baselines in overall accuracy across all languages, and demonstrates robust performance when different languages are used as the source.

pdf bib abs
TeleAI at SemEval-2025 Task 11: Bridging the Gap in Text-Based Emotion Detection with Prompt Engineering and Data Augmentation
Shiquan Wang | Mengxiang Li | Shengxiong Peng | Fang Yu | Zhongjiang He | Shuangyong Song | Yongxiang Li
Proceedings of the 19th International Workshop on Semantic Evaluation (SemEval-2025)

This paper presents the approach we employed in SemEval-2025 Task 11: “Bridging the Gap in Text-Based Emotion Detection.” The core objective of this shared task is emotion perception, focusing on determining the emotion the speaker is likely expressing when uttering a sentence or short text fragment, as perceived by the majority. In this task, we applied a prompt optimization strategy based on in-context learning, combined with data augmentation and ensemble voting techniques, to significantly enhance the model’s performance. Through these optimizations, the model demonstrated improved accuracy and stability in emotion detection. Ultimately, in both Track A (Multi-label Emotion Detection) and Track B (Emotion Intensity Prediction), our approach achieved top-3 rankings across multiple languages, showcasing the effectiveness and cross-lingual adaptability of our method.

The paper presents our system developed for SemEval-2025 Task 8, which focuses on table question answering (TQA). The TQA tasks face challenges due to the characteristics of real-world tabular data, such as large size, incomplete column semantics, and entity ambiguity. To address these issues, we propose a large language model (LLM)-powered and programming-based framework, named Flow-of-Table-Reasoning. We introduce the table schema integrating verbalized structure and semantics for query decomposition and programming, enabling a holistic understanding of tables and the ability to process large-size tables. We design a multi-step schema linking plan to derive a focused table schema that retains only information relevant to the query, aiming to eliminate ambiguity and reduce hallucinations. Furthermore, we incorporate reasoning workflow into an iterative thinking architecture, allowing incremental cycles of thinking, reasoning and reflection. Our system achieves first place on both TQA and Lite TQA subtasks.

In this paper, we present a novel pipeline for the XLLM Shared Task-III: Large Language Model for Structural Reasoning (LLM-SR). Our pipeline addresses key challenges in automatic process-reward training data construction, such as high manual annotation costs, limited accuracy of large models in structured data processing, and dependency on auxiliary information for validation. To overcome these limitations, we first decompose the construction process into extraction and validation phases. Leveraging model-generated annotations, we produce pseudo-labeled data and iteratively refine model performance. Second, by analyzing structured data patterns, we encode structural constraints into a rule-based module and fine-tune the model with Gradient Reward Policy Optimization (GRPO), significantly improving structured data extraction success rates. Finally, we train the model to generate critical responses that assess evidence-conclusion relationships, thus enhancing validation reliability. Experimental results demonstrate that our pipeline outperforms models with an order of magnitude more parameters and achieves the first position on the task.

2024

“本技术报告详细介绍了我们团队参加第四届中文空间语义理解评测(SpaCE2024)的方法和成果。SpaCE2024旨在全面测试机器对中文空间语义的理解能力,包括空间信息实体识别、空间信息实体识别、空间信息异常识别、空间方位信息推理和空间异形同义识别五个不同的任务。我们团队采用精心设计的prompt并结合微调的方式激发大语言模型的空间语义理解能力,构建了一个高效的空间语义理解系统。在最终的评估中,我们在空间信息实体识别题目中准确率为0.8947,在空间信息实体识别题目中准确率为0.9364,在空间信息异常识别题目中准确率为0.8480,在空间方位信息推理题目中准确率为0.3471,在空间异形同义识别题目中准确率为0.5631,测试集综合准确率为0.6024,排名第一。”

“本文描述了队伍“TeleAI”在CCL2024古文历史事件类型抽取评测任务(CHED2024)中提交的参赛系统。该任务旨在自动识别出古代文本中的事件触发词与事件类型,其中事件类型判别被分为粗粒度和细粒度的事件类型判别两部分。为了提高古文历史事件类型抽取的性能,我们结合了大模型和小模型,并采用了半监督自训练的方法。在最终的评估中,我们在触发词识别任务得分0.763,粗粒度事件类型判别任务得分0.842,细粒度事件类型判别任务得分0.779,综合得分0.791,在所有单项任务和综合评分上均排名第一。”

Hierarchical text classification aims at categorizing texts into a multi-tiered tree-structured hierarchy of labels. Existing methods pay more attention to capture hierarchy-aware text feature by exploiting explicit parent-child relationships, while interactions between peer labels are rarely taken into account, resulting in severe label confusion within each layer. In this work, we propose a novel Dual Prompt Tuning (DPT) method, which emphasizes identifying discrimination among peer labels by performing contrastive learning on each hierarchical layer. We design an innovative hand-crafted prompt containing slots for both positive and negative label predictions to cooperate with contrastive learning. In addition, we introduce a label hierarchy self-sensing auxiliary task to ensure cross-layer label consistency. Extensive experiments demonstrate that DPT achieves significant improvements and outperforms the current state-of-the-art methods on BGC and RCV1-V2 benchmark datasets.

This paper describes the participation of team “TeleAI” in the third International Chinese Ancient Chinese Language Information Processing Evaluation (EvalHan24). The competition comprises a joint task of sentence segmentation and punctuation, categorized into open and closed tracks based on the models and data used. In the final evaluation, our system achieved significantly better results than the baseline. Specifically, in the closed-track sentence segmentation task, we obtained an F1 score of 0.8885, while in the sentence punctuation task, we achieved an F1 score of 0.7129.

In this paper, we present TeleChat, a collection of large language models (LLMs) with parameters of 7 billion and 12 billion. TeleChat is initially pretrained on an extensive corpus containing a diverse collection of texts from both English and Chinese languages, encompassing trillions of tokens. Subsequently, the model undergoes fine-tuning to align with human preferences, following a detailed methodology that we describe. We evaluate the performance of TeleChat on various tasks, including general dialogue generation, language understanding, mathematics, reasoning, code generation, and knowledge-based question answering. Our findings indicate that TeleChat achieves state-of-the-art performance to other open-source models of similar size across a wide range of public benchmarks. To support future research and applications utilizing LLMs, we release the fine-tuned model checkpoints of TeleChat-7B and TeleChat-12B, along with code and a portion of our filtered high-quality pretraining data, to the public community.

2023

pdf bib abs
CCL23-Eval 任务1系统报告:基于持续预训练方法与上下文增强策略的古籍命名实体识别(System Report for CCL23-Eval Task 1:Named Entity Recognition for Ancient Books based on Continual Pre-training Method and Context Augmentation Strategy)
Shiquan Wang (士权王,) | Lingling Shi (石玲玲) | Luwen Pu (蒲璐汶) | Ruiyu Fang (方瑞玉) | Yu Zhao (宇赵,) | Shuangyong Song (宋双永)
Proceedings of the 22nd Chinese National Conference on Computational Linguistics (Volume 3: Evaluations)

“本文描述了队伍“翼智团”在CCL23古籍命名实体识别评测中提交的参赛系统。该任务旨在自动识别出古籍文本中人名、书名、官职名等事件基本构成要素的重要实体,并根据使用模型参数是否大于10b分为开放赛道和封闭赛道。该任务中,我们首先利用古籍相关的领域数据和任务数据对开源预训练模型进行持续预训练和微调,显著提升了基座模型在古籍命名实体识别任务上的性能表现。其次提出了一种基于pair-wise投票的不置信实体筛选算法用来得到候选实体,并对候选实体利用上下文增强策略进行实体识别修正。在最终的评估中,我们的系统在封闭赛道中排名第二,F1得分为95.8727。”

Dialogue state error correction has recently been proposed to correct wrong slot values in predicted dialogue states, thereby mitigating the error propagation problem for dialogue state tracking (DST). These approaches, though effective, are heavily intertwined with specific DST models, limiting their applicability to other DST models. To solve this problem, we propose Scalable Dialogue State Correction (Scalable-DSC), which can correct wrong slot values in the dialogue state predicted by any DST model. Specifically, we propose a Structural Template Prompt (STP) that converts predicted dialogue state from any DST models into a standardized natural language sequence as a part of the historical context, associates them with dialogue history information, and generates a corrected dialogue state sequence based on predefined template options. We further enhance Scalable-DSC by introducing two training strategies. The first employs a predictive state simulator to simulate the predicted dialogue states as the training data to enhance the generalization ability of the model. The second involves using the dialogue state predicted by DST as the training data, aiming at mitigating the inconsistent error type distribution between the training and inference. Experiments confirm that our model achieves state-of-the-art results on MultiWOZ 2.0-2.4.

2022

pdf bib abs
Tracking Satisfaction States for Customer Satisfaction Prediction in E-commerce Service Chatbots
Yang Sun | Liangqing Wu | Shuangyong Song | Xiaoguang Yu | Xiaodong He | Guohong Fu
Proceedings of the 29th International Conference on Computational Linguistics

Due to the increasing use of service chatbots in E-commerce platforms in recent years, customer satisfaction prediction (CSP) is gaining more and more attention. CSP is dedicated to evaluating subjective customer satisfaction in conversational service and thus helps improve customer service experience. However, previous methods focus on modeling customer-chatbot interaction across different turns, which are hard to represent the important dynamic satisfaction states throughout the customer journey. In this work, we investigate the problem of satisfaction states tracking and its effects on CSP in E-commerce service chatbots. To this end, we propose a dialogue-level classification model named DialogueCSP to track satisfaction states for CSP. In particular, we explore a novel two-step interaction module to represent the dynamic satisfaction states at each turn. In order to capture dialogue-level satisfaction states for CSP, we further introduce dialogue-aware attentions to integrate historical informative cues into the interaction module. To evaluate the proposed approach, we also build a Chinese E-commerce dataset for CSP. Experiment results demonstrate that our model significantly outperforms multiple baselines, illustrating the benefits of satisfaction states tracking on CSP.

Recently proposed dialogue state tracking (DST) approaches predict the dialogue state of a target turn sequentially based on the previous dialogue state. During the training time, the ground-truth previous dialogue state is utilized as the historical context. However, only the previously predicted dialogue state can be used in inference. This discrepancy might lead to error propagation, i.e., mistakes made by the model in the current turn are likely to be carried over to the following turns.To solve this problem, we propose Correctable Dialogue State Tracking (Correctable-DST). Specifically, it consists of three stages: (1) a Predictive State Simulator is exploited to generate a previously “predicted” dialogue state based on the ground-truth previous dialogue state during training; (2) a Slot Detector is proposed to determine the slots with an incorrect value in the previously “predicted” state and the slots whose values are to be updated in the current turn; (3) a State Generator takes the name of the above-selected slots as a prompt to generate the current state.Empirical results show that our approach achieves 67.51%, 68.24%, 70.30%, 71.38%, and 81.27% joint goal accuracy on MultiWOZ 2.0-2.4 datasets, respectively, and achieves a new state-of-the-art performance with significant improvements.

2021

pdf bib abs
An Emotional Comfort Framework for Improving User Satisfaction in E-Commerce Customer Service Chatbots
Shuangyong Song | Chao Wang | Haiqing Chen | Huan Chen
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Papers

E-commerce has grown substantially over the last several years, and chatbots for intelligent customer service are concurrently drawing attention. We presented AliMe Assist, a Chinese intelligent assistant designed for creating an innovative online shopping experience in E-commerce. Based on question answering (QA), AliMe Assist offers assistance service, customer service, and chatting service. According to the survey of user studies and the real online testing, emotional comfort of customers’ negative emotions, which make up more than 5% of whole number of customer visits on AliMe, is a key point for providing considerate service. In this paper, we propose a framework to obtain proper answer to customers’ emotional questions. The framework takes emotion classification model as a core, and final answer selection is based on topic classification and text matching. Our experiments on real online systems show that the framework is very promising.