Tabular data, which accounts for over 80% of enterprise data assets, is vital in various fields. With growing concerns about privacy protection and data-sharing restrictions, generating high-quality synthetic tabular data has become essential. Recent advancements show that large language models (LLMs) can effectively generate realistic tabular data by leveraging semantic information and overcoming the challenges of high-dimensional data that arise from one-hot encoding. However, current methods do not fully utilize the rich information available in tables. To address this, we introduce AI Generative Table based on prompt enhancement, a novel approach that utilizes metadata information, such as table descriptions and schemas, as prompts to generate ultra-high-quality synthetic data. To overcome the token limit constraints of LLMs, we propose long-token partitioning algorithms that enable AIGT to model tables of any scale. AIGT achieves state-of-the-art performance on 14 out of 20 public datasets and two real industry datasets within the Alipay risk control system.
The rapid growth of video platforms has transformed information dissemination and led to an explosion of multimedia content. However, this widespread reach also introduces risks, as some users exploit these platforms to spread hate speech, which is often concealed through complex rhetoric, making hateful video detection a critical challenge. Existing detection methods rely heavily on unimodal analysis or simple feature fusion, struggling to capture cross-modal interactions and reason through implicit hate in sarcasm and metaphor. To address these limitations, we propose HVGuard, the first reasoning-based hateful video detection framework with multimodal large language models (MLLMs). Our approach integrates Chain-of-Thought (CoT) reasoning to enhance multimodal interaction modeling and implicit hate interpretation. Additionally, we design a Mixture-of-Experts (MoE) network for efficient multimodal fusion and final decision-making. The framework is modular and extensible, allowing flexible integration of different MLLMs and encoders. Experimental results demonstrate that HVGuard outperforms all existing advanced detection tools, achieving an improvement of 6.88% to 13.13% in accuracy and 9.21% to 34.37% in M-F1 on two public datasets covering both English and Chinese.
“话头话身共享关系是小句组合成小句复合体的重要语法手段,也是汉语篇章级句法语义分析的重要基础。本文通过引入窗口滑动机制,将篇章文本及其成分共享关系转换为文本片段及片段内部的成分共享关系预测问题,并针对预测结果合并与选择问题,依据话头话身共享关系的语法限定性,提出了多种候选项消除策略。实验结果表明,本文方法在缺少小句复合体边界信息条件下仍取得了与传统基于NTC的方法可比的实验结果,尤其是在确实缺失共享成分的待预测位置处的召回率提高了约0.4个百分点。”
“机器阅读理解(Machine Reading Comprehension, MRC)任务旨在让机器回答给定上下文的问题来测试机器理解自然语言的能力。目前,基于大规模预训练语言模型的神经机器阅读理解模型已经取得重要进展,但在涉及答案要素、线索要素和问题要素跨标点句、远距离关联时,答案抽取的准确率还有待提升。本文通过篇章内话头话体结构分析,建立标点句间远距离关联关系、补全共享缺失成分,辅助机器阅读理解答案抽取;设计和实现融合话头话体结构信息的机器阅读理解模型,在公开数据集CMRC2018上的实验结果表明,模型的F1值相对于基线模型提升2.4%,EM值提升6%。”
“分词是中文信息处理的基础任务之一。目前全监督中文分词技术已相对成熟并在通用领域取得较好效果,但全监督方法存在依赖大规模标注语料且领域迁移能力差的问题,特别是跨领域未登录词识别性能不佳。为缓解上述问题,本文提出了一种充分利用相对易得的目标领域无标注文本、实现跨领域迁移的半监督中文分词框架;并设计实现了基于词记忆网络和序列条件熵的半监督权杒杆中文分词模型。实验结果表明本该模型在多个领域数据集上杆札值和杒杏杏杖值分别取得最高朲.朳朵朥和朱朲.朱朲朥的提升,并在多个数据集上成为当前好结果。”