Li Yang (李旸) - ACL Anthology

Li Yang

Also published as: 旸李

2025

pdf bib abs
StoryLLaVA: Enhancing Visual Storytelling with Multi-Modal Large Language Models
Li Yang | Zhiding Xiao | Wenxin Huang | Xian Zhong
Proceedings of the 31st International Conference on Computational Linguistics

The rapid development of multimodal large language models (MLLMs) has positioned visual storytelling as a crucial area in content creation. However, existing models often struggle to maintain temporal, spatial, and narrative coherence across image sequences, and they frequently lack the depth and engagement of human-authored stories. To address these challenges, we propose Story with Large Language-and-Vision Alignment (StoryLLaVA), a novel framework for enhancing visual storytelling. Our approach introduces a topic-driven narrative optimizer that improves both the training data and MLLM models by integrating image descriptions, topic generation, and GPT-4-based refinements. Furthermore, we employ a preference-based ranked story sampling method that aligns model outputs with human storytelling preferences through positive-negative pairing. These two phases of the framework differ in their training methods: the former uses supervised fine-tuning, while the latter incorporates reinforcement learning with positive and negative sample pairs. Experimental results demonstrate that StoryLLaVA outperforms current models in visual relevance, coherence, and fluency, with LLM-based evaluations confirming the generation of richer and more engaging narratives. The enhanced dataset and model will be made publicly available soon.

2024

pdf bib abs
融合扩展语义和标签层次信息的文档级事件抽取(Document-Level Event Extraction with Integrating Extended Semantics and Label Hierarchy Information)
Fu Yujiao (符玉娇) | Liao Jian (廖健) | Li Yang (李旸) | Guo Zhangfeng (郭张峰) | Wang Suge (王素格)
Proceedings of the 23rd Chinese National Conference on Computational Linguistics (Volume 1: Main Conference)

“文档级事件抽取是自然语言处理中的一项重要任务,面临论元分散和多事件提及的挑战,现有研究通常从文档的所有句子中抽取论元,通过论元角色建模捕获实体间关系,忽略了文档中事件-句子间的关联差异性。本文提出了一种融合扩展语义和标签层次信息的文档级事件抽取方法。首先,利用大语言模型对文本和事件类型标签与论元角色标签进行语义扩展,以引入更丰富的背景语义信息;其次,基于关联差异性的事件类型检测模块,获取文档中与事件类型高度相关的句子,通过约束候选实体的抽取范围,来缓解论元分散问题;进一步,针对文档提及的多个事件类型,利用有向无环图从候选实体中抽取论元,获取所有事件要素。在ChFinAnn和DuEE-Fin两个数据集上的实验结果表明,本文提出的方法相比基线模型可以有针对性地缓解多个事件所属论元分散的问题,有效地提升事件抽取的性能。”

pdf bib abs
基于双图注意力网络的篇章级散文情绪变化分析方法(A Document-Level Emotion Change Analysis Method Based on DualGATs for Prose)
Li Ailin (李爱琳) | Li Yang (李旸) | Wang Suge (王素格) | Li Shuqi (李书琪)
Proceedings of the 23rd Chinese National Conference on Computational Linguistics (Volume 1: Main Conference)

“在散文中,作者的情绪会伴随着文章的段落或者句子发生变化,比如从悲伤到快乐、从喜悦到愤怒。为此,本文构建散文情绪变化数据集,提出一种基于双图注意力网络的多种知识融合的情绪变化分析方法。首先,引入意象知识库,建立融合意象知识的句子表示;其次,构建上下文带权依赖图和语篇带权依赖图,通过融合上下文知识和语篇结构,建立了融合上下文知识、语篇结构的句子表示;同时设计愉悦效价识别层,获得融合愉悦效价信息的句子表示;在此基础上,将以上三者表示进行拼接,通过全连接网络得到最终的情绪变化结果。实验结果表明,本文提出的方法可以有效识别情绪变化,为散文阅读理解中的思想情绪变化类问题的解答提供帮助。”

Product attribute value extraction involves identifying the specific values associated with various attributes from a product profile. While existing methods often prioritize the development of effective models to improve extraction performance, there has been limited emphasis on extraction efficiency. However, in real-world scenarios, products are typically associated with multiple attributes, necessitating multiple extractions to obtain all corresponding values. In this work, we propose an Efficient product Attribute Value Extraction (EAVE) approach via lightweight sparse-layer interaction. Specifically, we employ a heavy encoder to separately encode the product context and attribute. The resulting non-interacting heavy representations of the context can be cached and reused for all attributes. Additionally, we introduce a light encoder to jointly encode the context and the attribute, facilitating lightweight interactions between them. To enrich the interaction within the lightweight encoder, we design a sparse-layer interaction module to fuse the non-interacting heavy representation into the lightweight encoder. Comprehensive evaluation on two benchmarks demonstrate that our method achieves significant efficiency gains with neutral or marginal loss in performance when the context is long and number of attributes is large. Our code is available at: https://anonymous.4open.science/r/EAVE-EA18.

2023

The task of product attribute value extraction is to identify values of an attribute from product information. Product attributes are important features, which help improve online shopping experience of customers, such as product search, recommendation and comparison. Most existing works only focus on extracting values for a set of known attributes with sufficient training data. However, with the emerging nature of e-commerce, new products with their unique set of new attributes are constantly generated from different retailers and merchants. Collecting a large number of annotations for every new attribute is costly and time consuming. Therefore, it is an important research problem for product attribute value extraction with limited data. In this work, we propose a novel prompt tuning approach with Mixed Prompts for few-shot Attribute Value Extraction, namely MixPAVE. Specifically, MixPAVE introduces only a small amount (< 1%) of trainable parameters, i.e., a mixture of two learnable prompts, while keeping the existing extraction model frozen. In this way, MixPAVE not only benefits from parameter-efficient training, but also avoids model overfitting on limited training examples. Experimental results on two product benchmarks demonstrate the superior performance of the proposed approach over several state-of-the-art baselines. A comprehensive set of ablation studies validate the effectiveness of the prompt design, as well as the efficiency of our approach.

2022

Automatic question generation (AQG) is the task of generating a question from a given passage and an answer. Most existing AQG methods aim at encoding the passage and the answer to generate the question. However, limited work has focused on modeling the correlation between the target answer and the generated question. Moreover, unseen or rare word generation has not been studied in previous works. In this paper, we propose a novel approach which incorporates question generation with its dual problem, question answering, into a unified primal-dual framework. Specifically, the question generation component consists of an encoder that jointly encodes the answer with the passage, and a decoder that produces the question. The question answering component then re-asks the generated question on the passage to ensure that the target answer is obtained. We further introduce a knowledge distillation module to improve the model generalization ability. We conduct an extensive set of experiments on SQuAD and HotpotQA benchmarks. Experimental results demonstrate the superior performance of the proposed approach over several state-of-the-art methods.

Automatic product attribute value extraction refers to the task of identifying values of an attribute from the product information. Product attributes are essential in improving online shopping experience for customers. Most existing methods focus on extracting attribute values from product title and description.However, in many real-world applications, a product is usually represented by multiple modalities beyond title and description, such as product specifications, text and visual information from the product image, etc. In this paper, we propose SMARTAVE, a Structure Mltimodal trAnsformeR for producT Attribute Value Extraction, which jointly encodes the structured product information from multiple modalities. Specifically, in SMARTAVE encoder, we introduce hyper-tokens to represent the modality-level information, and local-tokens to represent the original text and visual inputs. Structured attention patterns are designed among the hyper-tokens and local-tokens for learning effective product representation. The attribute values are then extracted based on the learned embeddings. We conduct extensive experiments on two multimodal product datasets. Experimental results demonstrate the superior performance of the proposed approach over several state-of-the-art methods. Ablation studies validate the effectiveness of the structured attentions in modeling the multimodal product information.

2020

pdf bib abs
Improving Multimodal Named Entity Recognition via Entity Span Detection with Unified Multimodal Transformer
Jianfei Yu | Jing Jiang | Li Yang | Rui Xia
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

In this paper, we study Multimodal Named Entity Recognition (MNER) for social media posts. Existing approaches for MNER mainly suffer from two drawbacks: (1) despite generating word-aware visual representations, their word representations are insensitive to the visual context; (2) most of them ignore the bias brought by the visual context. To tackle the first issue, we propose a multimodal interaction module to obtain both image-aware word representations and word-aware visual representations. To alleviate the visual bias, we further propose to leverage purely text-based entity span detection as an auxiliary module, and design a Unified Multimodal Transformer to guide the final predictions with the entity span predictions. Experiments show that our unified approach achieves the new state-of-the-art performance on two benchmark datasets.

Transformer models have advanced the state of the art in many Natural Language Processing (NLP) tasks. In this paper, we present a new Transformer architecture, “Extended Transformer Construction” (ETC), that addresses two key challenges of standard Transformer architectures, namely scaling input length and encoding structured inputs. To scale attention to longer inputs, we introduce a novel global-local attention mechanism between global tokens and regular input tokens. We also show that combining global-local attention with relative position encodings and a “Contrastive Predictive Coding” (CPC) pre-training objective allows ETC to encode structured inputs. We achieve state-of-the-art results on four natural language datasets requiring long and/or structured inputs.

2019

pdf bib abs
Naive Bayes and BiLSTM Ensemble for Discriminating between Mainland and Taiwan Variation of Mandarin Chinese
Li Yang | Yang Xiang
Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects

Automatic dialect identification is a more challengingctask than language identification, as it requires the ability to discriminate between varieties of one language. In this paper, we propose an ensemble based system, which combines traditional machine learning models trained on bag of n-gram fetures, with deep learning models trained on word embeddings, to solve the Discriminating between Mainland and Taiwan Variation of Mandarin Chinese (DMT) shared task at VarDial 2019. Our experiments show that a character bigram-trigram combination based Naive Bayes is a very strong model for identifying varieties of Mandarin Chinense. Through further ensemble of Navie Bayes and BiLSTM, our system (team: itsalexyang) achived an macro-averaged F1 score of 0.8530 and 0.8687 in two tracks.