2025
pdf
bib
abs
GL-GAN: Perceiving and Integrating Global and Local Styles for Handwritten Text Generation with Mamba
Yiming Wang
|
Hongxi Wei
|
Heng Wang
|
Shiwen Sun
|
Chao He
Proceedings of the 31st International Conference on Computational Linguistics
Handwritten text generation (HTG) aims to synthesize handwritten samples by imitating a specific writer, which has a wide range of applications and thus has significant research value. However, current studies on HTG are confronted with a main bottleneck: dominant models lack the ability to perceive and integrate handwriting styles, which affects the realism of the synthesized samples. In this paper, we propose GL-GAN, which effectively captures and integrates global and local styles. Specifically, we propose a Hybrid Style Encoder (HSE) that combines a state space model (SSM) and convolution to capture multilevel style features through various receptive fields. The captured style features are then fed to the proposed Dynamic Feature Enhancement Module (DFEM), which integrates these features by adaptively modeling the entangled relationships between multilevel styles and removing redundant details. Extensive experiments on two widely used handwriting datasets demonstrate that our GL-GAN is an effective HTG model and outperforms state-of-the-art models remarkably. Our code is publicly available at:https://github.com/Fyzjym/GL-GAN.
pdf
bib
abs
BannerAgency: Advertising Banner Design with Multimodal LLM Agents
Heng Wang
|
Yotaro Shimose
|
Shingo Takamatsu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Advertising banners are critical for capturing user attention and enhancing advertising campaign effectiveness. Creating aesthetically pleasing banner designs while conveying the campaign messages is challenging due to the large search space involving multiple design elements. Additionally, advertisers need multiple sizes for different displays and various versions to target different sectors of audiences. Since design is intrinsically an iterative and subjective process, flexible editability is also in high demand for practical usage. While current models have served as assistants to human designers in various design tasks, they typically handle only segments of the creative design process or produce pixel-based outputs that limit editability. This paper introduces a training-free framework for fully automated banner ad design creation, enabling frontier multimodal large language models (MLLMs) to streamline the production of effective banners with minimal manual effort across diverse marketing contexts. We present BannerAgency, an MLLM agent system that collaborates with advertisers to understand their brand identity and banner objectives, generates matching background images, creates blueprints for foreground design elements, and renders the final creatives as editable components in Figma or SVG formats rather than static pixels. To facilitate evaluation and future research, we introduce BannerRequest400, a benchmark featuring 100 unique logos paired with 400 diverse banner requests. Through quantitative and qualitative evaluations, we demonstrate the framework’s effectiveness, emphasizing the quality of the generated banner designs, their adaptability to various banner requests, and their strong editability enabled by this component-based approach.
pdf
bib
abs
Continuously Steering LLMs Sensitivity to Contextual Knowledge with Proxy Models
Yilin Wang
|
Heng Wang
|
Yuyang Bai
|
Minnan Luo
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
In Large Language Models (LLMs) generation, there exist knowledge conflicts, and scenarios where parametric knowledge contradicts knowledge provided in the context. Previous works studied tuning, decoding algorithms, or locating and editing context-aware neurons to adapt LLMs to be faithful to new contextual knowledge. However, they are usually inefficient or ineffective for large models, not workable for black-box models, or unable to continuously adjust LLMs’ sensitivity to the knowledge provided in the context. To mitigate these problems, we propose CSKS (Continuously Steering Knowledge Sensitivity), a simple framework that can steer LLMs’ sensitivity to contextual knowledge continuously at a lightweight cost. Specifically, we tune two small LMs (i.e. proxy models) and use the difference in their output distributions to shift the original distribution of an LLM without modifying the LLM weights. In the evaluation process, we not only design synthetic data and fine-grained metrics to measure models’ sensitivity to contextual knowledge but also use a real conflict dataset to validate CSKS’ practical efficacy. Extensive experiments demonstrate that our framework achieves continuous and precise control over LLMs’ sensitivity to contextual knowledge, enabling both increased sensitivity and reduced sensitivity, thereby allowing LLMs to prioritize either contextual or parametric knowledge as needed flexibly. Our data and code are available at https://github.com/OliveJuiceLin/CSKS.
pdf
bib
abs
Mirror in the Model: Ad Banner Image Generation via Reflective Multi-LLM and Multi-modal Agents
Zhao Wang
|
Bowen Chen
|
Yotaro Shimose
|
Sota Moriyama
|
Heng Wang
|
Shingo Takamatsu
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track
Recent generative models such as GPT‐4o have shown strong capabilities in producing high-quality images with accurate text rendering. However, commercial design tasks like advertising banners demand more than visual fidelity—they require structured layouts, precise typography, consistent branding and etc. In this paper, we introduce **MIMO (Mirror In‐the‐Model)**, an agentic refinement framework for automatic ad banner generation. MIMO combines a hierarchical multimodal agent system (MIMO‐Core) with a coordination loop (MIMO‐Loop) that explores multiple stylistic directions and iteratively improves design quality. Requiring only a simple natural language based prompt and logo image as input, MIMO automatically detects and corrects multiple types of errors during generation. Experiments show that MIMO significantly outperforms existing diffusion and LLM-based baselines in real-world banner design scenarios.
pdf
bib
abs
Multimodal Causal Reasoning Benchmark: Challenging Multimodal Large Language Models to Discern Causal Links Across Modalities
Zhiyuan Li
|
Heng Wang
|
Dongnan Liu
|
Chaoyi Zhang
|
Ao Ma
|
Jieting Long
|
Weidong Cai
Findings of the Association for Computational Linguistics: ACL 2025
Multimodal Large Language Models (MLLMs) have showcased exceptional Chain-of-Thought (CoT) reasoning ability in complex textual inference tasks including causal reasoning. However, will these causalities remain straightforward when crucial hints hide in visual details? If not, what factors might influence cross-modal generalization? Whether we can effectively enhance their capacity for robust causal inference across both text and vision? Motivated by these, we introduce **MuCR** - a novel **Mu**ltimodal **C**ausal **R**easoning benchmark that leverages synthetic siamese images and text pairs to challenge MLLMs. Additionally, we develop tailored metrics from multiple perspectives, including image-level match, phrase-level understanding, and sentence-level explanation, to comprehensively assess MLLMs’ comprehension abilities. Our experiments reveal that current MLLMs fall short in multimodal causal reasoning compared to their performance in purely textual settings. Additionally, we find that identifying visual cues across images is key to effective cross-modal generalization. Finally, we propose the **VcCoT** strategy that better highlights visual cues, and our results confirm its efficacy in enhancing multimodal causal reasoning.
2024
pdf
bib
abs
DELL: Generating Reactions and Explanations for LLM-Based Misinformation Detection
Herun Wan
|
Shangbin Feng
|
Zhaoxuan Tan
|
Heng Wang
|
Yulia Tsvetkov
|
Minnan Luo
Findings of the Association for Computational Linguistics: ACL 2024
Large language models are limited by challenges in factuality and hallucinations to be directly employed off-the-shelf for judging the veracity of news articles, where factual accuracy is paramount. In this work, we propose DELL that identifies three key stages in misinformation detection where LLMs could be incorporated as part of the pipeline: 1) LLMs could generate news reactions to represent diverse perspectives and simulate user-news interaction networks; 2) LLMs could generate explanations for proxy tasks (e.g., sentiment, stance) to enrich the contexts of news articles and produce experts specializing in various aspects of news understanding; 3) LLMs could merge task-specific experts and provide an overall prediction by incorporating the predictions and confidence scores of varying experts. Extensive experiments on seven datasets with three LLMs demonstrate that DELL outperforms state-of-the-art baselines by up to 16.8% in macro f1-score. Further analysis reveals that the generated reactions and explanations are greatly helpful in misinformation detection, while our proposed LLM-guided expert merging helps produce better-calibrated predictions.
pdf
bib
abs
Can LLM Graph Reasoning Generalize beyond Pattern Memorization?
Yizhuo Zhang
|
Heng Wang
|
Shangbin Feng
|
Zhaoxuan Tan
|
Xiaochuang Han
|
Tianxing He
|
Yulia Tsvetkov
Findings of the Association for Computational Linguistics: EMNLP 2024
Large language models (LLMs) demonstrate great potential for problems with implicit graphical structures, while recent works seek to enhance the graph reasoning capabilities of LLMs through specialized instruction tuning. The resulting “graph LLMs” are evaluated with in-distribution settings only, thus it remains underexplored whether LLMs are learning generalizable graph reasoning skills or merely memorizing patterns in the synthetic training data. To this end, we propose the NLGift benchmark, an evaluation suite of LLM graph reasoning generalization: whether LLMs could go beyond semantic, numeric, structural, reasoning patterns in the synthetic training data and improve utility on real-world graph-based tasks. Extensive experiments with two LLMs across four graph reasoning tasks demonstrate that while generalization on simple patterns (semantic, numeric) is somewhat satisfactory, LLMs struggle to generalize across reasoning and real-world patterns, casting doubt on the benefit of synthetic graph tuning for real-world tasks with underlying network structures. We explore three strategies to improve LLM graph reasoning generalization, and we find that while post-training alignment is most promising for real-world tasks, empowering LLM graph reasoning to go beyond pattern memorization remains an open research question.
2023
pdf
bib
abs
Detecting Spoilers in Movie Reviews with External Movie Knowledge and User Networks
Heng Wang
|
Wenqian Zhang
|
Yuyang Bai
|
Zhaoxuan Tan
|
Shangbin Feng
|
Qinghua Zheng
|
Minnan Luo
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Online movie review platforms are providing crowdsourced feedback for the film industry and the general public, while spoiler reviews greatly compromise user experience. Although preliminary research efforts were made to automatically identify spoilers, they merely focus on the review content itself, while robust spoiler detection requires putting the review into the context of facts and knowledge regarding movies, user behavior on film review platforms, and more. In light of these challenges, we first curate a large-scale network-based spoiler detection dataset LCS and a comprehensive and up-to-date movie knowledge base UKM. We then propose MVSD, a novel spoiler detection model that takes into account the external knowledge about movies and user activities on movie review platforms. Specifically, MVSD constructs three interconnecting heterogeneous information networks to model diverse data sources and their multi-view attributes, while we design and employ a novel heterogeneous graph neural network architecture for spoiler detection as node-level classification. Extensive experiments demonstrate that MVSD advances the state-of-the-art on two spoiler detection datasets, while the introduction of external knowledge and user interactions help ground robust spoiler detection.
2019
pdf
bib
abs
Incorporating Graph Attention Mechanism into Knowledge Graph Reasoning Based on Deep Reinforcement Learning
Heng Wang
|
Shuangyin Li
|
Rong Pan
|
Mingzhi Mao
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)
Knowledge Graph (KG) reasoning aims at finding reasoning paths for relations, in order to solve the problem of incompleteness in KG. Many previous path-based methods like PRA and DeepPath suffer from lacking memory components, or stuck in training. Therefore, their performances always rely on well-pretraining. In this paper, we present a deep reinforcement learning based model named by AttnPath, which incorporates LSTM and Graph Attention Mechanism as the memory components. We define two metrics, Mean Selection Rate (MSR) and Mean Replacement Rate (MRR), to quantitatively measure how difficult it is to learn the query relations, and take advantages of them to fine-tune the model under the framework of reinforcement learning. Meanwhile, a novel mechanism of reinforcement learning is proposed by forcing an agent to walk forward every step to avoid the agent stalling at the same entity node constantly. Based on this operation, the proposed model not only can get rid of the pretraining process, but also achieves state-of-the-art performance comparing with the other models. We test our model on FB15K-237 and NELL-995 datasets with different tasks. Extensive experiments show that our model is effective and competitive with many current state-of-the-art methods, and also performs well in practice.