Peinan Zhang


2024

pdf
Generating Diverse and High-Quality Texts by Minimum Bayes Risk Decoding
Yuu Jinnai | Ukyo Honda | Tetsuro Morimura | Peinan Zhang
Findings of the Association for Computational Linguistics ACL 2024

One of the most important challenges in text generation systems is to produce outputs that are not only correct but also diverse.Recently, Minimum Bayes-Risk (MBR) decoding has gained prominence for generating sentences of the highest quality among the decoding algorithms. However, existing algorithms proposed to generate diverse outputs are predominantly based on beam search or random sampling, thus their output quality is capped by these underlying decoding algorithms. In this paper, we investigate an alternative approach – we develop diversity-promoting decoding algorithms by enforcing diversity objectives to MBR decoding.We propose two variants of MBR; (i) Diverse MBR (DMBR) that adds a diversity penalty to the decoding objective and (ii) k-medoids MBR (KMBR) that reformulates the decoding task as a clustering problem.We evaluate DMBR and KMBR on a variety of directed text generation tasks using encoder-decoder models and a language model with prompting. The experimental results show that the proposed method achieves a better trade-off than the diverse beam search and sampling algorithms overall.

pdf
CAMERA³: An Evaluation Dataset for Controllable Ad Text Generation in Japanese
Go Inoue | Akihiko Kato | Masato Mita | Ukyo Honda | Peinan Zhang
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Ad text generation is the task of creating compelling text from an advertising asset that describes products or services, such as a landing page. In advertising, diversity plays an important role in enhancing the effectiveness of an ad text, mitigating a phenomenon called “ad fatigue,” where users become disengaged due to repetitive exposure to the same advertisement. Despite numerous efforts in ad text generation, the aspect of diversifying ad texts has received limited attention, particularly in non-English languages like Japanese. To address this, we present CAMERA³, an evaluation dataset for controllable text generation in the advertising domain in Japanese. Our dataset includes 3,980 ad texts written by expert annotators, taking into account various aspects of ad appeals. We make CAMERA³ publicly available, allowing researchers to examine the capabilities of recent NLG models in controllable text generation in a real-world scenario.

pdf
Cross-lingual Transfer or Machine Translation? On Data Augmentation for Monolingual Semantic Textual Similarity
Sho Hoshino | Akihiko Kato | Soichiro Murakami | Peinan Zhang
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Learning better sentence embeddings leads to improved performance for natural language understanding tasks including semantic textual similarity (STS) and natural language inference (NLI). As prior studies leverage large-scale labeled NLI datasets for fine-tuning masked language models to yield sentence embeddings, task performance for languages other than English is often left behind. In this study, we directly compared two data augmentation techniques as potential solutions for monolingual STS: - (a): _cross-lingual transfer_ that exploits English resources alone as training data to yield non-English sentence embeddings as zero-shot inference, and - (b) _machine translation_ that coverts English data into pseudo non-English training data in advance. In our experiments on monolingual STS in Japanese and Korean, we find that the two data techniques yield performance on par. In addition, we find a superiority of Wikipedia domain over NLI domain as unlabeled training data for these languages. Combining our findings, we further demonstrate that the cross-lingual transfer of Wikipedia data exhibits improved performance.

pdf
Striking Gold in Advertising: Standardization and Exploration of Ad Text Generation
Masato Mita | Soichiro Murakami | Akihiko Kato | Peinan Zhang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

In response to the limitations of manual ad creation, significant research has been conducted in the field of automatic ad text generation (ATG). However, the lack of comprehensive benchmarks and well-defined problem sets has made comparing different methods challenging. To tackle these challenges, we standardize the task of ATG and propose a first benchmark dataset, CAMERA, carefully designed and enabling the utilization of multi-modal information and facilitating industry-wise evaluations. Our extensive experiments with a variety of nine baselines, from classical methods to state-of-the-art models including large language models (LLMs), show the current state and the remaining challenges. We also explore how existing metrics in ATG and an LLM-based evaluator align with human evaluations.

2022

pdf
Aspect-based Analysis of Advertising Appeals for Search Engine Advertising
Soichiro Murakami | Peinan Zhang | Sho Hoshino | Hidetaka Kamigaito | Hiroya Takamura | Manabu Okumura
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Track

Writing an ad text that attracts people and persuades them to click or act is essential for the success of search engine advertising. Therefore, ad creators must consider various aspects of advertising appeals (A3) such as the price, product features, and quality. However, products and services exhibit unique effective A3 for different industries. In this work, we focus on exploring the effective A3 for different industries with the aim of assisting the ad creation process. To this end, we created a dataset of advertising appeals and used an existing model that detects various aspects for ad texts. Our experiments demonstrated %through correlation analysis that different industries have their own effective A3 and that the identification of the A3 contributes to the estimation of advertising performance.

2021

pdf
FAST: Fast Annotation tool for SmarT devices
Shunyo Kawamoto | Yu Sawai | Kohei Wakimoto | Peinan Zhang
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations

Working with a wide range of annotators with the same attributes is crucial, as in real-world applications. Although such application cases often use crowd-sourcing mechanisms to gather a variety of annotators, most real-world users use mobile devices. In this paper, we propose “FAST,” an annotation tool for application tasks that focuses on the user experience of mobile devices, which has not yet been focused on thus far. We designed FAST as a web application for use on any device with a flexible interface that can be customized to fit various tasks. In our experiments, we conducted crowd-sourced annotation for a sentiment analysis task with several annotators and evaluated annotation metrics such as speed, quality, and ease of use from the tool’s logs and user surveys. Based on the results of our experiments, we conclude that our system can annotate faster than existing methods while maintaining the annotation quality.

pdf
An Empirical Study of Generating Texts for Search Engine Advertising
Hidetaka Kamigaito | Peinan Zhang | Hiroya Takamura | Manabu Okumura
Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Papers

Although there are many studies on neural language generation (NLG), few trials are put into the real world, especially in the advertising domain. Generating ads with NLG models can help copywriters in their creation. However, few studies have adequately evaluated the effect of generated ads with actual serving included because it requires a large amount of training data and a particular environment. In this paper, we demonstrate a practical use case of generating ad-text with an NLG model. Specially, we show how to improve the ads’ impact, deploy models to a product, and evaluate the generated ads.

2015

pdf
Japanese Sentiment Classification with Stacked Denoising Auto-Encoder using Distributed Word Representation
Peinan Zhang | Mamoru Komachi
Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation