Junyi Li

2023

pdf abs
HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models
Junyi Li | Xiaoxue Cheng | Xin Zhao | Jian-Yun Nie | Ji-Rong Wen
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Large language models (LLMs), such as ChatGPT, are prone to generate hallucinations, i.e., content that conflicts with the source or cannot be verified by the factual knowledge. To understand what types of content and to which extent LLMs are apt to hallucinate, we introduce the Hallucination Evaluation for Large Language Models (HaluEval) benchmark, a large collection of generated and human-annotated hallucinated samples for evaluating the performance of LLMs in recognizing hallucination. To generate these samples, we propose a ChatGPT-based two-step framework, i.e., sampling-then-filtering. Besides, we also hire some human labelers to annotate the hallucinations in ChatGPT responses. The empirical results suggest that ChatGPT is likely to generate hallucinated content in specific topics by fabricating unverifiable information (i.e., about 19.5% user queries). Moreover, existing LLMs face great challenges in recognizing the hallucinations in texts. While, our experiments also prove that the hallucination recognition can be improved by providing external knowledge or adding reasoning steps.

pdf abs
Learning to Imagine: Visually-Augmented Natural Language Generation
Tianyi Tang | Yushuo Chen | Yifan Du | Junyi Li | Wayne Xin Zhao | Ji-Rong Wen
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

People often imagine relevant scenes to aid in the writing process. In this work, we aim to utilize visual information for composition in the same manner as humans. We propose a method, LIVE, that makes pre-trained language models (PLMs) Learn to Imagine for Visually-augmented natural language gEneration. First, we imagine the scene based on the text: we use a diffusion model to synthesize high-quality images conditioned on the input texts. Second, we use CLIP to determine whether the text can evoke the imagination in a posterior way. Finally, our imagination is dynamic, and we conduct synthesis for each sentence rather than generate only one image for an entire paragraph. Technically, we propose a novel plug-and-play fusion layer to obtain visually-augmented representations for each text. Our vision-text fusion layer is compatible with Transformer-based architecture. We have conducted extensive experiments on four generation tasks using BART and T5, and the automatic results and human evaluation demonstrate the effectiveness of our proposed method. We will release the code, model, and data at the link: https://github.com/RUCAIBox/LIVE.

Pretrained language models (PLMs) encode a large amount of world knowledge. However, as such knowledge is frozen at the time of model training, the models become static and limited by the training data at that time. In order to further improve the capacity of PLMs for knowledge-intensive tasks, we consider augmenting PLMs with the large-scale web using search engine. Unlike previous augmentation sources (e.g., Wikipedia data dump), the web provides broader, more comprehensive and constantly updated information. In this paper, we present a web-augmented PLM – UniWeb, which is trained over 16 knowledge-intensive tasks in a unified text-to-text format. Instead of simply using the retrieved contents from web, our approach has made two major improvements. Firstly, we propose an adaptive search engine assisted learning method that can self-evaluate the confidence level of PLM’s predictions, and adaptively determine when to refer to the web for more data, which can avoid useless or noisy augmentation from web. Secondly, we design a pretraining task, i.e., continual knowledge learning, based on salient spans prediction, to reduce the discrepancy between the encoded and retrieved knowledge. Experiments on a wide range of knowledge-intensive tasks show that our model significantly outperforms previous retrieval-augmented methods.

pdf abs
MVP: Multi-task Supervised Pre-training for Natural Language Generation
Tianyi Tang | Junyi Li | Wayne Xin Zhao | Ji-Rong Wen
Findings of the Association for Computational Linguistics: ACL 2023

Pre-trained language models (PLMs) have achieved remarkable success in natural language generation (NLG) tasks. Up to now, most NLG-oriented PLMs are pre-trained in an unsupervised manner using the large-scale general corpus. In the meanwhile, an increasing number of models pre-trained with labeled data (i.e. “supervised pre-training”) showcase superior performance compared to unsupervised pre-trained models. Motivated by the success of supervised pre-training, we propose Multi-task superVised Pre-training (MVP) for natural language generation. We collect a large-scale natural language generation corpus, MVPCorpus, from 77 datasets over 11 diverse NLG tasks. Then we unify these examples into a general text-to-text format to pre-train the text generation model MVP in a supervised manner. For each task, we further pre-train specific soft prompts to stimulate the model’s capacity to perform a specific task. Our MVP model can be seen as a practice that utilizes recent instruction tuning on relatively small PLMs. Extensive experiments have demonstrated the effectiveness and generality of our MVP model in a number of NLG tasks, which achieves state-of-the-art performance on 13 out of 17 datasets, outperforming BART by 9.3% and Flan-T5 by 5.8%.

pdf abs
Zero-shot Visual Question Answering with Language Model Feedback
Yifan Du | Junyi Li | Tianyi Tang | Wayne Xin Zhao | Ji-Rong Wen
Findings of the Association for Computational Linguistics: ACL 2023

In this paper, we propose a novel language model guided captioning approach, LAMOC, for knowledge-based visual question answering (VQA). Our approach employs the generated captions by a captioning model as the context of an answer prediction model, which is a Pre-Trained Language model (PLM). As the major contribution, we leverage the guidance and feedback of the prediction model to improve the capability of the captioning model. In this way, the captioning model can become aware of the task goal and information need from the PLM. To develop our approach, we design two specific training stages, where the first stage adapts the captioning model to the prediction model (selecting more suitable caption propositions for training) and the second stage tunes the captioning model according to the task goal (learning from feedback of the PLM). Extensive experiments demonstrate the effectiveness of the proposed approach on the knowledge-based VQA task. Specifically, on the challenging A-OKVQA dataset, LAMOC outperforms several competitive zero-shot methods and even achieves comparable results to a fine-tuned VLP model. Our code is publicly available at https://github.com/RUCAIBox/LAMOC.

2022

pdf abs
ELMER: A Non-Autoregressive Pre-trained Language Model for Efficient and Effective Text Generation
Junyi Li | Tianyi Tang | Wayne Xin Zhao | Jian-Yun Nie | Ji-Rong Wen
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

We study the text generation task under the approach of pre-trained language models (PLMs). Typically, an auto-regressive (AR) method is adopted for generating texts in a token-by-token manner. Despite many advantages of AR generation, it usually suffers from inefficient inference. Therefore, non-autoregressive (NAR) models are proposed to generate all target tokens simultaneously. However, NAR models usually generate texts of lower quality due to the absence of token dependency in the output text. In this paper, we propose ELMER: an efficient and effective PLM for NAR text generation to explicitly model the token dependency during NAR generation. By leveraging the early exit technique, ELMER enables the token generations at different layers, according to their prediction confidence (a more confident token will exit at a lower layer). Besides, we propose a novel pre-training objective, Layer Permutation Language Modeling, to pre-train ELMER by permuting the exit layer for each token in sequences. Experiments on three text generation tasks show that ELMER significantly outperforms NAR models and further narrows the performance gap with AR PLMs (ELMER (29.92) vs BART (30.61) ROUGE-L in XSUM) while achieving over 10 times inference speedup.

To facilitate research on text generation, this paper presents a comprehensive and unified library, TextBox 2.0, focusing on the use of pre-trained language models (PLMs). To be comprehensive, our library covers 13 common text generation tasks and their corresponding 83 datasets and further incorporates 45 PLMs covering general, translation, Chinese, dialogue, controllable, distilled, prompting, and lightweight PLMs. We also implement 4 efficient training strategies and provide 4 generation objectives for pre-training new PLMs from scratch. To be unified, we design the interfaces to support the entire research pipeline (from data loading to training and evaluation), ensuring that each step can be fulfilled in a unified way. Despite the rich functionality, it is easy to use our library, either through the friendly Python API or command line. To validate the effectiveness of our library, we conduct extensive experiments and exemplify four types of research scenarios. The project is released at the link: https://github.com/RUCAIBox/TextBox#2.0.

pdf abs
Context-Tuning: Learning Contextualized Prompts for Natural Language Generation
Tianyi Tang | Junyi Li | Wayne Xin Zhao | Ji-Rong Wen
Proceedings of the 29th International Conference on Computational Linguistics

Recently, pretrained language models (PLMs) have had exceptional success in language generation. To leverage the rich knowledge encoded by PLMs, a simple yet powerful paradigm is to use prompts in the form of either discrete tokens or continuous embeddings. In existing studies, these prompting methods are typically independent of the inputs, lacking sufficient consideration of input semantics. To address this issue, we propose a novel continuous prompting approach, called context-tuning, to fine-tuning PLMs for natural language generation. Firstly, the prompts are derived based on the input text to elicit useful knowledge from PLMs for generation. We refer to such prompts as contextualized prompts. Secondly, we use continuous inverse prompting to improve the process of natural language generation by modeling an inverse generation process from output to input, making the generated text more relevant to the inputs. Furthermore, we utilize a lightweight context-tuning method that fine-tunes only 0.12% of the parameters while maintaining good performance. Our code is publicly available at https://github.com/RUCAIBox/Context-Tuning.

pdf abs
Learning to Transfer Prompts for Text Generation
Junyi Li | Tianyi Tang | Jian-Yun Nie | Ji-Rong Wen | Xin Zhao
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Pretrained language models (PLMs) have made remarkable progress in text generation tasks via fine-tuning. While, it is challenging to fine-tune PLMs in a data-scarce situation. Therefore, it is non-trivial to develop a general and lightweight model that can adapt to various text generation tasks based on PLMs. To fulfill this purpose, the recent prompt-based learning offers a potential solution. In this paper, we improve this technique and propose a novel prompt-based method (PTG) for text generation in a transferable setting. First, PTG learns a set of source prompts for various source generation tasks and then transfers these prompts as target prompts to perform target generation tasks. To consider both task- and instance-level information, we design an adaptive attention mechanism to derive the target prompts. For each data instance, PTG learns a specific target prompt by attending to highly relevant source prompts. In extensive experiments, PTG yields competitive or better results than fine-tuning methods. We release our source prompts as an open resource, where users can add or reuse them to improve new text generation tasks for future research. Code and data can be available at https://github.com/RUCAIBox/Transfer-Prompts-for-Text-Generation.

Nowadays, pretrained language models (PLMs) have dominated the majority of NLP tasks. While, little research has been conducted on systematically evaluating the language abilities of PLMs. In this paper, we present a large-scale empirical study on general language ability evaluation of PLMs (ElitePLM). In our study, we design four evaluation dimensions, memory, comprehension, reasoning, and composition, to measure ten widely-used PLMs within five categories. Our empirical results demonstrate that: (1) PLMs with varying training objectives and strategies are good at different ability tests; (2) fine-tuning PLMs in downstream tasks is usually sensitive to the data size and distribution; (3) PLMs have excellent transferability between similar tasks. Moreover, the prediction results of PLMs in our experiments are released as an open resource for more deep and detailed analysis on the language abilities of PLMs. This paper can guide the future work to select, apply, and design PLMs for specific tasks. We have made all the details of experiments publicly available at https://github.com/RUCAIBox/ElitePLM.

2021

In this paper, we release an open-source library, called TextBox, to provide a unified, modularized, and extensible text generation framework. TextBox aims to support a broad set of text generation tasks and models. In our library, we implement 21 text generation models on 9 benchmark datasets, covering the categories of VAE, GAN, and pretrained language models. Meanwhile, our library maintains sufficient modularity and extensibility by properly decomposing the model architecture, inference, and learning process into highly reusable modules, which allows users to easily incorporate new models into our framework. The above features make TextBox especially suitable for researchers and practitioners to quickly reproduce baseline models and develop new models. TextBox is implemented based on PyTorch, and released under Apache License 2.0 at the link https://github.com/RUCAIBox/TextBox.

2020

pdf abs
Lijunyi at SemEval-2020 Task 4: An ALBERT Model Based Maximum Ensemble with Different Training Sizes and Depths for Commonsense Validation and Explanation
Junyi Li | Bin Wang | Haiyan Ding
Proceedings of the Fourteenth Workshop on Semantic Evaluation

This article describes the system submitted to SemEval 2020 Task 4: Commonsense Validation and Explanation. We only participated in the subtask A, which is mainly to distinguish whether the sentence has meaning. To solve this task, we mainly used ALBERT model-based maximum ensemble with different training sizes and depths. To prove the validity of the model to the task, we also used some other neural network models for comparison. Our model achieved the accuracy score of 0.938(ranked 10/41) in subtask A.

pdf abs
Lee at SemEval-2020 Task 5: ALBERT Model Based on the Maximum Ensemble Strategy and Different Data Sampling Methods for Detecting Counterfactual Statements
Junyi Li | Yuhang Wu | Bin Wang | Haiyan Ding
Proceedings of the Fourteenth Workshop on Semantic Evaluation

This article describes the system submitted to SemEval 2020 Task 5: Modelling Causal Reasoning in Language: Detecting Counterfactuals. In this task, we only participate in the subtask A which is detecting counterfactual statements. In order to solve this sub-task, first of all, because of the problem of data balance, we use the undersampling and oversampling methods to process the data set. Second, we used the ALBERT model and the maximum ensemble method based on the ALBERT model. Our methods achieved a F1 score of 0.85 in subtask A.

pdf abs
Lee at SemEval-2020 Task 12: A BERT Model Based on the Maximum Self-ensemble Strategy for Identifying Offensive Language
Junyi Li | Xiaobing Zhou | Zichen Zhang
Proceedings of the Fourteenth Workshop on Semantic Evaluation

This article describes the system submitted to SemEval 2020 Task 12: OffensEval 2020. This task aims to identify and classify offensive languages in different languages on social media. We only participate in the English part of subtask A, which aims to identify offensive languages in English. To solve this task, we propose a BERT model system based on the transform mechanism, and use the maximum self-ensemble to improve model performance. Our model achieved a macro F1 score of 0.913(ranked 13/82) in subtask A.

The advent of natural language understanding (NLU) benchmarks for English, such as GLUE and SuperGLUE allows new NLU models to be evaluated across a diverse set of tasks. These comprehensive benchmarks have facilitated a broad range of research and applications in natural language processing (NLP). The problem, however, is that most such benchmarks are limited to English, which has made it difficult to replicate many of the successes in English NLU for other languages. To help remedy this issue, we introduce the first large-scale Chinese Language Understanding Evaluation (CLUE) benchmark. CLUE is an open-ended, community-driven project that brings together 9 tasks spanning several well-established single-sentence/sentence-pair classification tasks, as well as machine reading comprehension, all on original Chinese text. To establish results on these tasks, we report scores using an exhaustive set of current state-of-the-art pre-trained Chinese models (9 in total). We also introduce a number of supplementary datasets and additional tools to help facilitate further progress on Chinese NLU. Our benchmark is released at https://www.cluebenchmarks.com

2019

pdf abs
Lijunyi at SemEval-2019 Task 9: An attention-based LSTM and ensemble of different models for suggestion mining from online reviews and forums
Junyi Li
Proceedings of the 13th International Workshop on Semantic Evaluation

In this paper, we describe a suggestion mining system that participated in SemEval 2019 Task 9, SubTask A - Suggestion Mining from Online Reviews and Forums. Given some suggestions from online reviews and forums that can be classified into suggestion and non-suggestion classes. In this task, we combine the attention mechanism with the LSTM model, which is the final system we submitted. The final submission achieves 14th place in Task 9, SubTask A with the accuracy of 0.6776. After the challenge, we train a series of neural network models such as convolutional neural network(CNN), TextCNN, long short-term memory(LSTM) and C-LSTM. Finally, we make an ensemble on the predictions of these models and get a better result.

pdf abs
YNU-junyi in BioNLP-OST 2019: Using CNN-LSTM Model with Embeddings for SeeDev Binary Event Extraction
Junyi Li | Xiaobing Zhou | Yuhang Wu | Bin Wang
Proceedings of the 5th Workshop on BioNLP Open Shared Tasks

We participated in the BioNLP 2019 Open Shared Tasks: binary relation extraction of SeeDev task. The model was constructed us- ing convolutional neural networks (CNN) and long short term memory networks (LSTM). The full text information and context information were collected using the advantages of CNN and LSTM. The model consisted of two main modules: distributed semantic representation construction, such as word embedding, distance embedding and entity type embed- ding; and CNN-LSTM model. The F1 value of our participated task on the test data set of all types was 0.342. We achieved the second highest in the task. The results showed that our proposed method performed effectively in the binary relation extraction.

pdf abs
Generating Long and Informative Reviews with Aspect-Aware Coarse-to-Fine Decoding
Junyi Li | Wayne Xin Zhao | Ji-Rong Wen | Yang Song
Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Generating long and informative review text is a challenging natural language generation task. Previous work focuses on word-level generation, neglecting the importance of topical and syntactic characteristics from natural languages. In this paper, we propose a novel review generation model by characterizing an elaborately designed aspect-aware coarse-to-fine generation process. First, we model the aspect transitions to capture the overall content flow. Then, to generate a sentence, an aspect-aware sketch will be predicted using an aspect-aware decoder. Finally, another decoder fills in the semantic slots by generating corresponding words. Our approach is able to jointly utilize aspect semantics, syntactic sketch, and context information. Extensive experiments results have demonstrated the effectiveness of the proposed model.