Peng Shi


Better Language Model with Hypernym Class Prediction
He Bai | Tong Wang | Alessandro Sordoni | Peng Shi
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Class-based language models (LMs) have been long devised to address context sparsity in n-gram LMs. In this study, we revisit this approach in the context of neural LMs. We hypothesize that class-based prediction leads to an implicit context aggregation for similar words and thus can improve generalization for rare words. We map words that have a common WordNet hypernym to the same class and train large neural LMs by gradually annealing from predicting the class to token prediction during training. Empirically, this curriculum learning strategy consistently improves perplexity over various large, highly-performant state-of-the-art Transformer-based models on two datasets, WikiText-103 and ARXIV. Our analysis shows that the performance improvement is achieved without sacrificing performance on rare words. Finally, we document other attempts that failed to yield empirical gains, and discuss future directions for the adoption of class-based LMs on a larger scale.

XRICL: Cross-lingual Retrieval-Augmented In-Context Learning for Cross-lingual Text-to-SQL Semantic Parsing
Peng Shi | Rui Zhang | He Bai | Jimmy Lin
Findings of the Association for Computational Linguistics: EMNLP 2022

In-context learning using large language models has recently shown surprising results for semantic parsing tasks such as Text-to-SQL translation.Prompting GPT-3 or Codex using several examples of question-SQL pairs can produce excellent results, comparable to state-of-the-art finetuning-based models.However, existing work primarily focuses on English datasets, and it is unknown whether large language models can serve as competitive semantic parsers for other languages.To bridge this gap, our work focuses on cross-lingual Text-to-SQL semantic parsing for translating non-English utterances into SQL queries based on an English schema.We consider a zero-shot transfer learning setting with the assumption that we do not have any labeled examples in the target language (but have annotated examples in English).This work introduces the XRICL framework, which learns to retrieve relevant English exemplars for a given query to construct prompts.We also include global translation exemplars for a target language to facilitate the translation process for large language models.To systematically evaluate our model, we construct two new benchmark datasets, XSpider and XKaggle-dbqa, which include questions in Chinese, Vietnamese, Farsi, and Hindi.Our experiments show that XRICL effectively leverages large pre-trained language models to outperform existing baselines.Data and code are publicly available at

Cross-lingual Text-to-SQL Semantic Parsing with Representation Mixup
Peng Shi | Linfeng Song | Lifeng Jin | Haitao Mi | He Bai | Jimmy Lin | Dong Yu
Findings of the Association for Computational Linguistics: EMNLP 2022

We focus on the cross-lingual Text-to-SQL semantic parsing task,where the parsers are expected to generate SQL for non-English utterances based on English database schemas.Intuitively, English translation as side information is an effective way to bridge the language gap,but noise introduced by the translation system may affect parser effectiveness.In this work, we propose a Representation Mixup Framework (Rex) for effectively exploiting translations in the cross-lingual Text-to-SQL task.Particularly, it uses a general encoding layer, a transition layer, and a target-centric layer to properly guide the information flow of the English translation.Experimental results on CSpider and VSpider show that our framework can benefit from cross-lingual training and improve the effectiveness of semantic parsers, achieving state-of-the-art performance.

UnifiedSKG: Unifying and Multi-Tasking Structured Knowledge Grounding with Text-to-Text Language Models
Tianbao Xie | Chen Henry Wu | Peng Shi | Ruiqi Zhong | Torsten Scholak | Michihiro Yasunaga | Chien-Sheng Wu | Ming Zhong | Pengcheng Yin | Sida I. Wang | Victor Zhong | Bailin Wang | Chengzu Li | Connor Boyle | Ansong Ni | Ziyu Yao | Dragomir Radev | Caiming Xiong | Lingpeng Kong | Rui Zhang | Noah A. Smith | Luke Zettlemoyer | Tao Yu
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Structured knowledge grounding (SKG) leverages structured knowledge to complete user requests, such as semantic parsing over databases and question answering over knowledge bases. Since the inputs and outputs of SKG tasks are heterogeneous, they have been studied separately by different communities, which limits systematic and compatible research on SKG. In this paper, we overcome this limitation by proposing the UnifiedSKG framework, which unifies 21 SKG tasks into a text-to-text format, aiming to promote systematic SKG research, instead of being exclusive to a single task, domain, or dataset. We use UnifiedSKG to benchmark T5 with different sizes and show that T5, with simple modifications when necessary, achieves state-of-the-art performance on almost all of the 21 tasks. We further demonstrate that multi-task prefix-tuning improves the performance on most tasks, largely improving the overall performance. UnifiedSKG also facilitates the investigation of zero-shot and few-shot learning, and we show that T0, GPT-3, and Codex struggle in zero-shot and few-shot learning for SKG. We also use UnifiedSKG to conduct a series of controlled experiments on structured knowledge encoding variants across SKG tasks. UnifiedSKG is easily extensible to more tasks, and it is open-sourced at


Hierarchical Character Tagger for Short Text Spelling Error Correction
Mengyi Gao | Canran Xu | Peng Shi
Proceedings of the Seventh Workshop on Noisy User-generated Text (W-NUT 2021)

State-of-the-art approaches to spelling error correction problem include Transformer-based Seq2Seq models, which require large training sets and suffer from slow inference time; and sequence labeling models based on Transformer encoders like BERT, which involve token-level label space and therefore a large pre-defined vocabulary dictionary. In this paper we present a Hierarchical Character Tagger model, or HCTagger, for short text spelling error correction. We use a pre-trained language model at the character level as a text encoder, and then predict character-level edits to transform the original text into its error-free form with a much smaller label space. For decoding, we propose a hierarchical multi-task approach to alleviate the issue of long-tail label distribution without introducing extra model parameters. Experiments on two public misspelling correction datasets demonstrate that HCTagger is an accurate and much faster approach than many existing models.

Logic-Consistency Text Generation from Semantic Parses
Chang Shu | Yusen Zhang | Xiangyu Dong | Peng Shi | Tao Yu | Rui Zhang
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021

Mr. TyDi: A Multi-lingual Benchmark for Dense Retrieval
Xinyu Zhang | Xueguang Ma | Peng Shi | Jimmy Lin
Proceedings of the 1st Workshop on Multilingual Representation Learning

We present Mr. TyDi, a multi-lingual benchmark dataset for mono-lingual retrieval in eleven typologically diverse languages, designed to evaluate ranking with learned dense representations. The goal of this resource is to spur research in dense retrieval techniques in non-English languages, motivated by recent observations that existing techniques for representation learning perform poorly when applied to out-of-distribution data. As a starting point, we provide zero-shot baselines for this new dataset based on a multi-lingual adaptation of DPR that we call “mDPR”. Experiments show that although the effectiveness of mDPR is much lower than BM25, dense representations nevertheless appear to provide valuable relevance signals, improving BM25 results in sparse–dense hybrids. In addition to analyses of our results, we also discuss future challenges and present a research agenda in multi-lingual dense retrieval. Mr. TyDi can be downloaded at

Cross-Lingual Training of Dense Retrievers for Document Retrieval
Peng Shi | Rui Zhang | He Bai | Jimmy Lin
Proceedings of the 1st Workshop on Multilingual Representation Learning

Dense retrieval has shown great success for passage ranking in English. However, its effectiveness for non-English languages remains unexplored due to limitation in training resources. In this work, we explore different transfer techniques for document ranking from English annotations to non-English languages. Our experiments reveal that zero-shot model-based transfer using mBERT improves search quality. We find that weakly-supervised target language transfer is competitive compared to generation-based target language transfer, which requires translation models.

Semantics of the Unwritten: The Effect of End of Paragraph and Sequence Tokens on Text Generation with GPT2
He Bai | Peng Shi | Jimmy Lin | Luchen Tan | Kun Xiong | Wen Gao | Jie Liu | Ming Li
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: Student Research Workshop

The semantics of a text is manifested not only by what is read but also by what is not read. In this article, we will study how those implicit “not read” information such as end-of-paragraph () and end-of-sequence () affect the quality of text generation. Specifically, we find that the pre-trained language model GPT2 can generate better continuations by learning to generate the in the fine-tuning stage. Experimental results on English story generation show that can lead to higher BLEU scores and lower perplexity. We also conduct experiments on a self-collected Chinese essay dataset with Chinese-GPT2, a character level LM without and during pre-training. Experimental results show that the Chinese GPT2 can generate better essay endings with .


Cross-Lingual Training of Neural Models for Document Ranking
Peng Shi | He Bai | Jimmy Lin
Findings of the Association for Computational Linguistics: EMNLP 2020

We tackle the challenge of cross-lingual training of neural document ranking models for mono-lingual retrieval, specifically leveraging relevance judgments in English to improve search in non-English languages. Our work successfully applies multi-lingual BERT (mBERT) to document ranking and additionally compares against a number of alternatives: translating the training data, translating documents, multi-stage hybrids, and ensembles. Experiments on test collections in six different languages from diverse language families reveal many interesting findings: model-based relevance transfer using mBERT can significantly improve search quality in (non-English) mono-lingual retrieval, but other “low resource” approaches are competitive as well.


Simple Attention-Based Representation Learning for Ranking Short Social Media Posts
Peng Shi | Jinfeng Rao | Jimmy Lin
Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)

This paper explores the problem of ranking short social media posts with respect to user queries using neural networks. Instead of starting with a complex architecture, we proceed from the bottom up and examine the effectiveness of a simple, word-level Siamese architecture augmented with attention-based mechanisms for capturing semantic “soft” matches between query and post tokens. Extensive experiments on datasets from the TREC Microblog Tracks show that our simple models not only achieve better effectiveness than existing approaches that are far more complex or exploit a more diverse set of relevance signals, but are also much faster.

Aligning Cross-Lingual Entities with Multi-Aspect Information
Hsiu-Wei Yang | Yanyan Zou | Peng Shi | Wei Lu | Jimmy Lin | Xu Sun
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Multilingual knowledge graphs (KGs), such as YAGO and DBpedia, represent entities in different languages. The task of cross-lingual entity alignment is to match entities in a source language with their counterparts in target languages. In this work, we investigate embedding-based approaches to encode entities from multilingual KGs into the same vector space, where equivalent entities are close to each other. Specifically, we apply graph convolutional networks (GCNs) to combine multi-aspect information of entities, including topological connections, relations, and attributes of entities, to learn entity embeddings. To exploit the literal descriptions of entities expressed in different languages, we propose two uses of a pretrained multilingual BERT model to bridge cross-lingual gaps. We further propose two strategies to integrate GCN-based and BERT-based modules to boost performance. Extensive experiments on two benchmark datasets demonstrate that our method significantly outperforms existing systems.

Bridging the Gap between Relevance Matching and Semantic Matching for Short Text Similarity Modeling
Jinfeng Rao | Linqing Liu | Yi Tay | Wei Yang | Peng Shi | Jimmy Lin
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

A core problem of information retrieval (IR) is relevance matching, which is to rank documents by relevance to a user’s query. On the other hand, many NLP problems, such as question answering and paraphrase identification, can be considered variants of semantic matching, which is to measure the semantic distance between two pieces of short texts. While at a high level both relevance and semantic matching require modeling textual similarity, many existing techniques for one cannot be easily adapted to the other. To bridge this gap, we propose a novel model, HCAN (Hybrid Co-Attention Network), that comprises (1) a hybrid encoder module that includes ConvNet-based and LSTM-based encoders, (2) a relevance matching module that measures soft term matches with importance weighting at multiple granularities, and (3) a semantic matching module with co-attention mechanisms that capture context-aware semantic relatedness. Evaluations on multiple IR and NLP benchmarks demonstrate state-of-the-art effectiveness compared to approaches that do not exploit pretraining on external data. Extensive ablation studies suggest that relevance and semantic matching signals are complementary across many problem settings, regardless of the choice of underlying encoders.


Farewell Freebase: Migrating the SimpleQuestions Dataset to DBpedia
Michael Azmy | Peng Shi | Jimmy Lin | Ihab Ilyas
Proceedings of the 27th International Conference on Computational Linguistics

Question answering over knowledge graphs is an important problem of interest both commercially and academically. There is substantial interest in the class of natural language questions that can be answered via the lookup of a single fact, driven by the availability of the popular SimpleQuestions dataset. The problem with this dataset, however, is that answer triples are provided from Freebase, which has been defunct for several years. As a result, it is difficult to build “real-world” question answering systems that are operationally deployable. Furthermore, a defunct knowledge graph means that much of the infrastructure for querying, browsing, and manipulating triples no longer exists. To address this problem, we present SimpleDBpediaQA, a new benchmark dataset for simple question answering over knowledge graphs that was created by mapping SimpleQuestions entities and predicates from Freebase to DBpedia. Although this mapping is conceptually straightforward, there are a number of nuances that make the task non-trivial, owing to the different conceptual organizations of the two knowledge graphs. To lay the foundation for future research using this dataset, we leverage recent work to provide simple yet strong baselines with and without neural networks.

Strong Baselines for Simple Question Answering over Knowledge Graphs with and without Neural Networks
Salman Mohammed | Peng Shi | Jimmy Lin
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

We examine the problem of question answering over knowledge graphs, focusing on simple questions that can be answered by the lookup of a single fact. Adopting a straightforward decomposition of the problem into entity detection, entity linking, relation prediction, and evidence combination, we explore simple yet strong baselines. On the popular SimpleQuestions dataset, we find that basic LSTMs and GRUs plus a few heuristics yield accuracies that approach the state of the art, and techniques that do not use neural networks also perform reasonably well. These results show that gains from sophisticated deep learning techniques proposed in the literature are quite modest and that some previous models exhibit unnecessary complexity.


Exploiting Mutual Benefits between Syntax and Semantic Roles using Neural Network
Peng Shi | Zhiyang Teng | Yue Zhang
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing