2021
pdf
abs
BSTC: A Large-Scale Chinese-English Speech Translation Dataset
Ruiqing Zhang
|
Xiyang Wang
|
Chuanqiang Zhang
|
Zhongjun He
|
Hua Wu
|
Zhi Li
|
Haifeng Wang
|
Ying Chen
|
Qinfei Li
Proceedings of the Second Workshop on Automatic Simultaneous Translation
This paper presents BSTC (Baidu Speech Translation Corpus), a large-scale Chinese-English speech translation dataset. This dataset is constructed based on a collection of licensed videos of talks or lectures, including about 68 hours of Mandarin data, their manual transcripts and translations into English, as well as automated transcripts by an automatic speech recognition (ASR) model. We have further asked three experienced interpreters to simultaneously interpret the testing talks in a mock conference setting. This corpus is expected to promote the research of automatic simultaneous translation as well as the development of practical systems. We have organized simultaneous translation tasks and used this corpus to evaluate automatic simultaneous translation systems.
pdf
abs
UnClE: Explicitly Leveraging Semantic Similarity to Reduce the Parameters of Word Embeddings
Zhi Li
|
Yuchen Zhai
|
Chengyu Wang
|
Minghui Qiu
|
Kailiang Li
|
Yin Zhang
Findings of the Association for Computational Linguistics: EMNLP 2021
Natural language processing (NLP) models often require a massive number of parameters for word embeddings, which limits their application on mobile devices. Researchers have employed many approaches, e.g. adaptive inputs, to reduce the parameters of word embeddings. However, existing methods rarely pay attention to semantic information. In this paper, we propose a novel method called Unique and Class Embeddings (UnClE), which explicitly leverages semantic similarity with weight sharing to reduce the dimensionality of word embeddings. Inspired by the fact that words with similar semantic can share a part of weights, we divide the embeddings of words into two parts: unique embedding and class embedding. The former is one-to-one mapping like traditional embedding, while the latter is many-to-one mapping and learn the representation of class information. Our method is suitable for both word-level and sub-word level models and can be used to reduce both input and output embeddings. Experimental results on the standard WMT 2014 English-German dataset show that our method is able to reduce the parameters of word embeddings by more than 11x, with about 93% performance retaining in BLEU metrics. For language modeling task, our model can reduce word embeddings by 6x or 11x on PTB/WT2 dataset at the cost of a certain degree of performance degradation.
pdf
abs
KuiLeiXi: a Chinese Open-Ended Text Adventure Game
Yadong Xi
|
Xiaoxi Mao
|
Le Li
|
Lei Lin
|
Yanjiang Chen
|
Shuhan Yang
|
Xuhan Chen
|
Kailun Tao
|
Zhi Li
|
Gongzheng Li
|
Lin Jiang
|
Siyan Liu
|
Zeng Zhao
|
Minlie Huang
|
Changjie Fan
|
Zhipeng Hu
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations
There is a long history of research related to automated story generation, dating back as far as the 1970s. Recently, the rapid development of pre-trained language models has spurred great progresses in this field. Equipped with GPT-2 and the latest GPT-3, AI Dungeon has been seen as a famous example of the powerful text generation capabilities of large-scale pre-trained language models, and a possibility for future games. However, as a game, AI Dungeon lacks incentives to players and relies entirely on players to explore on their own. This makes players’ enthusiasm decline rapidly. In this paper, we present an open-ended text adventure game in Chinese, named as KuiLeiXi. In KuiLeiXi, players need to interact with the AI until the pre-determined plot goals are reached. By introducing the plot goals, players have a stronger incentive to explore ways to reach plot goals, while the AI’s abilities are not abused to generate harmful contents. This limited freedom allows this game to be integrated as a part of a romance simulation mobile game, Yu Jian Love. Since KuiLeiXi was launched, it has received a lot of positive feedbacks from more than 100,000 players. A demo video is available at https://youtu.be/DyYZhxMRrkk.
pdf
abs
AutoChart: A Dataset for Chart-to-Text Generation Task
Jiawen Zhu
|
Jinye Ran
|
Roy Ka-Wei Lee
|
Zhi Li
|
Kenny Choo
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)
The analytical description of charts is an exciting and important research area with many applications in academia and industry. Yet, this challenging task has received limited attention from the computational linguistics research community. This paper proposes AutoChart, a large dataset for the analytical description of charts, which aims to encourage more research into this important area. Specifically, we offer a novel framework that generates the charts and their analytical description automatically. We conducted extensive human and machine evaluation on the generated charts and descriptions and demonstrate that the generated texts are informative, coherent, and relevant to the corresponding charts.