Jiarui Zhang


2024

pdf
Teaching Large Language Models to Translate on Low-resource Languages with Textbook Prompting
Ping Guo | Yubing Ren | Yue Hu | Yunpeng Li | Jiarui Zhang | Xingsheng Zhang | Heyan Huang
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Large Language Models (LLMs) have achieved impressive results in Machine Translation by simply following instructions, even without training on parallel data. However, LLMs still face challenges on low-resource languages due to the lack of pre-training data. In real-world situations, humans can become proficient in their native languages through abundant and meaningful social interactions and can also learn foreign languages effectively using well-organized textbooks. Drawing inspiration from human learning patterns, we introduce the Translate After LEarNing Textbook (TALENT) approach, which aims to enhance LLMs’ ability to translate low-resource languages by learning from a textbook. TALENT follows a step-by-step process: (1) Creating a Textbook for low-resource languages. (2) Guiding LLMs to absorb the Textbook’s content for Syntax Patterns. (3) Enhancing translation by utilizing the Textbook and Syntax Patterns. We thoroughly assess TALENT’s performance using 112 low-resource languages from FLORES-200 with two LLMs: ChatGPT and BLOOMZ. Evaluation across three different metrics reveals that TALENT consistently enhances translation performance by 14.8% compared to zero-shot baselines. Further analysis demonstrates that TALENT not only improves LLMs’ comprehension of low-resource languages but also equips them with the knowledge needed to generate accurate and fluent sentences in these languages.

2021

pdf
基于BPE分词的中国古诗主题模型及主题可控的诗歌生成(Topic model and topic-controlled poetry generation of Chinese ancient poem based on BPE)
Jiarui Zhang (张家瑞) | Wenhao Li (李文浩) | Maosong Sun (孙茂松)
Proceedings of the 20th Chinese National Conference on Computational Linguistics

中国古代诗歌是人类文化的瑰宝,其短小精悍的语言却能表达出极其丰富的含义和主题,从古至今吸引了无数的爱好者的欣赏。本文以超过锸锰万首古诗为研究对象,基于BPE算法,按照共现频率对古诗集进行分词,以便于下游任务对古诗的语义进行更准确的理解,我们还将分词后的古诗语料利用隐狄利克雷分配(LDA)模型进行了主题分析。通过比较、调整主题的数量得到了准确度较高的主题模型。更进一步,我们还对语料中的绝句和律诗逐句套用了主题模型,得到了一首诗内部的主题转移矩阵,并进行了一些相关的分析。最后,我们利用了简单的控制码方法将主题模型嵌入到诗歌生成模型中,实现了主题可控的诗歌生成,同时检验了我们训练的主题模型的有效性。

2020

pdf
Modeling Discourse Structure for Document-level Neural Machine Translation
Junxuan Chen | Xiang Li | Jiarui Zhang | Chulun Zhou | Jianwei Cui | Bin Wang | Jinsong Su
Proceedings of the First Workshop on Automatic Simultaneous Translation

Recently, document-level neural machine translation (NMT) has become a hot topic in the community of machine translation. Despite its success, most of existing studies ignored the discourse structure information of the input document to be translated, which has shown effective in other tasks. In this paper, we propose to improve document-level NMT with the aid of discourse structure information. Our encoder is based on a hierarchical attention network (HAN) (Miculicich et al., 2018). Specifically, we first parse the input document to obtain its discourse structure. Then, we introduce a Transformer-based path encoder to embed the discourse structure information of each word. Finally, we combine the discourse structure information with the word embedding before it is fed into the encoder. Experimental results on the English-to-German dataset show that our model can significantly outperform both Transformer and Transformer+HAN.