Cross-lingual pre-training methods mask and predict tokens in multilingual text to generalize diverse multilingual information. However, due to the lack of sufficient aligned multilingual resources in the pre-training process, these methods may not fully explore the multilingual correlation of masked tokens, resulting in the limitation of multilingual information interaction. In this paper, we propose a lifelong multilingual multi-granularity semantic alignment approach, which continuously extracts massive aligned linguistic units from noisy data via a maximum co-occurrence probability algorithm. Then, the approach releases a version of the multilingual multi-granularity semantic alignment resource, supporting seven languages, namely English, Czech, German, Russian, Romanian, Hindi and Turkish. Finally, we propose how to use this resource to improve the translation performance on WMT14 18 benchmarks in twelve directions. Experimental results show an average of 0.3 1.1 BLEU improvements in all translation benchmarks. The analysis and discussion also demonstrate the superiority and potential of the proposed approach. The resource used in this work will be publicly available.
Ensuring robustness is especially important when AI is deployed in responsible or safety-critical environments. ChatGPT can perform brilliantly in both adversarial and out-of-distribution (OOD) robustness, while other popular large language models (LLMs), like LLaMA-2, ERNIE and ChatGLM, do not perform satisfactorily in this regard. Therefore, it is valuable to study what efforts play essential roles in ChatGPT, and how to transfer these efforts to other LLMs. This paper experimentally finds that linguistic rule induction is the foundation for identifying the cause-effect relationships in LLMs. For LLMs, accurately processing the cause-effect relationships improves its adversarial and OOD robustness. Furthermore, we explore a low-cost way for aligning LLMs with linguistic rules. Specifically, we constructed a linguistic rule instruction dataset to fine-tune LLMs. To further energize LLMs for reasoning step-by-step with the linguistic rule, we construct the task-relevant LingR-based chain-of-thoughts. Experiments showed that LingR-induced LLaMA-13B achieves comparable or better results with GPT-3.5 and GPT-4 on various adversarial and OOD robustness evaluations.
To exploit the domain knowledge to guarantee the correctness of generated text has been a hot topic in recent years, especially for high professional domains such as medical. However, most of recent works only consider the information of unstructured text rather than structured information of the knowledge graph. In this paper, we focus on the medical topic-to-text generation task and adapt a knowledge-aware text generation model to the medical domain, named MedWriter, which not only introduces the specific knowledge from the external MKG but also is capable of learning graph-level representation. We conduct experiments on a medical literature dataset collected from medical journals, each of which has a set of topic words, an abstract of medical literature and a corresponding knowledge graph from CMeKG. Experimental results demonstrate incorporating knowledge graph into generation model can improve the quality of the generated text and has robust superiority over the competitor methods.