Zijun Sun

2024

pdf abs
Curriculum Masking in Vision-Language Pretraining to Maximize Cross Modal Interaction
Kraig Tou | Zijun Sun
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Many leading methods in Vision and language (V+L) pretraining utilize masked language modeling (MLM) as a standard pretraining component, with the expectation that reconstruction of masked text tokens would necessitate reference to corresponding image context via cross/self attention and thus promote representation fusion. However, we observe that the minimization of MLM loss in earlier training stages can depend disproportionately on local text signals, leading to poor training efficiency and inconsistency with the goal of representation fusion. The extent of this lack of cross modal interaction depends strongly which token(s) are masked. To address this issue, we propose a curriculum masking scheme as a replacement for random masking. Tokens are selected to be masked at a frequency proportional to the expected level of cross modal interaction necessary to reconstruct them. This is achieved using a parallel mask selection agent that measures the cross modal flow of information and treats it as a reward to be maximized. By additionally masking contiguous spans that include key objects and their relations, we also achieve better relational understanding, which has been shown to be lacking in many SOTA models. Our experiments on a wide range of V+L tasks show that we trail closely behind state-of-the-art methods despite pretraining on 300x to 1000x less data and we also achieve either top or runner-up performance on tasks from the ARO benchmark which tests compositional relationships. Finally, we demonstrate the potential of our method to scale to larger pretraining data.

2022

pdf abs
Summarize, Outline, and Elaborate: Long-Text Generation via Hierarchical Supervision from Extractive Summaries
Xiaofei Sun | Zijun Sun | Yuxian Meng | Jiwei Li | Chun Fan
Proceedings of the 29th International Conference on Computational Linguistics

The difficulty of generating coherent long texts lies in the fact that existing models overwhelmingly focus on the tasks of local word prediction, and cannot make high level plans on what to generate or capture the high-level discourse dependencies between chunks of texts. Inspired by how humans write, where a list of bullet points or a catalog is first outlined, and then each bullet point is expanded to form the whole article, we propose SOE, a pipelined system that involves of summarizing, outlining and elaborating for long text generation: the model first outlines the summaries for different segments of long texts, and then elaborates on each bullet point to generate the corresponding segment. To avoid the labor-intensive process of summary soliciting, we propose the reconstruction strategy, which extracts segment summaries in an unsupervised manner by selecting its most informative part to reconstruct the segment. The proposed generation system comes with the following merits: (1) the summary provides high-level guidance for text generation and avoids the local minimum of individual word predictions; (2) the high-level discourse dependencies are captured in the conditional dependencies between summaries and are preserved during the summary expansion process and (3) additionally, we are able to consider significantly more contexts by representing contexts as concise summaries. Extensive experiments demonstrate that SOE produces long texts with significantly better quality, along with faster convergence speed.

2021

pdf abs
ChineseBERT: Chinese Pretraining Enhanced by Glyph and Pinyin Information
Zijun Sun | Xiaoya Li | Xiaofei Sun | Yuxian Meng | Xiang Ao | Qing He | Fei Wu | Jiwei Li
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)

Recent pretraining models in Chinese neglect two important aspects specific to the Chinese language: glyph and pinyin, which carry significant syntax and semantic information for language understanding. In this work, we propose ChineseBERT, which incorporates both the glyph and pinyin information of Chinese characters into language model pretraining. The glyph embedding is obtained based on different fonts of a Chinese character, being able to capture character semantics from the visual features, and the pinyin embedding characterizes the pronunciation of Chinese characters, which handles the highly prevalent heteronym phenomenon in Chinese (the same character has different pronunciations with different meanings). Pretrained on large-scale unlabeled Chinese corpus, the proposed ChineseBERT model yields significant performance boost over baseline models with fewer training steps. The proposed model achieves new SOTA performances on a wide range of Chinese NLP tasks, including machine reading comprehension, natural language inference, text classification, sentence pair matching, and competitive performances in named entity recognition and word segmentation.

2019

In this paper, we propose a new paradigm for the task of entity-relation extraction. We cast the task as a multi-turn question answering problem, i.e., the extraction of entities and elations is transformed to the task of identifying answer spans from the context. This multi-turn QA formalization comes with several key advantages: firstly, the question query encodes important information for the entity/relation class we want to identify; secondly, QA provides a natural way of jointly modeling entity and relation; and thirdly, it allows us to exploit the well developed machine reading comprehension (MRC) models. Experiments on the ACE and the CoNLL04 corpora demonstrate that the proposed paradigm significantly outperforms previous best models. We are able to obtain the state-of-the-art results on all of the ACE04, ACE05 and CoNLL04 datasets, increasing the SOTA results on the three datasets to 49.6 (+1.2), 60.3 (+0.7) and 69.2 (+1.4), respectively. Additionally, we construct and will release a newly developed dataset RESUME, which requires multi-step reasoning to construct entity dependencies, as opposed to the single-step dependency extraction in the triplet exaction in previous datasets. The proposed multi-turn QA model also achieves the best performance on the RESUME dataset.

Co-authors

Fan Yin 1

Qing He 1

Fei Wu 1