Sun Le

2023

“Document Information Extraction (DIE) is a crucial task for extracting key information fromvisually-rich documents. The typical pipeline approach for this task involves Optical Charac-ter Recognition (OCR), serializer, Semantic Entity Recognition (SER), and Relation Extraction(RE) modules. However, this pipeline presents significant challenges in real-world scenariosdue to issues such as unnatural text order and error propagation between different modules. Toaddress these challenges, we propose a novel tagging-based method – Global TaggeR (GTR),which converts the original sequence labeling task into a token relation classification task. Thisapproach globally links discontinuous semantic entities in complex layouts, and jointly extractsentities and relations from documents. In addition, we design a joint training loss and a jointdecoding strategy for SER and RE tasks based on GTR. Our experiments on multiple datasetsdemonstrate that GTR not only mitigates the issue of text in the wrong order but also improvesRE performance. Introduction”

pdf bib abs
SentBench: Comprehensive Evaluation of Self-Supervised Sentence Representation with Benchmark Construction
Liu Xiaoming | Lin Hongyu | Han Xianpei | Sun Le
Proceedings of the 22nd Chinese National Conference on Computational Linguistics

“Self-supervised learning has been widely used to learn effective sentence representations. Previ-ous evaluation of sentence representations mainly focuses on the limited combination of tasks andparadigms while failing to evaluate their effectiveness in a wider range of application scenarios. Such divergences prevent us from understanding the limitations of current sentence representa-tions, as well as the connections between learning approaches and downstream applications. Inthis paper, we propose SentBench, a new comprehensive benchmark to evaluate sentence repre-sentations. SentBench covers 12 kinds of tasks and evaluates sentence representations with threetypes of different downstream application paradigms. Based on SentBench, we re-evaluate sev-eral frequently used self-supervised sentence representation learning approaches. Experimentsshow that SentBench can effectively evaluate sentence representations from multiple perspec-tives, and the performance on SentBench leads to some novel findings which enlighten futureresearches.”

2022

pdf bib abs
Data Synthesis and Iterative Refinement for Neural Semantic Parsing without Annotated Logical Forms
Wu Shan | Chen Bo | Han Xianpei | Sun Le
Proceedings of the 21st Chinese National Conference on Computational Linguistics

“Semantic parsing aims to convert natural language utterances to logical forms. A critical challenge for constructing semantic parsers is the lack of labeled data. In this paper, we propose a data synthesis and iterative refinement framework for neural semantic parsing, which can build semantic parsers without annotated logical forms. We first generate a naive corpus by sampling logic forms from knowledge bases and synthesizing their canonical utterances. Then, we further propose a bootstrapping algorithm to iteratively refine data and model, via a denoising language model and knowledge-constrained decoding. Experimental results show that our approach achieves competitive performance on GEO, ATIS and OVERNIGHT datasets in both unsupervised and semi-supervised data settings.”

2021

Few-shot relation classification has attracted great attention recently and is regarded as an ef-fective way to tackle the long-tail problem in relation classification. Most previous works onfew-shot relation classification are based on learning-to-match paradigms which focus on learn-ing an effective universal matcher between the query and one target class prototype based oninner-class support sets. However the learning-to-match paradigm focuses on capturing the sim-ilarity knowledge between query and class prototype while fails to consider discriminative infor-mation between different candidate classes. Such information is critical especially when targetclasses are highly confusing and domain shifting exists between training and testing phases. Inthis paper we propose the Global Transformed Prototypical Networks(GTPN) which learns tobuild a few-shot model to directly discriminate between the query and all target classes with bothinner-class local information and inter-class global information. Such learning-to-discriminate paradigm can make the model concentrate more on the discriminative knowledge between allcandidate classes and therefore leads to better classification performance. We conducted exper-iments on standard FewRel benchmarks. Experimental results show that GTPN achieves very competitive performance on few-shot relation classification and reached the best performance onthe official leaderboard of FewRel 2.0 1.

Co-authors

Wu Hua 1

Wu Shan 1

Venues

ccl4

Fix data