Bin Li


2021

pdf bib
中文连动句语义关系识别研究(Research on Semantic Relation Recognition of Chinese Serial-verb Sentences)
Chao Sun (孙超) | Weiguang Qu (曲维光) | Tingxin Wei (魏庭新) | Yanhui Gu (顾彦慧) | Bin Li (李斌) | Junsheng Zhou (周俊生)
Proceedings of the 20th Chinese National Conference on Computational Linguistics

“连动句是形如“NP+VP1+VP2”的句子,句中含有两个或两个以上的动词(或动词结构)且动词的施事为同一对象。相同结构的连动句可以表示多种不同的语义关系。本文基于前人对连动句中VP1和VP2之间的语义关系分类,标注了连动句语义关系数据集,基于神经网络完成了对连动句语义关系的识别。该方法将连动句语义识别任务进行分解,基于BERT进行编码,利用BiLSTM-CRF先识别出连动句中连动词(VP)及其主语(NP),再基于融合连动词信息的编码,利用BiLSTM-Attention对连动词进行关系判别,实验结果验证了所提方法的有效性。”

pdf bib
中文词语离合现象识别研究(Research on Recognition of the Separation and Reunion Phenomena of Words in Chinese)
Lou Zhou (周露) | Weiguang Qu (曲维光) | Tingxin Wei (魏庭新) | Junsheng Zhou (周俊生) | Bin Li (李斌) | Yanhui Gu (顾彦慧)
Proceedings of the 20th Chinese National Conference on Computational Linguistics

“汉语词语的离合现象是汉语中一种词语可分可合的特殊现象。本文采用字符级序列标注方法解决二字动词离合现象的自动识别问题,以避免中文分词及词性标注的错误传递,节省制定匹配规则与特征模板的人工开支。在训练过程中微调BERT中文预训练模型,获取面向目标任务的字符向量表示,并引入掩码机制对模型隐藏离用法中分离的词语,减轻词语本身对识别结果的影响,强化中间插入成分的学习,并对前后语素采用不同的掩码以强调其出现顺序,进而使模型具备了识别复杂及偶发性离用法的能力。为获得含有上下文信息的句子表达,将原始的句子表达与采用掩码的句子表达分别输入两个不同参数的BiLSTM层进行训练,最后采用CRF算法捕捉句子标签序列的依赖关系。本文提出的BERT MASK + 2BiLSTMs + CRF模型比现有最优的离合词识别模型提高了2.85%的F1值。”

pdf bib
先秦词网构建及梵汉对比研究(The Construction of Pre-Qin Ancient Chinese WordNet and Cross Language Comparative Study between Ancient Sanskrit WordNet and Pre-Qin Ancient Chinese WordNet)
Xuehui Lu (卢雪晖) | Huidan Xu (徐会丹) | Siyu Chen (陈思瑜) | Bin Li (李斌)
Proceedings of the 20th Chinese National Conference on Computational Linguistics

“先秦汉语在汉语史研究上具有重要地位,然而以往的研究始终没有形成结构化的先秦词汇资源,难以满足古汉语信息处理和跨语言对比的研究需要。国际上以英文词网(WordNet)的义类架构为基础,已经建立了数十种语言的词网,已经成为多语言自然语言处理和跨语言对比的基础资源。本文综述了国内外各种词网的构建情况,特别是古代语言的词网和汉语词网,然后详细介绍了先秦词网的构建和校正过程,构建起了涵盖43591个词语、61227个义项、17975个义类的先秦汉语词网。本文还通过与古梵语词网的跨语言对比,尝试分析这两种古老语言在词汇上的共性和差异,初步验证先秦词网的有效性。”

pdf bib
基于大规模语料库的《古籍汉字分级字表》研究(The Formulation of The graded Chinese character list of ancient books Based on Large-scale Corpus)
Changwei Xu (许长伟) | Minxuan Feng (冯敏萱) | Bin Li (李斌) | Yiguo Yuan (袁义国)
Proceedings of the 20th Chinese National Conference on Computational Linguistics

"《古籍汉字分级字表》是基于大规模古籍文本语料库、为辅助学习者古籍文献阅读而研制的分级字表。该字表填补了古籍字表研究成果的空缺,依据各汉字学习优先级别的不同,实现了古籍汉字的等级划分,目前收录一级字105个,二级字340个,三级字555个。本文介绍了该字表研制的主要依据和基本步骤,并将其与传统识字教材“三百千”及《现代汉语常用字表》进行比较,验证了其收字的合理性。该字表有助于学习者优先掌握古籍文本常用字,提升古籍阅读能力,从而促进中华优秀传统文化的继承与发展。”

2020

pdf bib
多轮对话的篇章级抽象语义表示标注体系研究(Research on Discourse-level Abstract Meaning Representation Annotation framework in Multi-round Dialogue)
Tong Huang (黄彤) | Bin Li (李斌) | Peiyi Yan (闫培艺) | Tingting Ji (计婷婷) | Weiguang Qu (曲维光)
Proceedings of the 19th Chinese National Conference on Computational Linguistics

对话分析是智能客服、聊天机器人等自然语言对话应用的基础课题,而对话语料与常规书面语料有较大差异,存在大量的称谓、情感短语、省略、语序颠倒、冗余等复杂现象,对句法和语义分析器的影响较大,对话自动分析的准确率相对书面语料一直不高。其主要原因在于对多轮对话缺乏严整的形式化描写方式,不利于后续的分析计算。因此,本文在梳理国内外针对对话的标注体系和语料库的基础上,提出了基于抽象语义表示的篇章级多轮对话标注体系。具体探讨了了篇章级别的语义结构标注方法,给出了词语和概念关系的对齐方案,针对称谓语和情感短语增加了相应的语义关系和概念,调整了表示主观情感词语的论元结构,并对对话中一些特殊现象进行了规定,设计了人工标注平台,为大规模的多轮对话语料库标注与计算研究奠定基础。

pdf bib
基于抽象语义表示的汉语疑问句的标注与分析(Chinese Interrogative Sentences Annotation and Analysis Based on the Abstract Meaning Representation)
Peiyi Yan (闫培艺) | Bin Li (李斌) | Tong Huang (黄彤) | Kairui Huo (霍凯蕊) | Jin Chen (陈瑾) | Weiguang Qu (曲维光)
Proceedings of the 19th Chinese National Conference on Computational Linguistics

疑问句的句法语义分析在搜索引擎、信息抽取和问答系统等领域有着广泛的应用。计算语言学多采取问句分类和句法分析相结合的方式来处理疑问句,精度和效率还不理想。而疑问句的语言学研究成果丰富,比如疑问句的结构类型、疑问焦点和疑问代词的非疑问用法等,但缺乏系统的形式化表示。本文致力于解决这一难题,采用基于图结构的汉语句子语义的整体表示方法—中文抽象语义表示(CAMR)来标注疑问句的语义结构,将疑问焦点和整句语义一体化表示出来。然后选取了宾州中文树库CTB8.0网络媒体语料、小学语文教材以及《小王子》中文译本的2万句语料中共计2071句疑问句,统计了疑问句的主要特点。统计表明,各种疑问代词都可以通过疑问概念amr-unknown和语义关系的组合来表示,能够完整地表示出疑问句的关键信息、疑问焦点和语义结构。最后,根据疑问代词所关联的语义关系,统计了疑问焦点的概率分布,其中原因、修饰语和受事的占比最高,分别占26.53%、16.73%以及16.44%。基于抽象语义表示的疑问句标注与分析可以为汉语疑问句研究提供基础理论与资源。

pdf bib
基于神经网络的连动句识别(Recognition of serial-verb sentences based on Neural Network)
Chao Sun (孙超) | Weiguang Qu (曲维光) | Tingxin Wei (魏庭新) | Yanhui Gu (顾彦慧) | Bin Li (李斌) | Junsheng Zhou (周俊生)
Proceedings of the 19th Chinese National Conference on Computational Linguistics

连动句是具有连动结构的句子,是汉语中的特殊句法结构,在现代汉语中十分常见且使用频繁。连动句语法结构和语义关系都很复杂,在识别中存在许多问题,对此本文针对连动句的识别问题进行了研究,提出了一种基于神经网络的连动句识别方法。本方法分两步:第一步,运用简单的规则对语料进行预处理;第二步,用文本分类的思想,使用BERT编码,利用多层CNN与BiLSTM模型联合提取特征进行分类,进而完成连动句识别任务。在人工标注的语料上进行实验,实验结果达到92.71%的准确率,F1值为87.41%。

pdf bib
基于深度学习的实体关系抽取研究综述(Review of Entity Relation Extraction based on deep learning)
Zhentao Xia (夏振涛) | Weiguang Qu (曲维光) | Yanhui Gu (顾彦慧) | Junsheng Zhou (周俊生) | Bin Li (李斌)
Proceedings of the 19th Chinese National Conference on Computational Linguistics

作为信息抽取的一项核心子任务,实体关系抽取对于知识图谱、智能问答、语义搜索等自然语言处理应用都十分重要。关系抽取在于从非结构化文本中自动地识别实体之间具有的某种语义关系。该文聚焦句子级别的关系抽取研究,介绍用于关系抽取的主要数据集并对现有的技术作了阐述,主要分为:有监督的关系抽取、远程监督的关系抽取和实体关系联合抽取。我们对比用于该任务的各种模型,分析它们的贡献与缺 陷。最后介绍中文实体关系抽取的研究现状和方法。

pdf bib
面向中文AMR标注体系的兼语语料库构建及识别研究(Research on the Construction and Recognition of Concurrent corpus for Chinese AMR Annotation System)
Wenhui Hou (侯文惠) | Weiguang Qu (曲维光) | Tingxin Wei (魏庭新) | Bin Li (李斌) | Yanhui Gu (顾彦慧) | Junsheng Zhou (周俊生)
Proceedings of the 19th Chinese National Conference on Computational Linguistics

兼语结构是汉语中常见的一种动词结构,由述宾短语与主谓短语共享兼语,结构复杂,给句法分析造成困难,因此兼语语料库构建及识别工作对于语义解析及下游任务都具有重要意义。但现存兼语语料库较少,面向中文AMR标注体系的兼语语料库构建仍处于空白阶段。针对这一现状,本文总结了一套兼语语料库标注规范,并构建了一定数量面向中文AMR标注体系的兼语语料库。基于构建的语料库,采用基于字符的神经网络模型识别兼语结构,并对识别结果以及未来的改进方向进行分析总结。

pdf bib
Integration of Automatic Sentence Segmentation and Lexical Analysis of Ancient Chinese based on BiLSTM-CRF Model
Ning Cheng | Bin Li | Liming Xiao | Changwei Xu | Sijia Ge | Xingyue Hao | Minxuan Feng
Proceedings of LT4HALA 2020 - 1st Workshop on Language Technologies for Historical and Ancient Languages

The basic tasks of ancient Chinese information processing include automatic sentence segmentation, word segmentation, part-of-speech tagging and named entity recognition. Tasks such as lexical analysis need to be based on sentence segmentation because of the reason that a plenty of ancient books are not punctuated. However, step-by-step processing is prone to cause multi-level diffusion of errors. This paper designs and implements an integrated annotation system of sentence segmentation and lexical analysis. The BiLSTM-CRF neural network model is used to verify the generalization ability and the effect of sentence segmentation and lexical analysis on different label levels on four cross-age test sets. Research shows that the integration method adopted in ancient Chinese improves the F1-score of sentence segmentation, word segmentation and part of speech tagging. Based on the experimental results of each test set, the F1-score of sentence segmentation reached 78.95, with an average increase of 3.5%; the F1-score of word segmentation reached 85.73%, with an average increase of 0.18%; and the F1-score of part-of-speech tagging reached 72.65, with an average increase of 0.35%.

pdf bib
Construct a Sense-Frame Aligned Predicate Lexicon for Chinese AMR Corpus
Li Song | Yuling Dai | Yihuan Liu | Bin Li | Weiguang Qu
Proceedings of the 12th Language Resources and Evaluation Conference

The study of predicate frame is an important topic for semantic analysis. Abstract Meaning Representation (AMR) is an emerging graph based semantic representation of a sentence. Since core semantic roles defined in the predicate lexicon compose the backbone in an AMR graph, the construction of the lexicon becomes the key issue. The existing lexicons blur senses and frames of predicates, which needs to be refined to meet the tasks like word sense disambiguation and event extraction. This paper introduces the on-going project on constructing a novel predicate lexicon for Chinese AMR corpus. The new lexicon includes 14,389 senses and 10,800 frames of 8,470 words. As some senses can be aligned to more than one frame, and vice versa, we found the alignment between senses is not just one frame per sense. Explicit analysis is given for multiple aligned relations, which proves the necessity of the proposed lexicon for AMR corpus, and supplies real data for linguistic theoretical studies.

pdf bib
Proceedings of the CoNLL 2020 Shared Task: Cross-Framework Meaning Representation Parsing
Stephan Oepen | Omri Abend | Lasha Abzianidze | Johan Bos | Jan Hajič | Daniel Hershcovich | Bin Li | Tim O'Gorman | Nianwen Xue | Daniel Zeman
Proceedings of the CoNLL 2020 Shared Task: Cross-Framework Meaning Representation Parsing

pdf bib
MRP 2020: The Second Shared Task on Cross-Framework and Cross-Lingual Meaning Representation Parsing
Stephan Oepen | Omri Abend | Lasha Abzianidze | Johan Bos | Jan Hajic | Daniel Hershcovich | Bin Li | Tim O’Gorman | Nianwen Xue | Daniel Zeman
Proceedings of the CoNLL 2020 Shared Task: Cross-Framework Meaning Representation Parsing

The 2020 Shared Task at the Conference for Computational Language Learning (CoNLL) was devoted to Meaning Representation Parsing (MRP) across frameworks and languages. Extending a similar setup from the previous year, five distinct approaches to the representation of sentence meaning in the form of directed graphs were represented in the English training and evaluation data for the task, packaged in a uniform graph abstraction and serialization; for four of these representation frameworks, additional training and evaluation data was provided for one additional language per framework. The task received submissions from eight teams, of which two do not participate in the official ranking because they arrived after the closing deadline or made use of additional training data. All technical information regarding the task, including system submissions, official results, and links to supporting resources and software are available from the task web site at: http://mrp.nlpl.eu

2019

pdf bib
Ellipsis in Chinese AMR Corpus
Yihuan Liu | Bin Li | Peiyi Yan | Li Song | Weiguang Qu
Proceedings of the First International Workshop on Designing Meaning Representations

Ellipsis is very common in language. It’s necessary for natural language processing to restore the elided elements in a sentence. However, there’s only a few corpora annotating the ellipsis, which draws back the automatic detection and recovery of the ellipsis. This paper introduces the annotation of ellipsis in Chinese sentences, using a novel graph-based representation Abstract Meaning Representation (AMR), which has a good mechanism to restore the elided elements manually. We annotate 5,000 sentences selected from Chinese TreeBank (CTB). We find that 54.98% of sentences have ellipses. 92% of the ellipses are restored by copying the antecedents’ concepts. and 12.9% of them are the new added concepts. In addition, we find that the elided element is a word or phrase in most cases, but sometimes only the head of a phrase or parts of a phrase, which is rather hard for the automatic recovery of ellipsis.

pdf bib
Building a Chinese AMR Bank with Concept and Relation Alignments
Bin Li | Yuan Wen | Li Song | Weiguang Qu | Nianwen Xue
Linguistic Issues in Language Technology, Volume 18, 2019 - Exploiting Parsed Corpora: Applications in Research, Pedagogy, and Processing

Abstract Meaning Representation (AMR) is a meaning representation framework in which the meaning of a full sentence is represented as a single-rooted, acyclic, directed graph. In this article, we describe an on-going project to build a Chinese AMR (CAMR) corpus, which currently includes 10,149 sentences from the newsgroup and weblog portion of the Chinese TreeBank (CTB). We describe the annotation specifications for the CAMR corpus, which follow the annotation principles of English AMR but make adaptations where needed to accommodate the linguistic facts of Chinese. The CAMR specifications also include a systematic treatment of sentence-internal discourse relations. One significant change we have made to the AMR annotation methodology is the inclusion of the alignment between word tokens in the sentence and the concepts/relations in the CAMR annotation to make it easier for automatic parsers to model the correspondence between a sentence and its meaning representation. We develop an annotation tool for CAMR, and the inter-agreement as measured by the Smatch score between the two annotators is 0.83, indicating reliable annotation. We also present some quantitative analysis of the CAMR corpus. 46.71% of the AMRs of the sentences are non-tree graphs. Moreover, the AMR of 88.95% of the sentences has concepts inferred from the context of the sentence but do not correspond to a specific word.

2018

pdf bib
Transition-Based Chinese AMR Parsing
Chuan Wang | Bin Li | Nianwen Xue
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers)

This paper presents the first AMR parser built on the Chinese AMR bank. By applying a transition-based AMR parsing framework to Chinese, we first investigate how well the transitions first designed for English AMR parsing generalize to Chinese and provide a comparative analysis between the transitions for English and Chinese. We then perform a detailed error analysis to identify the major challenges in Chinese AMR parsing that we hope will inform future research in this area.

2016

pdf bib
Annotating the Little Prince with Chinese AMRs
Bin Li | Yuan Wen | Weiguang Qu | Lijun Bu | Nianwen Xue
Proceedings of the 10th Linguistic Annotation Workshop held in conjunction with ACL 2016 (LAW-X 2016)

2015

pdf bib
Chinese CogBank: Where to See the Cognitive Features of Chinese Words
Bin Li | Xiaopeng Bai | Siqi Yin | Jie Xu
Proceedings of the Third Workshop on Metaphor in NLP

2012

pdf bib
Web Based Collection and Comparison of Cognitive Properties in English and Chinese
Bin Li | Jiajun Chen | Yingjie Zhang
Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction (AKBC-WEKEX)

pdf bib
Adapting Conventional Chinese Word Segmenter for Segmenting Micro-blog Text: Combining Rule-based and Statistic-based Approaches
Ning Xi | Bin Li | Guangchao Tang | Shujian Huang | Yinggong Zhao | Hao Zhou | Xinyu Dai | Jiajun Chen
Proceedings of the Second CIPS-SIGHAN Joint Conference on Chinese Language Processing

pdf bib
MIXCD: System Description for Evaluating Chinese Word Similarity at SemEval-2012
Yingjie Zhang | Bin Li | Xinyu Dai | Jiajun Chen
*SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012)

pdf bib
NJU-Parser: Achievements on Semantic Dependency Parsing
Guangchao Tang | Bin Li | Shuaishuai Xu | Xinyu Dai | Jiajun Chen
*SEM 2012: The First Joint Conference on Lexical and Computational Semantics – Volume 1: Proceedings of the main conference and the shared task, and Volume 2: Proceedings of the Sixth International Workshop on Semantic Evaluation (SemEval 2012)

2010

pdf bib
Improving Blog Polarity Classification via Topic Analysis and Adaptive Methods
Feifan Liu | Dong Wang | Bin Li | Yang Liu
Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics

2008

pdf bib
Nanjing Normal University Segmenter for the Fourth SIGHAN Bakeoff
Xiaohe Chen | Bin Li | Junzhi Lu | Hongdong Nian | Xuri Tang
Proceedings of the Sixth SIGHAN Workshop on Chinese Language Processing