Li Song


2024

pdf
Approaches and Challenges for Resolving Different Representations of Fictional Characters for Chinese Novels
Li Song | Ying Liu
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Due to the huge scale of literary works, automatic text analysis technologies are urgently needed for literary studies such as Digital Humanities. However, the domain-generality of existing NLP technologies limits their effectiveness on in-depth literary studies. It is valuable to explore how to adapt NLP technologies to the literary-specific tasks. Fictional characters are the most essential elements of a novel, and thus crucial to understanding the content of novels. The prerequisite of collecting a character’s information is to resolve its different representations. It is a specific problem of anaphora resolution which is a classical and open-domain NLP task. We adapt a state-of-the-art anaphora resolution model to resolve character representations in Chinese novels by making some modifications, and train a widely used BERT fine-tuned model for speaker extraction as assistance. We also analyze the challenges and potential solutions for character-resolution in Chinese novels according to the resolution results on a specific Chinese novel.

2020

pdf
Construct a Sense-Frame Aligned Predicate Lexicon for Chinese AMR Corpus
Li Song | Yuling Dai | Yihuan Liu | Bin Li | Weiguang Qu
Proceedings of the Twelfth Language Resources and Evaluation Conference

The study of predicate frame is an important topic for semantic analysis. Abstract Meaning Representation (AMR) is an emerging graph based semantic representation of a sentence. Since core semantic roles defined in the predicate lexicon compose the backbone in an AMR graph, the construction of the lexicon becomes the key issue. The existing lexicons blur senses and frames of predicates, which needs to be refined to meet the tasks like word sense disambiguation and event extraction. This paper introduces the on-going project on constructing a novel predicate lexicon for Chinese AMR corpus. The new lexicon includes 14,389 senses and 10,800 frames of 8,470 words. As some senses can be aligned to more than one frame, and vice versa, we found the alignment between senses is not just one frame per sense. Explicit analysis is given for multiple aligned relations, which proves the necessity of the proposed lexicon for AMR corpus, and supplies real data for linguistic theoretical studies.

pdf
用计量风格学方法考察《水浒传》的作者争议问题——以罗贯中《平妖传》为参照(Quantitive Stylistics Based Research on the Controversy of the Author of “Tales of the Marshes”: Comparing with “Pingyaozhuan” of Luo Guanzhong)
Li Song (宋丽) | Ying Liu (刘颖)
Proceedings of the 19th Chinese National Conference on Computational Linguistics

《水浒传》是独著还是合著,施耐庵和罗贯中是何关系一直存在争议。本文将其作者争议粗略归纳为施耐庵作、罗贯中作、施作罗续、罗作他续、施作罗改五种情况,以罗贯中的《平妖传》为参照,用假设检验、文本聚类、文本分类、波动风格计量等方法,结合对文本内容的分析,考察《水浒传》的写作风格,试图为其作者身份认定提供参考。结果显示,只有罗作他续的可能性大,即前70回为罗贯中所作,后由他人续写,其他四种情况可能性都较小。

2019

pdf bib
Building a Chinese AMR Bank with Concept and Relation Alignments
Bin Li | Yuan Wen | Li Song | Weiguang Qu | Nianwen Xue
Linguistic Issues in Language Technology, Volume 18, 2019 - Exploiting Parsed Corpora: Applications in Research, Pedagogy, and Processing

Abstract Meaning Representation (AMR) is a meaning representation framework in which the meaning of a full sentence is represented as a single-rooted, acyclic, directed graph. In this article, we describe an on-going project to build a Chinese AMR (CAMR) corpus, which currently includes 10,149 sentences from the newsgroup and weblog portion of the Chinese TreeBank (CTB). We describe the annotation specifications for the CAMR corpus, which follow the annotation principles of English AMR but make adaptations where needed to accommodate the linguistic facts of Chinese. The CAMR specifications also include a systematic treatment of sentence-internal discourse relations. One significant change we have made to the AMR annotation methodology is the inclusion of the alignment between word tokens in the sentence and the concepts/relations in the CAMR annotation to make it easier for automatic parsers to model the correspondence between a sentence and its meaning representation. We develop an annotation tool for CAMR, and the inter-agreement as measured by the Smatch score between the two annotators is 0.83, indicating reliable annotation. We also present some quantitative analysis of the CAMR corpus. 46.71% of the AMRs of the sentences are non-tree graphs. Moreover, the AMR of 88.95% of the sentences has concepts inferred from the context of the sentence but do not correspond to a specific word.

pdf
Ellipsis in Chinese AMR Corpus
Yihuan Liu | Bin Li | Peiyi Yan | Li Song | Weiguang Qu
Proceedings of the First International Workshop on Designing Meaning Representations

Ellipsis is very common in language. It’s necessary for natural language processing to restore the elided elements in a sentence. However, there’s only a few corpora annotating the ellipsis, which draws back the automatic detection and recovery of the ellipsis. This paper introduces the annotation of ellipsis in Chinese sentences, using a novel graph-based representation Abstract Meaning Representation (AMR), which has a good mechanism to restore the elided elements manually. We annotate 5,000 sentences selected from Chinese TreeBank (CTB). We find that 54.98% of sentences have ellipses. 92% of the ellipses are restored by copying the antecedents’ concepts. and 12.9% of them are the new added concepts. In addition, we find that the elided element is a word or phrase in most cases, but sometimes only the head of a phrase or parts of a phrase, which is rather hard for the automatic recovery of ellipsis.