Kuicai Dong


2024

pdf
MC-indexing: Effective Long Document Retrieval via Multi-view Content-aware Indexing
Kuicai Dong | Derrick Goh Xin Deik | Yi Quan Lee | Hao Zhang | Xiangyang Li | Cong Zhang | Yong Liu
Findings of the Association for Computational Linguistics: EMNLP 2024

Long document question answering (DocQA) aims to answer questions from long documents over 10k words. They usually contain content structures such as sections, sub-sections, and paragraph demarcations. However, the indexing methods of long documents remain under-explored, while existing systems generally employ fixed-length chunking. As they do not consider content structures, the resultant chunks can exclude vital information or include irrelevant content. Motivated by this, we propose the **M**ulti-view **C**ontent-aware indexing (**MC-indexing**) for more effective long DocQA via (i) segment structured document into content chunks, and (ii) represent each content chunk in raw-text, keywords, and summary views. We highlight that MC-indexing requires neither training nor fine-tuning. Having plug-and-play capability, it can be seamlessly integrated with any retrievers to boost their performance. Besides, we propose a long DocQA dataset that includes not only question-answer pair, but also document structure and answer scope. When compared to state-of-art chunking schemes, MC-indexing has significantly increased the recall by **42.8%**, **30.0%**, **23.9%**, and **16.3%** via top k = 1.5, 3, 5, and 10 respectively. These improved scores are the average of 8 widely used retrievers (2 sparse and 6 dense) via extensive experiments.

2023

pdf
Open Information Extraction via Chunks
Kuicai Dong | Aixin Sun | Jung-jae Kim | Xiaoli Li
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Open Information Extraction (OIE) aims to extract relational tuples from open-domain sentences. Existing OIE systems split a sentence into tokens and recognize token spans as tuple relations and arguments. We instead propose Sentence as Chunk sequence (SaC) and recognize chunk spans as tuple relations and arguments. We argue that SaC has better properties for OIE than sentence as token sequence, and evaluate four choices of chunks (i.e., CoNLL chunks, OIA simple phrases, noun phrases, and spans from SpanOIE). Also, we propose a simple end-to-end BERT-based model, Chunk-OIE, for sentence chunking and tuple extraction on top of SaC. Chunk-OIE achieves state-of-the-art results on multiple OIE datasets, showing that SaC benefits the OIE task.

pdf
From Speculation Detection to Trustworthy Relational Tuples in Information Extraction
Kuicai Dong | Aixin Sun | Jung-jae Kim | Xiaoli Li
Findings of the Association for Computational Linguistics: EMNLP 2023

Speculation detection is an important NLP task to identify text factuality. However, the extracted speculative information (e.g., speculative polarity, cue, and scope) lacks structure and poses challenges for direct utilization in downstream tasks. Open Information Extraction (OIE), on the other hand, extracts structured tuples as facts, without examining the certainty of these tuples. Bridging this gap between speculation detection and information extraction becomes imperative to generate structured speculative information and trustworthy relational tuples. Existing studies on speculation detection are defined at sentence level; but even if a sentence is determined to be speculative, not all factual tuples extracted from it are speculative. In this paper, we propose to study speculations in OIE tuples and determine whether a tuple is speculative. We formally define the research problem of tuple-level speculation detection. We then conduct detailed analysis on the LSOIE dataset which provides labels for speculative tuples. Lastly, we propose a baseline model SpecTup for this new research task.

2022

pdf
Syntactic Multi-view Learning for Open Information Extraction
Kuicai Dong | Aixin Sun | Jung-Jae Kim | Xiaoli Li
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing

Open Information Extraction (OpenIE) aims to extract relational tuples from open-domain sentences. Traditional rule-based or statistical models were developed based on syntactic structure of sentence, identified by syntactic parsers. However, previous neural OpenIE models under-explored the useful syntactic information. In this paper, we model both constituency and dependency trees into word-level graphs, and enable neural OpenIE to learn from the syntactic structures. To better fuse heterogeneous information from the two graphs, we adopt multi-view learning to capture multiple relationships from them. Finally, the finetuned constituency and dependency representations are aggregated with sentential semantic representations for tuple generation. Experiments show that both constituency and dependency information, and the multi-view learning are effective.

2021

pdf
DocOIE: A Document-level Context-Aware Dataset for OpenIE
Kuicai Dong | Zhao Yilin | Aixin Sun | Jung-Jae Kim | Xiaoli Li
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021