Recent works on knowledge base question answering (KBQA) retrieve subgraphs for easier reasoning. The desired subgraph is crucial as a small one may exclude the answer but a large one might introduce more noises. However, the existing retrieval is either heuristic or interwoven with the reasoning, causing reasoning on the partial subgraphs, which increases the reasoning bias when the intermediate supervision is missing. This paper proposes a trainable subgraph retriever (SR) decoupled from the subsequent reasoning process, which enables a plug-and-play framework to enhance any subgraph-oriented KBQA model. Extensive experiments demonstrate SR achieves significantly better retrieval and QA performance than existing retrieval methods. Via weakly supervised pre-training as well as the end-to-end fine-tuning, SR achieves new state-of-the-art performance when combined with NSM (He et al., 2021), a subgraph-oriented reasoner, for embedding-based KBQA methods. Codes and datasets are available online (https://github.com/RUCKBReasoning/SubgraphRetrievalKBQA)
Stories or narratives are comprised of a sequence of events. To compose interesting stories, professional writers often leverage a creative writing technique called *flashback* that inserts past events into current storylines as we commonly observe in novels and plays. However, it is challenging for machines to generate *flashback* as it requires a solid understanding of event **temporal order** (e.g. *feeling hungry* before *eat*, not vice versa), and the creativity to arrange storylines so that earlier events do not always appear first in **narrative order**. Two major issues in existing systems that exacerbate the challenges: 1) temporal bias in pertaining and story datasets that leads to monotonic event temporal orders; 2) lack of explicit guidance that helps machines decide where to insert *flashbacks*. We propose to address these issues using structured storylines to encode events and their pair-wise temporal relations (before, after and vague) as **temporal prompts** that guide how stories should unfold temporally. We leverage a Plan-and-Write framework enhanced by reinforcement learning to generate storylines and stories end-to-end. Evaluation results show that the proposed method can generate more interesting stories with *flashbacks* while maintaining textual diversity, fluency, and temporal coherence.
Story generation is a task that aims to automatically generate a meaningful story. This task is challenging because it requires high-level understanding of the semantic meaning of sentences and causality of story events. Naivesequence-to-sequence models generally fail to acquire such knowledge, as it is difficult to guarantee logical correctness in a text generation model without strategic planning. In this study, we focus on planning a sequence of events assisted by event graphs and use the events to guide the generator. Rather than using a sequence-to-sequence model to output a sequence, as in some existing works, we propose to generate an event sequence by walking on an event graph. The event graphs are built automatically based on the corpus. To evaluate the proposed approach, we incorporate human participation, both in event planning and story generation. Based on the largescale human annotation results, our proposed approach has been shown to provide more logically correct event sequences and stories compared with previous approaches.
Latent Dirichlet allocation (LDA), a widely used topic model, is often employed as a fundamental tool for text analysis in various applications. However, the training process of the LDA model typically requires massive text corpus data. On one hand, such massive data may expose private information in the training data, thereby incurring significant privacy concerns. On the other hand, the efficiency of the LDA model training may be impacted, since LDA training often needs to handle these massive text corpus data. To address the privacy issues in LDA model training, some recent works have combined LDA training algorithms that are based on collapsed Gibbs sampling (CGS) with differential privacy. Nevertheless, these works usually have a high accumulative privacy budget due to vast iterations in CGS. Moreover, these works always have low efficiency due to handling massive text corpus data. To improve the privacy guarantee and efficiency, we combine a subsampling method with CGS and propose a novel LDA training algorithm with differential privacy, SUB-LDA. We find that subsampling in CGS naturally improves efficiency while amplifying privacy. We propose a novel metric, the efficiency–privacy function, to evaluate improvements of the privacy guarantee and efficiency. Based on a conventional subsampling method, we propose an adaptive subsampling method to improve the model’s utility produced by SUB-LDA when the subsampling ratio is small. We provide a comprehensive analysis of SUB-LDA, and the experiment results validate its efficiency and privacy guarantee improvements.
Few-shot knowledge graph completion is to infer the unknown facts (i.e., query head-tail entity pairs) of a given relation with only a few observed reference entity pairs. Its general process is to first encode the implicit relation of an entity pair and then match the relation of a query entity pair with the relations of the reference entity pairs. Most existing methods have thus far encoded an entity pair and matched entity pairs by using the direct neighbors of concerned entities. In this paper, we propose the P-INT model for effective few-shot knowledge graph completion. First, P-INT infers and leverages the paths that can expressively encode the relation of two entities. Second, to capture the fine grained matches, P-INT calculates the interactions of paths instead of mix- ing them for each entity pair. Extensive experimental results demonstrate that P-INT out- performs the state-of-the-art baselines by 11.2– 14.2% in terms of Hits@1. Our codes and datasets are online now.
Generating texts in scientific papers requires not only capturing the content contained within the given input but also frequently acquiring the external information called context. We push forward the scientific text generation by proposing a new task, namely context-aware text generation in the scientific domain, aiming at exploiting the contributions of context in generated texts. To this end, we present a novel challenging large-scale Scientific Paper Dataset for ConteXt-Aware Text Generation (SciXGen), consisting of well-annotated 205,304 papers with full references to widely-used objects (e.g., tables, figures, algorithms) in a paper. We comprehensively benchmark, using state-of-the-arts, the efficacy of our newly constructed SciXGen dataset in generating description and paragraph. Our dataset and benchmarks will be made publicly available to hopefully facilitate the scientific text generation research.
Knowledge Base Question Answering (KBQA) is to answer natural language questions posed over knowledge bases (KBs). This paper targets at empowering the IR-based KBQA models with the ability of numerical reasoning for answering ordinal constrained questions. A major challenge is the lack of explicit annotations about numerical properties. To address this challenge, we propose a pretraining numerical reasoning model consisting of NumGNN and NumTransformer, guided by explicit self-supervision signals. The two modules are pretrained to encode the magnitude and ordinal properties of numbers respectively and can serve as model-agnostic plugins for any IR-based KBQA model to enhance its numerical reasoning ability. Extensive experiments on two KBQA benchmarks verify the effectiveness of our method to enhance the numerical reasoning ability for IR-based KBQA models.
We introduce PreCo, a large-scale English dataset for coreference resolution. The dataset is designed to embody the core challenges in coreference, such as entity representation, by alleviating the challenge of low overlap between training and test sets and enabling separated analysis of mention detection and mention clustering. To strengthen the training-test overlap, we collect a large corpus of 38K documents and 12.5M words which are mostly from the vocabulary of English-speaking preschoolers. Experiments show that with higher training-test overlap, error analysis on PreCo is more efficient than the one on OntoNotes, a popular existing dataset. Furthermore, we annotate singleton mentions making it possible for the first time to quantify the influence that a mention detector makes on coreference resolution performance. The dataset is freely available at https://preschool-lab.github.io/PreCo/.