Long-term chatbots are expected to develop relationships with users. The major trend in this field’s recent long-term chatbot studies is to train systems with virtual long-term chat data called Multi-Session Chat (MSC), which collects text chat from multiple sessions of crowd workers playing the roles of speakers with defined personas. However, no investigation has attempted to determine whether such virtual long-term chat can successfully simulate relationship-building between speakers. To clarify the difference between an actual long-term intimacy process and an MSC intimacy process, this study collects real long-term chat and MSC in Japanese and compares them in terms of speech form and dialogue acts. The results of analyzing these factors suggest that MSC have an unnatural tendency to behave as if they have a close relationship with non-polite speech levels compared to actual long-term chats, but also as if they have a shallow relationship with more questions than real long-term chats.
To augment datasets used for scientific-document writing support research, we extract texts from “Related Work” sections and citation information in PDF-formatted papers published in English. The previous dataset was constructed entirely with Tex-formatted papers, from which it is easy to extract citation information. However, since many publicly available papers in various fields are provided only in PDF format, a dataset constructed using only Tex papers has limited utility. To resolve this problem, we augment the existing dataset by extracting the titles of sections using the visual features of PDF documents and extracting the Related Work section text using the explicit title information. Since text generated from the figures and footnotes appearing in the extraction target areas is considered noise, we remove instances of such text. Moreover, we map the cited paper’s information obtained using existing tools to citation marks detected by regular expression rules, resulting in pairs of cited paper information and text of the Related Work section. By evaluating body text extraction and citation mapping in the constructed dataset, the accuracy of the proposed dataset was found to be close to that of the previous dataset. Accordingly, we demonstrated the possibility of building a significantly augmented dataset.
With the increase in the number of published academic papers, growing expectations have been placed on research related to supporting the writing process of scientific papers. Recently, research has been conducted on various tasks such as citation worthiness (judging whether a sentence requires citation), citation recommendation, and citation-text generation. However, since each task has been studied and evaluated using data that has been independently developed, it is currently impossible to verify whether such tasks can be successfully pipelined to effective use in scientific-document writing. In this paper, we first define a series of tasks related to scientific-document writing that can be pipelined. Then, we create a dataset of academic papers that can be used for the evaluation of each task as well as a series of these tasks. Finally, using the dataset, we evaluate the tasks of citation worthiness and citation recommendation as well as both of these tasks integrated. The results of our evaluations show that the proposed approach is promising.