Tarec Fares
2021
FANATIC: FAst Noise-Aware TopIc Clustering
Ari Silburt
|
Anja Subasic
|
Evan Thompson
|
Carmeline Dsilva
|
Tarec Fares
Findings of the Association for Computational Linguistics: EMNLP 2021
Extracting salient topics from a collection of documents can be a challenging task when a) the amount of data is large, b) the number of topics is not known a priori, and/or c) “topic noise” is present. We define “topic noise” as the collection of documents that are irrelevant to any coherent topic and should be filtered out. By design, most clustering algorithms (e.g. k-means, hierarchical clustering) assign all input documents to one of the available clusters, guaranteeing any topic noise to propagate into the result. To address these challenges, we present a novel algorithm, FANATIC, that efficiently distinguishes documents from genuine topics and those that are topic noise. We also introduce a new Reddit dataset to showcase FANATIC as it contains short, noisy data that is difficult to cluster using most clustering algorithms. We find that FANATIC clusters 500k Reddit titles (of which 20% are topic noise) in 2 minutes and achieves an AMI score of 0.59, in contrast with hdbscan (McInnes et al., 2017), a popular algorithm suited for this type of task, which requires over 7 hours and achieves an AMI of 0.03. Finally, we test FANATIC against a Twitter dataset and find again that it outperforms the other algorithms with an AMI score of 0.60. We make our code and data publicly available.
A Practical 2-step Approach to Assist Enterprise Question-Answering Live Chat
Ling-Yen Liao
|
Tarec Fares
Proceedings of the 22nd Annual Meeting of the Special Interest Group on Discourse and Dialogue
Live chat in customer service platforms is critical for serving clients online. For multi-turn question-answering live chat, typical Question Answering systems are single-turn and focus on factoid questions; alternatively, modeling as goal-oriented dialogue limits us to narrower domains. Motivated by these challenges, we develop a new approach based on a framework from a different discipline: Community Question Answering. Specifically, we opt to divide and conquer the task into two sub-tasks: (1) Question-Question Similarity, where we gain more than 9% absolute improvement in F1 over baseline; and (2) Answer Utterances Extraction, where we achieve a high F1 score of 87% for this new sub-task. Further, our user engagement metrics reveal how the enterprise support representatives benefit from the 2-step approach we deployed to production.
Search