K Dhanush Reddy


2020

pdf
Optimized Web-Crawling of Conversational Data from Social Media and Context-Based Filtering
Annapurna P Patil | Rajarajeswari Subramanian | Gaurav Karkal | Keerthana Purushotham | Jugal Wadhwa | K Dhanush Reddy | Meer Sawood
Proceedings of the Workshop on Joint NLP Modelling for Conversational AI @ ICON 2020

Building Chabot’s requires a large amount of conversational data. In this paper, a web crawler is designed to fetch multi-turn dialogues from websites such as Twitter, YouTube and Reddit in the form of a JavaScript Object Notation (JSON) file. Tools like Twitter Application Programming Interface (API), LXML Library, and JSON library are used to crawl Twitter, YouTube and Reddit to collect conversational chat data. The data obtained in a raw form cannot be used directly as it will have only text metadata such as author or name, time to provide more information on the chat data being scraped. The data collected has to be formatted for proper use case and the JSON library of python allows us to format the data easily. The scraped dialogues are further filtered based on the context of a search keyword without introducing bias and with flexible strictness of classification.