Henry Gorelick


2021

pdf
Syntax and Themes: How Context Free Grammar Rules and Semantic Word Association Influence Book Success
Henry Gorelick | Biddut Sarker Bijoy | Syeda Jannatus Saba | Sudipta Kar | Md Saiful Islam | Mohammad Ruhul Amin
Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021)

In this paper, we attempt to improve upon the state-of-the-art in predicting a novel’s success by modeling the lexical semantic relationships of its contents. We created the largest dataset used in such a project containing lexical data from 17,962 books from Project Gutenberg. We utilized domain specific feature reduction techniques to implement the most accurate models to date for predicting book success, with our best model achieving an average accuracy of 94.0%. By analyzing the model parameters, we extracted the successful semantic relationships from books of 12 different genres. We finally mapped those semantic relations to a set of themes, as defined in Roget’s Thesaurus and discovered the themes that successful books of a given genre prioritize. At the end of the paper, we further showed that our model demonstrate similar performance for book success prediction even when Goodreads rating was used instead of download count to measure success.

pdf
A Study on Using Semantic Word Associations to Predict the Success of a Novel
Syeda Jannatus Saba | Biddut Sarker Bijoy | Henry Gorelick | Sabir Ismail | Md Saiful Islam | Mohammad Ruhul Amin
Proceedings of *SEM 2021: The Tenth Joint Conference on Lexical and Computational Semantics

Many new books get published every year, and only a fraction of them become popular among the readers. So the prediction of a book success can be a very useful parameter for publishers to make a reliable decision. This article presents the study of semantic word associations using the word embedding of book content for a set of Roget’s thesaurus concepts for book success prediction. In this work, we discuss the method to represent a book as a spectrum of concepts based on the association score between its content embedding and a global embedding (i.e. fastText) for a set of semantically linked word clusters. We show that the semantic word associations outperform the previous methods for book success prediction. In addition, we present that semantic word associations also provide better results than using features like the frequency of word groups in Roget’s thesaurus, LIWC (a popular tool for linguistic inquiry and word count), NRC (word association emotion lexicon), and part of speech (PoS). Our study reports that concept associations based on Roget’s Thesaurus using word embedding of individual novel resulted in the state-of-the-art performance of 0.89 average weighted F1-score for book success prediction. Finally, we present a set of dominant themes that contribute towards the popularity of a book for a specific genre.