A First Dataset for Film Age Appropriateness Investigation

Emad Mohamed, Le An Ha


Abstract
Film age appropriateness classification is an important problem with a significant societal impact that has so far been out of the interest of Natural Language Processing and Machine Learning researchers. To this end, we have collected a corpus of 17000 films along with their age ratings. We use the textual contents in an experiment to predict the correct age classification for the United States (G, PG, PG-13, R and NC-17) and the United Kingdom (U, PG, 12A, 15, 18 and R18). Our experiments indicate that gradient boosting machines beat FastText and various Deep Learning architectures. We reach an overall accuracy of 79.3% for the US ratings compared to a projected super human accuracy of 84%. For the UK ratings, we reach an overall accuracy of 65.3% (UK) compared to a projected super human accuracy of 80.0%.
Anthology ID:
2020.lrec-1.164
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
1311–1317
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.164
DOI:
Bibkey:
Cite (ACL):
Emad Mohamed and Le An Ha. 2020. A First Dataset for Film Age Appropriateness Investigation. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 1311–1317, Marseille, France. European Language Resources Association.
Cite (Informal):
A First Dataset for Film Age Appropriateness Investigation (Mohamed & Ha, LREC 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/remove-xml-comments/2020.lrec-1.164.pdf