WordWars: A Dataset to Examine the Natural Selection of Words

Saif M. Mohammad


Abstract
There is a growing body of work on how word meaning changes over time: mutation. In contrast, there is very little work on how different words compete to represent the same meaning, and how the degree of success of words in that competition changes over time: natural selection. We present a new dataset, WordWars, with historical frequency data from the early 1800s to the early 2000s for monosemous English words in over 5000 synsets. We explore three broad questions with the dataset: (1) what is the degree to which predominant words in these synsets have changed, (2) how do prominent word features such as frequency, length, and concreteness impact natural selection, and (3) what are the differences between the predominant words of the 2000s and the predominant words of early 1800s. We show that close to one third of the synsets undergo a change in the predominant word in this time period. Manual annotation of these pairs shows that about 15% of these are orthographic variations, 25% involve affix changes, and 60% have completely different roots. We find that frequency, length, and concreteness all impact natural selection, albeit in different ways.
Anthology ID:
2020.lrec-1.377
Volume:
Proceedings of the Twelfth Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
3087–3095
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.377
DOI:
Bibkey:
Cite (ACL):
Saif M. Mohammad. 2020. WordWars: A Dataset to Examine the Natural Selection of Words. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 3087–3095, Marseille, France. European Language Resources Association.
Cite (Informal):
WordWars: A Dataset to Examine the Natural Selection of Words (Mohammad, LREC 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingestion-script-update/2020.lrec-1.377.pdf