Human Language Modeling

Nikita Soni, Matthew Matero, Niranjan Balasubramanian, H. Andrew Schwartz


Abstract
Natural language is generated by people, yet traditional language modeling views words or documents as if generated independently. Here, we propose human language modeling (HuLM), a hierarchical extension to the language modeling problem where by a human- level exists to connect sequences of documents (e.g. social media messages) and capture the notion that human language is moderated by changing human states. We introduce, HaRT, a large-scale transformer model for solving HuLM, pre-trained on approximately 100,000 social media users, and demonstrate it’s effectiveness in terms of both language modeling (perplexity) for social media and fine-tuning for 4 downstream tasks spanning document- and user-levels. Results on all tasks meet or surpass the current state-of-the-art.
Anthology ID:
2022.findings-acl.52
Volume:
Findings of the Association for Computational Linguistics: ACL 2022
Month:
May
Year:
2022
Address:
Dublin, Ireland
Editors:
Smaranda Muresan, Preslav Nakov, Aline Villavicencio
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
622–636
Language:
URL:
https://aclanthology.org/2022.findings-acl.52
DOI:
10.18653/v1/2022.findings-acl.52
Bibkey:
Cite (ACL):
Nikita Soni, Matthew Matero, Niranjan Balasubramanian, and H. Andrew Schwartz. 2022. Human Language Modeling. In Findings of the Association for Computational Linguistics: ACL 2022, pages 622–636, Dublin, Ireland. Association for Computational Linguistics.
Cite (Informal):
Human Language Modeling (Soni et al., Findings 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-5/2022.findings-acl.52.pdf
Video:
 https://preview.aclanthology.org/nschneid-patch-5/2022.findings-acl.52.mp4
Code
 humanlab/hart