Language as a fingerprint: Self-supervised learning of user encodings using transformers

Roberta Rocca, Tal Yarkoni


Abstract
The way we talk carries information about who we are. Demographics, personality, clinical conditions, political preferences influence what we speak about and how, suggesting that many individual attributes could be inferred from adequate encodings of linguistic behavior. Conversely, conditioning text representations on author attributes has been shown to improve model performance in many NLP tasks. Previous research on individual differences and language representations has mainly focused on predicting selected attributes from text, or on conditioning text representations on such attributes for author-based contextualization. Here, we present a self-supervised approach to learning language-based user encodings using transformers. Using a large corpus of Reddit submissions, we fine-tune DistilBERT on user-based triplet loss. We show that fine-tuned models can pick up on complex linguistic signatures of users, and that they are able to infer rich information about them. Through a series of intrinsic analyses and probing tasks, we provide evidence that fine-tuning enhances models’ ability to abstract generalizable user information, which yields performance advantages for user-based downstream tasks. We discuss applications in language-based assessment and contextualized and personalized NLP.
Anthology ID:
2022.findings-emnlp.123
Volume:
Findings of the Association for Computational Linguistics: EMNLP 2022
Month:
December
Year:
2022
Address:
Abu Dhabi, United Arab Emirates
Editors:
Yoav Goldberg, Zornitsa Kozareva, Yue Zhang
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
1701–1714
Language:
URL:
https://preview.aclanthology.org/icon-24-ingestion/2022.findings-emnlp.123/
DOI:
10.18653/v1/2022.findings-emnlp.123
Bibkey:
Cite (ACL):
Roberta Rocca and Tal Yarkoni. 2022. Language as a fingerprint: Self-supervised learning of user encodings using transformers. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 1701–1714, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Cite (Informal):
Language as a fingerprint: Self-supervised learning of user encodings using transformers (Rocca & Yarkoni, Findings 2022)
Copy Citation:
PDF:
https://preview.aclanthology.org/icon-24-ingestion/2022.findings-emnlp.123.pdf