Roberta Rocca

2025

pdf bib abs
S³ - Semantic Signal Separation
Márton Kardos | Jan Kostkan | Kenneth Enevoldsen | Arnault-Quentin Vermillet | Kristoffer Nielbo | Roberta Rocca
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Topic models are useful tools for discovering latent semantic structures in large textual corpora. Recent efforts have been oriented at incorporating contextual representations in topic modeling and have been shown to outperform classical topic models. These approaches are typically slow, volatile, and require heavy preprocessing for optimal results. We present Semantic Signal Separation (S³), a theory-driven topic modeling approach in neural embedding spaces. S³ conceptualizes topics as independent axes of semantic space and uncovers these by decomposing contextualized document embeddings using Independent Component Analysis. Our approach provides diverse and highly coherent topics, requires no preprocessing, and is demonstrated to be the fastest contextual topic model, being, on average, 4.5x faster than the runner-up BERTopic. We offer an implementation of S³, and all contextual baselines, in the Turftopic Python package.

2022

pdf bib
Evaluating the role of non-lexical markers in GPT-2’s language modeling behavior
Roberta Rocca | Alejandro de la Vega
Proceedings of the 3rd Workshop on Evaluation and Comparison of NLP Systems

pdf bib abs
Language as a fingerprint: Self-supervised learning of user encodings using transformers
Roberta Rocca | Tal Yarkoni
Findings of the Association for Computational Linguistics: EMNLP 2022

The way we talk carries information about who we are. Demographics, personality, clinical conditions, political preferences influence what we speak about and how, suggesting that many individual attributes could be inferred from adequate encodings of linguistic behavior. Conversely, conditioning text representations on author attributes has been shown to improve model performance in many NLP tasks. Previous research on individual differences and language representations has mainly focused on predicting selected attributes from text, or on conditioning text representations on such attributes for author-based contextualization. Here, we present a self-supervised approach to learning language-based user encodings using transformers. Using a large corpus of Reddit submissions, we fine-tune DistilBERT on user-based triplet loss. We show that fine-tuned models can pick up on complex linguistic signatures of users, and that they are able to infer rich information about them. Through a series of intrinsic analyses and probing tasks, we provide evidence that fine-tuning enhances models’ ability to abstract generalizable user information, which yields performance advantages for user-based downstream tasks. We discuss applications in language-based assessment and contextualized and personalized NLP.

Co-authors

Tal Yarkoni 1

Alejandro de la Vega 1

Venues

Fix author