Roberta Rocca
2025
S3 - Semantic Signal Separation
Márton Kardos
|
Jan Kostkan
|
Kenneth Enevoldsen
|
Arnault-Quentin Vermillet
|
Kristoffer Nielbo
|
Roberta Rocca
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Topic models are useful tools for discovering latent semantic structures in large textual corpora. Recent efforts have been oriented at incorporating contextual representations in topic modeling and have been shown to outperform classical topic models. These approaches are typically slow, volatile, and require heavy preprocessing for optimal results. We present Semantic Signal Separation (S3), a theory-driven topic modeling approach in neural embedding spaces. S3 conceptualizes topics as independent axes of semantic space and uncovers these by decomposing contextualized document embeddings using Independent Component Analysis. Our approach provides diverse and highly coherent topics, requires no preprocessing, and is demonstrated to be the fastest contextual topic model, being, on average, 4.5x faster than the runner-up BERTopic. We offer an implementation of S3, and all contextual baselines, in the Turftopic Python package.
2022
Evaluating the role of non-lexical markers in GPT-2’s language modeling behavior
Roberta Rocca
|
Alejandro de la Vega
Proceedings of the 3rd Workshop on Evaluation and Comparison of NLP Systems
Language as a fingerprint: Self-supervised learning of user encodings using transformers
Roberta Rocca
|
Tal Yarkoni
Findings of the Association for Computational Linguistics: EMNLP 2022
The way we talk carries information about who we are. Demographics, personality, clinical conditions, political preferences influence what we speak about and how, suggesting that many individual attributes could be inferred from adequate encodings of linguistic behavior. Conversely, conditioning text representations on author attributes has been shown to improve model performance in many NLP tasks. Previous research on individual differences and language representations has mainly focused on predicting selected attributes from text, or on conditioning text representations on such attributes for author-based contextualization. Here, we present a self-supervised approach to learning language-based user encodings using transformers. Using a large corpus of Reddit submissions, we fine-tune DistilBERT on user-based triplet loss. We show that fine-tuned models can pick up on complex linguistic signatures of users, and that they are able to infer rich information about them. Through a series of intrinsic analyses and probing tasks, we provide evidence that fine-tuning enhances models’ ability to abstract generalizable user information, which yields performance advantages for user-based downstream tasks. We discuss applications in language-based assessment and contextualized and personalized NLP.
Search
Fix author
Co-authors
- Kenneth Enevoldsen 1
- Márton Kardos 1
- Jan Kostkan 1
- Kristoffer Nielbo 1
- Arnault-Quentin Vermillet 1
- show all...