Lamiece Hassan
2025
CAST: Corpus-Aware Self-similarity Enhanced Topic modelling
Yanan Ma
|
Chenghao Xiao
|
Chenhan Yuan
|
Sabine N Van Der Veer
|
Lamiece Hassan
|
Chenghua Lin
|
Goran Nenadic
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Topic modelling is a pivotal unsupervised machine learning technique for extracting valuable insights from large document collections. Existing neural topic modelling methods often encode contextual information of documents, while ignoring contextual details of candidate centroid words, leading to the inaccurate selection of topic words due to the *contextualization gap*. In parallel, it is found that functional words are frequently selected over topical words. To address these limitations, we introduce **CAST**: **C**orpus-**A**ware **S**elf-similarity Enhanced **T**opic modelling, a novel topic modelling method that builds upon candidate centroid word embeddings contextualized on the dataset, and a novel self-similarity-based method to filter out less meaningful tokens. Inspired by findings in contrastive learning that self-similarities of functional token embeddings in different contexts are much lower than topical tokens, we find self-similarity to be an effective metric to prevent functional words from acting as candidate topic words. Our approach significantly enhances the coherence and diversity of generated topics, as well as the topic model’s ability to handle noisy data. Experiments on news benchmark datasets and one Twitter dataset demonstrate the method’s superiority in generating coherent, diverse topics, and handling noisy data, outperforming strong baselines.
2021
A Joint Training Approach to Tweet Classification and Adverse Effect Extraction and Normalization for SMM4H 2021
Mohab Elkaref
|
Lamiece Hassan
Proceedings of the Sixth Social Media Mining for Health (#SMM4H) Workshop and Shared Task
In this work we describe our submissions to the Social Media Mining for Health (SMM4H) 2021 Shared Task. We investigated the effectiveness of a joint training approach to Task 1, specifically classification, extraction and normalization of Adverse Drug Effect (ADE) mentions in English tweets. Our approach performed well on the normalization task, achieving an above average f1 score of 24%, but less so on classification and extraction, with f1 scores of 22% and 37% respectively. Our experiments also showed that a larger dataset with more negative results led to stronger results than a smaller more balanced dataset, even when both datasets have the same positive examples. Finally we also submitted a tuned BERT model for Task 6: Classification of Covid-19 tweets containing symptoms, which achieved an above average f1 score of 96%.
Search
Fix data
Co-authors
- Mohab El-karef 1
- Chenghua Lin 1
- Yanan Ma 1
- Goran Nenadic 1
- Sabine N Van Der Veer 1
- show all...