Clustering LLM-based Word Embeddings to Determine Topics from Bangla Articles

Rifat Rahman

Clustering LLM-based Word Embeddings to Determine Topics from Bangla Articles

Abstract

Topic modeling methods identify fundamental themes within textual documents, facilitating an understanding of the insights inside them. Traditional topic modeling approaches are based on the generative probabilistic process that assumes the document-topic and topic-word distribution. Hence, those approaches fail to capture semantic similarities among words inside the documents and are less scalable with the vast number of topics and documents. This paper presents a method for capturing topics from Bangla documents by clustering the word vectors induced from LLM models. Corpus statistics are integrated into the clustering & word reordering process within each cluster or topic to extract the top words. Additionally, we deploy dimensionality reduction techniques, such as PCA, prior to clustering. Finally, we perform a comparative study and identify the best-performing combination of clustering and word embedding methods. Our top-performing combination outperforms the traditional probabilistic topic model in capturing topics and top words per topic, and excels notably in terms of computational efficiency and time complexity.

Anthology ID:: 2025.banglalp-1.25
Volume:: Proceedings of the Second Workshop on Bangla Language Processing (BLP-2025)
Month:: December
Year:: 2025
Address:: Mumbai, India
Editors:: Firoj Alam, Sudipta Kar, Shammur Absar Chowdhury, Naeemul Hassan, Enamul Hoque Prince, Mohiuddin Tasnim, Md Rashad Al Hasan Rony, Md Tahmid Rahman Rahman
Venues:: BanglaLP | WS
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 309–321
Language:
URL:: https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.banglalp-1.25/
DOI:
Bibkey:
Cite (ACL):: Rifat Rahman. 2025. Clustering LLM-based Word Embeddings to Determine Topics from Bangla Articles. In Proceedings of the Second Workshop on Bangla Language Processing (BLP-2025), pages 309–321, Mumbai, India. Association for Computational Linguistics.
Cite (Informal):: Clustering LLM-based Word Embeddings to Determine Topics from Bangla Articles (Rahman, BanglaLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.banglalp-1.25.pdf

PDF Cite Search Fix data