Cross-lingual alignment of nuanced sociological concepts is crucial for comparative cross-cultural research, harmonising longitudinal studies, and leveraging knowledge from social science taxonomies (e.g., ELSST). However, aligning these concepts is challenging due to cultural context-dependency, linguistic variation, and data scarcity, particularly for low-resource languages. Existing methods often fail to capture domain-specific subtleties or require extensive parallel data. Grounded in a Vector Decomposition Hypothesis—positing separable domain and language components within embeddings, supported by observed language-pair specific geometric structures—we propose DLIR (Dual-Branch LoRA for Invariant Representation). DLIR employs parallel Low-Rank Adaptation (LoRA) branches: one captures core sociological semantics (trained primarily on English data structured by the ELSST hierarchy), while the other learns language invariance by counteracting specific language perturbations. These perturbations are modeled by Gaussian Mixture Models (GMMs) fitted on minimal parallel concept data using spherical geometry. DLIR significantly outperforms strong baselines on cross-lingual sociological concept retrieval across 10 languages. Demonstrating powerful zero-shot knowledge transfer, English-trained DLIR substantially surpasses target-language (French/German) LoRA fine-tuning even in monolingual tasks. DLIR learns disentangled, language-robust representations, advancing resource-efficient multilingual understanding and enabling reliable cross-lingual comparison of sociological constructs.
Social media is recognized as an important source for deriving insights into public opinion dynamics and social impacts due to the vast textual data generated daily and the ‘unconstrained’ behavior of people interacting on these platforms. However, such analyses prove challenging due to the semantic shift phenomenon, where word meanings evolve over time. This paper proposes an unsupervised dynamic word embedding method to capture longitudinal semantic shifts in social media data without predefined anchor words. The method leverages word co-occurrence statistics and dynamic updating to adapt embeddings over time, addressing the challenges of data sparseness, imbalanced distributions, and synergistic semantic effects. Evaluated on a large COVID-19 Twitter dataset, the method reveals semantic evolution patterns of vaccine- and symptom-related entities across different pandemic stages, and their potential correlations with real-world statistics. Our key contributions include the dynamic embedding technique, empirical analysis of COVID-19 semantic shifts, and discussions on enhancing semantic shift modeling for computational social science research. This study enables capturing longitudinal semantic dynamics on social media to understand public discourse and collective phenomena.
Online misogyny is a pernicious social problem that risks making online platforms toxic and unwelcoming to women. We present a new hierarchical taxonomy for online misogyny, as well as an expert labelled dataset to enable automatic classification of misogynistic content. The dataset consists of 6567 labels for Reddit posts and comments. As previous research has found untrained crowdsourced annotators struggle with identifying misogyny, we hired and trained annotators and provided them with robust annotation guidelines. We report baseline classification performance on the binary classification task, achieving accuracy of 0.93 and F1 of 0.43. The codebook and datasets are made freely available for future researchers.