Nelvin Licona-Guevara

2026

Modeling Cultural and Subcultural Variation in Code-Switched Discourse with Topic Annotation
Nemika Tyagi | Nelvin Licona-Guevara | Olga Kellert
Proceedings of the 4th Workshop on Cross-Cultural Considerations in NLP (C3NLP 2026)

Code-switching is often modeled in NLP as a structural or token-level phenomenon, overlooking its role as a discourse practice shaped by social and cultural context. In this work, we propose topic-based annotation as a framework for analyzing cultural and subcultural variation in bilingual discourse. Using large language models, we annotate 3,691 code-switched sentences from Spanish-English (Miami) and Spanish-Guaraní (Paraguay) corpora with topic and discourse-level information, integrating sociolinguistic metadata. Our analysis reveals systematic relationships between discourse topics, language choice, and social variables such as gender and language dominance. We observe subcultural variation within the Miami community and a clear diglossic distribution in Paraguay, where Guaraní is associated with formal domains and Spanish with informal communication. These findings suggest that modeling code-switching through discourse-level categories provides a more complete representation of multilingual communication and enables both cross-cultural and intra-cultural comparison at scale.

2025

pdf bib abs

Parsing the Switch: LLM-Based UD Annotation for Complex Code-Switched and Low-Resource Languages
Olga Kellert | Nemika Tyagi | Muhammad Imran | Nelvin Licona-Guevara | Carlos Gómez-Rodríguez
Findings of the Association for Computational Linguistics: EMNLP 2025

Code-switching presents a complex challenge for syntactic analysis, especially in low-resource language settings where annotated data is scarce. While recent work has explored the use of large language models (LLMs) for sequence-level tagging, few approaches systematically investigate how well these models capture syntactic structure in code-switched contexts. Moreover, existing parsers trained on monolingual treebanks often fail to generalize to multilingual and mixed-language input. To address this gap, we introduce the BiLingua Pipeline, an LLM-based annotation pipeline designed to produce Universal Dependencies (UD) annotations for code-switched text. First, we develop a prompt-based framework for Spanish-English and Spanish-Guaraní data, combining few-shot LLM prompting with expert review. Second, we release two annotated datasets, including the first Spanish-Guaraní UD-parsed corpus. Third, we conduct a detailed syntactic analysis of switch points across language pairs and communicative contexts. Experimental results show that BiLingua Pipeline achieves up to 95.29% LAS after expert revision, significantly outperforming prior baselines and multilingual parsers. These results show that LLMs, when carefully guided, can serve as practical tools for bootstrapping syntactic resources in under-resourced, code-switched environments.

Co-authors

Venues

Fix author