Nelvin Licona-Guevara


2026

Code-switching is often modeled in NLP as a structural or token-level phenomenon, overlooking its role as a discourse practice shaped by social and cultural context. In this work, we propose topic-based annotation as a framework for analyzing cultural and subcultural variation in bilingual discourse. Using large language models, we annotate 3,691 code-switched sentences from Spanish-English (Miami) and Spanish-Guaraní (Paraguay) corpora with topic and discourse-level information, integrating sociolinguistic metadata. Our analysis reveals systematic relationships between discourse topics, language choice, and social variables such as gender and language dominance. We observe subcultural variation within the Miami community and a clear diglossic distribution in Paraguay, where Guaraní is associated with formal domains and Spanish with informal communication. These findings suggest that modeling code-switching through discourse-level categories provides a more complete representation of multilingual communication and enables both cross-cultural and intra-cultural comparison at scale.

2025

Code-switching presents a complex challenge for syntactic analysis, especially in low-resource language settings where annotated data is scarce. While recent work has explored the use of large language models (LLMs) for sequence-level tagging, few approaches systematically investigate how well these models capture syntactic structure in code-switched contexts. Moreover, existing parsers trained on monolingual treebanks often fail to generalize to multilingual and mixed-language input. To address this gap, we introduce the BiLingua Pipeline, an LLM-based annotation pipeline designed to produce Universal Dependencies (UD) annotations for code-switched text. First, we develop a prompt-based framework for Spanish-English and Spanish-Guaraní data, combining few-shot LLM prompting with expert review. Second, we release two annotated datasets, including the first Spanish-Guaraní UD-parsed corpus. Third, we conduct a detailed syntactic analysis of switch points across language pairs and communicative contexts. Experimental results show that BiLingua Pipeline achieves up to 95.29% LAS after expert revision, significantly outperforming prior baselines and multilingual parsers. These results show that LLMs, when carefully guided, can serve as practical tools for bootstrapping syntactic resources in under-resourced, code-switched environments.