Yanting Li


2026

Concrete words (e.g., apple) are often described in the literature to share more semantic features across languages than abstract words (e.g., appetite). We test this hypothesis using multilingual aligned word embeddings by measuring the distance between words and their nearest neighbor in other languages, and examining whether shorter distances predicted higher concreteness ratings in six languages: Dutch, English, French, Cypriot Greek, Mandarin, and Portuguese. The relationship between concreteness and cross-linguistic distance varied across languages, suggesting that concreteness does not uniformly correspond to cross-linguistic semantic relatedness. Our attempt highlights the potential of using aligned word embeddings for operationalizing psycholinguistic constructs.
This paper investigates how word predictability influences code-switching probability. We analyze 1,010 code-switched instances drawn from naturalistic sociolinguistic interviews with 41 Cantonese–English bilinguals across three bilingual groups (homeland, immersed, and heritage). In particular, we examine whether the predictability of switch points, operationalized as surprisal, influences the likelihood of code-switching. Using pretrained transformer-based language models, we estimate surprisal at the switch point under different modeling conditions, including autoregressive and masked models and varying amounts of contextual information. Mixed-effects logistic regressionanalyses show that less predictable words are more likely to be code-switched. These effects are largely consistent across model types and bilingual groups. Overall, these findings highlight the role of predictability in bilingual speech production and provide new insights into code-switching among bilingual speakers with diverse language experiences.
Goal-directed molecular generation requires satisfying heterogeneous constraints such as protein–ligand compatibility and multi-objective drug-like properties, yet existing methods often optimize these constraints in isolation, failing to reconcile conflicting objectives (e.g., affinity vs. safety), and struggle to navigate the non-differentiable chemical space without compromising structural validity. To address these challenges, we propose CAGenMol, a condition-aware discrete diffusion framework over molecular sequences that formulates molecular design as conditional denoising guided by heterogeneous structural and property signals. By coupling discrete diffusion with reinforcement learning, the model aligns the generation trajectory with non-differentiable objectives while preserving chemical validity and diversity. The non-autoregressive nature of diffusion language model further enables iterative refinement of molecular fragments at inference time. Experiments on structure-conditioned, property-conditioned, and dual-conditioned benchmarks demonstrate consistent improvements over state-of-the-art methods in binding affinity, drug-likeness, and success rate, highlighting the effectiveness of our framework. The code is available at https://github.com/Lee612-1/CAGenMol.

2025

Proteins are critical for various molecular functions, relying on their precise tertiary structures. This structure-sequence relationship is complex and degenerate, meaning multiple sequences can fold into a similar structure. The challenges in protein prediction, design, and modification increase with sequence complexity, while research on RNA-protein interactions, especially RNA-binding proteins (RBPs), is gaining importance. Large-scale pre-trained language models (LLMs) have shown promising results in handling biological sequences by treating them as natural language, though integrating spatial structures remains complex due to the need for specialized visual and 3D modeling approaches. We introduce a method to integrate protein 3D structural data within a sequence processing framework, converting 3D coordinates into discrete structure tokens using a VQ-VAE-like network. This simplifies the handling of 3D data, avoiding complex pipelines and facilitating a unified sequence-to-sequence processing model. Our approach demonstrates strong performance across a range of tasks, achieving high sequence recovery in inverse folding and protein-conditioned RNA design. These outstanding results demonstrate significant potential for application in complex biological systems research.
Creolization and code-switching are closely related contact-induced linguistic phenomena, yet little attention has been paid to the connection between them. In this paper, we propose an agent-based cognitive model which provides a linkage between these two phenomena focusing on the statistical regularization of language use. That is, we identify that creolization as a conventionalization process and code-switching as flexible language choice can emerge from the same cognitive model in different social environments. Our model postulates a social structure of bilingual and monolingual populations, in which a set of agents seek for optimal communicative strategy shaped by multiple cognitive constraints. The simulation results show that our model successfully captures both phenomena as two ends of a continuum, characterized by varying degrees of regularization in the use of linguistic constructions from multiple source languages. The model also reveals a subtle dynamic between social structure and individual-level cognitive constraints.

2024

2021

2020

The advent of natural language understanding (NLU) benchmarks for English, such as GLUE and SuperGLUE allows new NLU models to be evaluated across a diverse set of tasks. These comprehensive benchmarks have facilitated a broad range of research and applications in natural language processing (NLP). The problem, however, is that most such benchmarks are limited to English, which has made it difficult to replicate many of the successes in English NLU for other languages. To help remedy this issue, we introduce the first large-scale Chinese Language Understanding Evaluation (CLUE) benchmark. CLUE is an open-ended, community-driven project that brings together 9 tasks spanning several well-established single-sentence/sentence-pair classification tasks, as well as machine reading comprehension, all on original Chinese text. To establish results on these tasks, we report scores using an exhaustive set of current state-of-the-art pre-trained Chinese models (9 in total). We also introduce a number of supplementary datasets and additional tools to help facilitate further progress on Chinese NLU. Our benchmark is released at https://www.cluebenchmarks.com