Yanxuan Yu
2026
SemToken: Semantic-Aware Tokenization for Efficient Long-Context Language Models
Dong Liu | Yanxuan Yu
Proceedings of the 15th Joint Conference on Lexical and Computational Semantics (*SEM 2026)
Dong Liu | Yanxuan Yu
Proceedings of the 15th Joint Conference on Lexical and Computational Semantics (*SEM 2026)
Long-context language models face efficiency challenges as context lengths expand. Traditional tokenization methods like BPE operate on frequency statistics, ignoring semantic structure and over-tokenizing redundant spans. We propose SemToken, a semantic-aware tokenization framework that adaptively compresses token sequences based on semantic density. SemToken uses lightweight encoders to identify and merge semantically equivalent spans, allocates variable granularity based on local semantic density, and dynamically adjusts token budgets during generation. Evaluations on WikiText-103, LongBench, and BookSum demonstrate 2.4× token reduction, 1.9× inference speedup, and 67% memory reduction while preserving or improving model quality. SemToken integrates seamlessly with existing models and achieves multiplicative benefits when combined with FlashAttention (up to 2.7× total speedup).
2025
HSGM: Hierarchical Segment-Graph Memory for Scalable Long-Text Semantics
Dong Liu | Yanxuan Yu
Proceedings of the 14th Joint Conference on Lexical and Computational Semantics (*SEM 2025)
Dong Liu | Yanxuan Yu
Proceedings of the 14th Joint Conference on Lexical and Computational Semantics (*SEM 2025)
Semantic parsing of long documents remains challenging due to quadratic growth in pairwise composition and memory requirements. We introduce Hierarchical Segment-Graph Memory (HSGM), a novel framework that decomposes an input of length N into M meaningful segments, constructs Local Semantic Graphs on each segment, and extracts compact summary nodes to form a Global Graph Memory. HSGM supports incremental updates—only newly arrived segments incur local graph construction and summary-node integration—while Hierarchical Query Processing locates relevant segments via top-K retrieval over summary nodes and then performs fine-grained reasoning within their local graphs.Theoretically, HSGM reduces worst-case complexity from O(N2) to O(N\,k + (N/k)2),with segment size k ≪ N, and we derive Frobenius-norm bounds on the approximation error introduced by node summarization and sparsification thresholds. Empirically, on three benchmarks—long-document AMR parsing, segment-level semantic role labeling (OntoNotes), and legal event extraction—HSGM achieves 2–4× inference speedup, >60% reduction in peak memory, and ≥95% of baseline accuracy. Our approach unlocks scalable, accurate semantic modeling for ultra-long texts, enabling real-time and resource-constrained NLP applications.
MT2ST: Adaptive Multi-Task to Single-Task Learning
Dong Liu | Yanxuan Yu
Proceedings of the 1st Workshop on Multimodal Augmented Generation via Multimodal Retrieval (MAGMaR 2025)
Dong Liu | Yanxuan Yu
Proceedings of the 1st Workshop on Multimodal Augmented Generation via Multimodal Retrieval (MAGMaR 2025)
We propose MT2ST, a general and efficient framework for accelerating multi-task training by progressively transitioning to single-task optimization. Unlike conventional multi-task learning (MTL) or single-task fine-tuning (STL), MT2ST dynamically adjusts the training focus via two complementary strategies: Diminish, which gradually down-weights auxiliary losses, and Switch, which explicitly switches to the primary task at a scheduled point. We demonstrate the effectiveness of MT2ST across three key paradigms: representation learning, transformers, and diffusion models, covering both unimodal (text/image) and multimodal (vision-language) tasks. Extensive experiments show that MT2ST significantly improves training efficiency—achieving up to 56% FLOPs compression—while maintaining or surpassing task performance. These results suggest MT2ST as a general-purpose solution for scalable and adaptive multi-task training. Although this work is general-purpose, it is especially suitable for multimodal settings such as VQA or vision-language retrieval, where auxiliary pretraining (e.g., masked language modeling or contrastive learning) often diverges from final objectives. We include a VQA case study and outline its efficiency for multimodal retrieval.