Hongyu Sun

2025

Word frequency is a key variable in psycholinguistics, useful for modeling human familiarity with words even in the era of large language models (LLMs). Frequency in film subtitles has proved to be a particularly good approximation of everyday language exposure. For many languages, however, film subtitles are not easily available, or are overwhelmingly translated from English. We demonstrate that frequencies extracted from carefully processed YouTube subtitles provide an approximation comparable to, and often better than, the best currently available resources. Moreover, they are available for languages for which a high-quality subtitle or speech corpus does not exist. We use YouTube subtitles to construct frequency norms for five diverse languages, Chinese, English, Indonesian, Japanese, and Spanish, and evaluate their correlation with lexical decision time, word familiarity, and lexical complexity. In addition to being strongly correlated with two psycholinguistic variables, a simple linear regression on the new frequencies achieves a new high score on a lexical complexity prediction task in English and Japanese, surpassing both models trained on film subtitle frequencies and the LLM GPT-4. We publicly release our code, the frequency lists, fastText word embeddings, and statistical language models.

pdf bib abs
LoCt-Instruct: An Automatic Pipeline for Constructing Datasets of Logical Continuous Instructions
Hongyu Sun | Yusuke Sakai | Haruki Sakajo | Shintaro Ozaki | Kazuki Hayashi | Hidetaka Kamigaito | Taro Watanabe
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing

Continuous instruction following closely mirrors real-world tasks by requiring models to solve sequences of interdependent steps, yet existing multi-step instruction datasets suffer from three key limitations: (1) lack of logical coherence across turns, (2) narrow topical breadth and depth, and (3) reliance on rigid templates or heavy manual effort. We introduce LoCt-Pipeline, a novel pipeline that leverages modern LLMs’ reasoning capabilities to assemble rich, topic-related single-instruction data into multi-turn dialogues, producing chains that are logically coherent, progressively deepen in content, and span diverse domains without fixed templates or extensive human annotation. We employed this pipeline to construct LoCt-Instruct for assessing models’ problem-solving abilities. The generated chains serve as a testbed for benchmarking a variety of models, including reasoning-oriented architectures, instruction-tuned variants, and state-of-the-art closed-source LLMs on their capacity to follow and correctly respond to each step. Our results reveal a substantial performance gap between current LLMs and human solvers. These findings highlight the need for more robust continuous instruction following. We publicly release the dataset and end-to-end pipeline.

2024

pdf bib abs
Applying Contrastive Learning to Code Vulnerability Type Classification
Chen Ji | Su Yang | Hongyu Sun | Yuqing Zhang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing

Vulnerability classification is a crucial task in software security analysis, essential for identifying and mitigating potential security risks. Learning-based methods often perform poorly due to the long-tail distribution of vulnerability classification datasets. Recent approaches try to address the problem but treat each CWE class in isolation, ignoring their relationships. This results in non-scalable code vector representations, causing significant performance drops when handling complex real-world vulnerabilities. We propose a hierarchical contrastive learning framework for code vulnerability type classification to bring vector representations of related CWEs closer together. To address the issue of class collapse and enhance model robustness, we mix self-supervised contrastive learning loss into our loss function. Additionally, we employ max-pooling to enable the model to handle longer vulnerability code inputs. Extensive experiments demonstrate that our proposed framework outperforms state-of-the-art methods by 2.97%-17.90% on accuracy and 0.98%-22.27% on weighted-F1, with even better performance on higher-quality datasets. We also utilize an ablation study to prove each component’s contribution. These findings underscore the potential and advantages of our approach in the multi-class vulnerability classification task.

Co-authors

Hidetaka Kamigaito 1

Eunike Andriani Kardinata 1

Adam Nohejl 1

Maria Angelica Riera Machin 1

Su Yang 1

Venues

emnlp2
coling1

Fix author