Hong Ting Tsang
2026
AutoSchemaKG: Autonomous Knowledge Graph Construction through Dynamic Schema Induction from Web-Scale Corpora
Jiaxin Bai | Wei Fan | Qi Hu | Qing Zong | Chunyang Li | Hong Ting Tsang | Hongyu Luo | Yauwai Yim | Haoyu Huang | Xiao Zhou | Feng Qin | Tianshi Zheng | Xi Peng | Xin Yao | Huiwen Yang | Leijie Wu | JI Yi | Gong Zhang | Renhai Chen | Yangqiu Song
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jiaxin Bai | Wei Fan | Qi Hu | Qing Zong | Chunyang Li | Hong Ting Tsang | Hongyu Luo | Yauwai Yim | Haoyu Huang | Xiao Zhou | Feng Qin | Tianshi Zheng | Xi Peng | Xin Yao | Huiwen Yang | Leijie Wu | JI Yi | Gong Zhang | Renhai Chen | Yangqiu Song
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
We present AutoSchemaKG, a framework for fully autonomous knowledge graph construction that eliminates the need for predefined schemas. Our system leverages large language models to simultaneously extract knowledge triples and induce comprehensive schemas directly from text, modeling both entities and events while employing conceptualization to organize instances into semantic categories. Processing over 50 million documents, we construct ATLAS (Automated Triple Linking And Schema induction), a family of knowledge graphs with 900+ million nodes and 5.9 billion edges. This approach outperforms state-of-the-art baselines on multi-hop QA tasks and enhances LLM factuality. Notably, our schema induction achieves 92% semantic alignment with human-crafted schemas with zero manual intervention, demonstrating that billion-scale knowledge graphs with dynamically induced schemas can effectively complement parametric knowledge in large language models.
ContextLens: Modeling Imperfect Privacy and Safety Context for Legal Compliance
Haoran Li | Yulin Chen | Huihao Jing | Wenbin Hu | Tsz Ho Li | Chanhou Lou | Hong Ting Tsang | Sirui Han | Yangqiu Song
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Haoran Li | Yulin Chen | Huihao Jing | Wenbin Hu | Tsz Ho Li | Chanhou Lou | Hong Ting Tsang | Sirui Han | Yangqiu Song
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Individuals’ concerns about data privacy and AI safety are highly contextualized and extend beyond sensitive patterns. Addressing these issues requires reasoning about the context to identify and mitigate potential risks. Though researchers have widely explored using large language models (LLMs) as evaluators for contextualized safety and privacy assessments, these efforts typically assume the availability of complete and clear context, whereas real-world contexts tend to be ambiguous and incomplete. In this paper, we propose ContextLens, a semi-rule-based framework that leverages LLMs to ground the input context in the legal domain and explicitly identify both known and unknown factors for legal compliance. Instead of directly assessing safety outcomes, our ContextLens instructs LLMs to answer a set of crafted questions that span over applicability, general principles and detailed provisions to assess compliance with pre-defined priorities and rules. We conduct extensive experiments on existing compliance benchmarks that cover the General Data Protection Regulation (GDPR) and the EU AI Act. The results suggest that our ContextLens can significantly improve LLMs’ compliance assessment and surpass existing baselines without any training. Additionally, our ContextLens can further identify the ambiguous and missing factors.
AutoGraph-R1: End-to-End Reinforcement Learning for Knowledge Graph Construction
Hong Ting Tsang | Jiaxin Bai | Haoyu Huang | Qiao Xiao | Tianshi Zheng | Baixuan Xu | Shujie Liu | Yangqiu Song
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Hong Ting Tsang | Jiaxin Bai | Haoyu Huang | Qiao Xiao | Tianshi Zheng | Baixuan Xu | Shujie Liu | Yangqiu Song
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Building effective knowledge graphs (KGs) for Retrieval-Augmented Generation (RAG) is pivotal for advancing question answering (QA) systems. However, its effectiveness is hindered by a fundamental disconnect: the knowledge graph (KG) construction process is decoupled from its downstream application, yielding suboptimal graph structures. To bridge this gap, we introduce AutoGraph-R1, the first framework to directly optimize KG construction for task performance using Reinforcement Learning (RL). AutoGraph-R1 trains an LLM constructor by framing graph generation as a policy learning problem, where the reward is derived from the graph’s functional utility in a RAG pipeline. We design two novel, task-aware reward functions, one for graphs as knowledge carriers and another as knowledge indices. Across multiple QA benchmarks, AutoGraph-R1 consistently enables graph RAG methods to achieve significant performance gains over using task-agnostic baseline graphs. Our work shows it is possible to close the loop between construction and application, shifting the paradigm from building intrinsically ‘good‘ graphs to building demonstrably ‘useful‘ ones.
2025
From Automation to Autonomy: A Survey on Large Language Models in Scientific Discovery
Tianshi Zheng | Zheye Deng | Hong Ting Tsang | Weiqi Wang | Jiaxin Bai | Zihao Wang | Yangqiu Song
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Tianshi Zheng | Zheye Deng | Hong Ting Tsang | Weiqi Wang | Jiaxin Bai | Zihao Wang | Yangqiu Song
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Large Language Models (LLMs) are catalyzing a paradigm shift in scientific discovery, evolving from task-specific automation tools into increasingly autonomous agents and fundamentally redefining research processes and human-AI collaboration. This survey systematically charts this burgeoning field, placing a central focus on the changing roles and escalating capabilities of LLMs in science. Through the lens of the scientific method, we introduce a foundational three-level taxonomy—Tool, Analyst, and Scientist—to delineate their escalating autonomy and evolving responsibilities within the research lifecycle. We further identify pivotal challenges and future research trajectories such as robotic automation, self-improvement, and ethical governance. Overall, this survey provides a conceptual architecture and strategic foresight to navigate and shape the future of AI-driven scientific discovery, fostering both rapid innovation and responsible advancement.