2025
pdf
bib
abs
Structural Patent Classification Using Label Hierarchy Optimization
Mengting Gui
|
Shufeng Hao
|
Chongyang Shi
|
Qi Zhang
Findings of the Association for Computational Linguistics: EMNLP 2025
Patent classification is a fundamental step in the patent examination process, directly impacting the efficiency and quality of substantive review. Existing methods mostly focus on general texts like titles and abstracts, thus ignoring the key technical content claims and the corresponding citation relationships. Meanwhile, these approaches treat labels as independent targets, failing to exploit the semantic and structural information within the label taxonomy. To address these problems, we propose a Claim Structure based Patent Classification model with Label Awareness (CSPC-LA). The method first utilizes the citation relationship of patent claim texts to construct the citation graph and the co-reference graph. Then structural graph learning is used on both graphs to mine the internal logic of patent claims. Finally, we optimize the tree hierarchy of IPC labels and employ tree propagation learning to enhance the patent representation. Extensive experiments on the latest patent classification dataset from USPTO demonstrate that the proposed method is more effective than the state-of-the-art baselines.
pdf
bib
abs
GameTox: A Comprehensive Dataset and Analysis for Enhanced Toxicity Detection in Online Gaming Communities
Usman Naseem
|
Shuvam Shiwakoti
|
Siddhant Bikram Shah
|
Surendrabikram Thapa
|
Qi Zhang
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers)
The prevalence of toxic behavior in online gaming communities necessitates robust detection methods to ensure user safety. We introduce GameTox, a novel dataset comprising 53K game chat utterances annotated for toxicity detection through intent classification and slot filling. This dataset captures the complex relationship between user intent and specific linguistic features that contribute to toxic interactions. We extensively analyze the dataset to uncover key insights into the nature of toxic speech in gaming environments. Furthermore, we establish baseline performance metrics using state-of-the-art natural language processing and large language models, demonstrating the dataset’s contribution towards enhancing the detection of toxic behavior and revealing the limitations of contemporary models. Our results indicate that leveraging both intent detection and slot filling provides a significantly more granular and context-aware understanding of harmful messages. This dataset serves as a valuable resource to train advanced models that can effectively mitigate toxicity in online gaming and foster healthier digital spaces. Our dataset is publicly available at: https://github.com/shucoll/GameTox.
2023
pdf
bib
abs
Reducing Knowledge Noise for Improved Semantic Analysis in Biomedical Natural Language Processing Applications
Usman Naseem
|
Surendrabikram Thapa
|
Qi Zhang
|
Liang Hu
|
Anum Masood
|
Mehwish Nasim
Proceedings of the 5th Clinical Natural Language Processing Workshop
Graph-based techniques have gained traction for representing and analyzing data in various natural language processing (NLP) tasks. Knowledge graph-based language representation models have shown promising results in leveraging domain-specific knowledge for NLP tasks, particularly in the biomedical NLP field. However, such models have limitations, including knowledge noise and neglect of contextual relationships, leading to potential semantic errors and reduced accuracy. To address these issues, this paper proposes two novel methods. The first method combines knowledge graph-based language model with nearest-neighbor models to incorporate semantic and category information from neighboring instances. The second method involves integrating knowledge graph-based language model with graph neural networks (GNNs) to leverage feature information from neighboring nodes in the graph. Experiments on relation extraction (RE) and classification tasks in English and Chinese language datasets demonstrate significant performance improvements with both methods, highlighting their potential for enhancing the performance of language models and improving NLP applications in the biomedical domain.