Hai Jin

2024

Large Language Models (LLMs) have shown remarkable progress in automated code generation. Yet, LLM-generated code may contain errors in API usage, class, data structure, or missing project-specific information. As much of this project-specific context cannot fit into the prompts of LLMs, we must find ways to allow the model to explore the project-level code context. We present CoCoGen, a new code generation approach that uses compiler feedback to improve the LLM-generated code. CoCoGen first leverages static analysis to identify mismatches between the generated code and the project’s context. It then iteratively aligns and fixes the identified errors using information extracted from the code repository. We integrate CoCoGen with two representative LLMs, i.e., GPT-3.5-Turbo and Code Llama (13B), and apply it to Python code generation. Experimental results show that CoCoGen significantly improves the vanilla LLMs by over 80% in generating code dependent on the project context and consistently outperforms the existing retrieval-based code generation baselines.

Code Pre-trained Models (CodePTMs) based vulnerability detection have achieved promising results over recent years. However, these models struggle to generalize as they typically learn superficial mapping from source code to labels instead of understanding the root causes of code vulnerabilities, resulting in poor performance in real-world scenarios beyond the training instances. To tackle this challenge, we introduce VulLLM, a novel framework that integrates multi-task learning with Large Language Models (LLMs) to effectively mine deep-seated vulnerability features. Specifically, we construct two auxiliary tasks beyond the vulnerability detection task. First, we utilize the vulnerability patches to construct a vulnerability localization task. Second, based on the vulnerability features extracted from patches, we leverage GPT-4 to construct a vulnerability interpretation task. VulLLM innovatively augments vulnerability classification by leveraging generative LLMs to understand complex vulnerability patterns, thus compelling the model to capture the root causes of vulnerabilities rather than overfitting to spurious features of a single task. The experiments conducted on six large datasets demonstrate that VulLLM surpasses seven state-of-the-art models in terms of effectiveness, generalization, and robustness.

pdf abs
Correcting Pronoun Homophones with Subtle Semantics in Chinese Speech Recognition
Zhaobo Zhang | Rui Gan | Pingpeng Yuan | Hai Jin
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Speech recognition is becoming prevalent in daily life. However, due to the similar semantic context of the entities and the overlap of Chinese pronunciation, the pronoun homophone, especially “他/她/它 (he/she/it)”, (their pronunciation is “Tā”) is usually recognized incorrectly. It poses a challenge to automatically correct them during the post-processing of Chinese speech recognition. In this paper, we propose three models to address the common confusion issues in this domain, tailored to various application scenarios. We implement the language model, the LSTM model with semantic features, and the rule-based assisted Ngram model, enabling our models to adapt to a wide range of requirements, from high-precision to low-resource offline devices. The extensive experiments show that our models achieve the highest recognition rate for “Tā” correction with improvements from 70% in the popular voice input methods up to 90%. Further ablation analysis underscores the effectiveness of our models in enhancing recognition accuracy. Therefore, our models improve the overall experience of Chinese speech recognition of “Tā” and reduce the burden of manual transcription corrections.

pdf abs
Harnessing the Power of Large Language Model for Uncertainty Aware Graph Processing
Zhenyu Qian | Yiming Qian | Yuting Song | Fei Gao | Hai Jin | Chen Yu | Xia Xie
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Handling graph data is one of the most difficult tasks. Traditional techniques, such as those based on geometry and matrix factorization, rely on assumptions about the data relations that become inadequate when handling large and complex graph data. On the other hand, deep learning approaches demonstrate promising results in handling large graph data, but they often fall short of providing interpretable explanations. To equip the graph processing with both high accuracy and explainability, we introduce a novel approach that harnesses the power of a large language model (LLM), enhanced by an uncertainty-aware module to provide a confidence score on the generated answer. We experiment with our approach on two graph processing tasks: few-shot knowledge graph completion and graph classification. Our results demonstrate that through parameter efficient fine-tuning, the LLM surpasses state-of-the-art algorithms by a substantial margin across ten diverse benchmark datasets. Moreover, to address the challenge of explainability, we propose an uncertainty estimation based on perturbation, along with a calibration scheme to quantify the confidence scores of the generated answers. Our confidence measure achieves an AUC of 0.8 or higher on seven out of the ten datasets in predicting the correctness of the answer generated by LLM.

2023

Temporal Knowledge Graph (TKG) reasoning, which focuses on leveraging temporal information to infer future facts in knowledge graphs, plays a vital role in knowledge graph completion. Typically, existing works for this task design graph neural networks and recurrent neural networks to respectively capture the structural and temporal information in KGs. Despite their effectiveness, in our practice, we find that they tend to suffer the issues of low training efficiency and insufficient generalization ability, which can be attributed to the over design of model architectures. To this end, this paper aims to figure out whether the current complex model architectures are necessary for temporal knowledge graph reasoning. As a result, we put forward a simple yet effective approach (termed SiMFy), which simply utilizes multilayer perceptron (MLP) to model the structural dependencies of events and adopts a fixed-frequency strategy to incorporate historical frequency during inference. Extensive experiments on real-world datasets demonstrate that our SiMFy can reach state-of-the-art performance with the following strengths: 1) faster convergence speed and better generalization ability; 2) a much smaller time consumption in the training process; and 3) better ability to capture the structural dependencies of events in KGs. These results provide evidence that the substitution of complex models with simpler counterparts is a feasible strategy.

2022

pdf abs
Can Language Models Serve as Temporal Knowledge Bases?
Ruilin Zhao | Feng Zhao | Guandong Xu | Sixiao Zhang | Hai Jin
Findings of the Association for Computational Linguistics: EMNLP 2022

Recent progress regarding the use of language models (LMs) as knowledge bases (KBs) has shown that language models can act as structured knowledge bases for storing relational facts. However, most existing works only considered the LM-as-KB paradigm in a static setting, which ignores the analysis of temporal dynamics of world knowledge. Furthermore, a basic function of KBs, i.e., the ability to store conflicting information (i.e., 1-N, N-1, and N-M relations), is underexplored. In this paper, we formulate two practical requirements for treating LMs as temporal KBs: (i) The capacity to store temporally-scoped knowledge that contains conflicting information and (ii) the ability to use stored knowledge for temporally-scoped knowledge queries. We introduce a new dataset called LAMA-TK which is aimed at probing temporally-scoped knowledge, and investigate the two above requirements to explore the LM-as-KB paradigm in the temporal domain. On the one hand, experiments show that LMs can memorize millions of temporally-scoped facts with relatively high accuracy and transfer stored knowledge to temporal knowledge queries, thereby expanding the LM-as-KB paradigm to the temporal domain. On the other hand, we show that memorizing conflicting information, which has been neglected by previous works, is still challenging for LMs and hinders the memorization of other unrelated one-to-one relationships.

pdf abs
OpticE: A Coherence Theory-Based Model for Link Prediction
Xiangyu Gui | Feng Zhao | Langjunqing Jin | Hai Jin
Proceedings of the 29th International Conference on Computational Linguistics

Knowledge representation learning is a key step required for link prediction tasks with knowledge graphs (KGs). During the learning process, the semantics of each entity are embedded by a vector or a point in a feature space. The distance between these points is a measure of semantic similarity. However, in a KG, while two entities may have similar semantics in some relations, they have different semantics in others. It is ambiguous to assign a fixed distance to depict the variant semantic similarity of entities. To alleviate the semantic ambiguity in KGs, we design a new embedding approach named OpticE, which is derived from the well-known physical phenomenon of optical interference. It is a lightweight and relation-adaptive model based on coherence theory, in which each entity’s semantics vary automatically regarding different relations. In addition, a unique negative sampling method is proposed to combine the multimapping properties and self-adversarial learning during the training process. The experimental results obtained on practical KG benchmarks show that the OpticE model, with elegant structures, can compete with existing link prediction methods.