Haiyan Ding


2026

While Large Language Models (LLMs) excel in various general domains, they exhibit notable gaps in the highly specialized, knowledge-intensive, and legally regulated Chinese tax domain. Consequently, while tax-related benchmarks are gaining attention, many focus on isolated NLP tasks, neglecting real-world practical capabilities. To address this issue, we introduce TaxPraBen, the first dedicated benchmark for Chinese taxation practice. It combines 10 traditional application tasks, along with 3 pioneering real-world scenarios: tax risk prevention, tax inspection analysis, and tax strategy planning, sourced from 14 datasets totaling 7.3K instances. TaxPraBen features a scalable structured evaluation paradigm designed through process of "structured parsing—field alignment extraction—numerical and textual matching", enabling end-to-end tax practice assessment while being extensible to other domains. We evaluate 19 LLMs based on Bloom’s taxonomy. The results indicate significant performance disparities: all closed-source large-parameter LLMs excel, and Chinese LLMs like Qwen2.5 generally exceed multilingual LLMs, while the YaYi2 LLM, fine-tuned with some tax data, shows only limited improvement. TaxPraBen[<https://anonymous.4open.science/r/TaxPraBen/>] serves as a vital resource for advancing evaluations of LLMs in practical applications.

2023

This paper describes the method we submitted as the Janko team in the SemEval-2023 Task 2,Multilingual Complex Named Entity Recognition (MultiCoNER 2). We only participated in the Chinese track. In this paper, we implement the BERT-BiLSTM-RDrop model. We use the fine-tuned BERT models, take the output of BERT as the input of the BiLSTM network, and finally use R-Drop technology to optimize the loss function. Our submission achieved a macro-averaged F1 score of 0.579 on the testset.
Semi-supervised learning has promising performance in deep learning, one of the approaches is consistency training on a large amount of unlabeled data to constrain model predictions to be invariant to input noise. However, The degree of correlation between unlabeled data and task objective directly affects model prediction performance. This paper describes our system designed for SemEval-2023 Task 10: Explainable Detection of Online Sexism. We utilize a consistency training framework and data augmentation as the main strategy to train a model. The score obtained by our method is 0.8180 in subtask A, ranking 57 in all the teams.

2020

This article describes the system submitted to SemEval 2020 Task 5: Modelling Causal Reasoning in Language: Detecting Counterfactuals. In this task, we only participate in the subtask A which is detecting counterfactual statements. In order to solve this sub-task, first of all, because of the problem of data balance, we use the undersampling and oversampling methods to process the data set. Second, we used the ALBERT model and the maximum ensemble method based on the ALBERT model. Our methods achieved a F1 score of 0.85 in subtask A.
This article describes the system submitted to SemEval 2020 Task 4: Commonsense Validation and Explanation. We only participated in the subtask A, which is mainly to distinguish whether the sentence has meaning. To solve this task, we mainly used ALBERT model-based maximum ensemble with different training sizes and depths. To prove the validity of the model to the task, we also used some other neural network models for comparison. Our model achieved the accuracy score of 0.938(ranked 10/41) in subtask A.

2019

This paper describes the system submitted to SemEval 2019 Task 5: Multilingual detection of hate speech against immigrants and women in Twitter (hatEval). Its main purpose is to conduct hate speech detection on Twitter, which mainly includes two specific different targets, immigrants and women. We participate in both subtask A and subtask B for English. In order to address this task, we develope an ensemble of an attention-LSTM model based on HAN and an BiGRU-capsule model. Both models use fastText pre-trained embeddings, and we use this model in both subtasks. In comparison to other participating teams, our system is ranked 16th in the Sub-task A for English, and 12th in the Sub-task B for English.