2023
pdf
bib
abs
Enhancing Language Model with Unit Test Techniques for Efficient Regular Expression Generation
Chenhui Mao
|
Xiexiong Lin
|
Xin Jin
|
Xin Zhang
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track
Recent research has investigated the use of generative language models to produce regular expressions with semantic-based approaches. However, these approaches have shown shortcomings in practical applications, particularly in terms of functional correctness, which refers to the ability to reproduce the intended function inputs by the user. To address this issue, we present a novel method called Unit-Test Driven Reinforcement Learning (UTD-RL). Our approach differs from previous methods by taking into account the crucial aspect of functional correctness and transforming it into a differentiable gradient feedback using policy gradient techniques. In which functional correctness can be evaluated through Unit Tests, a testing method that ensures regular expressions meets its design and performs as intended. Experiments conducted on three public datasets demonstrate the effectiveness of the proposed method in generating regular expressions. This method has been employed in a regulatory scenario where regular expressions can be utilized to ensure that all online content is free from non-compliant elements, thereby significantly reducing the workload of relevant personnel.
pdf
abs
TMID: A Comprehensive Real-world Dataset for Trademark Infringement Detection in E-Commerce
Tongxin Hu
|
Zhuang Li
|
Xin Jin
|
Lizhen Qu
|
Xin Zhang
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track
Annually, e-commerce platforms incur substantial financial losses due to trademark infringements, making it crucial to identify and mitigate potential legal risks tied to merchant information registered to the platforms. However, the absence of high-quality datasets hampers research in this area. To address this gap, our study introduces TMID, a novel dataset to detect trademark infringement in merchant registrations. This is a real-world dataset sourced directly from Alipay, one of the world’s largest e-commerce and digital payment platforms. As infringement detection is a legal reasoning task requiring an understanding of the contexts and legal rules, we offer a thorough collection of legal rules and merchant and trademark-related contextual information with annotations from legal experts. We ensure the data quality by performing an extensive statistical analysis. Furthermore, we conduct an empirical study on this dataset to highlight its value and the key challenges. Through this study, we aim to contribute valuable resources to advance research into legal compliance related to trademark infringement within the e-commerce sphere.
pdf
abs
TeamShakespeare at SemEval-2023 Task 6: Understand Legal Documents with Contextualized Large Language Models
Xin Jin
|
Yuchen Wang
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)
The growth of pending legal cases in populouscountries, such as India, has become a major is-sue. Developing effective techniques to processand understand legal documents is extremelyuseful in resolving this problem. In this pa-per, we present our systems for SemEval-2023Task 6: understanding legal texts (Modi et al., 2023). Specifically, we first develop the Legal-BERT-HSLN model that considers the com-prehensive context information in both intra-and inter-sentence levels to predict rhetoricalroles (subtask A) and then train a Legal-LUKEmodel, which is legal-contextualized and entity-aware, to recognize legal entities (subtask B).Our evaluations demonstrate that our designedmodels are more accurate than baselines, e.g.,with an up to 15.0% better F1 score in subtaskB. We achieved notable performance in the taskleaderboard, e.g., 0.834 micro F1 score, andranked No.5 out of 27 teams in subtask A.
2020
pdf
abs
On Construction of the ASR-oriented Indian English Pronunciation Dictionary
Xian Huang
|
Xin Jin
|
Qike Li
|
Keliang Zhang
Proceedings of the Twelfth Language Resources and Evaluation Conference
As a World English, a New English and a regional variety of English, Indian English (IE) has developed its own distinctive characteristics, especially phonologically, from other varieties of English. An Automatic Speech Recognition (ASR) system simply trained on British English (BE) /American English (AE) speech data and using the BE/AE pronunciation dictionary performs much worse when applied to IE. An applicable IEASR system needs spontaneous IE speech as training materials and a comprehensive, linguistically-guided IE pronunciation dictionary (IEPD) so as to achieve the effective mapping between the acoustic model and language model. This research builds a small IE spontaneous speech corpus, analyzes and summarizes the phonological variation features of IE, comes up with an IE phoneme set and complies the IEPD (including a common-English-word list, an Indian-word list, an acronym list and an affix list). Finally, two ASR systems are trained with 120 hours IE spontaneous speech data, using the IEPD we construct in this study and CMUdict separately. The two systems are tested with 50 audio clips of IE spontaneous speech. The result shows the system trained with IEPD performs better than the one trained with CMUdict with WER being 15.63% lower on the test data.