Xin Jin


2023

pdf
TeamShakespeare at SemEval-2023 Task 6: Understand Legal Documents with Contextualized Large Language Models
Xin Jin | Yuchen Wang
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)

The growth of pending legal cases in populouscountries, such as India, has become a major is-sue. Developing effective techniques to processand understand legal documents is extremelyuseful in resolving this problem. In this pa-per, we present our systems for SemEval-2023Task 6: understanding legal texts (Modi et al., 2023). Specifically, we first develop the Legal-BERT-HSLN model that considers the com-prehensive context information in both intra-and inter-sentence levels to predict rhetoricalroles (subtask A) and then train a Legal-LUKEmodel, which is legal-contextualized and entity-aware, to recognize legal entities (subtask B).Our evaluations demonstrate that our designedmodels are more accurate than baselines, e.g.,with an up to 15.0% better F1 score in subtaskB. We achieved notable performance in the taskleaderboard, e.g., 0.834 micro F1 score, andranked No.5 out of 27 teams in subtask A.

2020

pdf
On Construction of the ASR-oriented Indian English Pronunciation Dictionary
Xian Huang | Xin Jin | Qike Li | Keliang Zhang
Proceedings of the Twelfth Language Resources and Evaluation Conference

As a World English, a New English and a regional variety of English, Indian English (IE) has developed its own distinctive characteristics, especially phonologically, from other varieties of English. An Automatic Speech Recognition (ASR) system simply trained on British English (BE) /American English (AE) speech data and using the BE/AE pronunciation dictionary performs much worse when applied to IE. An applicable IEASR system needs spontaneous IE speech as training materials and a comprehensive, linguistically-guided IE pronunciation dictionary (IEPD) so as to achieve the effective mapping between the acoustic model and language model. This research builds a small IE spontaneous speech corpus, analyzes and summarizes the phonological variation features of IE, comes up with an IE phoneme set and complies the IEPD (including a common-English-word list, an Indian-word list, an acronym list and an affix list). Finally, two ASR systems are trained with 120 hours IE spontaneous speech data, using the IEPD we construct in this study and CMUdict separately. The two systems are tested with 50 audio clips of IE spontaneous speech. The result shows the system trained with IEPD performs better than the one trained with CMUdict with WER being 15.63% lower on the test data.