Jheng-Long Wu


2022

pdf
Combining Word Vector Technique and Clustering Algorithm for Credit Card Merchant Detection
Fang-Ju Lee | Ying-Chun Lo | Jheng-Long Wu
Proceedings of the 34th Conference on Computational Linguistics and Speech Processing (ROCLING 2022)

Extracting relevant user behaviors through customer’s transaction description is one of the ways to collect customer information. In the current text mining field, most of the researches are mainly study text classification, and only few study text clusters. Find the relationship between letters and words in the unstructured transaction consumption description. Use Word Embedding and text mining technology to break through the limitation of classification conditions that need to be distinguished in advance, establish automatic identification and analysis methods, and improve the accuracy of grouping. In this study, use Jieba to segment Chinese words, were based on the content of credit card transaction description. Feature extractions of Word2Vec, combined with Density-Based Spatial Clustering of Applications with Noise (DBSCAN) and Hierarchical Agglomerative Clustering, cross-combination experiments. The prediction results of MUC, B3 and CEAF’s F1 average of 67.58% are more significant.

pdf
A Dimensional Valence-Arousal-Irony Dataset for Chinese Sentence and Context
Sheng-Wei Huang | Wei-Yi Chung | Yu-Hsuan Wu | Chen-Chia Yu | Jheng-Long Wu
Proceedings of the 34th Conference on Computational Linguistics and Speech Processing (ROCLING 2022)

Chinese multi-dimensional sentiment detection is a challenging task with a considerable impact on semantic understanding. Past irony datasets are utilized to annotate sentiment type of whole sentences of irony. It does not provide the corresponding intensity of valence and arousal on the sentences and context. However, an ironic statement is defined as a statement whose apparent meaning is the opposite of its actual meaning. This means that in order to understand the actual meaning of a sentence, contextual information is needed. Therefore, the dimensional sentiment intensities of ironic sentences and context are important issues in the natural language processing field. This paper creates the extended NTU irony corpus, which includes valence, arousal and irony intensities on sentence-level; and valence and arousal intensities on context-level, called Chinese Dimensional Valence-Arousal-Irony (CDVAI) dataset. Therefore, this paper analyzes the annotation difference between the human annotators and uses a deep learning model such as BERT to evaluate the prediction performances on CDVAI corpus.

pdf
SCU-NLP at ROCLING 2022 Shared Task: Experiment and Error Analysis of Biomedical Entity Detection Model
Sung-Ting Chiou | Sheng-Wei Huang | Ying-Chun Lo | Yu-Hsuan Wu | Jheng-Long Wu
Proceedings of the 34th Conference on Computational Linguistics and Speech Processing (ROCLING 2022)

Named entity recognition generally refers to entities with specific meanings in unstructured text, including names of people, places, organizations, dates, times, quantities, proper nouns and other words. In the medical field, it may be drug names, Organ names, test items, nutritional supplements, etc. The purpose of named entity recognition in this study is to search for the above items from unstructured input text. In this study, taking healthcare as the research purpose, and predicting named entity boundaries and categories of sentences based on ten entity types, We explore multiple fundamental NER approaches to solve this task, Include: Hidden Markov Models, Conditional Random Fields, Random Forest Classifier and BERT. The prediction results are more significant in the F-score of the CRF model, and have achieved better results.

2021

pdf
A Corpus for Dimensional Sentiment Classification on YouTube Streaming Service
Ching-Wen Hsu | Chun-Lin Chou | Hsuan Liu | Jheng-Long Wu
Proceedings of the 33rd Conference on Computational Linguistics and Speech Processing (ROCLING 2021)

The streaming service platform such as YouTube provides a discussion function for audiences worldwide to share comments. YouTubers who upload videos to the YouTube platform want to track the performance of these uploaded videos. However, the present analysis functions of YouTube only provide a few performance indicators such as average view duration, browsing history, variance in audience’s demographics, etc., and lack of sentiment analysis on the audience’s comments. Therefore, the paper proposes multi-dimensional sentiment indicators such as YouTuber preference, Video preferences, and Excitement level to capture comprehensive sentiment on audience comments for videos and YouTubers. To evaluate the performance of different classifiers, we experiment with deep learning-based, machine learning-based, and BERT-based classifiers to automatically detect three sentiment indicators of an audience’s comments. Experimental results indicate that the BERT-based classifier is a better classification model than other classifiers according to F1-score, and the sentiment indicator of Excitement level is quite an improvement. Therefore, the multiple sentiment detection tasks on the video streaming service platform can be solved by the proposed multi-dimensional sentiment indicators accompanied with BERT classifier to gain the best result.

pdf
Confiscation Detection of Criminal Judgment Using Text Classification Approach
Hsuan-Tzu Shih | Yu-Cheng Chiu | Hsiao-Shih Chen | Jheng-Long Wu
Proceedings of the 33rd Conference on Computational Linguistics and Speech Processing (ROCLING 2021)

As the system of confiscation becomes more and more perfect, grasping the distribution of the types of confiscations actually announced by the courts will enable you to understand changing of the trend. In addition to assisting legislators in formulating laws, it can also provide other people with an understanding of the actual operation of the confiscation system. In order to enable artificial intelligence technology to automatically identify the distribution of confiscation, and consumes a lot of manpower and time costs of manual judgment. The purpose of this research is to establish an automated confiscation identification model that can quickly and accurately identify the multiple label categories of confiscation, and provide the needs of all social circles for confiscation information, so as to facilitate subsequent law amendments or discretion. This research uses the first instance criminal cases as the main experimental data. According to the current laws, the confiscation is divided into three categories: contrabands, criminal tools and criminal proceeds, and perform multiple label identification. This research will use Term Frequency–Inverse Document Frequency (TF-IDF) and Word2Vec algorithm as the feature extraction algorithm, with random forest classifier, and CKIPlabBERT pretrained model for training and identification. The experimental results show that under the CKIPlabBERT pretrained model, the best identification effect can be obtained when only use sentences with confiscated words mentioned in the judgment. When the task is case confiscation, the Micro F1 Score can be as high as 96.2716%, and when the task is defendant confiscation, the Micro F1 Score is as high as 95.5478%.

pdf
SCUDS at ROCLING-2021 Shared Task: Using Pretrained Model for Dimensional Sentiment Analysis Based on Sample Expansion Method
Hsiao-Shih Chen | Pin-Chiung Chen | Shao-Cheng Huang | Yu-Cheng Chiu | Jheng-Long Wu
Proceedings of the 33rd Conference on Computational Linguistics and Speech Processing (ROCLING 2021)

Sentiment analysis has become a popular research issue in recent years, especially on educational texts which is an important problem. According to literature, the similar sentence generation can help the prediction performance of machine learning. Therefore, the process of controlled expansional samples is a key component to prediction models. The paper proposed a sample expansion method which combined part-of-speech filter and similar word finder of Word2Vec. The generate samples have high quality with similar sentiment representation. The DistilBERT pretrained model is used to learn and predict Valence-Arousal scores from the expansion samples. Experimental result displays that the using the expansion samples as training data into prediction model has outperforms original training data without expansion, and obtains 80% mean square error reducing and 28% pearson correlation coefficient increasing.

pdf
A Pretrained YouTuber Embeddings for Improving Sentiment Classification of YouTube Comments
Ching-Wen Hsu | Hsuan Liu | Jheng-Long Wu
International Journal of Computational Linguistics & Chinese Language Processing, Volume 26, Number 2, December 2021

2020

pdf
Building A Multi-Label Detection Model for Question classification of Auction Website
I-Ju Lin | Jheng-Long Wu
Proceedings of the 32nd Conference on Computational Linguistics and Speech Processing (ROCLING 2020)