Shu-Kai Hsieh

Also published as: Shu-kai Hsieh, ShuKai Hsieh

2024

pdf bib abs
The Semantic Relations in LLMs: An Information-theoretic Compression Approach
Yu-Hsiang Tseng | Pin-Er Chen | Da-Chen Lian | Shu-Kai Hsieh
Proceedings of the Workshop: Bridging Neurons and Symbols for Natural Language Processing and Knowledge Graphs Reasoning (NeusymBridge) @ LREC-COLING-2024

Compressibility is closely related to the predictability of the texts from the information theory viewpoint. As large language models (LLMs) are trained to maximize the conditional probabilities of upcoming words, they may capture the subtlety and nuances of the semantic constraints underlying the texts, and texts aligning with the encoded semantic constraints are more compressible than those that do not. This paper systematically tests whether and how LLMs can act as compressors of semantic pairs. Using semantic relations from English and Chinese Wordnet, we empirically demonstrate that texts with correct semantic pairings are more compressible than incorrect ones, measured by the proposed compression advantages index. We also show that, with the Pythia model suite and a fine-tuned model on Chinese Wordnet, compression capacities are modulated by the model’s seen data. These findings are consistent with the view that LLMs encode the semantic knowledge as underlying constraints learned from texts and can act as compressors of semantic information or potentially other structured knowledge.

2023

pdf
Solving Linguistic Olympiad Problems with Tree-of-Thought Prompting
Zheng-Lin Lin | Chiao-Han Yen | Jia-Cheng Xu | Deborah Watty | Shu-Kai Hsieh
Proceedings of the 35th Conference on Computational Linguistics and Speech Processing (ROCLING 2023)

pdf
Evaluating Interfaced LLM Bias
Kai-Ching Yeh | Jou-An Chi | Da-Chen Lian | Shu-Kai Hsieh
Proceedings of the 35th Conference on Computational Linguistics and Speech Processing (ROCLING 2023)

pdf
Exploring Affordance and Situated Meaning in Image Captions: A Multimodal Analysis
Pin-Er Chen | Po-Ya Angela Wang | Hsin-Yu Chou | Yu-Hsiang Tseng | Shu-Kai Hsieh
Proceedings of the 37th Pacific Asia Conference on Language, Information and Computation

pdf
Vec2Gloss: definition modeling leveraging contextualized vectors with Wordnet gloss
Yu-Hsiang Tseng | Mao-Chang Ku | Wei-Ling Chen | Yu-Lin Chang | Shu-Kai Hsieh
Proceedings of the 37th Pacific Asia Conference on Language, Information and Computation

pdf
Lexical Retrieval Hypothesis in Multimodal Context
Po-Ya Angela Wang | Pin-Er Chen | Hsin-Yu Chou | Yu-Hsiang Tseng | Shu-Kai Hsieh
Proceedings of the 4th Conference on Language, Data and Knowledge

2022

pdf
Analyzing Discourse Functions with Acoustic Features and Phone mbeddings: Non-lexical Items in Taiwan Mandarin
Pin-Er Chen | Yu-Hsiang Tseng | Chi-Wei Wang | Fang-Chi Yeh | Shu-Kai Hsieh
International Journal of Computational Linguistics & Chinese Language Processing, Volume 27, Number 2, December 2022

pdf abs
Analyzing discourse functions with acoustic features and phone embeddings: non-lexical items in Taiwan Mandarin
Pin-Er Chen | Yu-Hsiang Tseng | Chi-Wei Wang | Fang-Chi Yeh | Shu-Kai Hsieh
Proceedings of the 34th Conference on Computational Linguistics and Speech Processing (ROCLING 2022)

Non-lexical items are expressive devices used in conversations that are not words but are nevertheless meaningful. These items play crucial roles, such as signaling turn-taking or marking stances in interactions. However, as the non-lexical items do not stably correspond to written or phonological forms, past studies tend to focus on studying their acoustic properties, such as pitches and durations. In this paper, we investigate the discourse functions of non-lexical items through their acoustic properties and the phone embeddings extracted from a deep learning model. Firstly, we create a non-lexical item dataset based on the interpellation video clips from Taiwan’s Legislative Yuan. Then, we manually identify the non-lexical items and their discourse functions in the videos. Next, we analyze the acoustic properties of those items through statistical modeling and building classifiers based on phone embeddings extracted from a phone recognition model. We show that (1) the discourse functions have significant effects on the acoustic features; and (2) the classifiers built on phone embeddings perform better than the ones on conventional acoustic properties. These results suggest that phone embeddings may reflect the phonetic variations crucial in differentiating the discourse functions of non-lexical items.

Constructions are direct form-meaning pairs with possible schematic slots. These slots are simultaneously constrained by the embedded construction itself and the sentential context. We propose that the constraint could be described by a conditional probability distribution. However, as this conditional probability is inevitably complex, we utilize language models to capture this distribution. Therefore, we build CxLM, a deep learning-based masked language model explicitly tuned to constructions’ schematic slots. We first compile a construction dataset consisting of over ten thousand constructions in Taiwan Mandarin. Next, an experiment is conducted on the dataset to examine to what extent a pretrained masked language model is aware of the constructions. We then fine-tune the model specifically to perform a cloze task on the opening slots. We find that the fine-tuned model predicts masked slots more accurately than baselines and generates both structurally and semantically plausible word samples. Finally, we release CxLM and its dataset as publicly available resources and hope to serve as new quantitative tools in studying construction grammar.

pdf abs
Character Jacobian: Modeling Chinese Character Meanings with Deep Learning Model
Yu-Hsiang Tseng | Shu-Kai Hsieh
Proceedings of the 29th International Conference on Computational Linguistics

Compounding, a prevalent word-formation process, presents an interesting challenge for computational models. Indeed, the relations between compounds and their constituents are often complicated. It is particularly so in Chinese morphology, where each character is almost simultaneously bound and free when treated as a morpheme. To model such word-formation process, we propose the Notch (NOnlinear Transformation of CHaracter embeddings) model and the character Jacobians. The Notch model first learns the non-linear relations between the constituents and words, and the character Jacobians further describes the character’s role in each word. In a series of experiments, we show that the Notch model predicts the embeddings of the real words from their constituents but helps account for the behavioral data of the pseudowords. Moreover, we also demonstrated that character Jacobians reflect the characters’ meanings. Taken together, the Notch model and character Jacobians may provide a new perspective on studying the word-formation process and morphology with modern deep learning.

2021

Ever-expanding evaluative texts on online forums have become an important source of sentiment analysis. This paper proposes an aspect-based annotated dataset consisting of telecom reviews on social media. We introduce a category, implicit evaluative texts, impevals for short, to investigate how the deep learning model works on these implicit reviews. We first compare two models, BertSimple and BertImpvl, and find that while both models are competent to learn simple evaluative texts, they are confused when classifying impevals. To investigate the factors underlying the correctness of the model’s predictions, we conduct a series of analyses, including qualitative error analysis and quantitative analysis of linguistic features with logistic regressions. The results show that local features that affect the overall sentential sentiment confuse the model: multiple target entities, transitional words, sarcasm, and rhetorical questions. Crucially, these linguistic features are independent of the model’s confidence measured by the classifier’s softmax probabilities. Interestingly, the sentence complexity indicated by syntax-tree depth is not correlated with the model’s correctness. In sum, this paper sheds light on the characteristics of the modern deep learning model and when it might need more supervision through linguistic evaluations.

pdf abs
Keyword-centered Collocating Topic Analysis
Yu-Lin Chang | Yongfu Liao | Po-Ya Angela Wang | Mao-Chang Ku | Shu-Kai Hsieh
Proceedings of the 33rd Conference on Computational Linguistics and Speech Processing (ROCLING 2021)

The rapid flow of information and the abundance of text data on the Internet have brought about the urgent demand for the construction of monitoring resources and techniques used for various purposes. To extract facets of information useful for particular domains from such large and dynamically growing corpora requires an unsupervised yet transparent ways of analyzing the textual data. This paper proposed a hybrid collocation analysis as a potential method to retrieve and summarize Taiwan-related topics posted on Weibo and PTT. By grouping collocates of 臺灣 ‘Taiwan’ into clusters of topics via either word embeddings clustering or Latent Dirichlet allocation, lists of collocates can be converted to probability distributions such that distances and similarities can be defined and computed. With this method, we conduct a diachronic analysis of the similarity between Weibo and PTT, providing a way to pinpoint when and how the topic similarity between the two rises or falls. A fine-grained view on the grammatical behavior and political implications is attempted, too. This study thus sheds light on alternative explainable routes for future social media listening method on the understanding of cross-strait relationship.

pdf
Exploring sentiment constructions: connecting deep learning models with linguistic construction
Shu-Kai Hsieh | Yu-Hsiang Tseng
Proceedings of the 35th Pacific Asia Conference on Language, Information and Computation

pdf
Examine persuasion strategies in Chinese on social media
Yu-Yun Chang | Po-Ya Angela Wang | Han-Tang Hung | Ka-Sîng Khóo | Shu-Kai Hsieh
Proceedings of the 35th Pacific Asia Conference on Language, Information and Computation

2020

pdf abs
Computational Modeling of Affixoid Behavior in Chinese Morphology
Yu-Hsiang Tseng | Shu-Kai Hsieh | Pei-Yi Chen | Sara Court
Proceedings of the 28th International Conference on Computational Linguistics

The morphological status of affixes in Chinese has long been a matter of debate. How one might apply the conventional criteria of free/bound and content/function features to distinguish word-forming affixes from bound roots in Chinese is still far from clear. Issues involving polysemy and diachronic dynamics further blur the boundaries. In this paper, we propose three quantitative features in a computational model of affixoid behavior in Mandarin Chinese. The results show that, except for in a very few cases, there are no clear criteria that can be used to identify an affix’s status in an isolating language like Chinese. A diachronic check using contextualized embeddings with the WordNet Sense Inventory also demonstrates the possible role of the polysemy of lexical roots across diachronic settings.

pdf
Exploring Discourse on Same-sex Marriage in Taiwan: A Case Study of Near-Synonym of HOMOSEXUAL in Opposing Stances
Han-Tang Hung | Shu-Kai Hsieh
Proceedings of the 34th Pacific Asia Conference on Language, Information and Computation

pdf abs
Do You Believe It Happened? Assessing Chinese Readers’ Veridicality Judgments
Yu-Yun Chang | Shu-Kai Hsieh
Proceedings of the Twelfth Language Resources and Evaluation Conference

This work collects and studies Chinese readers’ veridicality judgments to news events (whether an event is viewed as happening or not). For instance, in “The FBI alleged in court documents that Zazi had admitted having a handwritten recipe for explosives on his computer”, do people believe that Zazi had a handwritten recipe for explosives? The goal is to observe the pragmatic behaviors of linguistic features under context which affects readers in making veridicality judgments. Exploring from the datasets, it is found that features such as event-selecting predicates (ESP), modality markers, adverbs, temporal information, and statistics have an impact on readers’ veridicality judgments. We further investigated that modality markers with high certainty do not necessarily trigger readers to have high confidence in believing an event happened. Additionally, the source of information introduced by an ESP presents low effects to veridicality judgments, even when an event is attributed to an authority (e.g. “The FBI”). A corpus annotated with Chinese readers’ veridicality judgments is released as the Chinese PragBank for further analysis.

pdf
Mitigating Impacts of Word Segmentation Errors on Collocation Extraction in Chinese
Yongfu Liao | Shu-Kai Hsieh
Proceedings of the 32nd Conference on Computational Linguistics and Speech Processing (ROCLING 2020)

pdf
Lectal Variation of the Two Chinese Causative Auxiliaries
Cing-Fang Shih | Mao-Chang Ku | Shu-Kai Hsieh
Proceedings of the 32nd Conference on Computational Linguistics and Speech Processing (ROCLING 2020)

pdf
An Analysis of Multimodal Document Intent in Instagram Posts
Ying-Yu Chen | Shu-Kai Hsieh
Proceedings of the 32nd Conference on Computational Linguistics and Speech Processing (ROCLING 2020)

2019

pdf
Extracting Semantic Representations of Sexual Biases from Word Vectors
Ying-Yu Chen | Shu-Kai Hsieh
Proceedings of the 31st Conference on Computational Linguistics and Speech Processing (ROCLING 2019)

pdf abs
Augmenting Chinese WordNet semantic relations with contextualized embeddings
Yu-Hsiang Tseng | Shu-Kai Hsieh
Proceedings of the 10th Global Wordnet Conference

Constructing semantic relations in WordNet has been a labour-intensive task, especially in a dynamic and fast-changing language environment. Combined with recent advancements of contextualized embeddings, this paper proposes the concept of morphology-guided sense vectors, which can be used to semi-automatically augment semantic relations in Chinese Wordnet (CWN). This paper (1) built sense vectors with pre-trained contextualized embedding models; (2) demonstrated the sense vectors computed were consistent with the sense distinctions made in CWN; and (3) predicted the potential semantically-related sense pairs with high accuracy by sense vectors model.

pdf abs
Eigencharacter: An Embedding of Chinese Character Orthography
Yu-Hsiang Tseng | Shu-Kai Hsieh
Proceedings of the Beyond Vision and LANguage: inTEgrating Real-world kNowledge (LANTERN)

Chinese characters are unique in its logographic nature, which inherently encodes world knowledge through thousands of years evolution. This paper proposes an embedding approach, namely eigencharacter (EC) space, which helps NLP application easily access the knowledge encoded in Chinese orthography. These EC representations are automatically extracted, encode both structural and radical information, and easily integrate with other computational models. We built EC representations of 5,000 Chinese characters, investigated orthography knowledge encoded in ECs, and demonstrated how these ECs identified visually similar characters with both structural and radical information.

2018

pdf abs
Sinitic Wordnet: Laying the Groundwork with Chinese Varieties Written in Traditional Characters
Chih-Yao Lee | Shu-Kai Hsieh
Proceedings of the 9th Global Wordnet Conference

The present work seeks to make the logographic nature of Chinese script a relevant research ground in wordnet studies. While wordnets are not so much about words as about the concepts represented in words, synset formation inevitably involves the use of orthographic and/or phonetic representations to serve as headword for a given concept. For wordnets of Chinese languages, if their synsets are mapped with each other, the connection from logographic forms to lexicalized concepts can be explored backwards to, for instance, help trace the development of cognates in different varieties of Chinese. The Sinitic Wordnet project is an attempt to construct such an integrated wordnet that aggregates three Chinese varieties that are widely spoken in Taiwan and all written in traditional Chinese characters.

pdf
Fluid Annotation: A Granularity-aware Annotation Tool for Chinese Word Fluidity
Shu-Kai Hsieh | Yu-Hsiang Tseng | Chih-Yao Lee | Chiung-Yu Chiang
Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018)

2017

pdf abs
ClassifierGuesser: A Context-based Classifier Prediction System for Chinese Language Learners
Nicole Peinelt | Maria Liakata | Shu-Kai Hsieh
Proceedings of the IJCNLP 2017, System Demonstrations

Classifiers are function words that are used to express quantities in Chinese and are especially difficult for language learners. In contrast to previous studies, we argue that the choice of classifiers is highly contextual and train context-aware machine learning models based on a novel publicly available dataset, outperforming previous baselines. We further present use cases for our database and models in an interactive demo system.

pdf
Exploring Lavender Tongue from Social Media Texts[In Chinese]
Hsiao-Han Wu | Shu-Kai Hsieh
Proceedings of the 29th Conference on Computational Linguistics and Speech Processing (ROCLING 2017)

2016

pdf abs
CogALex-V Shared Task: LOPE
Kanan Luce | Jiaxing Yu | Shu-Kai Hsieh
Proceedings of the 5th Workshop on Cognitive Aspects of the Lexicon (CogALex - V)

Automatic discovery of semantically-related words is one of the most important NLP tasks, and has great impact on the theoretical psycholinguistic modeling of the mental lexicon. In this shared task, we employ the word embeddings model to testify two thoughts explicitly or implicitly assumed by the NLP community: (1). Word embedding models can reflect syntagmatic similarities in usage between words to distances in projected vector space. (2). Word embedding models can reflect paradigmatic relationships between words.

pdf
Evaluative Pattern Extraction for Automated Text Generation
Chia-Chen Lee | Shu-Kai Hsieh
Proceedings of the 9th International Natural Language Generation conference

pdf
Crowdsourcing Experiment Designs for Chinese Word Sense Annotation
Tzu-Yun Huang | Hsiao-Han Wu | Chia-Chen Lee | Shao-Man Lee | Guan-Wei Li | Shu-Kai Hsieh
Proceedings of the 28th Conference on Computational Linguistics and Speech Processing (ROCLING 2016)

pdf
Sarcasm Detection in Chinese Using a Crowdsourced Corpus
Shih-Kai Lin | Shu-Kai Hsieh
Proceedings of the 28th Conference on Computational Linguistics and Speech Processing (ROCLING 2016)

2015

pdf
Linguistic Linked Data in Chinese: The Case of Chinese Wordnet
Chih-Yao Lee | Shu-Kai Hsieh
Proceedings of the 4th Workshop on Linked Data in Linguistics: Resources and Applications

pdf
An Arguing Lexicon for Stance Classification on Short Text Comments in Chinese
Ju-han Chuang | Shu-Kai Hsieh
Proceedings of the 29th Pacific Asia Conference on Language, Information and Computation: Posters

2014

pdf abs
Why Chinese Web-as-Corpus is Wacky? Or: How Big Data is Killing Chinese Corpus Linguistics
Shu-Kai Hsieh
Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14)

This paper aims to examine and evaluate the current development of using Web-as-Corpus (WaC) paradigm in Chinese corpus linguistics. I will argue that the unstable notion of wordhood in Chinese and the resulting diverse ideas of implementing word segmentation systems have posed great challenges for those who are keen on building web-scaled corpus data. Two lexical measures are proposed to illustrate the issues and methodological discussions are provided.

pdf
Leveraging Morpho-semantics for the Discovery of Relations in Chinese Wordnet
Shu-Kai Hsieh | Yu-Yun Chang
Proceedings of the Seventh Global Wordnet Conference

pdf
Skillex: a graph-based lexical score for measuring the semantic efficiency of used verbs by human subjects describing actions
Bruno Gaume | Karine Duvignau | Emmanuel Navarro | Yann Desalle | Hintat Cheung | Shu-Kai Hsieh | Pierre Magistry | Laurent Prévot
Traitement Automatique des Langues, Volume 55, Numéro 3 : Traitement automatique du langage naturel et sciences cognitives [Natural Language Processing and Cognitive Sciences]

pdf
Public Opinion Toward CSSTA: A Text Mining Approach
Yi-An Wu | Shu-Kai Hsieh
Proceedings of the 26th Conference on Computational Linguistics and Speech Processing (ROCLING 2014)

pdf
Sketching the Dependency Relations of Words in Chinese
Meng-Hsien Shih | Shu-Kai Hsieh
Proceedings of the 26th Conference on Computational Linguistics and Speech Processing (ROCLING 2014)

pdf bib
Public Opinion Toward CSSTA: A Text Mining Approach
Yi-An Wu | Shu-Kai Hsieh
International Journal of Computational Linguistics & Chinese Language Processing, Volume 19, Number 4, December 2014 - Special Issue on Selected Papers from ROCLING XXVI

We outline work performed within the framework of a current EC project. The goal is to construct a language-independent information system for a specific domain (environment/ecology/biodiversity) anchored in a language-independent ontology that is linked to wordnets in seven languages. For each language, information extraction and identification of lexicalized concepts with ontological entries is carried out by text miners (Kybots). The mapping of language-specific lexemes to the ontology allows for crosslinguistic identification and translation of equivalent terms. The infrastructure developed within this project enables long-range knowledge sharing and transfer across many languages and cultures, addressing the need for global and uniform transition of knowledge beyond the specific domains addressed here.

pdf abs
Extracting Concrete Senses of Lexicon through Measurement of Conceptual Similarity in Ontologies
Siaw-Fong Chung | Laurent Prévot | Mingwei Xu | Kathleen Ahrens | Shu-Kai Hsieh | Chu-Ren Huang
Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08)

The measurement of conceptual similarity in a hierarchical structure has been proposed by studies such as Wu and Palmer (1994) which have been summarized and evaluated in Budanisky and Hirst (2006). The present study applies the measurement of conceptual similarity to conceptual metaphor research by comparing concreteness of ontological resource nodes to several prototypical concrete nodes selected by human subjects. Here, the purpose of comparing conceptual similarity between nodes is to select a concrete sense for a word which is used metaphorically. Through using WordNet-SUMO interface such as SinicaBow (Huang, Chang and Lee, 2004), concrete senses of a lexicon will be selected once its SUMO nodes have been compared in terms of conceptual similarity with the prototypical concrete nodes. This study has strong implications for the interaction of psycholinguistic and computational linguistic fields in conceptual metaphor research.

Corpus-based approaches and statistical approaches have been the main stream of natural language processing research for the past two decades. Language resources play a key role in such approaches, but there is an insufficient amount of language resources in many Asian languages. In this situation, standardisation of language resources would be of great help in developing resources in new languages. This paper presents the latest development efforts of our project which aims at creating a common standard for Asian language resources that is compatible with an international standard. In particular, the paper focuses on i) lexical specification and data categories relevant for building multilingual lexical resources for Asian languages; ii) a core upper-layer ontology needed for ensuring multilingual interoperability and iii) the evaluation platform used to test the entire architectural framework.