Keh-Yih Su

2022

This paper constructs a Chinese dialogue-based information-seeking question answering dataset CMDQA, which is mainly applied to the scenario of getting Chinese movie related information. It contains 10K QA dialogs (40K turns in total). All questions and background documents are compiled from the Wikipedia via an Internet crawler. The answers to the questions are obtained via extracting the corresponding answer spans within the related text passage. In CMDQA, in addition to searching related documents, pronouns are also added to the question to better mimic the real dialog scenario. This dataset can test the individual performance of the information retrieval, the question answering and the question re-writing modules. This paper also provides a baseline system and shows its performance on this dataset. The experiments elucidate that it still has a big gap to catch the human performance. This dataset thus provides enough challenge for the researcher to conduct related research.

pdf abs
Is Character Trigram Overlapping Ratio Still the Best Similarity Measure for Aligning Sentences in a Paraphrased Corpus?
Aleksandra Smolka | Hsin-Min Wang | Jason S. Chang | Keh-Yih Su
Proceedings of the 34th Conference on Computational Linguistics and Speech Processing (ROCLING 2022)

Sentence alignment is an essential step in studying the mapping among different language expressions, and the character trigram overlapping ratio was reported to be the most effective similarity measure in aligning sentences in the text simplification dataset. However, the appropriateness of each similarity measure depends on the characteristics of the corpus to be aligned. This paper studies if the character trigram is still a suitable similarity measure for the task of aligning sentences in a paragraph paraphrasing corpus. We compare several embedding-based and non-embeddings model-agnostic similarity measures, including those that have not been studied previously. The evaluation is conducted on parallel paragraphs sampled from the Webis-CPC-11 corpus, which is a paragraph paraphrasing dataset. Our results show that modern BERT-based measures such as Sentence-BERT or BERTScore can lead to significant improvement in this task.

2021

pdf abs
How Fast can BERT Learn Simple Natural Language Inference?
Yi-Chung Lin | Keh-Yih Su
Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume

This paper empirically studies whether BERT can really learn to conduct natural language inference (NLI) without utilizing hidden dataset bias; and how efficiently it can learn if it could. This is done via creating a simple entailment judgment case which involves only binary predicates in plain English. The results show that the learning process of BERT is very slow. However, the efficiency of learning can be greatly improved (data reduction by a factor of 1,500) if task-related features are added. This suggests that domain knowledge greatly helps when conducting NLI with neural networks.

pdf abs
Nested Named Entity Recognition for Chinese Electronic Health Records with QA-based Sequence Labeling
Yu-Lun Chiang | Chih-Hao Lin | Cheng-Lung Sung | Keh-Yih Su
Proceedings of the 33rd Conference on Computational Linguistics and Speech Processing (ROCLING 2021)

This study presents a novel QA-based sequence labeling (QASL) approach to naturally tackle both flat and nested Named Entity Recogntion (NER) tasks on a Chinese Electronic Health Records (CEHRs) dataset. This proposed QASL approach parallelly asks a corresponding natural language question for each specific named entity type, and then identifies those associated NEs of the same specified type with the BIO tagging scheme. The associated nested NEs are then formed by overlapping the results of various types. In comparison with those pure sequence-labeling (SL) approaches, since the given question includes significant prior knowledge about the specified entity type and the capability of extracting NEs with different types, the performance for nested NER task is thus improved, obtaining 90.70% of F1-score. Besides, in comparison with the pure QA-based approach, our proposed approach retains the SL features, which could extract multiple NEs with the same types without knowing the exact number of NEs in the same passage in advance. Eventually, experiments on our CEHR dataset demonstrate that QASL-based models greatly outperform the SL-based models by 6.12% to 7.14% of F1-score.

This paper presents a framework to answer the questions that require various kinds of inference mechanisms (such as Extraction, Entailment-Judgement, and Summarization). Most of the previous approaches adopt a rigid framework which handles only one inference mechanism. Only a few of them adopt several answer generation modules for providing different mechanisms; however, they either lack an aggregation mechanism to merge the answers from various modules, or are too complicated to be implemented with neural networks. To alleviate the problems mentioned above, we propose a divide-and-conquer framework, which consists of a set of various answer generation modules, a dispatch module, and an aggregation module. The answer generation modules are designed to provide different inference mechanisms, the dispatch module is used to select a few appropriate answer generation modules to generate answer candidates, and the aggregation module is employed to select the final answer. We test our framework on the 2020 Formosa Grand Challenge Contest dataset. Experiments show that the proposed framework outperforms the state-of-the-art Roberta-large model by about 11.4%.

pdf abs
Mining Commonsense and Domain Knowledge from Math Word Problems
Shih-Hung Tsai | Chao-Chun Liang | Hsin-Min Wang | Keh-Yih Su
Proceedings of the 33rd Conference on Computational Linguistics and Speech Processing (ROCLING 2021)

Current neural math solvers learn to incorporate commonsense or domain knowledge by utilizing pre-specified constants or formulas. However, as these constants and formulas are mainly human-specified, the generalizability of the solvers is limited. In this paper, we propose to explicitly retrieve the required knowledge from math problemdatasets. In this way, we can determinedly characterize the required knowledge andimprove the explainability of solvers. Our two algorithms take the problem text andthe solution equations as input. Then, they try to deduce the required commonsense and domain knowledge by integrating information from both parts. We construct two math datasets and show the effectiveness of our algorithms that they can retrieve the required knowledge for problem-solving.

pdf abs
Sequence to General Tree: Knowledge-Guided Geometry Word Problem Solving
Shih-hung Tsai | Chao-Chun Liang | Hsin-Min Wang | Keh-Yih Su
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)

With the recent advancements in deep learning, neural solvers have gained promising results in solving math word problems. However, these SOTA solvers only generate binary expression trees that contain basic arithmetic operators and do not explicitly use the math formulas. As a result, the expression trees they produce are lengthy and uninterpretable because they need to use multiple operators and constants to represent one single formula. In this paper, we propose sequence-to-general tree (S2G) that learns to generate interpretable and executable operation trees where the nodes can be formulas with an arbitrary number of arguments. With nodes now allowed to be formulas, S2G can learn to incorporate mathematical domain knowledge into problem-solving, making the results more interpretable. Experiments show that S2G can achieve a better performance against strong baselines on problems that require domain knowledge.

pdf
Answering Chinese Elementary School Social Studies Multiple Choice Questions
Chao-Chun Liang | Daniel Lee | Meng-Tse Wu | Hsin-Min Wang | Keh-Yih Su
International Journal of Computational Linguistics & Chinese Language Processing, Volume 26, Number 2, December 2021

2020

pdf abs
A Diverse Corpus for Evaluating and Developing English Math Word Problem Solvers
Shen-yun Miao | Chao-Chun Liang | Keh-Yih Su
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

We present ASDiv (Academia Sinica Diverse MWP Dataset), a diverse (in terms of both language patterns and problem types) English math word problem (MWP) corpus for evaluating the capability of various MWP solvers. Existing MWP corpora for studying AI progress remain limited either in language usage patterns or in problem types. We thus present a new English MWP corpus with 2,305 MWPs that cover more text patterns and most problem types taught in elementary school. Each MWP is annotated with its problem type and grade level (for indicating the level of difficulty). Furthermore, we propose a metric to measure the lexicon usage diversity of a given MWP corpus, and demonstrate that ASDiv is more diverse than existing corpora. Experiments show that our proposed corpus reflects the true capability of MWP solvers more faithfully.

2018

pdf abs
A Meaning-Based Statistical English Math Word Problem Solver
Chao-Chun Liang | Yu-Shiang Wong | Yi-Chung Lin | Keh-Yih Su
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers)

We introduce MeSys, a meaning-based approach, for solving English math word problems (MWPs) via understanding and reasoning in this paper. It first analyzes the text, transforms both body and question parts into their corresponding logic forms, and then performs inference on them. The associated context of each quantity is represented with proposed role-tags (e.g., nsubj, verb, etc.), which provides the flexibility for annotating an extracted math quantity with its associated context information (i.e., the physical meaning of this quantity). Statistical models are proposed to select the operator and operands. A noisy dataset is designed to assess if a solver solves MWPs mainly via understanding or mechanical pattern matching. Experimental results show that our approach outperforms existing systems on both benchmark datasets and the noisy dataset, which demonstrates that the proposed approach understands the meaning of each quantity in the text more.

pdf
Supporting Evidence Retrieval for Answering Yes/No Questions
Meng-Tse Wu | Yi-Chung Lin | Keh-Yih Su
International Journal of Computational Linguistics & Chinese Language Processing, Volume 23, Number 2, December 2018

pdf abs
Adopting the Word-Pair-Dependency-Triplets with Individual Comparison for Natural Language Inference
Qianlong Du | Chengqing Zong | Keh-Yih Su
Proceedings of the 27th International Conference on Computational Linguistics

This paper proposes to perform natural language inference with Word-Pair-Dependency-Triplets. Most previous DNN-based approaches either ignore syntactic dependency among words, or directly use tree-LSTM to generate sentence representation with irrelevant information. To overcome the problems mentioned above, we adopt Word-Pair-Dependency-Triplets to improve alignment and inference judgment. To be specific, instead of comparing each triplet from one passage with the merged information of another passage, we first propose to perform comparison directly between the triplets of the given passage-pair to make the judgement more interpretable. Experimental results show that the performance of our approach is better than most of the approaches that use tree structures, and is comparable to other state-of-the-art approaches.

pdf
是非題之支持證據檢索 (Supporting Evidence Retrieval for Answering Yes/No Questions) [In Chinese]
Meng-Tse Wu | Yi-Chung Lin | Keh-Yih Su
Proceedings of the 30th Conference on Computational Linguistics and Speech Processing (ROCLING 2018)

2016

pdf abs
A Meaning-based English Math Word Problem Solver with Understanding, Reasoning and Explanation
Chao-Chun Liang | Shih-Hong Tsai | Ting-Yun Chang | Yi-Chung Lin | Keh-Yih Su
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations

This paper presents a meaning-based statistical math word problem (MWP) solver with understanding, reasoning and explanation. It comprises a web user interface and pipelined modules for analysing the text, transforming both body and question parts into their logic forms, and then performing inference on them. The associated context of each quantity is represented with proposed role-tags (e.g., nsubj, verb, etc.), which provides the flexibility for annotating the extracted math quantity with its associated syntactic and semantic information (which specifies the physical meaning of that quantity). Those role-tags are then used to identify the desired operands and filter out irrelevant quantities (so that the answer can be obtained precisely). Since the physical meaning of each quantity is explicitly represented with those role-tags and used in the inference process, the proposed approach could explain how the answer is obtained in a human comprehensible way.

pdf
A Tag-based English Math Word Problem Solver with Understanding, Reasoning and Explanation
Chao-Chun Liang | Kuang-Yi Hsu | Chien-Tsung Huang | Chung-Min Li | Shen-Yu Miao | Keh-Yih Su
Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations

pdf
構建一個中文國小數學文字問題語料庫(Building a Corpus for Developing the Chinese Elementary School Math Word Problem Solver)[In Chinese]
Shen-Yun Miao | Su-Chu Lin | Wei-Yun Ma | Keh-Yih Su
Proceedings of the 28th Conference on Computational Linguistics and Speech Processing (ROCLING 2016)

pdf abs
Building A Case-based Semantic English-Chinese Parallel Treebank
Huaxing Shi | Tiejun Zhao | Keh-Yih Su
Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16)

We construct a case-based English-to-Chinese semantic constituent parallel Treebank for a Statistical Machine Translation (SMT) task by labelling each node of the Deep Syntactic Tree (DST) with our refined semantic cases. Since subtree span-crossing is harmful in tree-based SMT, DST is adopted to alleviate this problem. At the same time, we tailor an existing case set to represent bilingual shallow semantic relations more precisely. This Treebank is a part of a semantic corpus building project, which aims to build a semantic bilingual corpus annotated with syntactic, semantic cases and word senses. Data in our Treebank is from the news domain of Datum corpus. 4,000 sentence pairs are selected to cover various lexicons and part-of-speech (POS) n-gram patterns as much as possible. This paper presents the construction of this case Treebank. Also, we have tested the effect of adopting DST structure in alleviating subtree span-crossing. Our preliminary analysis shows that the compatibility between Chinese and English trees can be significantly increased by transforming the parse-tree into the DST. Furthermore, the human agreement rate in annotation is found to be acceptable (90% in English nodes, 75% in Chinese nodes).

In this paper, the major problems of the current machine translation systems are first outlined. A new direction, highlighting the system capability to be customizable and self-learnable, is then proposed for attacking the described problems, which are mainly resulted from the very complicated characteristics of natural languages. The proposed solution adopts an unsupervised two-way training mechanism and a parameterized architecture to acquire the required statistical knowledge, such that the system can be easily adapted to different domains and various preferences of individual users.

pdf bib
A Level-Synchronous Approach to Ill-formed Sentence Parsing and Error Recovery
Yi-Chung Lin | Keh-Yih Su
International Journal of Computational Linguistics & Chinese Language Processing, Volume 4, Number 1, February 1999

1998

pdf bib
Error Recovery in Natural Language Parsing With a Level-Synchronous Approach
Yi-Chung Lin | Keh-Yih Su
Proceedings of Research on Computational Linguistics Conference XI

1997

pdf abs
Corpus-Based Statistics-Oriented (CBSO) Machine Translation Researches in Taiwan
Jing-Shin Chang | Keh-Yih Su
Proceedings of Machine Translation Summit VI: Papers

A brief introduction to the MT research projects in Taiwan is given in this paper. Special attention is given to the more and more popular corpus-based statistics-oriented (CBSO) approaches in MT researches. In particular, the parameterized two-way training philosophy in designing the second generation BehaviorTran, which is the first and the largest operational system in this area, is introduced in this paper.

pdf
A Level-synchronous Approach to Ill-formed Sentence Parsing
Yi-Chung Lin | Keh-Yih Su
Proceedings of the 10th Research on Computational Linguistics International Conference

pdf
A Multivariate Gaussian Mixture Model for Automatic Compound Word Extraction
Jing-Shin Chang | Keh-Yih Su
Proceedings of the 10th Research on Computational Linguistics International Conference

pdf bib
Computational Tools and Resources for Linguistic Studies
Yu-Ling Una Hsu | Jing-Shin Chang | Keh-Yih Su
International Journal of Computational Linguistics & Chinese Language Processing, Volume 2, Number 1, February 1997: Special Issue on Computational Resources for Research in Chinese Linguistics

pdf
An Unsupervised Iterative Method for Chinese New Lexicon Extraction
Jing-Shin Chang | Keh-Yih Su
International Journal of Computational Linguistics & Chinese Language Processing, Volume 2, Number 2, August 1997

1996

bib
Proceedings of Rocling IX Computational Linguistics Conference IX
Chung-Hsien Wu | Keh-Yih Su
Proceedings of Rocling IX Computational Linguistics Conference IX

pdf
An Overview of Corpus-Based Statistics-Oriented (CBSO) Techniques for Natural Language Processing
Keh-Yih Su | Tung-Hui Chiang | Jing-Shin Chang
International Journal of Computational Linguistics & Chinese Language Processing, Volume 1, Number 1, August 1996

pdf
Statistical Models for Deep-structure Disambiguation
TungHui Chiang | Keh-Yih Su
Fourth Workshop on Very Large Corpora

1995

pdf
Automatic Construction of a Chinese Electronic Dictionary
Jing-Shin Chang | Yi-Chung Lin | Keh-Yih Su
Third Workshop on Very Large Corpora

pdf
The New Generation BehaviorTran: Design Philosophy And System Architecture
Yu-Ling Una Hsu | Keh-Yih Su
Proceedings of Rocling VIII Computational Linguistics Conference VIII

pdf
A Corpus-based Two-Way Design for Parameterized MT Systems: Rationale, Architecture and Training Issues
Keh-Yih Su | Jing-Shin Chang | Yu-Ling Una Hsu
Proceedings of the Sixth Conference on Theoretical and Methodological Issues in Machine Translation of Natural Languages

pdf bib
Robust Learning, Smoothing, and Parameter Tying on Syntactic Ambiguity Resolution
Tung-Hui Chiang | Yi-Chung Lin | Keh-Yih Su
Computational Linguistics, Volume 21, Number 3, September 1995

1994

pdf
A Corpus-based Approach to Automatic Compound Extraction
Keh-Yih Su | Ming-Wen Wu | Jing-Shin Chang
32nd Annual Meeting of the Association for Computational Linguistics

pdf
An Automatic Treebank Conversion Algorithm for Corpus Sharing
Jong-Nae Wang | Jing-Shin Chang | Keh-Yih Su
32nd Annual Meeting of the Association for Computational Linguistics

pdf
AUTOMATIC MODEL REFINEMENT - with an application to tagging
Yi-Chung Lin | Tung-Hui Chiang | Keh-Yih Su
COLING 1994 Volume 1: The 15th International Conference on Computational Linguistics

1993

pdf bib
A Corpus-Based Statistics-Oriented Transfer and Generation Model for Machine Translation
Jing-Shin Chang | Keh-Yih Su
Proceedings of the Fifth Conference on Theoretical and Methodological Issues in Machine Translation of Natural Languages

pdf
A Preliminary Study On Unknown Word Problem In Chinese Word Segmentation
Ming-Yu Lin | Tung-Hui Chiang | Keh-Yih Su
Proceedings of Rocling VI Computational Linguistics Conference VI

pdf
Corpus-based Automatic Rule Selection in Designing a Grammar Checker
Yuan-Ling Liu | Shih-ping Wang | Keh-Yih Su
Proceedings of Rocling VI Computational Linguistics Conference VI

pdf
Corpus-based Automatic Compound Extraction with Mutual Information and Relative Frequency Count
Ming-Wen Wu | Keh-Yih Su
Proceedings of Rocling VI Computational Linguistics Conference VI

1992

pdf
Why corpus-based statistics-oriented machine translation
Keh-Yih Su | Jing-Shin Chang
Proceedings of the Fourth Conference on Theoretical and Methodological Issues in Machine Translation of Natural Languages

pdf
Syntactic Ambiguity Resolution Using A Discrimination and Robustness Oriented Adaptive Learning Algorithm
Tung-Hui Chiang | Yi-Chung Lin | Keh-Yih Su
COLING 1992 Volume 1: The 14th International Conference on Computational Linguistics

pdf
A New Quantitative Quality Measure for Machine Translation Systems
Keh-Yih Su | Ming-Wen Wu | Jing-Shin Chang
COLING 1992 Volume 2: The 14th International Conference on Computational Linguistics

pdf
GPSM: A Generalized Probabilistic Semantic Model for Ambiguity Resolution
Jing-Shin Chang | Yih-Fen Luo | Keh-Yih Su
30th Annual Meeting of the Association for Computational Linguistics

pdf bib
Discrimination Oriented Probabilistic Tagging
Yi-Chung Lin | Tung-Hui Chiang | Keh-Yih Su
Proceedings of Rocling V Computational Linguistics Conference V

pdf
Statistical Models for Word Segmentation And Unknown Word Resolution
Tung-Hui Chiang | Jing-Shin Chang | Ming-Yu Lin | Keh-Yih Su
Proceedings of Rocling V Computational Linguistics Conference V

1991

pdf
Constructing A Phrase Structure Grammar By Incorporating Linguistic Knowledge And Statistical Log-Likelihood Ratio
Keh-Yih Su | Yu-Ling Hsu | Claire Saillard
Proceedings of Rocling IV Computational Linguistics Conference IV

1990

pdf
The Semantic Score Approach to the Disambiguation of PP Attachment Problem
Chao-Lin Liu | Jing-Shin Chang | Keh-Yih Su
Proceedings of Rocling III Computational Linguistics Conference III

1989

pdf bib
A Unification-based Approach to Lexicography for Machine Translation System
Shu-Chuan Chen | Mei-Hui Wang | Keh-Yih Su
Proceedings of Rocling II Computational Linguistics Conference II

pdf
A Quantitative Comparison Between an LR Parser and an ATN Interpreter
Chao-Lin Liu | Keh-Yih Su
Proceedings of Rocling II Computational Linguistics Conference II

pdf
Smoothing Statistic Databases in a Machine Translation System
Keh-Yih Su | Mei-Hui Su | Li-Mei Kuan
Proceedings of Rocling II Computational Linguistics Conference II

pdf abs
A Sequential Truncation Parsing Algorithm Based on the Score Function
Keh-Yih Su | Jong-Nae Wang | Mei-Hui Su | Jing-Shin Chang
Proceedings of the First International Workshop on Parsing Technologies

In a natural language processing system, a large amount of ambiguity and a large branching factor are hindering factors in obtaining the desired analysis for a given sentence in a short time. In this paper, we are proposing a sequential truncation parsing algorithm to reduce the searching space and thus lowering the parsing time. The algorithm is based on a score function which takes the advantages of probabilistic characteristics of syntactic information in the sentences. A preliminary test on this algorithm was conducted with a special version of our machine translation system, the ARCHTRAN, and an encouraging result was observed.