Taesung Lee


Full-Stack Information Extraction System for Cybersecurity Intelligence
Youngja Park | Taesung Lee
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing: Industry Track

Due to rapidly growing cyber-attacks and security vulnerabilities, many reports on cyber-threat intelligence (CTI) are being published daily.While these reports can help security analysts to understand on-going cyber threats,the overwhelming amount of information makes it difficult to digest the information in a timely manner.This paper presents, SecIE, an industrial-strength full-stack information extraction (IE) system for the security domain.SecIE can extract a large number of security entities, relations and the temporal information of the relations, which is critical for cyberthreat investigations.Our evaluation with 133 labeled threat reports containing 108,021 tokens shows thatSecIE achieves over 92% F1-score for entity extraction and about 70% F1-score for relation extraction.We also showcase how SecIE can be used for downstream security applications.


Utilizing Multimodal Feature Consistency to Detect Adversarial Examples on Clinical Summaries
Wenjie Wang | Youngja Park | Taesung Lee | Ian Molloy | Pengfei Tang | Li Xiong
Proceedings of the 3rd Clinical Natural Language Processing Workshop

Recent studies have shown that adversarial examples can be generated by applying small perturbations to the inputs such that the well- trained deep learning models will misclassify. With the increasing number of safety and security-sensitive applications of deep learn- ing models, the robustness of deep learning models has become a crucial topic. The robustness of deep learning models for health- care applications is especially critical because the unique characteristics and the high financial interests of the medical domain make it more sensitive to adversarial attacks. Among the modalities of medical data, the clinical summaries have higher risks to be attacked because they are generated by third-party companies. As few works studied adversarial threats on clinical summaries, in this work we first apply adversarial attack to clinical summaries of electronic health records (EHR) to show the text-based deep learning systems are vulnerable to adversarial examples. Secondly, benefiting from the multi-modality of the EHR dataset, we propose a novel defense method, MATCH (Multimodal feATure Consistency cHeck), which leverages the consistency between multiple modalities in the data to defend against adversarial examples on a single modality. Our experiments demonstrate the effectiveness of MATCH on a hospital readmission prediction task comparing with baseline methods.


Supervising Unsupervised Open Information Extraction Models
Arpita Roy | Youngja Park | Taesung Lee | Shimei Pan
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

We propose a novel supervised open information extraction (Open IE) framework that leverages an ensemble of unsupervised Open IE systems and a small amount of labeled data to improve system performance. It uses the outputs of multiple unsupervised Open IE systems plus a diverse set of lexical and syntactic information such as word embedding, part-of-speech embedding, syntactic role embedding and dependency structure as its input features and produces a sequence of word labels indicating whether the word belongs to a relation, the arguments of the relation or irrelevant. Comparing with existing supervised Open IE systems, our approach leverages the knowledge in existing unsupervised Open IE systems to overcome the problem of insufficient training data. By employing multiple unsupervised Open IE systems, our system learns to combine the strength and avoid the weakness in each individual Open IE system. We have conducted experiments on multiple labeled benchmark data sets. Our evaluation results have demonstrated the superiority of the proposed method over existing supervised and unsupervised models by a significant margin.


Probabilistic Prototype Model for Serendipitous Property Mining
Taesung Lee | Seung-won Hwang | Zhongyuan Wang
Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers

Besides providing the relevant information, amusing users has been an important role of the web. Many web sites provide serendipitous (unexpected but relevant) information to draw user traffic. In this paper, we study the representative scenario of mining an amusing quiz. An existing approach leverages a knowledge base to mine an unexpected property then find quiz questions on such property, based on prototype theory in cognitive science. However, existing deterministic model is vulnerable to noise in the knowledge base. Therefore, we instead propose to leverage probabilistic approach to build a prototype that can overcome noise. Our extensive empirical study shows that our approach not only significantly outperforms baselines by 0.06 in accuracy, and 0.11 in serendipity but also shows higher relevance than the traditional relevance-pursuing baseline using TF-IDF.


Understanding Relation Temporality of Entities
Taesung Lee | Seung-won Hwang
Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)

Map Translation Using Geo-tagged Social Media
Sunyou Lee | Taesung Lee | Seung-won Hwang
Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, volume 2: Short Papers


Bootstrapping Entity Translation on Weakly Comparable Corpora
Taesung Lee | Seung-won Hwang
Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)