Dongjun Lee

2025

pdf bib abs
MCS-SQL: Leveraging Multiple Prompts and Multiple-Choice Selection For Text-to-SQL Generation
Dongjun Lee | Choongwon Park | Jaehyuk Kim | Heesoo Park
Proceedings of the 31st International Conference on Computational Linguistics

Recent advancements in large language models (LLMs) have enabled in-context learning (ICL)-based methods that significantly outperform fine-tuning approaches for text-to-SQL tasks. However, their performance is still considerably lower than that of human experts on benchmarks that include complex schemas and queries, such as BIRD. This study considers the sensitivity of LLMs to the prompts and introduces a novel approach that leverages multiple prompts to explore a broader search space for possible answers and effectively aggregate them. Specifically, we robustly refine the database schema through schema linking using multiple prompts. Thereafter, we generate various candidate SQL queries based on the refined schema and diverse prompts. Finally, the candidate queries are filtered based on their confidence scores, and the optimal query is obtained through a multiple-choice selection that is presented to the LLM. When evaluated on the BIRD and Spider benchmarks, the proposed method achieved execution accuracies of 65.5% and 89.6%, respectively, significantly outperforming previous ICL-based methods.

pdf bib abs
Dunamu ML at the Financial Misinformation Detection Challenge Task: Improving Supervised Fine-Tuning with LLM-based Data Augmentation
Dongjun Lee | Heesoo Park
Proceedings of the Joint Workshop of the 9th Financial Technology and Natural Language Processing (FinNLP), the 6th Financial Narrative Processing (FNP), and the 1st Workshop on Large Language Models for Finance and Legal (LLMFinLegal)

In this paper, we describe Dunamu ML’s submission to the Financial Misinformation Detection (FMD) 2025 shared task. To address the low-resource challenge in FMD, we augmented a general domain misinformation detection dataset for training. We first collected claims, contexts, and misinformation labels from a public dataset. Then, we generated evidence for each label based on a closed LLM with few-shot examples extracted from the FMD training dataset. Finally, we oversampled the training data specific to the financial domain and augmented it with the generated data to perform supervised fine-tuning (SFT) on the LLM. When evaluated on the blind test dataset, our model achieved an F1 score of 84.67 in misinformation classification and a ROUGE-1 score of 81.21 in evidence generation, ranking first on the leaderboard in both aspects.

2024

pdf bib abs
Dunamu-ml’s Submissions on AVERITEC Shared Task
Heesoo Park | Dongjun Lee | Jaehyuk Kim | ChoongWon Park | Changhwa Park
Proceedings of the Seventh Fact Extraction and VERification Workshop (FEVER)

This paper presents the Dunamu-ml’s submission to the AVERITEC shared task of the 7th the Fact Extraction and VERification (FEVER) workshop. The task focused on discriminating whether each claim is a fact or not. Our method is powered by the combination of an LLM and a non-parametric lexicon-based method (i.e. BM25). Essentially, we augmented the list of evidences containing the query and the corresponding answers using an powerful LLM, then, retrieved the relative documents using the generated evidences. As such, our method made a great improvement over the baseline results, achieving 0.33 performance gain over the baseline in AveriTec score.

2021

pdf bib abs
IntelliCAT: Intelligent Machine Translation Post-Editing with Quality Estimation and Translation Suggestion
Dongjun Lee | Junhyeong Ahn | Heesoo Park | Jaemin Jo
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations

We present IntelliCAT, an interactive translation interface with neural models that streamline the post-editing process on machine translation output. We leverage two quality estimation (QE) models at different granularities: sentence-level QE, to predict the quality of each machine-translated sentence, and word-level QE, to locate the parts of the machine-translated sentence that need correction. Additionally, we introduce a novel translation suggestion model conditioned on both the left and right contexts, providing alternatives for specific words or phrases for correction. Finally, with word alignments, IntelliCAT automatically preserves the original document’s styles in the translated document. The experimental results show that post-editing based on the proposed QE and translation suggestions can significantly improve translation quality. Furthermore, a user study reveals that three features provided in IntelliCAT significantly accelerate the post-editing task, achieving a 52.9% speedup in translation time compared to translating from scratch. The interface is publicly available at https://intellicat.beringlab.com/.

pdf bib abs
Bering Lab’s Submissions on WAT 2021 Shared Task
Heesoo Park | Dongjun Lee
Proceedings of the 8th Workshop on Asian Translation (WAT2021)

This paper presents the Bering Lab’s submission to the shared tasks of the 8th Workshop on Asian Translation (WAT 2021) on JPC2 and NICT-SAP. We participated in all tasks on JPC2 and IT domain tasks on NICT-SAP. Our approach for all tasks mainly focused on building NMT systems in domain-specific corpora. We crawled patent document pairs for English-Japanese, Chinese-Japanese, and Korean-Japanese. After cleaning noisy data, we built parallel corpus by aligning those sentences with the sentence-level similarity scores. Also, for SAP test data, we collected the OPUS dataset including three IT domain corpora. We then trained transformer on the collected dataset. Our submission ranked 1st in eight out of fourteen tasks, achieving up to an improvement of 2.87 for JPC2 and 8.79 for NICT-SAP in BLEU score .

2020

pdf bib abs
Cross-Lingual Transformers for Neural Automatic Post-Editing
Dongjun Lee
Proceedings of the Fifth Conference on Machine Translation

In this paper, we describe the Bering Lab’s submission to the WMT 2020 Shared Task on Automatic Post-Editing (APE). First, we propose a cross-lingual Transformer architecture that takes a concatenation of a source sentence and a machine-translated (MT) sentence as an input to generate the post-edited (PE) output. For further improvement, we mask incorrect or missing words in the PE output based on word-level quality estimation and then predict the actual word for each mask based on the fine-tuned cross-lingual language model (XLM-RoBERTa). Finally, to address the over-correction problem, we select the final output among the PE outputs and the original MT sentence based on a sentence-level quality estimation. When evaluated on the WMT 2020 English-German APE test dataset, our system improves the NMT output by -3.95 and +4.50 in terms of TER and BLEU, respectively.

pdf bib abs
Two-Phase Cross-Lingual Language Model Fine-Tuning for Machine Translation Quality Estimation
Dongjun Lee
Proceedings of the Fifth Conference on Machine Translation

In this paper, we describe the Bering Lab’s submission to the WMT 2020 Shared Task on Quality Estimation (QE). For word-level and sentence-level translation quality estimation, we fine-tune XLM-RoBERTa, the state-of-the-art cross-lingual language model, with a few additional parameters. Model training consists of two phases. We first pre-train our model on a huge artificially generated QE dataset, and then we fine-tune the model with a human-labeled dataset. When evaluated on the WMT 2020 English-German QE test set, our systems achieve the best result on the target-side of word-level QE and the second best results on the source-side of word-level QE and sentence-level QE among all submissions.

2019

pdf bib abs
Clause-Wise and Recursive Decoding for Complex and Cross-Domain Text-to-SQL Generation
Dongjun Lee
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

Most deep learning approaches for text-to-SQL generation are limited to the WikiSQL dataset, which only supports very simple queries over a single table. We focus on the Spider dataset, a complex and cross-domain text-to-SQL task, which includes complex queries over multiple tables. In this paper, we propose a SQL clause-wise decoding neural architecture with a self-attention based database schema encoder to address the Spider task. Each of the clause-specific decoders consists of a set of sub-modules, which is defined by the syntax of each clause. Additionally, our model works recursively to support nested queries. When evaluated on the Spider dataset, our approach achieves 4.6% and 9.8% accuracy gain in the test and dev sets, respectively. In addition, we show that our model is significantly more effective at predicting complex and nested queries than previous work.

Co-authors

Changhwa Park 1

Venues

ijcnlp2
wmt2
ws2
acl1
coling1
show all...

wat1