2025
pdf
bib
abs
CodeComplex: Dataset for Worst-Case Time Complexity Prediction
SeungYeop Baik
|
Joonghyuk Hahn
|
Jungin Kim
|
Aditi
|
Mingi Jeon
|
Yo-Sub Han
|
Sang-Ki Ko
Findings of the Association for Computational Linguistics: EMNLP 2025
Reasoning ability of large language models (LLMs) is a crucial ability,especially in complex decision-making tasks. One significant task to show LLMs’reasoning capability is code time complexity prediction, which involves variousintricate factors such as the input range of variables and conditional loops.Current benchmarks fall short of providing a rigorous assessment due to limiteddata, language constraints, and insufficient labeling. They do not consider timecomplexity based on input representation and merely evaluate whether predictionsfall into the same class, lacking a measure of how close incorrect predictionsare to the correct ones.To address these dependencies, we introduce CodeComplex, the first robust andextensive dataset designed to evaluate LLMs’ reasoning abilities in predictingcode time complexity. CodeComplex comprises 4,900 Java codes and an equivalentnumber of Python codes, overcoming language and labeling constraints, carefullyannotated with complexity labels based on input characteristics by a panel ofalgorithmic experts. Additionally, we propose specialized evaluation metrics forthe reasoning of complexity prediction tasks, offering a more precise andreliable assessment of LLMs’ reasoning capabilities. We release our dataset andbaseline models publicly to encourage the relevant (NLP, SE, and PL) communitiesto utilize and participate in this research. Our code and data are available athttps://github.com/sybaik1/CodeComplex.
2023
pdf
bib
abs
GDA: Grammar-based Data Augmentation for Text Classification using Slot Information
Joonghyuk Hahn
|
Hyunjoon Cheon
|
Elizabeth Orwig
|
Su-Hyeon Kim
|
Sang-Ki Ko
|
Yo-Sub Han
Findings of the Association for Computational Linguistics: EMNLP 2023
Recent studies propose various data augmentation approaches to resolve the low-resource problem in natural language processing tasks. Data augmentation is a successful solution to this problem and recent strategies give variation on sentence structures to boost performance. However, these approaches can potentially lead to semantic errors and produce semantically noisy data due to the unregulated variation of sentence structures. In an effort to combat these semantic errors, we leverage slot information, the representation of the context of keywords from a sentence, and form a data augmentation strategy which we propose, called GDA. Our strategy employs algorithms that construct and manipulate rules of context-aware grammar, utilizing this slot information. The algorithms extract recurrent patterns by distinguishing words with slots and form the “rules of grammar”—a set of injective relations between a sentence’s semantics and its syntactical structure—to augment the dataset. The augmentation is done in an automated manner with the constructed rules and thus, GDA is explainable and reliable without any human intervention. We evaluate GDA with state-of-the-art data augmentation techniques, including those using pre-trained language models, and the result illustrates that GDA outperforms all other data augmentation methods by 19.38%. Extensive experiments show that GDA is an effective data augmentation strategy that incorporates word semantics for more accurate and diverse data.
2021
pdf
bib
abs
MultiFix: Learning to Repair Multiple Errors by Optimal Alignment Learning
HyeonTae Seo
|
Yo-Sub Han
|
Sang-Ki Ko
Findings of the Association for Computational Linguistics: EMNLP 2021
We consider the problem of learning to repair erroneous C programs by learning optimal alignments with correct programs. Since the previous approaches fix a single error in a line, it is inevitable to iterate the fixing process until no errors remain. In this work, we propose a novel sequence-to-sequence learning framework for fixing multiple program errors at a time. We introduce the edit-distance-based data labeling approach for program error correction. Instead of labeling a program repair example by pairing an erroneous program with a line fix, we label the example by paring an erroneous program with an optimal alignment to the corresponding correct program produced by the edit-distance computation. We evaluate our proposed approach on a publicly available dataset (DeepFix dataset) that consists of erroneous C programs submitted by novice programming students. On a set of 6,975 erroneous C programs from the DeepFix dataset, our approach achieves the state-of-the-art result in terms of full repair rate on the DeepFix dataset (without extra data such as compiler error message or additional source codes for pre-training).
2019
pdf
bib
abs
SoftRegex: Generating Regex from Natural Language Descriptions using Softened Regex Equivalence
Jun-U Park
|
Sang-Ki Ko
|
Marco Cognetta
|
Yo-Sub Han
Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)
We continue the study of generating se-mantically correct regular expressions from natural language descriptions (NL). The current state-of-the-art model SemRegex produces regular expressions from NLs by rewarding the reinforced learning based on the semantic (rather than syntactic) equivalence between two regular expressions. Since the regular expression equivalence problem is PSPACE-complete, we introduce the EQ_Reg model for computing the simi-larity of two regular expressions using deep neural networks. Our EQ_Reg mod-el essentially softens the equivalence of two regular expressions when used as a reward function. We then propose a new regex generation model, SoftRegex, us-ing the EQ_Reg model, and empirically demonstrate that SoftRegex substantially reduces the training time (by a factor of at least 3.6) and produces state-of-the-art results on three benchmark datasets.