LLM4Decompile: Decompiling Binary Code with Large Language Models

Hanzhuo Tan; Qi Luo (罗琪); Jing Li (李婧); Yuqun Zhang

doi:10.18653/v1/2024.emnlp-main.203

LLM4Decompile: Decompiling Binary Code with Large Language Models

Hanzhuo Tan, Qi Luo, Jing Li, Yuqun Zhang

Abstract

Decompilation aims to convert binary code to high-level source code, but traditional tools like Ghidra often produce results that are difficult to read and execute. Motivated by the advancements in Large Language Models (LLMs), we propose LLM4Decompile, the first and largest open-source LLM series (1.3B to 33B) trained to decompile binary code. We optimize the LLM training process and introduce the LLM4Decompile-End models to decompile binary directly. The resulting models significantly outperform GPT-4o and Ghidra on the HumanEval and ExeBench benchmarks by over 100% in terms of re-executability rate. Additionally, we improve the standard refinement approach to fine-tune the LLM4Decompile-Ref models, enabling them to effectively refine the decompiled code from Ghidra and achieve a further 16.2% improvement over the LLM4Decompile-End. LLM4Decompile demonstrates the potential of LLMs to revolutionize binary code decompilation, delivering remarkable improvements in readability and executability while complementing conventional tools for optimal results.

Anthology ID:: 2024.emnlp-main.203
Volume:: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2024
Address:: Miami, Florida, USA
Editors:: Yaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 3473–3487
Language:
URL:: https://preview.aclanthology.org/add_missing_videos/2024.emnlp-main.203/
DOI:: 10.18653/v1/2024.emnlp-main.203
Bibkey:
Cite (ACL):: Hanzhuo Tan, Qi Luo, Jing Li, and Yuqun Zhang. 2024. LLM4Decompile: Decompiling Binary Code with Large Language Models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 3473–3487, Miami, Florida, USA. Association for Computational Linguistics.
Cite (Informal):: LLM4Decompile: Decompiling Binary Code with Large Language Models (Tan et al., EMNLP 2024)
Copy Citation:
PDF:: https://preview.aclanthology.org/add_missing_videos/2024.emnlp-main.203.pdf

PDF Search Fix data