Debjit Pal


2025

pdf bib
AssertionBench: A Benchmark to Evaluate Large-Language Models for Assertion Generation
Vaishnavi Pulavarthi | Deeksha Nandal | Soham Dan | Debjit Pal
Findings of the Association for Computational Linguistics: NAACL 2025

Assertions have been the de facto collateral for hardware for over a decade. The verification quality, i.e., detection and diagnosis of corner-case design bugs, is critically dependent on the assertion quality. There has been a considerable amount of research to generate high-quality assertions from hardware design source code and design execution trace data. With recent advent of generative AI techniques such as Large-Language Models (LLMs), there has been a renewed interest in deploying LLMs for assertion generation. However, there is little effort to quantitatively establish the effectiveness and suitability of various LLMs for assertion generation. In this paper, we present AssertionBench, a novel benchmark to evaluate LLMs’ effectiveness for assertion generation quantitatively. AssertionBench contains 100 curated Verilog hardware designs from OpenCores and formally verified assertions for each design, generated from GoldMine and HARM. We use AssertionBench to compare state-of-the-art LLMs, e.g., GPT-3.5, GPT-4o, CodeLLaMa-2, and LLaMa3-70B, to assess their effectiveness in inferring functionally correct assertions for hardware designs. Our experiments comprehensively demonstrate how LLMs perform relative to each other, the benefits of using more in-context exemplars in generating a higher fraction of functionally correct assertions, and the significant room for improvement for LLM-based assertion generators.