Sungchul Choi

2025

pdf bib abs
PILOT-Bench: A Benchmark for Legal Reasoning in the Patent Domain with IRAC-Aligned Classification Tasks
Yehoon Jang | Chaewon Lee | Hyun-seok Min | Sungchul Choi
Proceedings of the Natural Legal Language Processing Workshop 2025

The Patent Trial and Appeal Board (PTAB) of the USPTO adjudicates thousands of *ex parte* appeals each year, requiring the integration of technical understanding and legal reasoning. While large language models (LLMs) are increasingly applied in patent and legal practice, their use has remained limited to lightweight tasks, with no established means of systematically evaluating their capacity for structured legal reasoning in the patent domain. To address this gap, we introduce **PILOT-Bench** (**P**atent **I**nva**L**idati**O**n **T**rial Benchmark), a dataset and benchmark that aligns PTAB decisions with USPTO patent data at the case level and formalizes three IRAC-aligned classification tasks: Issue Type, Board Authorities, and Subdecision. We evaluate a diverse set of commercial and open-source LLMs and conduct analyses across multiple perspectives, including input-variation settings, model families, and error tendencies. Notably, on the Issue Type task, commercial models consistently exceed 0.75 in Exact Match, whereas the strongest open-source model (Qwen-8B) achieves performance around 0.56, highlighting the substantial gap in reasoning capabilities. PILOT-Bench establishes a foundation for the systematic evaluation of patent-domain legal reasoning and points toward future directions for improving LLMs through dataset design and model alignment. All data, code, and benchmark resources are available at https://github.com/TeamLab/pilot-bench.

2023

pdf bib abs
Bag of Tricks for In-Distribution Calibration of Pretrained Transformers
Jaeyoung Kim | Dongbin Na | Sungchul Choi | Sungbin Lim
Findings of the Association for Computational Linguistics: EACL 2023

While pre-trained language models (PLMs) have become a de-facto standard promoting the accuracy of text classification tasks, recent studies find that PLMs often predict over-confidently. Although calibration methods have been proposed, such as ensemble learning and data augmentation, most of the methods have been verified in computer vision benchmarks rather than in PLM-based text classification tasks. In this paper, we present an empirical study on confidence calibration for PLMs, addressing three categories, including confidence penalty losses, data augmentations, and ensemble methods. We find that the ensemble model overfitted to the training set shows sub-par calibration performance and also observe that PLMs trained with confidence penalty loss have a trade-off between calibration and accuracy. Building on these observations, we propose the Calibrated PLM (CALL), a combination of calibration techniques. The CALL complements shortcomings that may occur when utilizing a calibration method individually and boosts both classification and calibration accuracy. Design choices in CALL’s training procedures are extensively studied, and we provide a detailed analysis of how calibration techniques affect the calibration performance of PLMs.

For real-world language applications, detecting an out-of-distribution (OOD) sample is helpful to alert users or reject such unreliable samples. However, modern over-parameterized language models often produce overconfident predictions for both in-distribution (ID) and OOD samples. In particular, language models suffer from OOD samples with a similar semantic representation to ID samples since these OOD samples lie near the ID manifold.A rejection network can be trained with ID and diverse outlier samples to detect test OOD samples, but explicitly collecting auxiliary OOD datasets brings an additional burden for data collection. In this paper, we propose a simple but effective method called Pseudo Outlier Exposure (POE) that constructs a surrogate OOD dataset by sequentially masking tokens related to ID classes. The surrogate OOD sample introduced by POE shows a similar representation to ID data, which is most effective in training a rejection network. Our method does not require any external OOD data and can be easily implemented within off-the-shelf Transformers.A comprehensive comparison with state-of-the-art algorithms demonstrates POE’s competitiveness on several text classification benchmarks.

Co-authors

Venues

Fix author