2025
pdf
bib
abs
UORA: Uniform Orthogonal Reinitialization Adaptation in Parameter Efficient Fine-Tuning of Large Models
Xueyan Zhang
|
Jinman Zhao
|
Zhifei Yang
|
Yibo Zhong
|
Shuhao Guan
|
Linbo Cao
|
Yining Wang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
This paper introduces UoRA, a novel parameter-efficient fine-tuning (PEFT) approach for large language models (LLMs). UoRA achieves state-of-the-art efficiency by leveraging a low-rank approximation method that reduces the number of trainable parameters without compromising performance. Unlike existing methods such as LoRA and VeRA, UoRA employs a re-parametrization mechanism that eliminates the need to adapt frozen projection matrices while maintaining shared projection layers across the model. This results in halving the trainable parameters compared to LoRA and outperforming VeRA in computation and storage efficiency. Comprehensive experiments across various benchmarks demonstrate UoRA’s superiority in achieving competitive fine-tuning performance with minimal computational overhead. We demonstrate its performance on GLUE and E2E benchmarks and is effectiveness in instruction-tuning large language models and image classification models. Our contributions establish a new paradigm for scalable and resource-efficient fine-tuning of LLMs.
pdf
bib
abs
PreP-OCR: A Complete Pipeline for Document Image Restoration and Enhanced OCR Accuracy
Shuhao Guan
|
Moule Lin
|
Cheng Xu
|
Xinyi Liu
|
Jinman Zhao
|
Jiexin Fan
|
Qi Xu
|
Derek Greene
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
This paper introduces PreP-OCR, a two-stage pipeline that combines document image restoration with semantic-aware post-OCR correction to enhance both visual clarity and textual consistency, thereby improving text extraction from degraded historical documents.First, we synthesize document-image pairs from plaintext, rendering them with diverse fonts and layouts and then applying a randomly ordered set of degradation operations. An image restoration model is trained on this synthetic data, using multi-directional patch extraction and fusion to process large images. Second, a ByT5 post-OCR model, fine-tuned on synthetic historical text pairs, addresses remaining OCR errors.Detailed experiments on 13,831 pages of real historical documents in English, French, and Spanish show that the PreP-OCR pipeline reduces character error rates by 63.9-70.3% compared to OCR on raw images. Our pipeline demonstrates the potential of integrating image restoration with linguistic error correction for digitizing historical archives.
pdf
bib
abs
SSA: Semantic Contamination of LLM-Driven Fake News Detection
Cheng Xu
|
Nan Yan
|
Shuhao Guan
|
Yuke Mei
|
Tahar Kechadi
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Benchmark data contamination (BDC) silently inflate the evaluation performance of large language models (LLMs), yet current work on BDC has centered on direct token overlap (data/label level), leaving the subtler and equally harmful semantic level BDC largely unexplored. This gap is critical in fake news detection task, where prior exposure to semantic BDC lets a model “remember” the answer instead of reasoning. In this work, (1) we are the first to formally define semantic contamination for this task and (2) introduce the Semantic Sensitivity Amplifier (SSA), a lightweight, model-agnostic framework that detects BDC risks across semantic to label level via an entity shift perturbation and a comprehensive interpretable metric, the SSA Factor. Evaluating 45 variants of nine LLMs (0.5B–72B parameters) across four BDC levels, we find LIAR2 accuracy climbs monotonically with injected contamination, while the SSA Factor escalates in near-perfect lock-step (r≥.97, for models ≥3B, p<.05; 𝜌 ≥.9 overall, p<.05). These results show that SSA provides a sensitive and scalable audit of comprehensive BDC risk and paves the way for a more integrity evaluation of the LLM-driven fake news detection task.
pdf
bib
abs
DCR: Quantifying Data Contamination in LLMs Evaluation
Cheng Xu
|
Nan Yan
|
Shuhao Guan
|
Changhong Jin
|
Yuke Mei
|
Yibing Guo
|
Tahar Kechadi
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
The rapid advancement of large language models (LLMs) has heightened concerns about benchmark data contamination (BDC), where models inadvertently memorize evaluation data during the training process, inflating performance metrics, and undermining genuine generalization assessment. This paper introduces the Data Contamination Risk (DCR) framework, a lightweight, interpretable pipeline designed to detect and quantify BDC risk across four granular levels: semantic, informational, data, and label. By synthesizing contamination scores via a fuzzy inference system, DCR produces a unified DCR Factor that adjusts raw accuracy to reflect contamination-aware performance. Validated on 9 LLMs (0.5B-72B) across sentiment analysis, fake news detection, and arithmetic reasoning tasks, the DCR framework reliably diagnoses contamination severity and with accuracy adjusted using the DCR Factor to within 4% average error across the three benchmarks compared to the uncontaminated baseline. Emphasizing computational efficiency and transparency, DCR provides a practical tool for integrating contamination assessment into routine evaluations, fostering fairer comparisons and enhancing the credibility of LLM benchmarking practices.
2024
pdf
bib
abs
Advancing Arabic Sentiment Analysis: ArSen Benchmark and the Improved Fuzzy Deep Hybrid Network
Yang Fang
|
Cheng Xu
|
Shuhao Guan
|
Nan Yan
|
Yuke Mei
Proceedings of the 28th Conference on Computational Natural Language Learning
Sentiment analysis is pivotal in Natural Language Processing for understanding opinions and emotions in text. While advancements in Sentiment analysis for English are notable, Arabic Sentiment Analysis (ASA) lags, despite the growing Arabic online user base. Existing ASA benchmarks are often outdated and lack comprehensive evaluation capabilities for state-of-the-art models. To bridge this gap, we introduce ArSen, a meticulously annotated COVID-19-themed Arabic dataset, and the IFDHN, a novel model incorporating fuzzy logic for enhanced sentiment classification. ArSen provides a contemporary, robust benchmark, and IFDHN achieves state-of-the-art performance on ASA tasks. Comprehensive evaluations demonstrate the efficacy of IFDHN using the ArSen dataset, highlighting future research directions in ASA.
pdf
bib
abs
Effective Synthetic Data and Test-Time Adaptation for OCR Correction
Shuhao Guan
|
Cheng Xu
|
Moule Lin
|
Derek Greene
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Post-OCR technology is used to correct errors in the text produced by OCR systems. This study introduces a method for constructing post-OCR synthetic data with different noise levels using weak supervision. We define Character Error Rate (CER) thresholds for “effective” and “ineffective” synthetic data, allowing us to create more useful multi-noise level synthetic datasets. Furthermore, we propose Self-Correct-Noise Test-Time Adaptation (SCN-TTA), which combines self-correction and noise generation mechanisms. SCN-TTA allows a model to dynamically adjust to test data without relying on labels, effectively handling proper nouns in long texts and further reducing CER. In our experiments we evaluate a range of models, including multiple PLMs and LLMs. Results indicate that our method yields models that are effective across diverse text types. Notably, the ByT5 model achieves a CER reduction of 68.67% without relying on manually annotated data
pdf
bib
abs
Advancing Post-OCR Correction: A Comparative Study of Synthetic Data
Shuhao Guan
|
Derek Greene
Findings of the Association for Computational Linguistics: ACL 2024
This paper explores the application of synthetic data in the post-OCR domain on multiple fronts by conducting experiments to assess the impact of data volume, augmentation, and synthetic data generation methods on model performance. Furthermore, we introduce a novel algorithm that leverages computer vision feature detection algorithms to calculate glyph similarity for constructing post-OCR synthetic data. Through experiments conducted across a variety of languages, including several low-resource ones, we demonstrate that models like ByT5 can significantly reduce Character Error Rates (CER) without the need for manually annotated data, and our proposed synthetic data generation method shows advantages over traditional methods, particularly in low-resource languages.