Reward Models (RMs), vital for large model alignment, are underexplored for complex embodied tasks like Embodied Question Answering (EQA) where nuanced evaluation of agents’ spatial, temporal, and logical understanding is critical yet not considerred by generic approaches. We introduce EQA-RM, a novel generative multimodal reward model specifically architected for EQA, trained via our innovative Contrastive Group Relative Policy Optimization (C-GRPO) strategy to learn fine-grained behavioral distinctions. The generative nature of EQA-RM provides interpretable, structured reward feedback (beyond simple scalars), uniquely enabling test-time scaling to dynamically adjust evaluation granularity, from concise scores to detailed critiques of reasoning and grounding, at inference without retraining. Concurrently, we introduce EQARewardBench, a new benchmark built on OpenEQA for standardized EQA reward model assessment. Demonstrating high sample efficiency, EQA-RM (fine-tuning Qwen2-VL-2B-Instruct) achieves 61.9% accuracy on EQA-RM-Bench with 700 samples, outperforming strong proprietary baselines, including Gemini-2.5-Flash, GPT-4o, Claude-3.5-Haiku, and open-sourced state-of-the-art models such as RoVRM and VisualPRM.
Bit-flip errors (BFEs) are hardware faults where individual bits in memory or processing units are unintentionally flipped. These errors pose a significant threat to neural network reliability because even small changes in model parameters can lead to large shifts in outputs. Large language models (LLMs) are particularly vulnerable on resource-constrained or outdated hardware. Such hardware often lacks error-correction mechanisms and faces aging issues, leading to instability under the vast parameter counts and heavy computational loads of LLMs. While the impact of BFEs on traditional networks like CNNs is relatively well-studied, their effect on the complex architecture of transformers remains largely unexplored. Firstly, this paper presents a comprehensive systematic analysis of BFE vulnerabilities in key LLM components, revealing distinct sensitivities across parameters, activations, and gradients during fine-tuning and inference. Secondly, based on our findings, we introduce a novel defense strategy FlipGuard: (i) exponent bit protection, and (ii) a self-correction based fine-tuning mechanism, to address BFE consequences. FlipGuard minimizes performance degradation while significantly enhancing robustness against BFEs. Experiments demonstrate a 9.27 reduction in accuracy drop under 1 BFEs on the SST-2 dataset using BERT, and a 36.35-point improvement in perplexity on the Wikitext-103 dataset using GPT-2, compared to unprotected models. These results show the potential of our approach in enabling reliable LLM deployment on diverse and less reliable hardware platforms.
Image mass cytometry (IMC) enables high-dimensional spatial profiling by combining mass cytometry’s analytical power with spatial distributions of cell phenotypes. Recent studies leverage large language models (LLMs) to extract cell states by translating gene or protein expression into biological context. However, existing single-cell LLMs face two major challenges: (1) Integration of spatial information—they struggle to generalize spatial coordinates and effectively encode spatial context as text, and (2) Treating each cell independently—they overlook cell-cell interactions, limiting their ability to capture biological relationships. To address these limitations, we propose Spatial2Sentence, a novel framework that integrates both single-cell expression and spatial information into natural language using a multi-sentence approach. Given an expression matrix and spatial coordinates, Spatial2Sentence constructs expression similarity and distance matrices, pairing spatially adjacent and expressionally similar cells as positive pairs while using distant and dissimilar cells as negatives. These multi-sentence representations are processed by LLMs, enabling them to learn cellular interactions in both expression and spatial contexts. Equipped with multi-task learning, Spatial2Sentence outperforms existing single-cell LLMs on preprocessed IMC datasets for diabetes and brain tumors, improving cell-type classification by 5.98% and clinical status prediction by 4.18% on the diabetes dataset while enhancing interpretability. The source code can be found here:
https://github.com/UNITES-Lab/Spatial2Sentence.
未经整理的古代典籍不含任何标点,不符合当代人的阅读习惯,古籍断句标点之后有助于阅读、研究和出版。本文提出了一种基于预训练语言模型的繁体古文自动句读框架。本文整理了约10亿字的繁体古文语料,对于训练语言模型进行增量训练,在此基础上上实现古文自动句读和标点。实验表明经过大规模繁体古文语料增量训练后的语言模型具备更好的古文语义表示能力,能够有助提升繁体古文自动句读和自动标点的效果。融合了增量训练模型之后,古文断句F1值达到95.03%,古文标点F1值达到了80.18%,分别比使用未增量训练的语言模型提升1.83%和2.21%。为解决现有篇章级句读方案效率低的问题,本文改进了前人的串行滑动窗口方式,在一定程度上提高了句读效率,并提出一种新的并行滑动窗口方式,能够高效准确地进行长文本自动句读。