Qi Bi

2026

Segment Anything Model 3 (SAM3) advances open-vocabulary segmentation through promptable concept segmentation, enabling users to segment all instances associated with a given concept using short noun-phrase (NP) prompts. While effective for concept-level grounding, real-world interactions often involve far richer natural-language instructions that combine attributes, relations, actions, states, or implicit reasoning. Currently, SAM3 relies on external multi-modal agents to convert complex instructions into NPs and conducts iterative mask filtering, leading to coarse representations and limited instance specificity. In this work, we present SAM3-I, an instruction-following extension of the SAM family that unifies concept-level grounding and instruction-level reasoning within a single segmentation framework. Built upon SAM3, SAM3-I introduces an instruction-aware cascaded adaptation mechanism with dedicated alignment losses that progressively aligns expressive instruction semantics with SAM3’s vision-language representations, enabling direct interpretation of natural-language instructions while preserving its strong concept recall ability. To enable instruction-following learning, we introduce HMPL-Instruct, a large-scale instruction-centric dataset that systematically covers hierarchical instruction semantics and diverse target granularities. Experiments demonstrate that SAM3-I achieves appealing performance across referring and reasoning-based segmentation, showing that SAM3 can be effectively extended to follow complex natural-language instructions without sacrificing its original concept-driven strengths. Code and dataset are available at https://github.com/debby-0527/SAM3-I.

2025

pdf bib abs

MaCP: Minimal yet Mighty Adaptation via Hierarchical Cosine Projection
Yixian Shen | Qi Bi | Jia-hong Huang | Hongyi Zhu | Andy D. Pimentel | Anuj Pathania
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

We present a new adaptation method MaCP, Minimal yet Mighty adaptive Cosine Projection, that achieves exceptional performance while requiring minimal parameters and memory for fine-tuning large foundation models.Its general idea is to exploit the superior energy compaction and decorrelation properties of cosine projection to improve both model efficiency and accuracy.Specifically, it projects the weight change from the low-rank adaptation into the discrete cosine space.Then, the weight change is partitioned over different levels of the discrete cosine spectrum, and each partition’s most critical frequency components are selected.Extensive experiments demonstrate the effectiveness of MaCP across a wide range of single-modality tasks, including natural language understanding, natural language generation, text summarization, as well as multi-modality tasks such as image classification and video understanding. MaCP consistently delivers superior accuracy, significantly reduced computational complexity, and lower memory requirements compared to existing alternatives.

pdf bib abs

SSH: Sparse Spectrum Adaptation via Discrete Hartley Transformation
Yixian Shen | Qi Bi | Jia-hong Huang | Hongyi Zhu | Andy D. Pimentel | Anuj Pathania
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)

Low-rank adaptation (LoRA) has been demonstrated effective in reducing the trainable parameter number when fine-tuning a large foundation model (LLM). However, it still encounters computational and memory challenges when scaling to larger models or addressing more complex task adaptation.In this work, we introduce **Sparse Spectrum Adaptation via Discrete Hartley Transformation (SSH)**, a novel approach that significantly reduces the number of trainable parameters while enhancing model performance. It selects the most informative spectral components across all layers, under the guidance of the initial weights after a discrete Hartley transformation (DHT). The lightweight inverse DHT then projects the spectrum back into the spatial domain for updates.Extensive experiments across both single-modality tasks—such as language understanding and generation—and multi-modality tasks—such as video-text understanding—demonstrate that SSH outperforms existing parameter-efficient fine-tuning (PEFT) methods while achieving substantial reductions in computational cost and memory requirements. For instance, during instruction tuning on the LLaMA3.1 8B model, SSH achieves higher accuracy with only 0.048M trainable parameters compared to LoRA’s 33.5M, while reducing computational intensity up to 55% compared to FourierFT.

Co-authors

Wei Ji 1

Venues

ACL2
NAACL1

Fix author