Effective SAM Combination for Open-Vocabulary Semantic Segmentation

Authors
Lee, MinhyeokCho, SuhwanLee, JunghoYang, SunghunChoi, HeeseungKim, Ig-JaeLee, Sangyoun
Issue Date
2025-06-10
Publisher
IEEE
Citation
2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.26081 - 26090
Abstract
Open-vocabulary semantic segmentation aims to assign pixel-level labels to images across an unlimited range of classes. Traditional methods address this by sequentially connecting a powerful mask proposal generator, such as the Segment Anything Model (SAM), with a pre-trained vision-language model like CLIP. But these two-stage approaches often suffer from high computational costs, memory inefficiencies. In this paper, we propose ESC-Net, a novel one-stage open-vocabulary segmentation model that leverages the SAM decoder blocks for class-agnostic segmentation within an efficient inference framework. By embedding pseudo prompts generated from image-text correlations into SAM’s promptable segmentation framework, ESC-Net achieves refined spatial aggregation for accurate mask predictions. Additionally, a Vision-Language Fusion (VLF) module enhances the final mask prediction through image and text guidance. ESC-Net and PASCAL-Context, outperforming prior methods in both efficiency and accuracy. Comprehensive ablation studies further demonstrate its robustness across challenging conditions.
URI
https://pubs.kist.re.kr/handle/201004/153920
DOI
10.1109/cvpr52734.2025.02429
Appears in Collections:
KIST Conference Paper > 2025
Export
RIS (EndNote)
XLS (Excel)
XML

qrcode

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

BROWSE