Effective SAM Combination for Open-Vocabulary Semantic Segmentation
- Authors
- Lee, Minhyeok; Cho, Suhwan; Lee, Jungho; Yang, Sunghun; Choi, Heeseung; Kim, Ig-Jae; Lee, Sangyoun
- Issue Date
- 2025-06-10
- Publisher
- IEEE
- Citation
- 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.26081 - 26090
- Abstract
- Open-vocabulary semantic segmentation aims to assign pixel-level labels to images across an unlimited range of classes. Traditional methods address this by sequentially connecting a powerful mask proposal generator, such as the Segment Anything Model (SAM), with a pre-trained vision-language model like CLIP. But these two-stage approaches often suffer from high computational costs, memory inefficiencies. In this paper, we propose ESC-Net, a novel one-stage open-vocabulary segmentation model that leverages the SAM decoder blocks for class-agnostic segmentation within an efficient inference framework. By embedding pseudo prompts generated from image-text correlations into SAM’s promptable segmentation framework, ESC-Net achieves refined spatial aggregation for accurate mask predictions. Additionally, a Vision-Language Fusion (VLF) module enhances the final mask prediction through image and text guidance. ESC-Net and PASCAL-Context, outperforming prior methods in both efficiency and accuracy. Comprehensive ablation studies further demonstrate its robustness across challenging conditions.
- URI
- https://pubs.kist.re.kr/handle/201004/153920
- DOI
- 10.1109/cvpr52734.2025.02429
- Appears in Collections:
- KIST Conference Paper > 2025
- Export
- RIS (EndNote)
- XLS (Excel)
- XML
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.