Textual Attention RPN for Open-Vocabulary Object Detection
- Authors
- Choi, Tae-Min; Yoon, Inug; Kim, Jong-Hwan; Park, Ju youn
- Issue Date
- 2024-11-25
- Publisher
- The British Machine Vision Association and Society for Pattern Recognition
- Citation
- The 35th British Machine Vision Conference
- Abstract
- Open-vocabulary object detection (OVD) is a computer vision task that detects and classifies objects from categories not seen during training. While recent OVD methods primarily focus on aligning region embeddings with visual-language pre-trained models like CLIP for classification, object detection requires effective localization as well. However, existing methods often use a proposal generator biased toward the training data, which creates a bottleneck in performance improvement. To address this challenge, we introduce the Textual Attention Region Proposal Network (TA-RPN). This network enhances proposal generation by integrating visual and textual features from the CLIP text encoder, utilizing pixel-wise attention for a comprehensive fusion across the image space. Our approach also incorporates prompt learning to optimize textual features for better localization. Evaluated on the COCO and LVIS benchmarks, TA-RPN outperforms existing state-of-the-art methods, demonstrating its effectiveness in detecting novel object categories.
- URI
Go to Link
- Appears in Collections:
- KIST Conference Paper > 2024
- Files in This Item:
There are no files associated with this item.
- Export
- RIS (EndNote)
- XLS (Excel)
- XML
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.