Choi, Tae-Min Yoon, Inug Kim, Jong-Hwan Park, Ju youn 2024-12-12T08:30:06Z 2024-12-12T08:30:06Z 2024-12-11 2024-11-25 https://pubs.kist.re.kr/handle/201004/151350 https://bmvc2024.org/proceedings/85/ Open-vocabulary object detection (OVD) is a computer vision task that detects and classifies objects from categories not seen during training. While recent OVD methods primarily focus on aligning region embeddings with visual-language pre-trained models like CLIP for classification, object detection requires effective localization as well. However, existing methods often use a proposal generator biased toward the training data, which creates a bottleneck in performance improvement. To address this challenge, we introduce the Textual Attention Region Proposal Network (TA-RPN). This network enhances proposal generation by integrating visual and textual features from the CLIP text encoder, utilizing pixel-wise attention for a comprehensive fusion across the image space. Our approach also incorporates prompt learning to optimize textual features for better localization. Evaluated on the COCO and LVIS benchmarks, TA-RPN outperforms existing state-of-the-art methods, demonstrating its effectiveness in detecting novel object categories. English The British Machine Vision Association and Society for Pattern Recognition Textual Attention RPN for Open-Vocabulary Object Detection Conference 1 The 35th British Machine Vision Conference The 35th British Machine Vision Conference UK Glasgow, UK 2024-11-25 The 35th British Machine Vision Conference