Knowing Where to Focus: Event-aware Transformer for Video Grounding
- Authors
- Jang, Jinhyun; Park, Jungin; Kim, Jin; Kwon, Hyeongjun; Sohn, Kwanghoon
- Issue Date
- 2023-10
- Publisher
- IEEE COMPUTER SOC
- Citation
- IEEE/CVF International Conference on Computer Vision (ICCV), pp.13800 - 13810
- Abstract
- Recent DETR-based video grounding models have made the model directly predict moment timestamps without any hand-crafted components, such as a pre-defined proposal or non-maximum suppression, by learning moment queries. However, their input-agnostic moment queries inevitably overlook an intrinsic temporal structure of a video, providing limited positional information. In this paper, we formulate an event-aware dynamic moment query to enable the model to take the input-specific content and positional information of the video into account. To this end, we present two levels of reasoning: 1) Event reasoning that captures distinctive event units constituting a given video using a slot attention mechanism; and 2) moment reasoning that fuses the moment queries with a given sentence through a gated fusion transformer layer and learns interactions between the moment queries and video-sentence representations to predict moment timestamps. Extensive experiments demonstrate the effectiveness and efficiency of the event-aware dynamic moment queries, outperforming state-of-the-art approaches on several video grounding benchmarks. The code is publicly available at https://github.com/jinhyunj/EaTR.
- ISSN
- 1550-5499
- URI
- https://pubs.kist.re.kr/handle/201004/149641
- DOI
- 10.1109/ICCV51070.2023.01273
- Appears in Collections:
- KIST Conference Paper > 2023
- Files in This Item:
There are no files associated with this item.
- Export
- RIS (EndNote)
- XLS (Excel)
- XML
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.