Full metadata record
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Leem, Saebom | - |
dc.contributor.author | Seo, Hyunseok | - |
dc.date.accessioned | 2024-09-19T02:30:08Z | - |
dc.date.available | 2024-09-19T02:30:08Z | - |
dc.date.created | 2024-09-19 | - |
dc.date.issued | 2024-02 | - |
dc.identifier.issn | 2159-5399 | - |
dc.identifier.uri | https://pubs.kist.re.kr/handle/201004/150630 | - |
dc.description.abstract | Vision Transformer(ViT) is one of the most widely used models in the computer vision field with its great performance on various tasks. In order to fully utilize the ViT-based architecture in various applications, proper visualization methods with a decent localization performance are necessary, but these methods employed in CNN-based models are still not available in ViT due to its unique structure. In this work, we propose an attention-guided visualization method applied to ViT that provides a high-level semantic explanation for its decision. Our method selectively aggregates the gradients directly propagated from the classification output to each self-attention, collecting the contribution of image features extracted from each location of the input image. These gradients are additionally guided by the normalized self-attention scores, which are the pairwise patch correlation scores. They are used to supplement the gradients on the patch-level context information efficiently detected by the self-attention mechanism. This approach of our method provides elaborate high-level semantic explanations with great localization performance only with the class labels. As a result, our method outperforms the previous leading explainability methods of ViT in the weakly-supervised localization task and presents great capability in capturing the full instances of the target class object. Meanwhile, our method provides a visualization that faithfully explains the model, which is demonstrated in the perturbation comparison test. | - |
dc.language | English | - |
dc.publisher | ASSOC ADVANCEMENT ARTIFICIAL INTELLIGENCE | - |
dc.title | Attention Guided CAM: Visual Explanations of Vision Transformer Guided by Self-Attention | - |
dc.type | Conference | - |
dc.identifier.doi | 10.1609/aaai.v38i4.28077 | - |
dc.description.journalClass | 1 | - |
dc.identifier.bibliographicCitation | 38th AAAI Conference on Artificial Intelligence (AAAI) / 36th Conference on Innovative Applications of Artificial Intelligence / 14th Symposium on Educational Advances in Artificial Intelligence, pp.2956 - 2964 | - |
dc.citation.title | 38th AAAI Conference on Artificial Intelligence (AAAI) / 36th Conference on Innovative Applications of Artificial Intelligence / 14th Symposium on Educational Advances in Artificial Intelligence | - |
dc.citation.startPage | 2956 | - |
dc.citation.endPage | 2964 | - |
dc.citation.conferencePlace | US | - |
dc.citation.conferencePlace | Vancouver, CANADA | - |
dc.citation.conferenceDate | 2024-02-20 | - |
dc.relation.isPartOf | THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 4 | - |
dc.identifier.wosid | 001239884400009 | - |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.