SSTRAC: Skeleton-Based Dual-Stream Spatio-Temporal Transformer for Repetitive Action Counting in Videos

Authors
LIM, JUNGJUNKANG, DONG HOONRYU, KANGHYUNHONG, JE HYEONG
Issue Date
2025-10
Publisher
Institute of Electrical and Electronics Engineers Inc.
Citation
IEEE Access, v.13, pp.184046 - 184058
Abstract
Most existing approaches to predicting the number of repetitive actions in videos focus on improving model accuracy, but overlook important issues such as robustness to changes in human body size and and occlusion of human body parts in videos. To achieve robustness to changes in human size and and occlusion of human body in videos, we propose a novel network, Skeleton-based dual-stream Spatio-temporal Transformer for Repetitive Action Counting (SSTRAC) using videos, which reconstructs defective human skeletons as a preprocessing step, and then encodes the spatial and temporal information of repetitive actions into the per-frame embeddings through the dual-stream spatio-temporal transformer. To capture both high and low frequency actions in short and long videos, the per-frame embeddings are abstracted in the form of a multi-scale self-attention matrix. In the final step, the period predictor estimates a density map, which provides the number of repetitive actions in each video. We performed extensive experiments by comparing the proposed model with other recent state-of-the art models. The experimental results demonstrate the superiority of our model in terms of robustness to changes in human size and occlusion of human body parts in videos. Codes and models are available at https://github.com/imjjun/SSTRAC_public
URI
https://pubs.kist.re.kr/handle/201004/153401
DOI
10.1109/ACCESS.2025.3624029
Appears in Collections:
KIST Article > 2025
Export
RIS (EndNote)
XLS (Excel)
XML

qrcode

Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.

BROWSE