公开数据集
数据结构 ? 30.5M
Data Structure ?
* 以上分析是由系统提取分析形成的结果,具体实际数据为准。
README.md
FineGym数据集的概述。我们在时间上和语义上都提供从粗到细的注释。有三个层次的分类标签。时间维度(由两根柱子代表)也被分为两个层次,即行动和子行动。子行动可以用集合类别进行一般描述,也可以用元素类别进行精确描述。子行动实例的真实元素类别是通过人工构建的决策树获得的。
Abstract
在公共基准上,目前的动作识别技术已经取得了巨大的成功。然而,当用于现实世界的应用时,例如体育分析,需要将一个活动解析为不同的阶段并区分不同的细微动作,它们的表现仍然远远不能令人满意。为了将动作识别提高到一个新的水平,我们开发了FineGym,一个建立在体育馆视频之上的新数据集。与现有的动作识别数据集相比,FineGym在丰富性、质量和多样性方面都很突出。特别是,它在动作和子动作两个层面上提供了具有三级语义层次的时间注释。例如,一个 "平衡木 "事件将被注释为由五组基本子动作组成的序列。"跳跃-跳跃"、"平衡木-转体"、"飞行-萨尔托"、"飞行-手弹 "和 "下马",每组中的子动作都将被进一步注释为精细定义的类标签。这种新的粒度水平给动作识别带来了巨大的挑战,例如,如何从一个连贯的动作中解析出时间结构,以及如何区分微妙的不同动作类别。我们在这个数据集上系统地研究了代表性的方法,并获得了一些有趣的发现。我们希望这个数据集能够推动对动作理解的研究。
Dataset hierarchy
FineGym将语义和时间注释分层组织起来。上部显示了三个层次的分类标签,即事件(如平衡木)、集合(如下马)和元素(如salto forward tucked)。下部描述了两级的时间注释,即动作的时间边界(在顶部栏)和子动作实例(在底部栏)。
Sub-action examples
我们提出了几个细化子行动实例的例子。每一组都属于同一事件中的三个元素类别(BB、FX、UB和VT)。可以看出,这种细粒度的实例包含了细微的、具有挑战性的差异。(在GIF上悬停以获得0.25倍的减速)
| |
|
|
Empirical Studies and Analysis
(1) 元素级的动作识别对现有的方法提出了巨大的挑战。
代表性方法的元素级动作识别结果
(2) 稀疏采样对于细粒度的动作识别是不够的。
在训练期间改变采样帧的数量时,TSN的表现。
(3) 时间信息的重要性如何?
(a) 运动特征(如光流)可以捕获帧的时间动态,从而使TSN的性能更好。
(b) 时间动态在FineGym中起着重要作用,而TRN可以捕捉到它。
(c) 当测试帧的数量与训练帧的数量相差很大时,TSM的性能急剧下降,而TSN由于只应用了时间平均池而保持了其性能。
(a) 6个元素类别中具有运动和外观特征的TSN的每类表现。
(b) TRN在使用有序或洗牌的测试帧的UB-circle集合上的表现。
(c) 当用3个框架训练和用更多的框架测试时,TSM和TSN在Gym99上的平均类准确率。
(4) 对大规模视频数据集进行预训练有帮助吗?
在FineGym上,对Kinetics的预训练并不总是有帮助。一个潜在的原因是粗粒度和细粒度的动作之间在时间模式上存在很大的差距。
在Kinetics和ImageNet上预训练的I3D在不同元素类别中的每类表现。
(5) 为什么摆出的信息没有帮助?
基于骨架的ST-GCN由于在体操实例上的骨架估计的挑战而陷入困境。
使用AlphaPose对跳马动作进行人员检测和姿势估计的结果。可以看出,体操运动员的检测和姿势估计在多个帧中被遗漏,特别是在有强烈运动的帧中。这些帧对于细粒度的识别非常重要。(在GIF上悬停以获得0.25倍的减速)
Updates
[23/07/2020] We have made pre-extracted feature available at GitHub. Check out here.
[16/04/2020] We fix a small issue on the naming of the subaction identifier "A_{ZZZZ}_{WWWW}" to avoid ambiguity.
(Thanks Haodong Duan for pointing this out.)
[16/04/2020] We include new subsections to track updates and address FAQs.
FAQs
Q0: License issue:
A0: The annotations of FineGym are copyright by us and published
under the Creative Commons Attribution-NonCommercial 4.0 International
License.
Q1: Some links are invalid on YouTube. How can I obtain the missing videos?
Q1': I am located in mainland China and I cannot access YouTube. How can I get the dataset?
A1: Please submit a Google form at this link.
We may reach you shortly.
Q2: Is the event-/element-level instance in your dataset cut in integral seconds?
A2: No. All levels of instances (actions and sub-actions) are annotated in exact timestamp (milliseconds)
in a pursuit of frame-level preciseness.
The number in the identifier is derived from integral seconds due to conciseness.
Please refer to the instructions below for details.
Q3: Difference of Mean and Top-1 accuracy in Table 2 & 3?
A3: The Top-K accuracy is the fraction of the instances whose
correct label falls in the top-k most confident predictions. In our case
we take K=1.
The mean accuracy is the averaged per-class accuracy. To be
specific, we calculate the top-1 accuracy of each class i to be A_i.
The mean accuracy is the arithmetic mean of A_{1...N}, i.e. (A_1 +
A_2 + ... + A_N)/N, where N is the number of classes.
How to read the temporal annotation files (JSON)?
Below, we show an example entry from the above JSON annotation file:
"0LtLS9wROrk": { "E_002407_002435": { "event": 4, "segments": { "A_0003_0005": { "stages": 1, "timestamps": [ [ 3.45, 5.64 ] ] }, "A_0006_0008": { ... }, "A_0023_0028": { ... }, ... }, "timestamps": [ [ 2407.32, 2435.28 ] ] }, "E_002681_002688": { "event": 1, "segments": { "A_0000_0006": { "stages": 3, "timestamps": [ [ 0.04, 3.2 ], [ 3.2, 4.49 ], [ 4.49, 6.57 ] ] } }, "timestamps": [ [ 2681.88, 2688.48 ] ] }, "E_002710_002737": { ... }, ... }
The example shows the annotations related to this video.
First of all, we assign the unique identifier "0LtLS9wROrk" to that video,
which corresponds to the 11-digit YouTube identifier.
It contains all action (event-level) instances, whose names follow the format of "E_{XXXXXX}_{YYYYYY}".
Here, "E" indicates "Event", and "XXXXXX"/"YYYYYY" indicates the
zero-padded starting and ending timestamp (in seconds and truncated to
Int).
Each action instance includes (1) the exact timestamps in the original video ('timestamps', in seconds),
(2) event label ('event'), and
(3) a list of annotated subaction (element-level) instances ('segments').
The annotated subaction instances follow the format of "A_{ZZZZ}_{WWWW}".
Here, "A" indicates "subAction", and "ZZZZ"/"WWWW" indicates the
zero-padded starting and ending timestamp (in seconds and truncated to
Int).
Ech subaction instance includes (1) the number of stages of this
subaction instance ('stages', 3 for Vault and 1 for other events)
(2) the exact timestamps of each stage relative to the starting time of event. ('timestamps', in seconds)
As a result, each subaction instance has a unique identifier "{VIDEO_ID}_E_{XXXXXX}_{YYYYYY}_A_{ZZZZ}_{WWWW}".
This identifier serves as the instance name in the train/val splits of Gym99 and Gym288.
How to read the question annotation files (JSON)?
Below, we show an example entry from the above JSON annotation file:
"0": { "BTcode": "1111111", "questions": [ "round-off onto the springboard?", "turning entry after round-off (turning in first flight phase)?", "Facing the coming direction when handstand on vault (0.5 turn in first flight phase)?", "Body keep stretched during salto (stretched salto)?", "Salto with turn?", "Facing vault table after landing?", "Salto with 1.5 turn?" ], "code": "6.00" }, "1": { "BTcode": "1111110", "questions": [ "round-off onto the springboard?", "turning entry after round-off (turning in first flight phase)?", "Facing the coming direction when handstand on vault (0.5 turn in first flight phase)?", "Body keep stretched during salto (stretched salto)?", "Salto with turn?", "Facing vault table after landing?", "Salto with 1.5 turn?" ], "code": "5.20" }, ...
The example shows the questions related to each class.
The identifier corresponds to the label name provided in Gym530 category list.
Each class includes (1) a list of questions that are asked ('quetions'),
(2) a string of binary codes ('BTcode') where 1 refers to 'yes' and 0 refers to 'no',
(3) and original code in the official codebook.
Cite
@inproceedings{shao2020finegym, title={FineGym: A Hierarchical Video Dataset for Fine-grained Action Understanding}, author={Shao, Dian and Zhao, Yue and Dai, Bo and Lin, Dahua}, booktitle={IEEE Conference on Computer Vision and Pattern Recognition (CVPR)}, year={2020} }
Acknowledgements
We sincerely thank the outstanding annotation team for their excellent work.
This work is partially supported by SenseTime Collaborative Grant on Large-scale Multi-modality Analysis
and the General Research Funds (GRF) of Hong Kong (No. 14203518 and No. 14205719).
The template of this webpage is borrowed from Richard Zhang.
Contact
For further questions and suggestions, please contact Dian Shao (sd017@ie.cuhk.edu.hk)
帕依提提提温馨提示
该数据集正在整理中,为您准备了其他渠道,请您使用
- 分享你的想法
全部内容
数据使用声明:
- 1、该数据来自于互联网数据采集或服务商的提供,本平台为用户提供数据集的展示与浏览。
- 2、本平台仅作为数据集的基本信息展示、包括但不限于图像、文本、视频、音频等文件类型。
- 3、数据集基本信息来自数据原地址或数据提供方提供的信息,如数据集描述中有描述差异,请以数据原地址或服务商原地址为准。
- 1、本站中的所有数据集的版权都归属于原数据发布者或数据提供方所有。
- 1、如您需要转载本站数据,请保留原数据地址及相关版权声明。
- 1、如本站中的部分数据涉及侵权展示,请及时联系本站,我们会安排进行数据下线。