TY - JOUR
T1 - SST
T2 - Spatial and Semantic Transformers for Multi-Label Image Recognition
AU - Chen, Zhao Min
AU - Cui, Quan
AU - Zhao, Borui
AU - Song, Renjie
AU - Zhang, Xiaoqin
AU - Yoshie, Osamu
N1 - Funding Information:
This work was supported in part by the Zhejiang Provincial Natural Science Foundation of China under Grant LQ22F020006 and in part by the National Natural Science Foundation of China under Grant 61922064, Grant U2033210, and Grant 62101387.
Publisher Copyright:
© 1992-2012 IEEE.
PY - 2022
Y1 - 2022
N2 - Multi-label image recognition has attracted considerable research attention and achieved great success in recent years. Capturing label correlations is an effective manner to advance the performance of multi-label image recognition. Two types of label correlations were principally studied, i.e., the spatial and semantic correlations. However, in the literature, previous methods considered only either of them. In this work, inspired by the great success of Transformer, we propose a plug-and-play module, named the Spatial and Semantic Transformers (SST), to simultaneously capture spatial and semantic correlations in multi-label images. Our proposal is mainly comprised of two independent transformers, aiming to capture the spatial and semantic correlations respectively. Specifically, our Spatial Transformer is designed to model the correlations between features from different spatial positions, while the Semantic Transformer is leveraged to capture the co-existence of labels without manually defined rules. Other than methodological contributions, we also prove that spatial and semantic correlations complement each other and deserve to be leveraged simultaneously in multi-label image recognition. Benefitting from the Transformer's ability to capture long-range correlations, our method remarkably outperforms state-of-the-art methods on four popular multi-label benchmark datasets. In addition, extensive ablation studies and visualizations are provided to validate the essential components of our method.
AB - Multi-label image recognition has attracted considerable research attention and achieved great success in recent years. Capturing label correlations is an effective manner to advance the performance of multi-label image recognition. Two types of label correlations were principally studied, i.e., the spatial and semantic correlations. However, in the literature, previous methods considered only either of them. In this work, inspired by the great success of Transformer, we propose a plug-and-play module, named the Spatial and Semantic Transformers (SST), to simultaneously capture spatial and semantic correlations in multi-label images. Our proposal is mainly comprised of two independent transformers, aiming to capture the spatial and semantic correlations respectively. Specifically, our Spatial Transformer is designed to model the correlations between features from different spatial positions, while the Semantic Transformer is leveraged to capture the co-existence of labels without manually defined rules. Other than methodological contributions, we also prove that spatial and semantic correlations complement each other and deserve to be leveraged simultaneously in multi-label image recognition. Benefitting from the Transformer's ability to capture long-range correlations, our method remarkably outperforms state-of-the-art methods on four popular multi-label benchmark datasets. In addition, extensive ablation studies and visualizations are provided to validate the essential components of our method.
KW - Multi-label image recognition
KW - label correlation
KW - transformer
UR - http://www.scopus.com/inward/record.url?scp=85126275326&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85126275326&partnerID=8YFLogxK
U2 - 10.1109/TIP.2022.3148867
DO - 10.1109/TIP.2022.3148867
M3 - Article
C2 - 35275814
AN - SCOPUS:85126275326
SN - 1057-7149
VL - 31
SP - 2570
EP - 2583
JO - IEEE Transactions on Image Processing
JF - IEEE Transactions on Image Processing
ER -