TY - GEN
T1 - Hierarchical Unified Spectral-Spatial Aggregated Transformer for Hyperspectral Image Classification
AU - Zhou, Weilian
AU - Kamata, Sei Ichiro
AU - Luo, Zhengbo
AU - Chen, Xiaoyue
N1 - Publisher Copyright:
© 2022 IEEE.
PY - 2022
Y1 - 2022
N2 - Vision Transformer (ViT) has recently been introduced into the computer vision (CV) field with its self-attention mechanism and gotten remarkable performance. However, simply applying ViT for hyperspectral image (HSI) classification is not applicable due to 1) ViT is a spatial-only self-attention model, but rich spectral information exists in HSI; 2) ViT needs sufficient training samples, but HSI suffers from limited samples; 3) ViT does not well learn local features; 4) multi-scale features for ViT are not considered. Furthermore, the methods which combine convolutional neural network (CNN) and ViT generally suffer from a large computational burden. Hence, this paper tends to design a suitable pure ViT based model for HSI classification as the following points: 1) spectral-only vision transformer with all tokens' aggregation; 2) spatial-only local-global transformer; 3) cross-scale local-global feature fusion, and 4) a cooperative loss function to unify the spectral and spatial features. As a result, the proposed idea achieves competitive classification performance on three public datasets than other state-of-the-art methods.
AB - Vision Transformer (ViT) has recently been introduced into the computer vision (CV) field with its self-attention mechanism and gotten remarkable performance. However, simply applying ViT for hyperspectral image (HSI) classification is not applicable due to 1) ViT is a spatial-only self-attention model, but rich spectral information exists in HSI; 2) ViT needs sufficient training samples, but HSI suffers from limited samples; 3) ViT does not well learn local features; 4) multi-scale features for ViT are not considered. Furthermore, the methods which combine convolutional neural network (CNN) and ViT generally suffer from a large computational burden. Hence, this paper tends to design a suitable pure ViT based model for HSI classification as the following points: 1) spectral-only vision transformer with all tokens' aggregation; 2) spatial-only local-global transformer; 3) cross-scale local-global feature fusion, and 4) a cooperative loss function to unify the spectral and spatial features. As a result, the proposed idea achieves competitive classification performance on three public datasets than other state-of-the-art methods.
UR - http://www.scopus.com/inward/record.url?scp=85143618580&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85143618580&partnerID=8YFLogxK
U2 - 10.1109/ICPR56361.2022.9956396
DO - 10.1109/ICPR56361.2022.9956396
M3 - Conference contribution
AN - SCOPUS:85143618580
T3 - Proceedings - International Conference on Pattern Recognition
SP - 3041
EP - 3047
BT - 2022 26th International Conference on Pattern Recognition, ICPR 2022
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 26th International Conference on Pattern Recognition, ICPR 2022
Y2 - 21 August 2022 through 25 August 2022
ER -