TY - GEN
T1 - Hyperspectral Image Classification Based on Multi-stage Vision Transformer with Stacked Samples
AU - Chen, Xiaoyue
AU - Kamata, Sei Ichiro
AU - Zhou, Weilian
N1 - Publisher Copyright:
© 2021 IEEE.
PY - 2021
Y1 - 2021
N2 - Hyperspectral image classification (HSIC) is a task assigning the correct label to each pixel. It is a hot topic in the remote sensing field, which has been processed in several deep learning methods. Recently, there are some works that apply Vision Transformer (ViT) methods to the HSIC task, but the performance is not as good as some CNN-structured methods, considering that Vision Transformer uses attention to capture global information but ignores local characteristics. In this paper, a multi-stage Vision Transformer model referring to the feature extraction structure of CNN is proposed, and the result shows the realizability and reliability. Besides, experiments show that the modified ViT structure needs more samples for training. An innovative data augmentation method is used to generate extended samples with virtual yet reliable labels. The generated samples are combined with the original ones as the stacked samples, which are used for the following feature extraction process. Experiments explain the optimization of the multi-stage Vision Transformer structure with stacked samples in the accuracy term compared with other methods.
AB - Hyperspectral image classification (HSIC) is a task assigning the correct label to each pixel. It is a hot topic in the remote sensing field, which has been processed in several deep learning methods. Recently, there are some works that apply Vision Transformer (ViT) methods to the HSIC task, but the performance is not as good as some CNN-structured methods, considering that Vision Transformer uses attention to capture global information but ignores local characteristics. In this paper, a multi-stage Vision Transformer model referring to the feature extraction structure of CNN is proposed, and the result shows the realizability and reliability. Besides, experiments show that the modified ViT structure needs more samples for training. An innovative data augmentation method is used to generate extended samples with virtual yet reliable labels. The generated samples are combined with the original ones as the stacked samples, which are used for the following feature extraction process. Experiments explain the optimization of the multi-stage Vision Transformer structure with stacked samples in the accuracy term compared with other methods.
KW - Hyperspectral image classification
KW - Vision Transformer
KW - data augmentation
KW - deep learning
KW - image processing
UR - http://www.scopus.com/inward/record.url?scp=85125966551&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85125966551&partnerID=8YFLogxK
U2 - 10.1109/TENCON54134.2021.9707289
DO - 10.1109/TENCON54134.2021.9707289
M3 - Conference contribution
AN - SCOPUS:85125966551
T3 - IEEE Region 10 Annual International Conference, Proceedings/TENCON
SP - 441
EP - 446
BT - TENCON 2021 - 2021 IEEE Region 10 Conference
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2021 IEEE Region 10 Conference, TENCON 2021
Y2 - 7 December 2021 through 10 December 2021
ER -