TY - GEN
T1 - Vision-Touch Fusion for Predicting Grasping Stability Using Self Attention and Past Visual Images
AU - Yan, Gang
AU - Qin, Zhida
AU - Funabashi, Satoshi
AU - Schmitz, Alexander
AU - Tomo, Tito Pradhono
AU - Somlor, Sophon
AU - Jamone, Lorenzo
AU - Sugano, Shigeki
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - Predicting the grasp stability before lifting an object, to be detailed, whether a gripped object will move with respect to the gripper, gives more time to modify unstable grasps compared to after-lift slip detection. Recently, deep learning relying on visual and tactile information becomes increasingly popular. However, how to combine visual and tactile data effectively is still under research. In this paper, we propose to fuse visual and tactile data by introducing self attention (SA) mechanisms for predicting grasp stability. In our experiments, we use tactile sensors (uSkin) and camera sensor (Spresense). An image of the object, not collected immediately before or during grasping, is used, as it might be more readily available. Dataset collection is done by grasping and lifting 1050 times on 35 daily objects in total with various forces and grasping positions. As a result, the predicted accuracy improves over 9% compared to previous attention-based visual-tactile fusion research. Furthermore, our analysis reveals that the introduction of self-attention mechanisms enables more effective and widespread feature extraction for both visual and tactile data.
AB - Predicting the grasp stability before lifting an object, to be detailed, whether a gripped object will move with respect to the gripper, gives more time to modify unstable grasps compared to after-lift slip detection. Recently, deep learning relying on visual and tactile information becomes increasingly popular. However, how to combine visual and tactile data effectively is still under research. In this paper, we propose to fuse visual and tactile data by introducing self attention (SA) mechanisms for predicting grasp stability. In our experiments, we use tactile sensors (uSkin) and camera sensor (Spresense). An image of the object, not collected immediately before or during grasping, is used, as it might be more readily available. Dataset collection is done by grasping and lifting 1050 times on 35 daily objects in total with various forces and grasping positions. As a result, the predicted accuracy improves over 9% compared to previous attention-based visual-tactile fusion research. Furthermore, our analysis reveals that the introduction of self-attention mechanisms enables more effective and widespread feature extraction for both visual and tactile data.
UR - http://www.scopus.com/inward/record.url?scp=85182943387&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85182943387&partnerID=8YFLogxK
U2 - 10.1109/ICDL55364.2023.10364461
DO - 10.1109/ICDL55364.2023.10364461
M3 - Conference contribution
AN - SCOPUS:85182943387
T3 - 2023 IEEE International Conference on Development and Learning, ICDL 2023
SP - 339
EP - 345
BT - 2023 IEEE International Conference on Development and Learning, ICDL 2023
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2023 IEEE International Conference on Development and Learning, ICDL 2023
Y2 - 9 November 2023 through 11 November 2023
ER -