A Visual-Tactile Fusion Method for Estimating the Grasping Force on Flexible Objects
 
                 
                
                    
                                        
                    - 
Graphical Abstract
 
- 
Abstract
    To address the manipulation problem of flexible objects, a visual-tactile fusion method for estimating the grasping force on flexible objects is proposed, named the MultiSense Local-Enhanced Transformer (MSLET). This approach uses a model to learn low-dimensional features from each sensor modality, infers the physical characteristics of the grasped object, and integrates these modality-specific feature vectors to predict the grasping result. By leveraging knowledge of safe grasping practices, the optimal grasping force is inferred. Firstly, the Feature-to-Patch module is developed to extract shallow features from both visual and tactile images. This module generates image patches from these shallow features, capturing their edge characteristics, thus effectively learning the feature information from data and inferring the physical properties of the grasped objects. Secondly, the Local-Enhanced module is proposed to enhance local features. Depth-wise separable convolution is applied to the image patches produced by the multi-head self-attention mechanism, to enhance the local feature processing. This increases the correlation between adjacent tokens in the spatial dimension, improving the prediction accuracy of grasping results. Finally, comparative experiments demonstrate that the proposed algorithm improves the grasping accuracy by 10.19% over the state-of-the-art models while ensuring operational efficiency, thereby proving its effectiveness in estimating the grasping force on flexible objects.
 
- 
                          
-