EfficientSAM: Leveraged Masked Image Pretraining for Efficient Segment Anything
Citations Over TimeTop 1% of 2024 papers
Abstract
Segment Anything Model (SAM) has emerged as a powerful tool for numerous vision applications. A key component that drives the impressive performance for zero-shot trans-fer and high versatility is a super large Transformer model trained on the extensive high-quality SA -1 B dataset. While beneficial, the huge computation cost of SAM model has limited its applications to wider real-world applications. To address this limitation, we propose EfficientSAMs, light-weight SAM models that exhibits decent performance with largely reduced complexity. Our idea is based on leveraging masked image pretraining, SAMI, which learns to reconstruct features from SAM image encoder for effective visual representation learning. Further, we take SAMI-pretrained light-weight image encoders and mask decoder to build Effi-cientSAMs, and finetune the models on SA -1B for segment anything task. We perform evaluations on multiple vision tasks including image classification, object detection, in-stance segmentation, and semantic segmentation, and find that our proposed pretraining method, SAMI, consistently outperforms other masked image pretraining methods. On segment anything task such as zero-shot instance segmentation, our EfficientSAMs with SAMI-pretrained lightweight image encoders perform favorably with a significant gain (e.g., rv4 AP on COCOILVIS) over other fast SAM models. Our EfficientSAM code and models are available at here.
Related Papers
- → An Object Detection and Pose Estimation Approach for Position Based Visual Servoing(2017)5 cited
- → Tracking in 3D: Image Variability Decomposition for Recovering Object Pose and Illumination(1999)15 cited
- → Foreground object segmentation from binocular stereo video(2005)2 cited
- → Object-oriented stripe structured-light vision-guided robot(2017)2 cited
- → 6-DOF object localization by combining monocular vision and robot arm kinematics(2017)1 cited