EdgeTAM: On-Device Track anything Model
페이지 정보
작성자 Roslyn 작성일25-10-14 14:51 조회13회 댓글0건관련링크
본문
On prime of Segment Anything Model (SAM), SAM 2 additional extends its functionality from image to video inputs by means of a memory financial institution mechanism and obtains a exceptional performance compared with earlier strategies, making it a basis mannequin for video segmentation job. On this paper, we purpose at making SAM 2 much more environment friendly so that it even runs on mobile devices whereas sustaining a comparable efficiency. Despite several works optimizing SAM for better effectivity, we find they don't seem to be adequate for SAM 2 because all of them deal with compressing the picture encoder, whereas our benchmark exhibits that the newly launched memory attention blocks are additionally the latency bottleneck. Given this statement, we suggest EdgeTAM, which leverages a novel 2D Spatial Perceiver to scale back the computational value. Specifically, the proposed 2D Spatial Perceiver encodes the densely saved frame-stage memories with a lightweight Transformer that incorporates a fixed set of learnable queries.
On condition that video segmentation is a dense prediction process, we discover preserving the spatial construction of the recollections is crucial in order that the queries are break up into world-degree and patch-level groups. We also propose a distillation pipeline that further improves the performance without inference overhead. DAVIS 2017, MOSE, iTagPro geofencing SA-V val, and SA-V check, while working at sixteen FPS on iPhone 15 Pro Max. SAM to handle each image and video inputs, with a reminiscence bank mechanism, and ItagPro is skilled with a brand new massive-scale multi-grained video monitoring dataset (SA-V). Despite reaching an astonishing efficiency compared to previous video object segmentation (VOS) fashions and permitting more diverse user prompts, SAM 2, as a server-side foundation model, isn't environment friendly for on-device inference. CPU and NPU. Throughout the paper, we interchangeably use iPhone and iPhone 15 Pro Max for simplicity.. SAM for better effectivity solely consider squeezing its picture encoder since the mask decoder is extraordinarily lightweight. SAM 2. Specifically, SAM 2 encodes past frames with a reminiscence encoder, and these frame-stage recollections along with object-degree pointers (obtained from the mask decoder) serve as the memory bank.
These are then fused with the features of current body by way of memory consideration blocks. As these memories are densely encoded, this results in an enormous matrix multiplication in the course of the cross-attention between present frame options and memory features. Therefore, regardless of containing relatively fewer parameters than the image encoder, the computational complexity of the memory consideration shouldn't be reasonably priced for on-machine inference. The hypothesis is further proved by Fig. 2, where lowering the number of memory consideration blocks nearly linearly cuts down the overall decoding latency and within every reminiscence attention block, removing the cross attention gives the most important speed-up. To make such a video-based mostly monitoring model run on machine, in EdgeTAM, we look at exploiting the redundancy in movies. To do that in follow, we propose to compress the uncooked body-level reminiscences earlier than performing reminiscence consideration. We start with naïve spatial pooling and observe a major performance degradation, ItagPro particularly when utilizing low-capacity backbones.
However, naïvely incorporating a Perceiver also leads to a extreme drop in performance. We hypothesize that as a dense prediction task, the video segmentation requires preserving the spatial structure of the memory bank, which a naïve Perceiver discards. Given these observations, we propose a novel lightweight module that compresses frame-degree memory function maps whereas preserving the 2D spatial construction, named 2D Spatial Perceiver. Specifically, we split the learnable queries into two teams, where one group features equally to the original Perceiver, the place every query performs world consideration on the input options and outputs a single vector because the frame-stage summarization. In the opposite group, the queries have 2D priors, i.e., each question is just accountable for compressing a non-overlapping native patch, thus the output maintains the spatial construction while reducing the entire variety of tokens. In addition to the structure enchancment, we additional propose a distillation pipeline that transfers the data of the powerful teacher SAM 2 to our student model, which improves the accuracy without charge of inference overhead.
We discover that in each levels, aligning the features from picture encoders of the unique SAM 2 and our environment friendly variant benefits the efficiency. Besides, we further align the function output from the memory consideration between the instructor SAM 2 and our student mannequin in the second stage so that along with the picture encoder, reminiscence-related modules can also obtain supervision alerts from the SAM 2 trainer. SA-V val and check by 1.3 and 3.3, respectively. Putting together, we propose EdgeTAM (Track Anything Model for Edge devices), that adopts a 2D Spatial Perceiver for iTagPro geofencing efficiency and knowledge distillation for accuracy. Through complete benchmark, we reveal that the latency bottleneck lies within the memory consideration module. Given the latency evaluation, we propose a 2D Spatial Perceiver that considerably cuts down the reminiscence consideration computational cost with comparable performance, which could be integrated with any SAM 2 variants. We experiment with a distillation pipeline that performs feature-smart alignment with the unique SAM 2 in both the image and video segmentation levels and observe performance improvements without any extra value during inference.
댓글목록
등록된 댓글이 없습니다.