ReSurgSAM2: Referring Segment Anything in Surgical Video via Credible Long-term Tracking

1 National University of Singapore
2 Southern University of Science and Technology
3 University of Oxford

ReSurgSAM2 online demo — hands-free, text-driven referring video segmentation (61.2 FPS).

Abstract

Surgical scene segmentation is critical in computer-assisted surgery and is vital for enhancing surgical quality and patient outcomes. Recently, referring surgical segmentation is emerging, given its advantage of providing surgeons with an interactive experience to segment the target object. However, existing methods are limited by low efficiency and short-term tracking, hindering their applicability in complex real-world surgical scenarios. In this paper, we introduce ReSurgSAM2, a two-stage surgical referring segmentation framework that leverages Segment Anything Model 2 to perform text-referred target detection, followed by tracking with reliable initial frame identification and diversity-driven long-term memory. For the detection stage, we propose a cross-modal spatial-temporal Mamba to generate precise detection and segmentation results. Based on these results, our credible initial frame selection strategy identifies the reliable frame for the subsequent tracking. Upon selecting the initial frame, our method transitions to the tracking stage, where it incorporates a diversity-driven memory mechanism that maintains a credible and diverse memory bank, ensuring consistent long-term tracking. Extensive experiments demonstrate that ReSurgSAM2 achieves substantial improvements in accuracy and efficiency compared to existing methods, operating in real-time at 61.2 FPS. Our code and datasets are available at https://github.com/jinlab-imvr/ReSurgSAM2.

Proposed Method

Overview of ReSurgSAM2.

ReSurgSAM2 is a two-stage framework for accurate, real-time referring segmentation with robust long-term tracking in surgical videos. It integrates text grounding with credible initialization and diversity-driven memory to sustain performance over long-range procedures.

Two-stage design. Stage-1 performs text-referred target detection and selects a credible initial frame. Stage-2 switches to long-term tracking with a diversity-driven memory on top of SAM2.

Stage-1: CSTMamba + CIFS

Stage-2: DLM

Results

Datasets: We evaluate on Ref-EndoVis17 (instruments) and Ref-EndoVis18 (instrument and tissues).

Metrics. We use J for region similarity, F for boundary accuracy; J&F denotes their mean, and FPS measures runtime speed.

Qualitative Analysis: ReSurgSAM2 excels in complex surgical scenes, delivering precise segmentation and stable tracking even occlusions and rapid movements.

Quantitative comparison on Ref-EndoVis17/18 showing ReSurgSAM2 achieving the highest J&F while running in real time.
Quantitative results on Ref-EndoVis17/18. ReSurgSAM2 delivers state-of-the-art J&F at real-time 61.2 FPS (online).

Poster

BibTeX

@inproceedings{resurgsam2,
  title={ReSurgSAM2: Referring Segment Anything in Surgical Video via Credible Long-term Tracking},
  author={Haofeng Liu and Mingqi Gao and Xuxiao Luo and Ziyue Wang and Guanyi Qin and Junde Wu and Yueming Jin},
  booktitle={International Conference on Medical Image Computing and Computer-Assisted Intervention},
  year={2025},
}