R²-Seg: Training-Free OOD Medical Tumor Segmentation via Anatomical Reasoning and Statistical Rejection

Shen, Shuaike; Liu, Ke; Xie, Jiaqing; Gao, Shangde; Shen, Chunhua; Liu, Ge; Crispin-Ortuzar, Mireia; Gao, Shangqi

R²-Seg: Training-Free OOD Medical Tumor Segmentation via Anatomical Reasoning and Statistical Rejection

Shuaike Shen^1*, Ke Liu^3*, Jiaqing Xie⁴, Shangde Gao³,
Chunhua Shen³, Ge Liu⁵, Mireia Crispin-Ortuzar², Shangqi Gao^2✉

¹Carnegie Mellon University · ²University of Cambridge · ³Zhejiang University
⁴ETH Zurich · ⁵University of Illinois Urbana-Champaign

CVPR 2026 Oral
^*Equal Contribution · ^✉Corresponding Author

Paper arXiv Code Poster

LLM-Guided Reasoning

Anatomical planning via LLM to localize anchor organs and generate multi-scale ROI proposals.

Statistical Rejection

GPU-backed MMD permutation tests with FDR control to suppress false-positive candidates.

Training-Free

Zero parameter updates — compatible with test-time augmentation, no catastrophic forgetting.

Abstract

Foundation models for medical image segmentation struggle under out-of-distribution (OOD) shifts, often producing fragmented false positives on OOD tumors. We introduce R²-Seg, a training-free framework for robust OOD tumor segmentation that operates via a two-stage Reason-and-Reject process. First, the Reason step employs an LLM-guided anatomical reasoning planner to localize organ anchors and generate multi-scale ROIs. Second, the Reject step applies two-sample statistical testing to candidates generated by a frozen foundation model (BiomedParse) within these ROIs. This statistical rejection filter retains only candidates significantly different from normal tissue, effectively suppressing false positives. Our framework requires no parameter updates, making it compatible with zero-update test-time augmentation and avoiding catastrophic forgetting. On multi-center and multi-modal tumor segmentation benchmarks, R²-Seg substantially improves Dice, specificity, and sensitivity over strong baselines and the original foundation models.

OOD Test-Time Adaptation

When vision embeddings are well separated, the model can distinguish foreground from background by aligning text embeddings with a single decision boundary (left).

In medical imaging, however, protocols vary across scanners, tumor sites, and modalities. Vision embeddings for OOD samples become poorly separated, biasing the decision boundary so that background structures are misclassified as tumors — leading to high false-positive rates and potentially harmful over-diagnosis (right).

In-distribution vs out-of-distribution embedding separation

R²-Seg Pipeline

Top row: LLM-based segmentation planning and ROI construction. Middle row: BioMedParse-based tumor segmentation and candidate extraction. Bottom row: Statistical two-sample test and false discovery rate control.

R2-Seg pipeline overview showing the Reason and Reject stages

Experiments

Visualization of segmentation results for both in-distribution and out-of-distribution tumor types.

Segmentation results for ID and OOD tumors

BibTeX

@misc{shen2025r2segtrainingfreeoodmedical,
      title={R$^{2}$Seg: Training-Free OOD Medical Tumor Segmentation via Anatomical Reasoning and Statistical Rejection},
      author={Shuaike Shen and Ke Liu and Jiaqing Xie and Shangde Gao and Chunhua Shen and Ge Liu and Mireia Crispin-Ortuzar and Shangqi Gao},
      year={2025},
      eprint={2511.12691},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2511.12691},
}

R2-Seg: Training-Free OOD Medical Tumor Segmentation via Anatomical Reasoning and Statistical Rejection

LLM-Guided Reasoning

Statistical Rejection

Training-Free

Abstract

OOD Test-Time Adaptation

R2-Seg Pipeline

Experiments

BibTeX

R²-Seg: Training-Free OOD Medical Tumor Segmentation via Anatomical Reasoning and Statistical Rejection

R²-Seg Pipeline