SkySense++ (Few-Shot Checkpoint)

SkySense++ foundation model checkpoint for few-shot inference. Converted from skysensepp_release.ckpt to Hugging Face format.

Multi-modal remote sensing model fusing high-resolution optical (HR), Sentinel-2 (S2), and Sentinel-1 SAR (S1) through backbones, modality-completion VAE, and a fusion encoder.

Model Metadata

Attribute	Value
Model type	Multi-modal segmentation (HR + S2 + S1)
Use case	Few-shot inference, representation extraction
Paper	SkySense++: A Semantic-Enhanced Multi-Modal Remote Sensing Foundation Model Beyond SkySense for Earth Observation
License	Apache-2.0
Input modalities	High-resolution optical, Sentinel-2, Sentinel-1
HR input size	512×512
S2/S1 patch size	16×16

Repository Structure

.
├── config.json
├── model.safetensors              # Main weights (~6.4GB)
├── modality_vae/                 # VAE (ConvVQVAEv2 legacy)
│   ├── config.json
│   └── diffusion_pytorch_model.safetensors
├── modeling_skysensepp.py
├── configuration_skysensepp.py
├── pipeline_skysensepp.py
└── sky_sensepp_impl/             # ModalityCompletionVAE, necks

Installation

pip install transformers torch safetensors diffusers

Usage

Load from Hugging Face

from transformers import AutoModel

# Replace with your HF model ID, e.g. username/SkySensepp-fewshot
model = AutoModel.from_pretrained("path/to/SkySensepp-fewshot", trust_remote_code=True)
model = model.eval().to("cuda")

Few-Shot Inference (SkySensePlusPlus repo)

# Clone SkySensePlusPlus and run 1-shot evaluation
bash tools/run_1shot.sh <gpu_idx> flood3i

Point your config's model_path to this checkpoint or load via AutoModel.from_pretrained() in your predictor.

Feature Extraction

import torch
from transformers import AutoModel

model = AutoModel.from_pretrained("path/to/SkySensepp-fewshot", trust_remote_code=True)
model = model.eval().to("cuda")

hr_img = torch.randn(1, 3, 512, 512, device="cuda")
s2_img = torch.randn(1, 10, 2, 256, 256, device="cuda")
s1_img = torch.randn(1, 2, 2, 256, 256, device="cuda")
modalities = torch.ones(1, 3, dtype=torch.bool, device="cuda")

with torch.no_grad():
    out = model(
        hr_img=hr_img,
        s2_img=s2_img,
        s1_img=s1_img,
        modality_flag_hr=modalities[:, :1],
        modality_flag_s2=modalities[:, 1:2],
        modality_flag_s1=modalities[:, 2:],
        return_features=True,
    )

features_fusion = out["features_fusion"]  # (B, 1024, H, W)

Input Formats

Modality	Shape	Description
hr_img	`(B, 3, H, W)`	RGB, H=W=512 typical
s2_img	`(B, 10, S, H, W)`	Sentinel-2, 10 bands, S steps
s1_img	`(B, 2, S, H, W)`	Sentinel-1 VV/VH, S steps

Citation

@article{skysensepp2025,
  title={SkySense++: A Semantic-Enhanced Multi-Modal Remote Sensing Foundation Model Beyond SkySense for Earth Observation},
  journal={Nature Machine Intelligence},
  year={2025},
  url={https://www.nature.com/articles/s42256-025-01078-8}
}

Downloads last month: 111