GeoRSCLIP-ViT-H-14

This model is a mirror/redistribution of the original GeoRSCLIP model.

Original Repository and Links

Description

GeoRSCLIP is a vision-language foundation model for remote sensing, trained on a large-scale dataset of remote sensing image-text pairs (RS5M). It is based on the CLIP architecture and is designed to handle the unique characteristics of remote sensing imagery.

How to use

With transformers

from transformers import CLIPProcessor, CLIPModel
from PIL import Image
import torch

# Load model and processor
model = CLIPModel.from_pretrained("BiliSakura/GeoRSCLIP-ViT-H-14")
processor = CLIPProcessor.from_pretrained("BiliSakura/GeoRSCLIP-ViT-H-14")

# Load and process image
image = Image.open("path/to/your/image.jpg")
inputs = processor(
    text=["a photo of a building", "a photo of vegetation", "a photo of water"],
    images=image,
    return_tensors="pt",
    padding=True
)

# Get image-text similarity scores
with torch.inference_mode():
    outputs = model(**inputs)
    logits_per_image = outputs.logits_per_image
    probs = logits_per_image.softmax(dim=1)

print(f"Similarity scores: {probs}")

Zero-shot image classification:

from transformers import CLIPProcessor, CLIPModel
from PIL import Image
import torch

model = CLIPModel.from_pretrained("BiliSakura/GeoRSCLIP-ViT-H-14")
processor = CLIPProcessor.from_pretrained("BiliSakura/GeoRSCLIP-ViT-H-14")

# Define candidate labels
candidate_labels = [
    "a satellite image of urban area",
    "a satellite image of forest",
    "a satellite image of agricultural land",
    "a satellite image of water body"
]

image = Image.open("path/to/your/image.jpg")
inputs = processor(
    text=candidate_labels,
    images=image,
    return_tensors="pt",
    padding=True
)

with torch.inference_mode():
    outputs = model(**inputs)
    probs = outputs.logits_per_image.softmax(dim=1)

# Get the predicted label
predicted_idx = probs.argmax().item()
print(f"Predicted label: {candidate_labels[predicted_idx]}")
print(f"Confidence: {probs[0][predicted_idx]:.4f}")

Extracting individual features:

from transformers import CLIPProcessor, CLIPModel
from PIL import Image
import torch

model = CLIPModel.from_pretrained("BiliSakura/GeoRSCLIP-ViT-H-14")
processor = CLIPProcessor.from_pretrained("BiliSakura/GeoRSCLIP-ViT-H-14")

# Get image features only
image = Image.open("path/to/your/image.jpg")
image_inputs = processor(images=image, return_tensors="pt")

with torch.inference_mode():
    image_features = model.get_image_features(**image_inputs)

# Get text features only
text_inputs = processor(
    text=["a satellite image of urban area"],
    return_tensors="pt",
    padding=True,
    truncation=True
)

with torch.inference_mode():
    text_features = model.get_text_features(**text_inputs)

print(f"Image features shape: {image_features.shape}")
print(f"Text features shape: {text_features.shape}")

With diffusers

This model's text encoder can be used with Stable Diffusion and other diffusion models:

from diffusers import StableDiffusionPipeline
from transformers import CLIPTextModel, CLIPTokenizer
import torch

# Load the text encoder and tokenizer
text_encoder = CLIPTextModel.from_pretrained(
    "BiliSakura/GeoRSCLIP-ViT-H-14/diffusers",
    subfolder="text_encoder",
    torch_dtype=torch.float16
)
tokenizer = CLIPTokenizer.from_pretrained(
    "BiliSakura/GeoRSCLIP-ViT-H-14"
)

# Encode text prompt
prompt = "a satellite image of a city with buildings and roads"
text_inputs = tokenizer(
    prompt,
    padding="max_length",
    max_length=77,
    truncation=True,
    return_tensors="pt"
)

with torch.inference_mode():
    text_outputs = text_encoder(text_inputs.input_ids)
    text_embeddings = text_outputs.last_hidden_state

print(f"Text embeddings shape: {text_embeddings.shape}")

Using with Stable Diffusion:

from diffusers import StableDiffusionPipeline
import torch

# Load pipeline with custom text encoder
pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    text_encoder=text_encoder,
    tokenizer=tokenizer,
    torch_dtype=torch.float16
)
pipe = pipe.to("cuda")

# Generate image
prompt = "a high-resolution satellite image of urban area"
image = pipe(prompt, num_inference_steps=50, guidance_scale=7.5).images[0]
image.save("generated_image.png")

Citation

If you use this model in your research, please cite the original work:

@article{zhangRS5MGeoRSCLIPLargeScale2024,
  title = {RS5M} and {GeoRSCLIP}: {A Large-Scale Vision-Language Dataset} and a {Large Vision-Language Model} for {Remote Sensing},
  shorttitle = {RS5M} and {GeoRSCLIP},
  author = {Zhang, Zilun and Zhao, Tiancheng and Guo, Yulong and Yin, Jianwei},
  year = {2024},
  journal = {TGRS},
  volume = {62},
  pages = {1--23},
  issn = {1558-0644},
  doi = {10.1109/TGRS.2024.3449154},
  urldate = {2024-12-15},
  keywords = {Computational modeling,Data models,Domain VLM (DVLM),general VLM (GVLM),image-text paired dataset,Location awareness,parameter efficient tuning,Remote sensing,remote sensing (RS),RS cross-modal text-image retrieval (RSCTIR),semantic localization (SeLo),Semantics,Tuning,vision-language model (VLM),Visualization,zero-shot classification (ZSC)}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including BiliSakura/GeoRSCLIP-ViT-H-14