Papers
arxiv:2601.11310

Context-Aware Semantic Segmentation via Stage-Wise Attention

Published on Jan 16
Authors:
,
,
,
,
,

Abstract

CASWiT is a dual-branch Swin-based transformer architecture that combines global context and high-resolution features for semantic segmentation of ultra-high-resolution remote sensing images through cross-scale fusion and SimMIM-style pretraining.

AI-generated summary

Semantic ultra high resolution image (UHR) segmentation is essential in remote sensing applications such as aerial mapping and environmental monitoring. Transformer-based models struggle in this setting because memory grows quadratically with token count, constraining either the contextual scope or the spatial resolution. We introduce CASWiT (Context-Aware Stage-Wise Transformer), a dual-branch, Swin-based architecture that injects global cues into fine-grained UHR features. A context encoder processes a downsampled neighborhood to capture long-range dependencies, while a high resolution encoder extracts detailed features from UHR patches. A cross-scale fusion module, combining cross-attention and gated feature injection, enriches high-resolution tokens with context. Beyond architecture, we propose a SimMIM-style pretraining. We mask 75% of the high-resolution image tokens and the low-resolution center region that spatially corresponds to the UHR patch, then train the shared dual-encoder with small decoder to reconstruct the UHR initial image. Extensive experiments on the large-scale IGN FLAIR-HUB aerial dataset demonstrate the effectiveness of CASWiT. Our method achieves 65.83% mIoU, outperforming RGB baselines by 1.78 points. On URUR, CASWiT achieves 49.1% mIoU, surpassing the current SoTA by +0.9% under the official evaluation protocol. All codes are provided on: https://huggingface.co/collections/heig-vd-geo/caswit.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2601.11310 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2601.11310 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2601.11310 in a Space README.md to link it from this page.

Collections including this paper 1