The Aerial-D Dataset for Generalized Referring Expression Segmentation on Aerial Photos

The Largest Referring Expression Segmentation Dataset for Aerial Imagery

Luís Pedro Soares Marnoto

Instituto Superior Técnico, Universidade de Lisboa

📄 Paper 🤗 Dataset (HuggingFace) 💻 Code (GitHub)

Dataset Statistics

1.5M+ Referring Expressions

37,288 Image Patches

259K+ Annotated Targets

21 Object Classes

Abstract

Referring expression segmentation is a fundamental task in computer vision that integrates natural language understanding with precise visual localization of target regions. Considering aerial imagery (for example, modern drone surveys, historic aerial archives, and high-resolution satellite captures) introduces unique challenges because spatial resolution varies widely across datasets, color usage is inconsistent, targets often shrink to only a few pixels, and scenes contain extreme object densities with frequent occlusions. This work presents Aerial-D, a new large-scale referring expression segmentation dataset for aerial imagery comprising 37,288 image patches with 1,522,523 referring expressions covering 259,709 annotated targets across individual instances, coherent groups, and semantic categories spanning 21 distinct classes that range from vehicles and infrastructure to land-cover types.

The dataset is constructed through a fully automatic pipeline that combines systematic rule-based expression generation with Large Language Model enhancement, enriching both linguistic variety and the visual detail captured within each description. As an additional capability, the pipeline produces dedicated historic counterparts for every scene, supporting archival analyses such as monitoring urban change across decades. We adopt the RSRefSeg architecture featuring a SigLIP2 vision-language encoder and a SAM segmentation decoder and train models on Aerial-D together with prior aerial datasets, yielding unified instance and semantic segmentation from text for both modern and historic imagery. Results show that this combined training achieves competitive performance on contemporary benchmarks while maintaining strong accuracy under the monochrome, sepia, and grainy degradations that characterize archival aerial photography.

Dataset Examples

Key Features

🎯 Comprehensive Coverage

Includes single instances, groups, and full semantic categories across 21 distinct classes from vehicles to land-cover types.

🤖 Fully Automated Pipeline

First fully automatic construction pipeline combining rule-based generation with LLM enhancement and distillation.

🌍 Large-Scale

Over 1.5 million expressions across 37K+ patches—significantly larger than prior RRSIS benchmarks.

Acknowledgements

This dataset builds upon two foundational aerial imagery datasets:

iSAID Dataset

Zamir, S. W., Arora, A., Gupta, A., et al. (2019). iSAID: A Large-scale Dataset for Instance Segmentation in Aerial Images. CVPR Workshops. We thank the authors for making their dataset publicly available.

LoveDA Dataset

Wang, J., Zheng, Z., Ma, A., Lu, X., & Zhong, Y. (2021). LoveDA: A Remote Sensing Land-Cover Dataset for Domain Adaptive Semantic Segmentation. NeurIPS Datasets and Benchmarks Track. We thank the authors for making their dataset publicly available.

Citation

If you use this dataset or code, please cite:

@article{marnoto2025aeriald,
  title={The Aerial-D Dataset for Generalized Referring Expression Segmentation on Aerial Photos},
  author={Marnoto, Luís Pedro Soares},
  journal={IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing (J-STARS)},
  year={2025},
  note={Submitted}
}

Contact: luis.marnoto.gaspar.lopes@tecnico.ulisboa.pt