Datasets¶

A curated guide to spatial omics datasets commonly used for benchmarking, method development, and biological discovery. Organized by use case to help identify the right dataset for a given analysis need.

Standard benchmark datasets¶

These datasets are the most frequently used in spatial omics benchmarking papers. They have well-characterized ground truth annotations and are widely available.

Name	Technology	Tissue	Spots/Cells	Genes	Ground truth	Key paper	Download
DLPFC	Visium	Human dorsolateral prefrontal cortex	~3,500/sample (12 samples)	~33,000	Manual layer annotations (L1--L6, WM)	Maynard et al. 2021	spatialLIBD
Mouse brain MERFISH	MERFISH	Mouse whole brain	~5 million cells	500	Allen Brain Cell Atlas cell types	Allen Institute	Allen Brain Cell Atlas
Mouse embryo seqFISH	seqFISH	Mouse embryo (E8.5--E9.5)	~20,000 cells	351	Expert cell-type annotations	Lohoff et al. 2022	ArrayExpress
Slide-seqV2 cerebellum	Slide-seqV2	Mouse cerebellum	~40,000 beads	~20,000	Known cerebellar architecture	Stickels et al. 2021	Broad Single Cell Portal
MIBI-TOF breast	MIBI-TOF	Human breast cancer	~200,000 cells	36 proteins	Expert phenotyping	Keren et al. 2018	Zenodo
CosMx NSCLC	CosMx SMI	Human non-small cell lung cancer	~800,000 cells	960	Pathologist annotations	He et al. 2022	NanoString
Stereo-seq mouse embryo	Stereo-seq	Mouse embryo (E9.5--E16.5)	~100 million spots	~25,000	Developmental stage annotations	Chen et al. 2022	STOmicsDB
Xenium breast cancer	Xenium	Human breast cancer	~100,000 cells	313	Pathologist annotations	Janesick et al. 2023	10x Genomics
CODEX mouse spleen	CODEX	Mouse spleen	~70,000 cells	30 proteins	Known splenic architecture	Goltsev et al. 2018	HuBMAP

The DLPFC caveat

The DLPFC dataset is by far the most common benchmark for spatial domain detection. Its layered cortical architecture makes it relatively easy to segment into domains. Performance on DLPFC may not generalize to tissues with irregular or overlapping spatial patterns. See Benchmark Synthesis for discussion.

Large-scale atlases¶

These atlas-scale datasets provide comprehensive spatial maps of organs or organisms. They are valuable for reference-based analyses, atlas-level questions, and training foundation models.

Name	Technology	Scope	Scale	Access
Allen Brain Cell Atlas	MERFISH + scRNA-seq	Mouse whole brain	~5 million cells, 500 genes (MERFISH) + full transcriptome (scRNA-seq)	portal.brain-map.org
HuBMAP	Multiple (Visium, CODEX, MALDI, etc.)	Human organs (kidney, intestine, spleen, etc.)	Millions of cells across tissues	hubmapconsortium.org
Human Cell Atlas (spatial)	Multiple	Human organs	Varies by consortium	humancellatlas.org
CZ CELLxGENE Census	scRNA-seq + spatial	Human + mouse, many tissues	>50 million cells (Census), spatial collections growing	cellxgene.cziscience.com
Mouse Brain MERFISH (BICCN)	MERFISH	Mouse motor cortex	~300,000 cells, 252 genes	BICCN portal

Disease-specific datasets¶

These datasets are particularly relevant for translational research and clinical spatial omics.

Name	Technology	Disease/Tissue	Key finding	Reference
Visium FFPE breast cancer	Visium	Breast cancer (FFPE)	Spatial heterogeneity of tumor subtypes within single sections	10x Genomics
Spatial PDAC	Slide-seq + Visium	Pancreatic ductal adenocarcinoma	Spatially resolved tumor-stroma interactions	Moncada et al. 2020
Glioblastoma spatial	Visium + scRNA-seq	Glioblastoma	Spatial organization of tumor cell states	Ravi et al. 2022
Alzheimer's spatial	Visium + MERFISH	Alzheimer's disease (human brain)	Spatial patterns of neurodegeneration	Multiple studies
CRC spatial	CODEX + scRNA-seq	Colorectal cancer	Immune microenvironment architecture	Pelka et al. 2021
COVID-19 lung	Visium + scRNA-seq	COVID-19 lung tissue	Spatial immune response in SARS-CoV-2 infection	Melms et al. 2021
Melanoma spatial	Visium + scRNA-seq	Melanoma	Spatial organization of immune evasion	Biermann et al. 2022

Data repositories¶

Repository	Content	Format	Access
SODB	30+ spatial datasets, standardized	AnnData/SpatialData	Free, web interface
STOmicsDB	Stereo-seq and other spatial datasets	Varies	Free registration
SpatialDB	Curated spatial transcriptomics datasets	Varies	Free
10x Genomics Datasets	Visium, Xenium, Visium HD demo datasets	Vendor format + H5AD	Free
CZ CELLxGENE	scRNA-seq + growing spatial collection	AnnData	Free
Broad Single Cell Portal	scRNA-seq + spatial (Slide-seq)	Varies	Free registration
GEO	Raw data for most published spatial studies	Raw counts + coordinates	Free
Zenodo	Processed datasets from publications	Varies	Free
SpatialData (scverse)	Framework for spatial data + example datasets	SpatialData/Zarr	Free, Python API

Dataset selection guide¶

Analysis need	Recommended dataset	Why
Benchmarking spatial clustering	DLPFC (Visium)	Most widely used, enables comparison with published results
Benchmarking deconvolution	DLPFC + matching scRNA-seq	Well-annotated layers provide pseudo-ground truth
Testing segmentation methods	CosMx NSCLC or Xenium breast	High-quality imaging + transcript coordinates
Multi-resolution analysis	Mouse brain MERFISH + Visium	Same tissue, different technologies
Testing at scale (>1M cells)	Stereo-seq mouse embryo	Largest publicly available spatial dataset
Tumor microenvironment	Xenium breast or CRC CODEX	Well-characterized immune landscapes
Reference atlas for mapping	Allen Brain Cell Atlas	Gold-standard brain reference
Protein-level spatial analysis	CODEX mouse spleen or MIBI-TOF breast	Well-characterized protein panels
Multi-modal (RNA + protein)	HuBMAP datasets	Multiple modalities, same tissue
Quick exploration / tutorials	10x Genomics demo datasets	Small, well-documented, easy to download

Practical notes¶

Starting with a benchmark? Use the DLPFC.

Despite its limitations, the DLPFC dataset is the easiest way to compare results with published methods. Use it for initial validation, then test on more challenging datasets.

Data formats

Most datasets are available in or convertible to AnnData (H5AD) format, the standard for the scverse ecosystem (Scanpy, Squidpy, SpatialData). The SpatialData framework provides loaders for many common formats and wraps spatial coordinates, images, and expression data into a unified structure.

Data size considerations. Spatial datasets can be very large. Stereo-seq and MERFISH whole-brain datasets exceed 100 GB. Plan storage and memory requirements before downloading. Many repositories offer subsets or downsampled versions for initial exploration.