Datasets¶
A curated guide to spatial omics datasets commonly used for benchmarking, method development, and biological discovery. Organized by use case to help identify the right dataset for a given analysis need.
Standard benchmark datasets¶
These datasets are the most frequently used in spatial omics benchmarking papers. They have well-characterized ground truth annotations and are widely available.
| Name | Technology | Tissue | Spots/Cells | Genes | Ground truth | Key paper | Download |
|---|---|---|---|---|---|---|---|
| DLPFC | Visium | Human dorsolateral prefrontal cortex | ~3,500/sample (12 samples) | ~33,000 | Manual layer annotations (L1--L6, WM) | Maynard et al. 2021 | spatialLIBD |
| Mouse brain MERFISH | MERFISH | Mouse whole brain | ~5 million cells | 500 | Allen Brain Cell Atlas cell types | Allen Institute | Allen Brain Cell Atlas |
| Mouse embryo seqFISH | seqFISH | Mouse embryo (E8.5--E9.5) | ~20,000 cells | 351 | Expert cell-type annotations | Lohoff et al. 2022 | ArrayExpress |
| Slide-seqV2 cerebellum | Slide-seqV2 | Mouse cerebellum | ~40,000 beads | ~20,000 | Known cerebellar architecture | Stickels et al. 2021 | Broad Single Cell Portal |
| MIBI-TOF breast | MIBI-TOF | Human breast cancer | ~200,000 cells | 36 proteins | Expert phenotyping | Keren et al. 2018 | Zenodo |
| CosMx NSCLC | CosMx SMI | Human non-small cell lung cancer | ~800,000 cells | 960 | Pathologist annotations | He et al. 2022 | NanoString |
| Stereo-seq mouse embryo | Stereo-seq | Mouse embryo (E9.5--E16.5) | ~100 million spots | ~25,000 | Developmental stage annotations | Chen et al. 2022 | STOmicsDB |
| Xenium breast cancer | Xenium | Human breast cancer | ~100,000 cells | 313 | Pathologist annotations | Janesick et al. 2023 | 10x Genomics |
| CODEX mouse spleen | CODEX | Mouse spleen | ~70,000 cells | 30 proteins | Known splenic architecture | Goltsev et al. 2018 | HuBMAP |
The DLPFC caveat
The DLPFC dataset is by far the most common benchmark for spatial domain detection. Its layered cortical architecture makes it relatively easy to segment into domains. Performance on DLPFC may not generalize to tissues with irregular or overlapping spatial patterns. See Benchmark Synthesis for discussion.
Large-scale atlases¶
These atlas-scale datasets provide comprehensive spatial maps of organs or organisms. They are valuable for reference-based analyses, atlas-level questions, and training foundation models.
| Name | Technology | Scope | Scale | Access |
|---|---|---|---|---|
| Allen Brain Cell Atlas | MERFISH + scRNA-seq | Mouse whole brain | ~5 million cells, 500 genes (MERFISH) + full transcriptome (scRNA-seq) | portal.brain-map.org |
| HuBMAP | Multiple (Visium, CODEX, MALDI, etc.) | Human organs (kidney, intestine, spleen, etc.) | Millions of cells across tissues | hubmapconsortium.org |
| Human Cell Atlas (spatial) | Multiple | Human organs | Varies by consortium | humancellatlas.org |
| CZ CELLxGENE Census | scRNA-seq + spatial | Human + mouse, many tissues | >50 million cells (Census), spatial collections growing | cellxgene.cziscience.com |
| Mouse Brain MERFISH (BICCN) | MERFISH | Mouse motor cortex | ~300,000 cells, 252 genes | BICCN portal |
Disease-specific datasets¶
These datasets are particularly relevant for translational research and clinical spatial omics.
| Name | Technology | Disease/Tissue | Key finding | Reference |
|---|---|---|---|---|
| Visium FFPE breast cancer | Visium | Breast cancer (FFPE) | Spatial heterogeneity of tumor subtypes within single sections | 10x Genomics |
| Spatial PDAC | Slide-seq + Visium | Pancreatic ductal adenocarcinoma | Spatially resolved tumor-stroma interactions | Moncada et al. 2020 |
| Glioblastoma spatial | Visium + scRNA-seq | Glioblastoma | Spatial organization of tumor cell states | Ravi et al. 2022 |
| Alzheimer's spatial | Visium + MERFISH | Alzheimer's disease (human brain) | Spatial patterns of neurodegeneration | Multiple studies |
| CRC spatial | CODEX + scRNA-seq | Colorectal cancer | Immune microenvironment architecture | Pelka et al. 2021 |
| COVID-19 lung | Visium + scRNA-seq | COVID-19 lung tissue | Spatial immune response in SARS-CoV-2 infection | Melms et al. 2021 |
| Melanoma spatial | Visium + scRNA-seq | Melanoma | Spatial organization of immune evasion | Biermann et al. 2022 |
Data repositories¶
| Repository | Content | Format | Access |
|---|---|---|---|
| SODB | 30+ spatial datasets, standardized | AnnData/SpatialData | Free, web interface |
| STOmicsDB | Stereo-seq and other spatial datasets | Varies | Free registration |
| SpatialDB | Curated spatial transcriptomics datasets | Varies | Free |
| 10x Genomics Datasets | Visium, Xenium, Visium HD demo datasets | Vendor format + H5AD | Free |
| CZ CELLxGENE | scRNA-seq + growing spatial collection | AnnData | Free |
| Broad Single Cell Portal | scRNA-seq + spatial (Slide-seq) | Varies | Free registration |
| GEO | Raw data for most published spatial studies | Raw counts + coordinates | Free |
| Zenodo | Processed datasets from publications | Varies | Free |
| SpatialData (scverse) | Framework for spatial data + example datasets | SpatialData/Zarr | Free, Python API |
Dataset selection guide¶
| Analysis need | Recommended dataset | Why |
|---|---|---|
| Benchmarking spatial clustering | DLPFC (Visium) | Most widely used, enables comparison with published results |
| Benchmarking deconvolution | DLPFC + matching scRNA-seq | Well-annotated layers provide pseudo-ground truth |
| Testing segmentation methods | CosMx NSCLC or Xenium breast | High-quality imaging + transcript coordinates |
| Multi-resolution analysis | Mouse brain MERFISH + Visium | Same tissue, different technologies |
| Testing at scale (>1M cells) | Stereo-seq mouse embryo | Largest publicly available spatial dataset |
| Tumor microenvironment | Xenium breast or CRC CODEX | Well-characterized immune landscapes |
| Reference atlas for mapping | Allen Brain Cell Atlas | Gold-standard brain reference |
| Protein-level spatial analysis | CODEX mouse spleen or MIBI-TOF breast | Well-characterized protein panels |
| Multi-modal (RNA + protein) | HuBMAP datasets | Multiple modalities, same tissue |
| Quick exploration / tutorials | 10x Genomics demo datasets | Small, well-documented, easy to download |
Practical notes¶
Starting with a benchmark? Use the DLPFC.
Despite its limitations, the DLPFC dataset is the easiest way to compare results with published methods. Use it for initial validation, then test on more challenging datasets.
Data formats
Most datasets are available in or convertible to AnnData (H5AD) format, the standard for the scverse ecosystem (Scanpy, Squidpy, SpatialData). The SpatialData framework provides loaders for many common formats and wraps spatial coordinates, images, and expression data into a unified structure.
Data size considerations. Spatial datasets can be very large. Stereo-seq and MERFISH whole-brain datasets exceed 100 GB. Plan storage and memory requirements before downloading. Many repositories offer subsets or downsampled versions for initial exploration.
Further reading¶
- Benchmark Synthesis for how these datasets are used in benchmarks
- Technologies Overview for understanding which technology produced each dataset
- The Pipeline Problem for challenges in processing these datasets