State of Benchmarking¶
The Core Problem¶
The most commonly used benchmark in spatial omics — the human DLPFC dataset — tests domain detection, not niche identification. Methods optimized for DLPFC learn to find contiguous cortical layers. This is a valid task, but it is not the same as finding recurring cellular microenvironments.
As a result, we have extensive benchmarks for a task that is adjacent to, but distinct from, the one most niche researchers care about.
Existing Benchmarks¶
Domain-Focused Benchmarks¶
| Benchmark | Methods | Datasets | What It Tests |
|---|---|---|---|
| NAR 2025 | 19 | 30 | Spatial domain detection (ARI against layer annotations) |
| Yuan et al. (Nature Methods, 2024) | 13 | 34 | Spatial clustering (ARI, NMI against manual annotations) |
| Genome Biology 2024 | 16 clustering + 5 alignment + 5 integration | Multiple | Clustering, alignment, and integration |
These benchmarks consistently use DLPFC layer annotations as ground truth. A method that scores high on DLPFC may or may not identify distributed niches in tumor tissue.
Niche-Focused Benchmarks¶
| Benchmark | Methods | Datasets | What It Tests |
|---|---|---|---|
| Niche ID benchmark (bioRxiv, 2026) | 16 | Multiple | Niche identification via domain segmentation — closer to niche testing but still relies on domain-like annotations |
| CellCharter (Nature Genetics, 2024) | 6 | Cross-platform | Multi-resolution niche detection |
| NicheCompass (Nature Genetics, 2025) | 5 | Cross-sample | Communication-aware niche atlas building |
These are closer to evaluating niche methods but still limited by the availability of ground-truth niche annotations.
The Annotation Problem¶
True niche benchmarking requires ground-truth niche labels — and those barely exist. The CRC CODEX dataset (Schurch et al., 2020) is the closest thing: 9 cellular neighborhoods manually validated by pathologists. But:
- 9 neighborhoods is a small number of classes.
- The neighborhoods were defined by the same composition-based approach being benchmarked (circular).
- The dataset is multiplexed protein imaging (CODEX), not spatial transcriptomics.
No spatial transcriptomics dataset has expert-validated niche annotations suitable for benchmarking.
What a Good Niche Benchmark Would Look Like¶
- Distributed ground truth: Niche annotations at multiple disconnected locations (not contiguous regions).
- Multiple niche types: More than binary (niche vs not-niche) — a hierarchy of niche definitions tested simultaneously.
- Cross-definition evaluation: Test whether methods using different niche definitions (composition, expression, communication) find consistent structure.
- Biological validation: Functional readouts (perturbation data, clinical outcome) to validate that identified niches are biologically meaningful, not just statistically separable.
- Multi-platform: Evaluate across imaging and sequencing platforms.
Current State¶
We are in a situation where:
- Domain benchmarks are mature but test the wrong task for niche researchers.
- Niche benchmarks are nascent and lack proper ground truth.
- Cross-method comparison is almost impossible because different methods define niches differently — comparing ARI scores across methods that use different niche definitions is not meaningful.
The field needs a DLPFC-equivalent for niches: a well-annotated dataset with expert-validated niche labels that the community agrees to benchmark against. The Schurch CRC dataset is the closest candidate but has limitations.