Key Datasets¶
Datasets organized by what analyses they support. Each entry notes the phosphoproteomics platform, sample count, and what makes it uniquely useful.
Pan-Cancer Atlases¶
CPTAC Pan-Cancer Phosphoproteome (Geffen et al., 2023)¶
- Cancer types: 10 (breast, ovarian, endometrial, colon, lung adenocarcinoma, lung squamous, head and neck, glioblastoma, renal clear cell, pancreatic)
- Samples: ~1,100 tumors with matched normal
- Phosphosites: 110,274 quantified
- Platform: TMT-based DDA (Orbitrap)
- Access: CPTAC Data Portal
- Best for: Pan-cancer kinase activity analysis, phospho-subtype discovery, cross-cancer convergence studies
- Caveat: TMT ratio compression affects quantitative accuracy for small fold changes
CPTAC Proteogenomic Ecosystem (Vasaikar et al., 2023)¶
- Cancer types: 14 (extends Geffen with additional tumor types)
- Samples: 1,524 tumors
- Data layers: Proteomics, phosphoproteomics, genomics, transcriptomics, clinical
- Access: LinkedOmicsKB
- Best for: Multi-omic integration, proteogenomic driver discovery
Single-Cancer Datasets¶
Breast Cancer CPTAC (Krug et al., 2020)¶
- Samples: 122 treatment-naive breast tumors
- Phosphosites: ~40,000 quantified
- Key value: PAM50 subtype-specific kinase activities; foundational for phospho-subtype concept
- Best for: Subtype-specific kinase activity, benchmark dataset for method development
Ovarian Cancer CPTAC (McDermott et al., 2020)¶
- Samples: 169 high-grade serous ovarian tumors
- Key value: Links phospho-signaling to platinum sensitivity
- Best for: Therapeutic response prediction, platinum resistance mechanisms
Lung Adenocarcinoma CPTAC (Gillette et al., 2020)¶
- Samples: 110 tumors
- Key value: Connects driver mutations to phospho-signaling consequences
- Best for: Mutation-to-phospho linking, EGFR/ALK/KRAS signaling studies
Kinase Reference Datasets¶
KinaseLibrary Positional Scanning Data (Johnson et al., 2023)¶
- Kinases: 303 Ser/Thr kinases
- Method: Positional scanning peptide arrays
- Access: kinase-library.phosphosite.org
- Best for: Motif-based kinase-substrate prediction, motif comparison
PhosphoSitePlus Curated Annotations¶
- Phosphosites: ~240,000 human sites
- Kinase-substrate pairs: ~10,000 curated
- Access: phosphosite.org
- Best for: Training and evaluating kinase-substrate inference tools
- Caveat: Heavy bias toward well-studied kinases (AKT, ERK, CDK families); ~130 kinases have zero annotated substrates
Perturbation Datasets (for Benchmarking)¶
Kinase Inhibitor Panels¶
Multiple studies have profiled phosphoproteome changes after kinase inhibitor treatment:
| Study | Inhibitors Tested | Cell Lines | Value |
|---|---|---|---|
| Wilkes et al. 2015 | 28 kinase inhibitors | Jurkat, A549 | benchmarKIN evaluation dataset |
| Klaeger et al. 2017 | 243 clinical kinase inhibitors | Cell-free kinobeads | Drug-kinase binding ground truth |
| Hijazi et al. 2020 | EGF/HGF time courses | HeLa | Temporal dynamics benchmark |
Best for: Evaluating kinase activity inference methods; the controlled perturbation provides ground truth for which kinases should change.
Dataset Selection Guide¶
| Analysis Goal | Recommended Dataset | Why |
|---|---|---|
| Pan-cancer kinase landscape | CPTAC Geffen 2023 | Largest standardized phospho dataset |
| Method benchmarking (kinase inference) | Kinase inhibitor panels | Controlled perturbation = ground truth |
| Motif-based prediction | KinaseLibrary | Most comprehensive motif data |
| Subtype-specific analysis | Krug 2020 (breast) | Best-characterized phospho subtypes |
| Multi-omic integration | Vasaikar 2023 ecosystem | Matched multi-omic layers |
| Training ML models | PhosphoSitePlus + dbPTM | Largest labeled datasets |