State of Benchmarking¶
The Problem¶
Benchmarking kinase-substrate inference and phosphosite analysis tools is fundamentally harder than benchmarking most bioinformatics methods. The ground truth is sparse, biased toward well-studied kinases, and the definition of "correct" depends on the application.
Major Benchmarks¶
benchmarKIN (Briefings in Bioinformatics, 2024)¶
The most comprehensive kinase activity inference benchmark to date. Compares 11 methods across multiple perturbation datasets.
Key findings:
- No single method dominates across all evaluation scenarios
- Enrichment-based methods (KSEAapp, decoupleR/ULM) perform well on known kinase-inhibitor perturbation data
- Motif-based methods (PhosX, KinaseLibrary-based) excel at predicting kinases for individual sites but not at activity scoring
- Network-augmented methods (RoKAI) improve performance when high-quality interaction data is available
- Ensemble approaches combining motif + enrichment consistently outperform individual methods
Evaluation Datasets Used¶
| Dataset | Design | Strengths | Weaknesses |
|---|---|---|---|
| Kinase inhibitor perturbations | Treat cells with specific inhibitor, measure phospho changes | Clear ground truth (inhibited kinase should show reduced activity) | Tests only inhibition, not activation; limited to druggable kinases |
| CPTAC tumor data | Compare kinase activity across cancer subtypes | Clinically relevant; large sample sizes | No controlled perturbation; ground truth is indirect |
| Synthetic benchmarks | Simulate phosphosite changes with known kinase activities | Full control over ground truth | May not capture biological complexity |
What Metrics Matter¶
For Kinase Activity Inference¶
- AUROC for known perturbations — Can the method identify which kinase was inhibited? This is the most direct test but only covers druggable kinases.
- Recall of known kinase-substrate pairs — Does the method recover PhosphoSitePlus annotations? Circular if the method uses these annotations as input.
- Consistency across replicates — Does the method give the same answer with biological replicates?
- Coverage — How many kinases can the method score? Methods limited to well-annotated kinases miss the long tail.
For Phosphosite Identification¶
- Localization accuracy — Phospho(STY) probability scores from MaxQuant/Spectronaut. Sites with probability >0.75 are standard cutoff.
- Quantitative reproducibility — CV across replicates. DIA methods typically show lower CVs than DDA.
- Depth — Number of quantified phosphosites per sample. Current state-of-the-art: 15,000-40,000 sites per sample.
The Circularity Problem¶
Most kinase-substrate inference methods are evaluated against PhosphoSitePlus annotations. Many of these same methods use PhosphoSitePlus as training data or prior knowledge. This creates an evaluation circularity:
- Method is trained/configured using known kinase-substrate pairs from PhosphoSitePlus
- Method is evaluated on its ability to recover kinase-substrate pairs from PhosphoSitePlus
- Performance on truly novel kinase-substrate relationships remains unknown
The benchmarKIN study partially addresses this by using kinase inhibitor perturbation data as an orthogonal evaluation, but even this tests activity inference rather than substrate prediction directly.
Recommendations¶
- Use multiple evaluation frameworks (perturbation data + known substrates + cross-validation)
- Report coverage alongside accuracy — a method scoring 300 kinases at 60% accuracy may be more useful than one scoring 30 kinases at 90%
- Distinguish between activity inference (which kinase is active?) and substrate prediction (which sites does a kinase phosphorylate?) — these are different tasks requiring different benchmarks