What to Measure¶
Two Different Questions¶
Phosphoproteomics benchmarking conflates two fundamentally different tasks:
| Task | Question | Ground Truth | Typical Methods |
|---|---|---|---|
| Substrate prediction | Which kinase phosphorylates this site? | Curated kinase-substrate databases | KinaseLibrary, GPS, NetworKIN |
| Activity inference | Which kinases are active in this sample? | Perturbation experiments | KSEAapp, decoupleR, INKA |
A method can excel at one and fail at the other. GPS 6.0 predicts kinase-substrate relationships well but does not score kinase activity from differential phosphoproteomics. Conversely, KSEAapp infers kinase activity effectively but relies entirely on pre-existing substrate annotations.
Metrics by Task¶
Substrate Prediction Metrics¶
- Precision at k — Among the top-k predicted substrates for a kinase, how many are true? More informative than AUROC when the positive-to-negative ratio is extreme (as it always is — each kinase has <100 known substrates among ~240K sites).
- Kinase coverage — How many kinases can the method make predictions for? Methods relying on known substrates are limited to well-studied kinases.
- Cross-validated AUROC — Standard but misleading if train and test sets share the same kinase annotations.
Activity Inference Metrics¶
- Perturbation recovery — After kinase inhibitor treatment, does the inferred activity of the target kinase decrease? The gold standard for activity inference.
- Specificity — Among all scored kinases, is only the target kinase affected, or do many kinases show spurious activity changes?
- Sensitivity to input size — How many differential phosphosites are needed for reliable inference? Methods requiring >500 sites are impractical for small experiments.
The Missing Benchmark¶
No current benchmark adequately tests novel substrate prediction — predicting kinase-substrate relationships not in any training database. This matters because:
- The annotation bottleneck means most real-world phosphosites have no known kinase
- Methods trained on known substrates may simply memorize patterns of well-studied kinases
- The field needs tools that generalize to the ~95% of phosphosites without annotations
A proper novel-substrate benchmark would require:
- Holding out entire kinase families from training
- Testing prediction on those held-out families
- Validating predictions experimentally (not against other databases)
Until this benchmark exists, claims about kinase-substrate prediction accuracy should be interpreted cautiously.
Practical Guidance¶
| Scenario | Recommended Evaluation | Watch Out For |
|---|---|---|
| Choosing a kinase inference tool for CPTAC data | perturbation recovery + coverage | Methods that only score 30-50 kinases |
| Predicting kinase for a novel phosphosite | Cross-validated substrate prediction + motif quality | Circularity with PhosphoSitePlus |
| Scoring functional importance of a site | funscoR benchmark + experimental validation rate | Bias toward well-studied proteins |
| Comparing tools in a paper | benchmarKIN framework with matched datasets | Cherry-picked evaluation datasets |