A statistical framework to evaluate virtual screening

Wei Zhao, Kirk Hevener, Stephen W. White, Richard E. Lee, James M. Boyett

Research output: Contribution to journalArticle

52 Citations (Scopus)

Abstract

Background: Receiver operating characteristic (ROC) curve is widely used to evaluate virtual screening (VS) studies. However, the method fails to address the "early recognition" problem specific to VS. Although many other metrics, such as RIE, BEDROC, and pROC that emphasize "early recognition" have been proposed, there are no rigorous statistical guidelines for determining the thresholds and performing significance tests. Also no comparisons have been made between these metrics under a statistical framework to better understand their performances. Results: We have proposed a statistical framework to evaluate VS studies by which the threshold to determine whether a ranking method is better than random ranking can be derived by bootstrap simulations and 2 ranking methods can be compared by permutation test. We found that different metrics emphasize "early recognition" differently. BEDROC and RIE are 2 statistically equivalent metrics. Our newly proposed metric SLR is superior to pROC. Through extensive simulations, we observed a "seesaw effect" - overemphasizing early recognition reduces the statistical power of a metric to detect true early recognitions. Conclusion: The statistical framework developed and tested by us is applicable to any other metric as well, even if their exact distribution is unknown. Under this framework, a threshold can be easily selected according to a pre-specified type I error rate and statistical comparisons between 2 ranking methods becomes possible. The theoretical null distribution of SLR metric is available so that the threshold of SLR can be exactly determined without resorting to bootstrap simulations, which makes it easy to use in practical virtual screening studies.

Original languageEnglish (US)
Article number225
JournalBMC Bioinformatics
Volume10
DOIs
StatePublished - Jul 20 2009

Fingerprint

Virtual Screening
Screening
Metric
Evaluate
Reactive ion etching
Ranking
ROC Curve
Bootstrap
Guidelines
Significance Test
Permutation Test
Statistical Power
Simulation
Type I Error Rate
Framework
Exact Distribution
Null Distribution
Receiver Operating Characteristic Curve
Unknown

All Science Journal Classification (ASJC) codes

  • Structural Biology
  • Biochemistry
  • Molecular Biology
  • Computer Science Applications
  • Applied Mathematics

Cite this

A statistical framework to evaluate virtual screening. / Zhao, Wei; Hevener, Kirk; White, Stephen W.; Lee, Richard E.; Boyett, James M.

In: BMC Bioinformatics, Vol. 10, 225, 20.07.2009.

Research output: Contribution to journalArticle

Zhao, Wei ; Hevener, Kirk ; White, Stephen W. ; Lee, Richard E. ; Boyett, James M. / A statistical framework to evaluate virtual screening. In: BMC Bioinformatics. 2009 ; Vol. 10.
@article{93c7306e9a07489c8d60fcaf6d27bf2d,
title = "A statistical framework to evaluate virtual screening",
abstract = "Background: Receiver operating characteristic (ROC) curve is widely used to evaluate virtual screening (VS) studies. However, the method fails to address the {"}early recognition{"} problem specific to VS. Although many other metrics, such as RIE, BEDROC, and pROC that emphasize {"}early recognition{"} have been proposed, there are no rigorous statistical guidelines for determining the thresholds and performing significance tests. Also no comparisons have been made between these metrics under a statistical framework to better understand their performances. Results: We have proposed a statistical framework to evaluate VS studies by which the threshold to determine whether a ranking method is better than random ranking can be derived by bootstrap simulations and 2 ranking methods can be compared by permutation test. We found that different metrics emphasize {"}early recognition{"} differently. BEDROC and RIE are 2 statistically equivalent metrics. Our newly proposed metric SLR is superior to pROC. Through extensive simulations, we observed a {"}seesaw effect{"} - overemphasizing early recognition reduces the statistical power of a metric to detect true early recognitions. Conclusion: The statistical framework developed and tested by us is applicable to any other metric as well, even if their exact distribution is unknown. Under this framework, a threshold can be easily selected according to a pre-specified type I error rate and statistical comparisons between 2 ranking methods becomes possible. The theoretical null distribution of SLR metric is available so that the threshold of SLR can be exactly determined without resorting to bootstrap simulations, which makes it easy to use in practical virtual screening studies.",
author = "Wei Zhao and Kirk Hevener and White, {Stephen W.} and Lee, {Richard E.} and Boyett, {James M.}",
year = "2009",
month = "7",
day = "20",
doi = "10.1186/1471-2105-10-225",
language = "English (US)",
volume = "10",
journal = "BMC Bioinformatics",
issn = "1471-2105",
publisher = "BioMed Central",

}

TY - JOUR

T1 - A statistical framework to evaluate virtual screening

AU - Zhao, Wei

AU - Hevener, Kirk

AU - White, Stephen W.

AU - Lee, Richard E.

AU - Boyett, James M.

PY - 2009/7/20

Y1 - 2009/7/20

N2 - Background: Receiver operating characteristic (ROC) curve is widely used to evaluate virtual screening (VS) studies. However, the method fails to address the "early recognition" problem specific to VS. Although many other metrics, such as RIE, BEDROC, and pROC that emphasize "early recognition" have been proposed, there are no rigorous statistical guidelines for determining the thresholds and performing significance tests. Also no comparisons have been made between these metrics under a statistical framework to better understand their performances. Results: We have proposed a statistical framework to evaluate VS studies by which the threshold to determine whether a ranking method is better than random ranking can be derived by bootstrap simulations and 2 ranking methods can be compared by permutation test. We found that different metrics emphasize "early recognition" differently. BEDROC and RIE are 2 statistically equivalent metrics. Our newly proposed metric SLR is superior to pROC. Through extensive simulations, we observed a "seesaw effect" - overemphasizing early recognition reduces the statistical power of a metric to detect true early recognitions. Conclusion: The statistical framework developed and tested by us is applicable to any other metric as well, even if their exact distribution is unknown. Under this framework, a threshold can be easily selected according to a pre-specified type I error rate and statistical comparisons between 2 ranking methods becomes possible. The theoretical null distribution of SLR metric is available so that the threshold of SLR can be exactly determined without resorting to bootstrap simulations, which makes it easy to use in practical virtual screening studies.

AB - Background: Receiver operating characteristic (ROC) curve is widely used to evaluate virtual screening (VS) studies. However, the method fails to address the "early recognition" problem specific to VS. Although many other metrics, such as RIE, BEDROC, and pROC that emphasize "early recognition" have been proposed, there are no rigorous statistical guidelines for determining the thresholds and performing significance tests. Also no comparisons have been made between these metrics under a statistical framework to better understand their performances. Results: We have proposed a statistical framework to evaluate VS studies by which the threshold to determine whether a ranking method is better than random ranking can be derived by bootstrap simulations and 2 ranking methods can be compared by permutation test. We found that different metrics emphasize "early recognition" differently. BEDROC and RIE are 2 statistically equivalent metrics. Our newly proposed metric SLR is superior to pROC. Through extensive simulations, we observed a "seesaw effect" - overemphasizing early recognition reduces the statistical power of a metric to detect true early recognitions. Conclusion: The statistical framework developed and tested by us is applicable to any other metric as well, even if their exact distribution is unknown. Under this framework, a threshold can be easily selected according to a pre-specified type I error rate and statistical comparisons between 2 ranking methods becomes possible. The theoretical null distribution of SLR metric is available so that the threshold of SLR can be exactly determined without resorting to bootstrap simulations, which makes it easy to use in practical virtual screening studies.

UR - http://www.scopus.com/inward/record.url?scp=68949158347&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=68949158347&partnerID=8YFLogxK

U2 - 10.1186/1471-2105-10-225

DO - 10.1186/1471-2105-10-225

M3 - Article

C2 - 19619306

AN - SCOPUS:68949158347

VL - 10

JO - BMC Bioinformatics

JF - BMC Bioinformatics

SN - 1471-2105

M1 - 225

ER -