I was very pleased to be asked to teach on the recent EMBL Plant and Pathogen Genomics training course - a four-day introduction to bioinformatics for (largely) experimental plant scientists. The course was (excepting my bits, which others can judge) excellent and, even after 11 years working in the area, I learned new things from every presentation.
GitHub and slides at SlideShare), I covered some of the material I presented at an earlier EMBO course (and in an older blog post) on the base rate fallacy and how, when applying a predictive method to classify protein function with nominal performance statistics such as sensitivity and false positive rate to large numbers of samples with a small base rate (or prevalence) of positive samples, the results may have a surprisingly large proportion of false positives.
These statistics of binary (yes/no) classifiers are the same whatever is the subject of the classifier, and this topic is occasionally timely in the UK news, as screening tests for conditions such as Alzheimer's Disease gain publicity, sometimes with excessive or incorrect statistical claims in the media (as discussed in David Colquhoun's blog and the NHS Choices website).
For the presentation at EMBL I had prepared an interactive iPython notebook, which you can see in preview at nbviewer or download at GitHub. It briefly discusses the statistical theory, and some example Python code for calculating and visualising the way in which predictive performance varies with the base rate (prevalence) of the positive examples you hope to detect.
- You can get the notebook by following this link.
The notebook in action
The notebook does one simple thing: it plots the probability that any individual positive result from your test is actually a positive case, for all relative base rates (prevalence) of positive examples in the dataset being tested. Of course, the curve that is plotted depends on the sensitivity and false positive rate of your test, so the function that plots this curve takes those values as an argument. If you tell it your base rate too, it will put an arrow on the plot, to show how well your test is expected to perform.
|Notebook plot of variation in classifier performance with baserate, for 50% sensitivity, 50% FPR, indicating the expected probability that any individual positive test is for a positive case, given a 50% base rate of positive examples.|
If you're using iPython v2 (or greater), then the last cell in the notebook is interactive, and you can use sliders to set the method's sensitivity and FPR, and also the prevalence of positive examples in the dataset being analysed or screened.
For the Arnold et al. (2009) paper used in the older blog post, the test's sensitivity and FPR are 71% and 15% respectively, and we might expect a 3% prevalence of type III effectors in our genome (if we're being generous). Using the sliders appropriately, we can see that the expected probability of any positive T3 effector prediction truly being a T3 effector is only around 0.13.
|Probability of T3 effector prediction indicating a true T3 effector, for the Arnold et al. (2009) predictor, with 3% prevalence of T3 effectors.|
Using the indicated sensitivity and FPR of the recently publicised test for Alzheimer's Disease noted at NHS Choices: 85% and 12%, respectively - with the indicated prevalence of Alzheimer's Disease in the likely test population: 10-15%, we can use the notebook to confirm that site's conclusion that the probability that a positive test indicates progression to Alzheimer's is expected to be 0.44 to 0.56.
|Probability that the AD test indicates progression to AD, with 10% prevalence|
|Probability that the AD test indicates progression to AD, with 15% prevalence|
It shouldn't be a surprise that the base rate is critical to interpretation of the usefulness of any such test. As David Colquhoun notes in his blog:
The author [of the primary paper] makes it clear that the results are not intended to be a screening test for Alzheimer’s. It’s obvious from what’s been said that it would be a lousy test. Rather, the paper was intended to identify patients who would eventually (well, within only 18 months) get dementia. The denominator (always the key to statistical problems) in this case is the highly atypical patients that who come to memory clinics in trials centres (the potential trials population). The prevalence in this very restricted population may indeed be higher that the 10 percent that I used above.In this quote, and the discussion in the comments, it is noted that the appropriate base rate/prevalence to know is the proportion of the tested population who are expected to develop Alzheimer's in the timescale considered. If this was as high as 60% of the vulnerable population who were being tested (stratified by already showing mild cognitive impairment [MCI]), we would expect a very promising 0.91 probability that a positive test would indicate progression to Alzheimer's within the projected timescale.
|Probability that the AD test indicates progression to AD, with 60% prevalence|
NHS Choices estimate the actual base rate of MCI progression to Alzheimer's to be around 10-15%, but the discussion on DC's blog indicates this paper, which suggests the rate is higher. I'm not familiar with the area, and there are many considerations regarding the sampling and biology which render me - not being an expert in that field - inappropriate to judge between the two.