Why do experienced researchers make this mistake?
Simply put p-values are counter intuitive to interpret and suffer from the practice of being described simply! It is very easy to get away with saying to a non-researcher / scientist that p-values require a 0.05 value to demonstrate significance or success in a study.
I posit that the simple number, the pass or fail target, is what has allowed it to become the universal yardstick in research. But to a researcher this number implies a lot more. Getting to this number with a realistic chance of passing a peer review and thus published status requires a very thorough experimental process, whatever the domain. Science is difficult is an understatement.
Below, each topic gives reason to the difficulties of calculating or inferring results from the p-value.
THE EXPERIMENTAL PROCESS
This wonderful summary from data scientist Ichan Michaeli is the best description I have come across in my last six months of reading and writing on the subject. It contains a very strong candidate for explaining the general misinterpretation of a p-values meaning.
The experimental process includes hypothesis testing of the p-value which “tests not for the “optimistic” case in which our alternative variation B is really better than the baseline.” Checking “that the newly introduced variation B is not any better than the existing baseline A, and that the observed differences represent no more than random noise. We then try to reject this hypothesis by calculating how rare our empirical findings are if the above Null Hypothesis is correct. The p-value represents that probability.” (Link)
To provide an insight of the difficulty of statistics, here is a summary ‘cheat sheet’ from UCLA (link), that goes to demonstrate just how many options are facing the researcher during his experimentation. Without statistical qualifications I can’t detail which are directly related to p-values but there are clearly different methods suited for different data. So this is our first potential reason for misinterpretation – statistics are difficult.
In 2016, the American Statistical Association published a statement (link) on p-values, defining the following principles attempting to maintain the standards of the scientific process. They are:
- Principle 1: P-values can indicate how incompatible the data are with a specified statistical model.
- Principle 2: P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.
- Principle 3: Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.
- Principle 4: Proper inference requires full reporting and transparency.
- Principle 5: A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.
- Principle 6: By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.
A popular phrase repeated often in my research, well known statistician Andrew Gelman used the term “the garden of forking paths” to describe the numerous choices facing the researcher. At each fork a researcher has to make a decision. How much is he or she affected by all the various biases at play? His lengthy paper linked above provides some wonderful practical examples and summarises the way forward being very dependent on starting with quality data. When talking about his opinions he states (link) “People say p-hacking and it sounds like someone’s cheating. The flip side is that people know they didn’t cheat, so they don’t think they did anything wrong. But even if you don’t cheat, it’s still a moral error to misanalyze data on a problem of consequence.”
BASE RATE FALLACY
One popular mistake or misinterpretation in research and statistics is to not account suitably for false positives. Accounting for the likelihood of a representative p-value demands accounting for this figure also. This blog provides a wonderful example:
“If mammograms have a false positive rate of 5% and a 90% chance of accurately identifying cancer then if you test 1000 people and 50 of them test positive then it is still quite unlikely that most of those people have cancer. Only 10 people in that sample have cancer and we expect 9 of them to be accurately identified but more than 50 will test positive!”
He goes on to link to ‘the Law of Large Numbers’ and how sample size can ensure you will always hit your p-value, even if it is set more ambitiously than the typically significant 0.05.
So many of these headline grabbing stories that cite ‘research’ on a very skeptical hypothesis are due to an overly simplistic approach to the variables at play. This blog’s inspiration, NN Taleb, an expert in complexity theory cites many examples in his books whereby being proved right comes not from a precise prediction but an un-prediction. To fully prove the influence of various variables in an experiment you need to be able to test them independently. You need to actively search for a hidden, unaccounted variable. That is to say that it appears that the best quality research involves spending most of your time researching in the opposite direction to your intended positive outcome!
To finish this piece I will unashamedly re-use this piece from an earlier linked article titled ‘Science isn’t broken’ as I feel it sums up this blog so well:
“The scientific method is the most rigorous path to knowledge, but it’s also messy and tough. Science deserves respect exactly because it is difficult — not because it gets everything correct on the first try. (link)”