Thus far we have discussed the principle, the associated problems, the code and the statistics behind the p-value but we have yet to discuss solutions. It is important to remember that although a major part of the replication crisis is the p-value, the tool which enables it, it isn’t 100% responsible.
I will admit and indeed highlight my status as a new entrant to the world of data science as an undergraduate. This means I have little experience in enabling me to truly evaluate these solutions or ideas below in what may create more ‘replicative’ science.
Inclusion below was warranted by convincing prose, powerful evidence or persuasive recommendations from well known figures. We can imagine these recommendations to act like an extensive series of pre-flight safety checks. A combination of protective layers designed to make something foolproof.
P-hacking falls into two very broad categories. 1) The unconsciously biased, naive, or poorly skilled practitioner. 2) The conscious hacker, desperate to be published, using different resources to hit the magic number of significance. Solutions must attempt to solve one or both of these. None of these proposals solve the problem in its entirety, but a combination of them appears to have a real chance.
(The following suggestions are in no particular order)
SOLUTION 1: REWARDS TO PREVENT PUBLICATION BIAS
There are so many potential reasons for a case of p-hacking to occur that is very difficult to isolate the root cause. (Here is a mathematical model of the bias, p4 link). But if we are to just go by the frequency of mentions, the term ‘publication bias’ wins when discussing p-values. We can define this type of bias as one only publishing the positive or dramatic finding. Studies undertaken that don’t find anything significant or newsworthy tend not to be published nearly as much despite there being potential unknown future value.
Evidence of publication bias > change in expected published studies as the p-value fails to reach the magic 0.05 (from this paper: The Extent and Consequences of P-Hacking in Science)
In an exponential age of communication and limitless information there should be no ceiling on what can be known through a research process. Of course the committees and groups and traditions of the peer review and publishing process have a purpose. But there is significant evidence of the process’s failure. We would want to safeguard against poor research being widely available or cited so rather than simply suggesting a transparent process from beginning to end could we perhaps not separate out and encapsulate the whole process more. More encapsulation would mean more opportunity for appraisal and reward for a studies progress.
‘The institutions of science need to get better at rewarding failure’ is a proposal (link) that is backed up by John Ioannidis, the author that published the meta-analysis that helped define the replication crisis. He states “failures, on average, are more valuable than positive studies,” this being another pillar of NN Taleb’s philosophy ‘via negativa’. The article states another common theme, “young scientists need publications to get jobs.” How can we change the metrics to reward quality research?
SOLUTION 2: INTEGRITY
Proposed by well known mathematician Hannah Fry (article link), ensuring integrity in the research process of the tech industry could be improved by the requirement to go on record as taking a ‘Hippocratic oath’. This is a specific angle with a focus on ethics rather than the p-value but perhaps there could exist something substantial and public for research publishing in general?
SOLUTION 3: AUTOMATION
Could there be an open source software tool developed that becomes a necessary part of reaching the peer review stage? I imagine it could be built piece by piece over time, first targeting analysis of raw data sets for example and would need to have a level of automation built into it to decipher the context and relationship requirements of multiple variables. Could it simply match patterns of successful studies in particular domains. Could it search for signs of bias by looking at or needing the full p-value distribution (see Taleb’s paper)? Could it run the maths?
SOLUTION 4: PRE-REGISTRATION
Making this a standard element in research could help negate the example from this article,
“You really believe your hypothesis and you get the data and there’s ambiguity about how to analyze it. When the first analysis you try doesn’t spit out the result you want, you keep trying until you find one that does.”
Defining methodology, goals and expectations upfront could make hacking more difficult, especially for those without devious intentions. The details of this would include defining statistical parameters, defining how much data is required and what it perhaps will look like in relation to outliers.
This paper goes on to define pre registration guidelines with the following recommendations:
- Authors must list all variables collected in a study.
- Authors must report all experimental conditions, including failed manipulations.
- If observations are eliminated, authors must also report what the statistical results are if those observations are included.
- If an analysis includes a covariate, authors must report the statistical results of the analysis without the covariate.
SOLUTION 5: A NEW INTERPRETATION
This article from 2019, in the well respected science magazine, ‘Nature’, over 800 scientists signed a petition stating, “we are calling for a stop to the use of P values in the conventional, dichotomous way — to decide whether a result refutes or supports a scientific hypothesis”, essentially an abandonment of the term “statistical significance”. The three authors continue on. “We must learn to embrace uncertainty. One practical way to do so is to rename confidence intervals as ‘compatibility intervals’ and interpret them in a way that avoids overconfidence.”
From my perspective this approach to solving p-hacking is based on two key points. Firstly, it’s reliant on the power of language and secondly attempts to remind researchers of the exacting nature of the scientific method. Admirable but having looked at other solutions, I feel this is not a blanket solution and can be easily bypassed for the determined hacker.
More interestingly this paper with 72 authors proposed to enhance reproducibility by changing the p-value threshold for statistical significance from 0.05 to 0.005, with the potential effect shown in their graph below.
While reading up on this suggested solution which isn’t seen as the ultimate long term option, yet easily applicable in the short term I learned (link) that “a lot of cutting-edge genetics research uses p-values of less than .00000005, and astrophysics requires, well, astronomical levels of significance.” Could we perhaps see each area of science develop its own p-value standard based on historical data?
Ronald Fisher, grandfather of the p-value (see my earlier article), “never intended it to be the final word on scientific evidence”. To him it “meant the hypothesis is worthy of a follow-up investigation.” (quote source). What comes after? Is this new (or founding) interpretation and degradation of the p-value to a mere indicator?
SOLUTION 6: RETRACTIONS
The website https://retractionwatch.com/ has gained fame and notoriety in the world of science for attempting to publicize the studies that had to be retracted. But this is more than an armchair blog:
“The mission of the Center for Scientific Integrity, the parent organization of Retraction Watch, is to promote transparency and integrity in science and scientific publishing, and to disseminate best practices and increase efficiency in science.”
With such ambition, it will be interesting to see how this develops. Is the negative publicity enough of a deterrent in case you get ‘found out’? With significant funding and over 125,000 unique views a month (source) it looks like a very promising area for preventing malpractice. (They even have a leaderboard of retractions!)
Currently science suffers from an element of Goodhart’s Law with p-values, “When a measure becomes a target, it ceases to be a good measure.” Hopefully the suggested measures make their way into the scientific process of the future. That is to say that a probable outcome could be that p-values importance will decrease and it will simply be seen as one tool of many to use in evaluating an experiment. What do you think?
Another angle not discussed in this post is patience. What if the peer review becomes a part of a larger ‘open review’? Simply through the continuing advancement of the quantity and quality of online communication will the science community be ‘ready and waiting’ with ever better tools (ML powered apps for example) to validate a paper. We can hope that over time the ‘predatory publishers’ get shut down and true scientific progress doesn’t rely on a retweet for exposure!