“When the concept of P-values was developed in the 1920s and 1930s … the intention was to use them to guide decision making, not make the decision, which is what so many researchers do today,” says Alan Hackshaw, an epidemiologist and medical statistician at University College London.
That P less than or equal to 0.05, or any similarly arbitrary cut off, means results are statistically significant is a concept that pharmacists are familiar with as a key statistic for interpreting clinical literature. At its most consequential, the P-value plays a role in shaping current clinical practice and even approving drugs for licensed use. However, recent debate is causing scientists and clinicians to question the prominence given to P-values — and if they even have a place in clinical interpretation at all.
Precipitative action was first taken in 2015, when David Trafimow, editor of Basic and Applied Social Psychology, banned P-values from the journal, citing the invalidity of null hypothesis significance testing (NHST), which P-values are an integral part of,. The NHST procedure is the traditional scientific process of using a null hypothesis that proposes the tested intervention has no effect, and which uses the P-value as a measure of how consistent the results are with the null hypothesis scenario.
Such direct action against a fundamental part of scientific research galvanised the statistical profession to consider their stance on this issue. In 2016, experts from the American Statistical Association published a statement in The American Statistician, consolidating the main problems of P-values. It did not support widespread dismissal of P-values, but instead provided recommendations for their correct interpretation, proposing the issue was rooted in misunderstanding. Looking to the P-value’s history can trace how this confusion arose.
“P-values have been around since the 1700s, but really became popular in the 1920s, and there’s a statistician called Ronald Fisher, who really popularised their use,” explains Janet Peacock, an epidemiologist and biomedical data scientist at Dartmouth College, New Hampshire, United States, and a medical statistician at King’s College London. Fisher’s book, Statistical Methods for Research Works, published in 1925, disseminated the significance test across the scientific community; he proposed a guide value of 0.05 or below for researchers to have reasonable confidence that their results disagreed with the null hypothesis.
P-values do not indicate whether the treatment works
But how P-values are used today has gradually morphed further and further away from Fisher’s original intention. “Fisher was very much of the view that P-values were probabilities, used as an additional tool for analysing data,” says Peacock. “Then [fellow statisticians] Neyman and Pearson arrived about ten years later and built on his work, but they introduced the idea that P-values could be used to make decisions through a cut-off value. While that’s useful, that’s where the potential difficulties start to creep in.”
This issue, referred to as the dichotomisation of the P-value, is a core argument against statistical significance — using P≤0.05 as a cut-off to sift clinical research into binary categories, with statistically significant results considered meaningful and the rest discarded. Using P-values in this manner influences what is published in medical journals, and even what enters clinical practice.
“What people are concerned about is that you’ve got something that might be effective and might be useful, but it’s thrown away because the P-value is not quite small enough, so 0.06,” explains Hackshaw. “Or you’ve got things that are put into practice because, yes, the P-value might be 0.048, but the treatment effect isn’t that strong at all.” He adds: “I can’t imagine anything much that’s gotten through the European Medicines Agency, the Medicines and Healthcare products Regulatory Agency or the US Food and Drug Administration in recent years that’s got a P-value above 0.05. If it’s got something like 0.06 it might have scraped through, but if its 0.08, I doubt it.”
Using P-values like this loses sight of the fact that they provide a measure of the strength of the evidence against the null hypothesis, not the strength of the treatment itself. P-values therefore do not indicate whether the treatment works.
This confusion comes from mixing up the factors of a clinical trial that P-values are derived from. “The P-value is influenced by two things separately, almost like a weighing scale. So, on one side is the size of the treatment effect, and the bigger the treatment effect the smaller the P-value becomes,” explains Hackshaw. “On the other side of the weighing scale is the size of the study, and the bigger that is, that makes the P-value smaller. So, they can both work in tandem or they can work differently.”
This means that a study can have a large sample size that lowers the P-value to statistically significant levels, while the treatment effect remains clinically negligible. Choosing to repeat a test to increase sample size, and therefore artificially reduce the P-value, is just one example of how P-values can be manipulated — part of the popularly termed ‘P-hacking’ toolkit.
P-values can be manipulated by emphasising treatment effect size, such as removing outlying results
P-hacking is an umbrella term for actions that tweak the P-value towards statistical significance — intentionally or not. P-hacking techniques often take advantage of these compositional, influencing factors; for instance, in addition to altering the sample size, P-values can be manipulated by emphasising treatment effect size, such as removing outlying results. One of the more insidious forms of P-hacking is the issue of multiplicity; because P-values are a frequency probability, if enough P-values are calculated, then one is bound to be statistically significant. The risk of returning a false positive in this way can hit 40% if even 10 P-values are calculated.
Medical journals screen for this type of abuse, and often insist statistical corrections for multiplicity are applied. Howard Bauchner, editor-in-chief of JAMA, details a classic example of multiplicity P-hacking. “Usually, the problems are with multiple secondary outcomes, so an author will have 1 primary outcome and then they’ll have 15 secondary outcomes, and they won’t control for multiple comparisons. So then, they’ll choose 2 or 3 to present, and you won’t know about the other 10 or 12.” Here, the issue of multiplicity is compounded by the selective reporting of only statistically significant analyses; another variation of P-hacking that journals seek to control. “So, what we do is, we may say in the abstract, 2 of 10 secondary outcomes were significant. That makes it clear to the reader that 8 weren’t significant, and they need to think about those as being more hypothesis-generating than necessarily definitive.”
Some of the major medical journals, such as JAMA and The New England Journal of Medicine, reviewed their guidance on reporting statistical significance in 2019,. Prompted by the recent debate among statisticians, they now recommend ensuring P-values are provided in context with the treatment effect and confidence intervals (CIs).
The CI is the range of values the ‘true’ treatment effect can reasonably be expected to lie within; small CIs indicate a smaller range of possible values that ‘true’ value could be. Bringing treatment effect, whether a risk ratio or a difference of means, and its CI to the forefront gives clinicians room to properly consider edge cases which may otherwise have been dismissed, such as a treatment that reduces the risk of dying by 50% but has a P-value of 0.055.
There’s a fundamental difference between looking at things where people are already doing one thing, and looking at therapies that are novel or completely off the wall
Paul Young, a critical care expert and deputy director of the Medical Research Institute of New Zealand, also speaks out against using arbitrary ‘cut-offs’ in those edge cases in a recent article in JAMA. Young recommends considering the wider context of the decision and altering the burden of proof levels to match the clinical situation. “I think in some ways you can use the legal analogy, right, so it’s like a difference between a criminal proceeding and a civil proceeding, where a criminal proceeding requires proof beyond reasonable doubt and a civil proceeding requires proof on the balance of probability.”
Young points to how comparative studies that are cost-neutral between two treatments should be held to a lower burden of proof than those for introducing new, complex or expensive treatments; after all, “there’s a fundamental difference between looking at things where people are already doing one thing, and looking at therapies that are novel or completely off the wall”. Young extends this same logic to de-adoption of invasive procedures, an area where the evidence base is rarely unilateral, and therefore where weighing up the benefits to patients from reducing medical intervention is particularly tricky.
In all edge-case scenarios, Young advocates the importance of decoupling decisions from the P-value ‘cut-off’ and considering the treatment effect, CIs, overall context of the evidence base and nature of the decision at hand.
So, the current advice points to the treatment effect and CI, while giving the P-value its appropriate — reduced — emphasis. However, there remain additional questions around whether this goes far enough.
In 2019, zoologist and science writer Valentin Amrhein found himself lead authoring a comment piece for Nature that proposed scrapping the phrase ‘statistical significance’ as a way to tackle dichotomisation of the P-value. Written together with statistical experts Sander Greenland and Blake McShane, and gathering 800 signatures from further statistical and scientific experts, they also suggested renaming the CI, ditching ‘confidence’ in favour of ‘compatibility’. This was driven by recent debate in the statistical profession that considered the name ‘confidence intervals’ to falsely imply that all error was accounted for within the interval when, actually, additional uncertainty could be introduced by factors such as poor study design or bias.
“The best way to really look at the results is to derive the compatibility interval, saying the values from the lower to upper bound are most compatible with the data. And the most compatible value is the measured point estimate (treatment effect), which does not mean this is the true value, but this is just what we observed in our study,” explains Amrhein. Thinking in terms of compatibility is hoped to prevent simply transferring the current overreliance on the P-value to overconfidence in a newly prominent CI.
Other statistical approaches
Going further yet, some argue that P-values and NHST should be overlooked in favour of other statistical approaches, with the most popular opponent Bayesian statistics. P-values are a ‘frequentist’ statistic that, like Fisher first proposed, view probability as a frequency, as a chance of an event occurring over time. Bayesian methods instead look at probability as a confidence in the chances of an event occurring, formed from the clinician’s assessment of how likely the hypothesis is to be true, but also drawing upon previous data to support this (see Box 1). Like all statistical methods, however, it has its own limitations that mean it is not appropriate for every situation — such as where previous data are minimal or heavily flawed.
Box 1: What are Bayesian statistics?
The Bayes theorem was first proposed by mathematician Thomas Bayes in the mid-1700s and has been slowly popularised in different fields over the centuries. Perhaps most famously used by Alan Turing to crack the Enigma code in the Second World War, it outlines the incorporation of relevant prior information to refine predicted results.
The Bayes theorem can be translated for use in a clinical setting: existing data, such as that of meta-analysis or earlier randomised controlled trial results, are defined as prior information or the ‘informative prior’. If there is limited or lacking evidence, data is instead categorised as lowly or non-informative priors. The quality of this prior evidence is also considered, categorised as ‘enthusiastic’, ‘neutral’ or ‘sceptical’, which establishes the ‘confidence’ in whether an effect is expected from the intervention. These assignments are combined to form a ‘prior probability’, which manifests as a distribution of possible, expected values. After the trial is completed, the observed data is combined with the prior probability to form a ‘posterior probability’, again a distribution of data. By comparing the prior and posterior distributions, the clinician can assess whether the results agree with, and add strength to, the existing evidence base or not. This way of thinking is similar to what clinicians naturally seek to do with a P-value, assessing it against the context of the existing evidence base, but is supported by statistical methodology.
Bayesian statistics are on the rise, but for the most part, the P-value remains a staple statistic in clinical trials. The scale of data generated for each trial means applying any statistical method is lengthy and difficult, and introducing change or more complex techniques can be slow. Hackshaw, however, believes that this will change as technology advances.
All the P-value really gives you is a measure of surprise
“If you think of 1,000 people in a clinical trial, reducing all that data into one effect size and one P-value, and making decisions on that, that’s only because it’s easy for us as humans to interpret one or two numbers. But my guess is, in years to come, you’ll have artificial intelligence that won’t need to do that data reduction.”
In the meantime, the message is clear: let’s stop dichotomising. Instead, advice is to assess the treatment effect for how valuable that effect is clinically, and then use the CI to get a sense of the possible values of that treatment effect that are compatible with the overall data collected (see Box 2).
And as for the much-maligned P-value, one simple way to think of it, says Amrhein, is to remember that “all the P-value really gives you is a measure of surprise. This is what Fisher said, a small P-value means a result is worth another look — nothing more”.
Box 2: Key questions to ask when assessing clinical data
- What are the study’s limitations?Looking at the level of blinding (e.g. open label, single-blind, double-blind), the number of participants and the confidence interval (i.e. a wide confidence interval indicates uncertainty in the results) will help here;There is also a risk of bias if the trial was cut short or if a lot of patients were lost to follow-up.
- Does this apply to my patient?Looking at the exclusion criteria for the trial will help here.
- Is my patient sufficiently similar to the patients in the studies examined?Looking at the variance of the treatment effects will help here; the larger the variance, the less likely the treatment will be for an individual.
- Does the treatment have a clinically relevant benefit that outweighs the harms?Looking at the statistical significance will help here.
- Is another treatment better?Looking at how the benefit and harm profiles compare will help here.
- Is this treatment feasible in my setting?Looking at how the interventions were administered will help here.
- Is there publication bias?Looking at trial registers and conflicts of interest could help here.
Source: Pharm J online. 2019. Available at: https://www.pharmaceutical-journal.com/acute-pain/how-to-understand-and-interpret-clinical-data/20206305.article
 Fisher R. Statistical Methods for Research Workers. Edinburgh: Oliver & Boyd; 1925
 Amrhein V, Greenland S & McShane B. Nature Comment 2019;567:305–307. Available at: https://www.nature.com/articles/d41586-019-00857-9