Over the past decades, the scientific enterprise has heavily relied on the frequentist statistical inference tests of hypothesis,12 the end-product of which commonly used in scientific articles is the P value. The P value is by far, the most common statistic being reported in scientific literature.34
Conventionally, when a P value is less than 0.05, the results are considered “significant.” However, there is no reasonable justification for choosing this cut-off for the P value significance threshold; the value was proposed arbitrarily by Ronald A. Fisher in 1925. Thereafter, the P value has increasingly been used in scientific literature. The rate of reporting the P value in biomedical articles published in PubMed Central has doubled from 1990 to 2014, from 7.3% to 15.6%, respectively; there were on average nine P values in each article published during the study period.5 Over the past years, several researchers have reported that the P value is often incorrectly calculated and even if it is correctly calculated, it would be misinterpreted in many instances.6 They have also shown that a significant P value (< 0.05) can easily be attained merely by chance, which implies that the results of many studies are false-positive that is tantamount to low replicability of the findings obtained.7 A recent article examining more than 23,000 clinical trials, shows that with P values ranging from 0.001 to 0.05, the probability that a replication of the study also ends with a significant P value (< 0.05) with an observed effect in the same direction is just a little bit more than 40%.8
To overcome the problem with replicability, some investigators have proposed a lower significance threshold for the P value. For instance, Ioannidis has proposed to lower the P cut-off from the conventional value of 0.05 to 0.005.9 None of the proposed values has, nonetheless, gained universal acceptance. All these proposals were asking for a smaller but fixed P value significance threshold without noting that the trouble is not with the significance threshold, itself; it is mainly with presuming that a unique P value significance threshold does work for different study designs, while it is really not — “one size does not fit all.”
Based on the analogy between the diagnostic tests and statistical inference tests of hypothesis, a recent article shows that the most appropriate P value significance threshold is not a fixed value; it depends on various aspects of a study including the study sample size, the effect size of interest, and the prior odds of the alternative hypothesis (H
1) relative to the null hypothesis (H
0).10
The most appropriate P cut-off value computed for most study designs is generally far less than the conventional threshold of 0.05.10 For example, let the effectiveness of a drug is being compared with a placebo in a two-arm parallel-design randomized clinical trial with each arm having 300 participants. Assume that we are looking for a medium effect (Cohen’s d of 0.5)11 and that the prior odds of the H
1 relative to H
0 is 1 (ie, a probability of 50%, meaning that prior to conducting the study, it is presumed that the drug may be effective with a probability of 50%). It can be shown that the most appropriate significance threshold for a one-sided Student’s t test is then 5.3 × 10-4, which is far less than the conventional P cut-off value of 0.05. If the sample size increases to 500 in each study arm (keeping all other things unchanged), the cut-off would decrease to 2.0 × 10-5.10
It can be shown that the most appropriate P cut-off value does not only change with the study design, but also varies from statistical test to test (e.g., for Student’s t test and χ2 test), even within a single study. Worse, it can be proved that there is no way to determine the most appropriate P value significance threshold a priori — before the study results are obtained — as the value depends on the distribution of the resultant data too. It can only be determined a posteriori, once the results are available.10 This raises a serious fundamental problem with the frequentist statistical inference. The most appropriate P significance threshold changes even for replicas of the same study for the sampling error existing in all experimental studies. Existence of this inherent conflict leaves no way for us but to abandon using the P value, in particular, and the frequentist statistical methods, at large. We need to replace the prevailing frequentist statistical analysis used in biomedical sciences with alternative approaches.
Currently, most courses on biostatistics and research methodology taught in most medical schools across the globe focus only on the frequentist statistics. The focus is better to be shifted to alternative methods such as the Bayesian approaches. In the meantime, editors of biomedical journals should encourage researchers to employ Bayesian methods and avoid reporting P values for lack of scientific credibility. Nor should editors consider confidence intervals as a replacement for the P value, as the intervals are really not what they are meant to be. A recent article examining a large number of randomized clinical trials retrieved from the Cochrane Database of Systematic Reviews has shown that the 95% confidence intervals reported in the studies contained the real value of the statistic of interest in nearly 90% (not the expected 95%) of the time.8
Although Bayesian statistics would be considered an appropriate substitute for the frequentist statistics in analyzing biomedical data for the time being, it is not perfect too; it has its own limitations. Selection of inappropriate priors would result in incorrect results. The method is normally computationally intensive, particularly when there are many variables in the model. Nevertheless, the right way is not always the easier one. The increasing application of artificial intelligence units in most areas of biomedical research and data analysis provides the computational capacity necessary for the complex analyses involved in Bayesian methods and paves the way for the application of the methods in biomedical research.12
Nowadays, some research studies consist of millions of participants. Given the very large sample size, even a trivial difference in a variable between two study groups, which is not of clinical significance at all, would result in a significant P value, if frequentist statistical inference methods are used for data analysis. This would cause problems with using the methods and interpreting the results.
Technically, reducing the entire data set to a one-dimensional summary measure such as P value, as it is done in the frequentist statistics, always results in information loss. Worse, using a set cut-off for the P value significance threshold to dichotomize the results of a study into “significant” and “not significant,” as it is also done in the frequentist statistics, has been strongly criticized. As Sir William Osler once asserted, “medicine is a science of uncertainty and an art of probability.” The most appropriate way the results of biomedical research can be presented is thus not through interpreting the results of a research study to dichotomize the findings to reject or retain a hypothesis, the way the frequentist methods work; it is instead through revising the probability a hypothesis could be correct in light of the new results obtained from a study, the way the Bayesian methods work, I believe. Given the necessary infrastructure exists, it seems that the time is ripe to choose the right statistical methods for the analysis of biomedical sciences, once and for all. Of course, further studies should be conducted to figure out more appropriate approaches for data analysis.
References
1. Gasparyan AY, Ayvazyan L, Mukanova U, Yessirkepov M, Kitas GD. Scientific hypotheses: writing, promoting, and predicting implications. J Korean Med Sci. 2019; 34(45):e300. PMID: 31760713.
2. Misra DP, Gasparyan AY, Zimba O, Yessirkepov M, Agarwal V, Kitas GD. Formulating hypotheses for different study designs. J Korean Med Sci. 2021; 36(50):e338. PMID: 34962112.
3. Habibzadeh F. How to report the results of public health research. J Public Health Emerg. 2017; 1:90.
4. Habibzadeh F. Statistical data editing in scientific articles. J Korean Med Sci. 2017; 32(7):1072–1076. PMID: 28581261.
5. Chavalarias D, Wallach JD, Li AH, Ioannidis JP. Evolution of reporting P values in the biomedical literature, 1990-2015. JAMA. 2016; 315(11):1141–1148. PMID: 26978209.
6. Goodman S. A dirty dozen: twelve p-value misconceptions. Semin Hematol. 2008; 45(3):135–140. PMID: 18582619.
7. Ioannidis JP. Why most published research findings are false. PLoS Med. 2005; 2(8):e124. PMID: 16060722.
8. van Zwet E, Gelman A, Greenland S, Imbens G, Schwab S, Goodman SN. A new look at p values for randomized clinical trials. NEJM Evid. 2024; 3(1):EVIDoa2300003. PMID: 38320512.
9. Ioannidis JPA. The proposal to lower p value thresholds to. 005. JAMA. 2018; 319(14):1429–1430. PMID: 29566133.
10. Habibzadeh F. On the use of receiver operating characteristic curve analysis to determine the most appropriate p value significance threshold. J Transl Med. 2024; 22(1):16. PMID: 38178182.
11. Cohen J. Handbook of Clinical Psychology. New York, NY, USA: McGraw-Hill;1965.
12. Habibzadeh F. The future of scientific journals: the rise of UniAI. Learn Publ. 2023; 36(2):326–330.



PDF
Citation
Print



XML Download