Translate this page into:
Authors’ response
* For correspondence: a.indrayan@gmail.com
-
Received: ,
Accepted: ,
Sir,
We thank Yadav and Kumar for their feedback1 on our article2 and for appreciating our efforts for aligning biostatistical methods with clinical relevance.
P values indeed have several limitations, and they are commonly misused to draw inferences not implied by P values. These limitations have been widely discussed, and we ourselves cited several references (nos. 3 to 8 in our article), including the compilation of the views on ASA’s 2016 statement mentioned by the reviewers. We have also acknowledged in our article that P values are losing relevance for clinical decisions. But they cannot be ignored either. Since all studies are based on samples – even a complete coverage of the target population cannot cover future cases, and ironically, the inferences are generally applied to the future cases – it is necessary to try to rule out sampling fluctuation as a cause of the result. The P values precisely do that. A small P value signifies a small chance of a false positive result due to sampling fluctuations, assuming that the data are correct.
We agree with their assertion that ‘the way forward is not to redefine significance threshold but to adopt a multi-component inferential framework’, but this framework must include P value as a component besides others. This is necessary because P value-based statistical significance provides fair insurance against disruptions by sampling fluctuations. In an earlier article3, we have explained the necessity of statistical significance while emphasizing the medical significance of a result. Statistical significance against the null or no effect assures us that some effect is present, and it is only after this that the question of examining its magnitude for reaching to medically important threshold arises. If the null is for a medically significant effect, as pleaded in our article, this step could become redundant. Either way, a P value is required. Use of confidence intervals (CIs) instead of P values is good to assess the variation around the estimated effect size, but to conclude that the required effect is present or not, we again must fall back on examining whether or not the CI contains the effect size under the null hypothesis. This is the same as the test of significance, just another method, and possibly more involved than the direct method of P values. Most clinical decisions are binary – disease present or absent, to operate or not operate, to admit to ICU or not, to discharge or not. For such binary decisions, tests of significance have a definite edge over the CIs. Moreover, tests give an exact P value – a facility not available with CI. If P<0.001, what CI can give this information?
We fully agree that reducing the P value threshold from 0.05 to 0.01 would raise the chance of Type-II error, and this is also already acknowledged in our article. However, it must be realized that a false positive result (Type-I error) is a much more serious error, just as is convicting an innocent in a court of law, whereas a false negative result (Type-II error), which is like acquitting a criminal due to lack of evidence, is less serious. In a clinical context, 5 per cent error of false positive results is too much, particularly in research, where the design is under control. This error may blow up when the result is applied to practical situations. Thus, there is a strong case to reduce it to 1 per cent, even at the cost of missing an effect. We agree that this may be problematic with small samples, but, as explained previously4, small samples allow us to intensively investigate each case with sophisticated tools for valid results without falling back on P values.
Their second concern is regarding our advice to use PPV-NPV-P-index for assessing the predictive performance of a model instead of ROC-based indices. Yes, PPV-NPV are highly dependent on the prevalence and thus lack portability across populations. But that does not justify the use of the wrong indices. PPV-NPV are the correct measures for predictivity of the unknown, whereas sensitivity-specificity and area under the ROC curve assess the classification of the known. This is not a single metric. Prediction should remain a prediction of the unknown and not be reduced to classification of the known. Realize that P(Test+|Disease+) is not the same as P(Disease+|Test+), and this distinction is crucial for correct clinical decisions. Similarly, the existing Youden index provides a cut-off for the best classification of the known, whereas the P index proposed by us provides a cut-off for the best prediction of the unknown5. These two can be very different depending on the prevalence. We reiterate that the use of ROC-based indices for assessing predictivity is wrong and must be discontinued and replaced by predictivity-based indices.
References
- Some newer & simpler biostatistical approaches for more credible clinical research. Indian J Med Res. 2025;162:414-8.
- [CrossRef] [PubMed] [Google Scholar]
- Attack on statistical significance: a balanced approach for medical research. Indian J Med Res. 2020;151:275-8.
- [CrossRef] [PubMed] [PubMed Central] [Google Scholar]
- The importance of small samples in medical research. J Postgrad Med. 2021;67:219-23.
- [CrossRef] [PubMed] [PubMed Central] [Google Scholar]
- Assessing the adequacy of a prediction model. Indian J Community Med. 2025;50:739-44.
- [CrossRef] [PubMed] [PubMed Central] [Google Scholar]