Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors
Search in posts
Search in pages
Filter by Categories
Addendum
Announcement
Announcements
Author’ response
Author’s reply
Authors' response
Authors#x2019; response
Book Received
Book Review
Book Reviews
Books Received
Centenary Review Article
Clinical Image
Clinical Images
Commentary
Communicable Diseases - Original Articles
Correspondence
Correspondence, Letter to Editor
Correspondences
Correspondences & Authors’ Responses
Corrigendum
Corrrespondence
Critique
Current Issue
Editorial
Editorial Podcast
Errata
Erratum
FORM IV
GUIDELINES
Health Technology Innovation
IAA CONSENSUS DOCUMENT
Innovations
Letter to Editor
Malnutrition & Other Health Issues - Original Articles
Media & News
Notice of Retraction
Obituary
Original Article
Original Articles
Panel of Reviewers (2006)
Panel of Reviewers (2007)
Panel of Reviewers (2009) Guidelines for Contributors
Perspective
Policy
Policy Document
Policy Guidelines
Policy, Review Article
Policy: Correspondence
Policy: Editorial
Policy: Mapping Review
Policy: Original Article
Policy: Perspective
Policy: Process Paper
Policy: Scoping Review
Policy: Special Report
Policy: Systematic Review
Policy: Viewpoint
Practice
Practice: Authors’ response
Practice: Book Review
Practice: Clinical Image
Practice: Commentary
Practice: Correspondence
Practice: Letter to Editor
Practice: Method
Practice: Obituary
Practice: Original Article
Practice: Pages From History of Medicine
Practice: Perspective
Practice: Review Article
Practice: Short Note
Practice: Short Paper
Practice: Special Report
Practice: Student IJMR
Practice: Systematic Review
Pratice, Original Article
Pratice, Review Article
Pratice, Short Paper
Programme
Programme, Correspondence, Letter to Editor
Programme: Authors’ response
Programme: Commentary
Programme: Correspondence
Programme: Editorial
Programme: Original Article
Programme: Originial Article
Programme: Perspective
Programme: Rapid Review
Programme: Review Article
Programme: Short Paper
Programme: Special Report
Programme: Status Paper
Programme: Systematic Review
Programme: Viewpoint
Protocol
Public Notice
Research Brief
Research Correspondence
Retraction
Review Article
Reviewers
Short Paper
Some Forthcoming Scientific Events
Special Opinion Paper
Special Report
Special Section Nutrition & Food Security
Status Paper
Status Report
Strategy
Student IJMR
Systematic Article
Systematic Review
Systematic Review & Meta-Analysis
View Point
Viewpoint
White Paper
Generic selectors
Exact matches only
Search in title
Search in content
Post Type Selectors
Search in posts
Search in pages
Filter by Categories
Addendum
Announcement
Announcements
Author’ response
Author’s reply
Authors' response
Authors#x2019; response
Book Received
Book Review
Book Reviews
Books Received
Centenary Review Article
Clinical Image
Clinical Images
Commentary
Communicable Diseases - Original Articles
Correspondence
Correspondence, Letter to Editor
Correspondences
Correspondences & Authors’ Responses
Corrigendum
Corrrespondence
Critique
Current Issue
Editorial
Editorial Podcast
Errata
Erratum
FORM IV
GUIDELINES
Health Technology Innovation
IAA CONSENSUS DOCUMENT
Innovations
Letter to Editor
Malnutrition & Other Health Issues - Original Articles
Media & News
Notice of Retraction
Obituary
Original Article
Original Articles
Panel of Reviewers (2006)
Panel of Reviewers (2007)
Panel of Reviewers (2009) Guidelines for Contributors
Perspective
Policy
Policy Document
Policy Guidelines
Policy, Review Article
Policy: Correspondence
Policy: Editorial
Policy: Mapping Review
Policy: Original Article
Policy: Perspective
Policy: Process Paper
Policy: Scoping Review
Policy: Special Report
Policy: Systematic Review
Policy: Viewpoint
Practice
Practice: Authors’ response
Practice: Book Review
Practice: Clinical Image
Practice: Commentary
Practice: Correspondence
Practice: Letter to Editor
Practice: Method
Practice: Obituary
Practice: Original Article
Practice: Pages From History of Medicine
Practice: Perspective
Practice: Review Article
Practice: Short Note
Practice: Short Paper
Practice: Special Report
Practice: Student IJMR
Practice: Systematic Review
Pratice, Original Article
Pratice, Review Article
Pratice, Short Paper
Programme
Programme, Correspondence, Letter to Editor
Programme: Authors’ response
Programme: Commentary
Programme: Correspondence
Programme: Editorial
Programme: Original Article
Programme: Originial Article
Programme: Perspective
Programme: Rapid Review
Programme: Review Article
Programme: Short Paper
Programme: Special Report
Programme: Status Paper
Programme: Systematic Review
Programme: Viewpoint
Protocol
Public Notice
Research Brief
Research Correspondence
Retraction
Review Article
Reviewers
Short Paper
Some Forthcoming Scientific Events
Special Opinion Paper
Special Report
Special Section Nutrition & Food Security
Status Paper
Status Report
Strategy
Student IJMR
Systematic Article
Systematic Review
Systematic Review & Meta-Analysis
View Point
Viewpoint
White Paper
View/Download PDF

Translate this page into:

Perspective
ARTICLE IN PRESS
doi:
10.25259/IJMR_457_2025

Some newer & simpler biostatistical approaches for more credible clinical research

Department of Clinical Research, Max Healthcare New Delhi, India

* For correspondence: a.indrayan@gmail.com

Licence
This is an open-access article distributed under the terms of the Creative Commons Attribution-Non Commercial-Share Alike 4.0 License, which allows others to remix, transform, and build upon the work non-commercially, as long as the author is credited and the new creations are licensed under the identical terms.

Biostatistical validation of the results is many times considered essential for their credibility. Without this, the results are suspected and rarely published; yet, the results with claimed validity frequently fail to produce the expected outcome in real-world applications1. Statistical fallacies in clinical research can jeopardize life and health of many and need to be addressed more seriously than currently done2.

Among the reasons for discrepancy between published research and practical applications, one important factor could be the use of inadequate methods of statistical evaluation, leading to incomplete or erroneous conclusion. Newer approaches are now available that can increase the applicability of the medical research results. We describe some relatively unknown new approaches in this communication that have wide applicability and are helpful in conducting more credible clinical research. A summary is provided at the end.

Assess medical significance besides statistical significance

Cohen3 has discussed uses and misuses of P values. Extensive discussions have taken place on statistical significance, culminating in a passionate plea to go beyond P<0.054. However, the medical significance of the results has not received similar attention despite the use of this term at least since 1945 in a clinical context5.

The P value measures the likelihood of the results not inconsistent with the null hypothesis. This value depends on many, largely uncontrollable parameters such as the standard deviation (SD) and the effect size, but also the sample size, which is in our control. It is well known that a trivial effect becomes statistically significant with a small P value when the sample size is large6. However, a P value neither represents the importance of the results nor represents the effect size7. Thus, statistical significance by itself is losing significance, and we are moving to the medical significance of the results8.

Clinical research is more useful when the target is a sizeable effect that will justify adopting a new regimen. For instance, not many clinicians may like to abandon the well-tested existing regimen in favour of the new regimen for a minor gain of 1-2 per cent in efficacy, even when statistically significant. A substantial size of the effect that changes the current medical practice and abandons the existing one is called the medically significant effect. Thus, the effect size deserves more importance than currently given in most clinical research studies. The null hypothesis that there is no effect may soon vanish from the literature, even P values have a risk of going into obscurity unless some modifications are done.

A pragmatic approach is to pre-fix a target effect size considered medically significant and check that the results achieve statistical significance with respect to that effect size. The difficulty, however, could be to reach a consensus on what threshold of effect size is to be considered medically significant. Besides the variation in perception across clinicians, it would most likely differ from situation to situation. The researcher may fix a threshold with justification.

A pre-determined target for the detection of effect size has two disparate statistical implications that require newer approaches:

(i) The test of hypothesis

The null hypothesis under test in a new approach will be H0: effect size ≤ δ against the alternative H1: effect size > δ, where δ is the threshold of the effect size for medical significance for better clinical applicability. This will be a one-tailed test since the objective is to find whether the effect size is more than δ. The conventional test of significance for δ = 0 has little clinical relevance. For example, the observed mean reduction of 7 per cent in leak by nasal airway compared to conventional ventilation in cases undergoing bariatric surgery with SD = 6 per cent in a series of 64 cases would be statistically significant for H0: mean reduction = 0. If the researcher decides that it should be at least δ = 20 per cent for switching to a nasal airway in these patients because it requires some extra efforts, the observed reduction of 7 per cent is not medically significant. The H0: effect size ≤ δ is easy to test by computing Student-t as

Observed effect size δ Estimated SE of the effect size under H 0

This is particularly so for quantitative parameters, provided other conditions such as independence of the observations and Gaussian (Normal) distribution of the values are met. Gaussianity is almost assured for large samples because of the central limit theorem. A similar null can be set for other tests, such as chi-square. When the null effect size = δ is rejected against the alternative that is more, any effect size <δ is automatically rejected.

(ii) Sample size

The sample size formulae require specification of the effect size to be detected (or not to be missed) at a pre-fixed level of significance with specified power, where power is the probability of detecting that effect size when present. Detecting in our context means that this is found statistically significant. The square of the effect size (δ2) occurs in the denominator of all sample size formulae for testing of hypothesis situations. Some formulae appearing in the literature make the error of considering the observed effect size in a previous study as the size to be detected. This error was made by Kang9 for G*Power software and by Egbuchulem10 while discussing the basics of sample size determination. For comparing two proportions (one-tail), the formula generally used is n [ Z 1 α 2 p ¯ ( 1 p ¯ ) + Z 1 β p 0 ( 1 p 0 ) + p 1 ( 1 p 1 ) ] ( p 1 p 0 ) 2 2 per group, with the usual notations. This is fallacious since the effect size to be detected must be the medically significant effect that has the potential to change the current medical practice and not the effect observed in a previous study. The denominator needs to be changed to a medically significant effect δ. For example, the correct formula for comparing two proportions (one-tail) is n [ Z 1 α 2 π ¯ ( 1 π ¯ ) + Z 1 β π 0 ( 1 π 0 ) + π 1 ( 1 π 1 ) ] δ 2 2 per group where π 1 = π 0 + / δ depending on increase is anticipated or decrease. This is now being increasingly realized and adopted. This should not be confused with the sample size formula for equivalence trials, where the denominator is [ | π 1 π 0 | δ ] 2 . The correct formula has π and not the sample estimate p (proportion).

Almost all sample size formulae require simple random sampling and Gaussian distribution of the estimate of the parameter involved in the calculations. Both tend to be ignored in practice. It is essential to make adjustments for sampling other than simple random and for distributions other than Gaussian. For sampling other than simple random, the sample size is typically adjusted by the design effect11. For sampling from non-Gaussian distributions, the central limit theorem is invoked, and a large sample is advised. These adjustments are in addition to the well-known adjustments for expected non-response. Further adjustment is required for using sample estimate(s) for SD and p for proportion, whereas the calculation of sample size requires the population value of the parameter. Many popular texts, such as by Chow et al12, wrongly use sample estimates in the sample size formulae instead of the correct population parameters. However, since the parameter value is not known, it is replaced by its estimate for practical purposes. This approximation requires adjustment but is almost never done (for example, see Egbuchulem10). We subjectively advise that the sample size be inflated by 10 per cent to account for this approximation.

Use P<0.01 for statistical significance and not P<0.05

A P value is calculated to measure the confidence that the study-data does not counter the null hypothesis. Even while testing the null H0: effect size ≤ δ against that it is more, in fact for all tests statistical significance, there is a growing realization that P<0.01 should be considered for statistical significance in medical research in place of the prevalent P<0.05. Benjamin and Berger13 made a plea to use even P<0.005 for a novel discovery because people’s life and health are at stake. P<0.01 is a greater assurance that the required effect size is indeed present. Although the chance of missing the effect size when present (Type-II error) increases with P<0.01, that is not considered a big loss since it returns us to the status quo. This is temporary since the statistical significance will appear when repeat study with better methodology is done if the researcher is confident that the required effect is present.

Prefer individual comparisons over group comparisons

In a paired setup, such as the values of a liver function parameter before a treatment and after a treatment, the usual procedure is to use paired t-test for means of quantitative data, and McNemar test for counts in qualitative data. Whereas McNemar test is built on one-to-one matching, the paired t-test, despite being based on individual differences, compares the means. A small mean difference is likely to be adjudged not significant, even when some differences are extremely large on positive side and some on negative side. In a study on the blood characteristics pre- and post-chemotherapy in cancer patients14, if it was 10.38 in one case and improved to 14.05, the observation has great clinical significance, but the paired t-test would disregard this finding because it is based on mean difference. Large differences in individual subjects may contain cardinal information and can be flagged for further investigations that may suggest some new mechanisms causing such large differences in specific cases. This is rarely done. The new approach for paired data is to examine each difference and count how many of them agree within the clinical tolerance. For example, for a study on Hb level before and after a supplement for a month, perhaps a gain of less than 0.5 g/dL can be considered trivial and the proportion of cases with gain of 0.5 g/dL or more can be the right criterion to find the percentage of cases in whom the supplement was effective. This procedure is very simple and direct. Tests of hypothesis and confidence interval can be built for the percentage agreement (or disagreement, as needed) using binomial distribution even for small samples. Currently, this is rarely done.

The same kind of errors and other problems have been listed with Bland-Altman method and the method of one-to-one agreement within clinical tolerance has been proposed15. This new approach is simple, nonparametric, and exactly measures the extent of agreement. In addition, flexible clinical tolerance limits can be set in this method – a facility not available with Bland-Altman method.

Assess predictive performance of a model by PPV, NPV, and agreement, instead of the area under the ROC curve

The predictive performance of a model for qualitative outcome is generally wrongly assessed by the area under the ROC curve, called C-index. For example, see Neuman et al16. This area is based on sensitivity-specificity that assess discrimination of already known outcomes, and not predictive performance for the unknown outcomes. Positive predictive value (PPV) and negative predictive value (NPV) are the right indicators to assess predictive performance but are rarely used. For adequate prediction, both these predictive values must be at least 90 per cent with a 95 per cent lower bound of at least 85 per cent to minimise the error. The cut-off for best predictivity should be where PPV + NPV is maximum, called P-index17, and not the sensitivity-specificity based Youden index. Secondly, even a low C-index, such as 0.735, has been considered sufficient. Many authors considered C-index value of 0.7 and above acceptable18. The remaining 0.3 is too much of an error and unacceptable for clinical applications. A stricter value, such as at least 0.9 with 95 per cent lower bound 0.85, should be the norm even for adequate discrimination19. This may require extra efforts but that is worth it to move from mediocrity to excellence.

For quantitative outcomes, the predictive performance of a model should be assessed again by one-to-one agreement within clinical tolerance between the predicted values for each subject and the corresponding observed values, since clinical applications are not based on averages but on the individual values. A review of the recent 20 articles indexed in PubMed from a variety of journals around the world found that none used this method19. Flexible clinical tolerance limits can be set in this method, such as correct prediction within 10 per cent of the value in place of an absolute value. This would imply prediction within ±12 min for say, the duration of surgery when the expectation is 2 h and ±24 min when it is 4 h.

Shift from medical statistics to statistical medicine

Statistical methods such as confidence intervals and various tests of hypothesis are commonly used in most empirical research. All these methods come under the generic of medical statistics or biostatistics. An unrealized difficulty with these methods is that they are based on group averages and trends, which can be deceptive when applied to individuals. For example, it can be shown that the regression y = x with 0 intercept and 1 regression coefficient can be obtained even when the values are widely different, since this actually is E(y) = x, where E(y) is the average of y values for a fixed value of x. Many researchers, including biostatisticians, do not realize the potential of misinterpretation of the clinical implications of these group-based statistical methods. Some results fail in actual applications on individuals in clinics because of such inadequacies.

Besides focus on individuals, a new approach is to use statistical tools for diagnosis, treatment, and prognosis in individual patients or persons instead of averages and proportions in groups. These tools are models, scoring systems, scales, indexes, decision trees, and now machine-learning (AI) based approaches. These are being increasingly used for establishing diagnosis, prescribing treatment, and assessing prognosis in clinics at the individual level and have been found to perform better than clinical assessment20. Decision support systems such as improved primary health care21 and clinical research calculators, such as for cardiovascular risk22 have shown promise in healthcare of individuals. More such tools and tools with higher validity can be developed to enhance objectivity in clinical assessment. This paradigm shift is called personalized statistical medicine and projected as a branch of medicine, whereas medical statistics is a branch of statistics23.

Small sample size is better in some cases

Despite advantages of large sample studies, small sample may be more relevant in some situations, particularly for research on biological mechanism of an outcome24. Research by many Nobel prize winners in physiology are glaring examples. For rare disease and health conditions in athletes in elite sports25, small samples are assets. Small sample research in any setup allows use of highly sophisticated instruments and highly qualified professionals to obtain and understand the data. Every subject in the sample can be thoroughly investigated for antecedents, mediators, confounders, and outcomes to establish cause-effect relationship or to explain the biological mechanism. Such a facility is generally not available with large sample studies. A large sample is helpful when multifactorial entities are investigated, where some factors cannot be studied or are unknown, but they are generally for group averages and proportions and not for individual values. The results assume that the effect of unaccounted factors average out. For example, Bogl et al26 conducted a meta-analysis comprising 68,494 same-sex and 53,808 opposite sex dizygotic twins for studying the effect of sex on height in adulthood and found no evidence of a major effect. Perhaps an in-depth study of one pair of M-M, M-F, and F-F with minimal difference in height, and one pair of each with a substantial difference, could provide better evidence. Small sample studies can bring out the details of mechanism – thus helping to tailor the regimen to an individual’s need.

Overall, simple new statistical approaches are now available that enhance the credibility of results, particularly for clinical research studies (Table). These approaches include testing for medically significant effect instead of the null of no effect, using P<0.01 in place of the current P<0.05 for statistical significance for greater assurance, assessing predictive performance of a model by PPV and NPV instead of the area under the ROC curve, using individual agreement within clinical tolerance in place of Bland-Altman method, using statistical tools such as scoring systems for diagnosis, treatment, and prognosis for personalized assessment in place of group averages, and recognizing the importance of small samples in some situations for obtaining better quality data with intensive investigations.

Table. Existing and the newer approaches for better credibility of the results.
Problem Existing approach Newer approach
Significance of results Statistical significance for any effect (H0: Effect = 0; H1: Effect ≠ 0) Medical significance in addition to statistical significance for a minimum medically important effect (H0: Effect ≤ δ; H1: Effect > δ)
P value for statistical significance P<0.05 P<0.01
Paired comparison Average difference Difference in individual pairs
Agreement Bland-Altman method Percent agreement within clinical tolerance
Predictive performance of a model Area under the ROC curve (C-index) PPV and NPV ≥0.90
Adequate discrimination C-index ≥0.70 C-index ≥0.90
Best cutoff for prediction Youden index P-index
Statistical tools Estimation & testing of hypothesis (Medical statistics) Models, scoring systems, decision trees for diagnosis, treatment & prognosis (Statistical medicine)
Sample size Prefer large sample Prefer small sample for intensive research in some cases

ROC, receiver operating characteristic

Financial support & sponsorship

None.

Conflicts of Interest

None.

Use of Artificial Intelligence (AI)-Assisted Technology for manuscript preparation

The authors confirm that there was no use of AI-assisted technology for assisting in the writing of the manuscript and no images were manipulated using AI.

Reference

  1. . Why most published research findings are false? PLoS Med. 2005;2:e124.
    [CrossRef] [PubMed] [PubMed Central] [Google Scholar]
  2. . Statistical fallacies errors can also jeopardize life health of many. Indian J Med Res. 2018;148:677-79.
    [CrossRef] [PubMed] [PubMed Central] [Google Scholar]
  3. . P values: Use and misuse in medical literature. Am J Hypertens. 2011;24:18-23.
    [CrossRef] [PubMed] [Google Scholar]
  4. , , . Moving to a world beyond “p < 0.05”. The American Statistician. 2019;73:1-19.
    [Google Scholar]
  5. . Medical significance of ocular torticollis. Bull Hosp Joint Dis. 1945;6:99-109.
    [PubMed] [Google Scholar]
  6. , , . Research commentary—Too big to fail: Large samples and the p-value problem. Information Systems Research. 2013;24:906-17.
    [Google Scholar]
  7. , , . Is it time to stop using statistical significance? Aust Prescr. 2021;44:16-8.
    [CrossRef] [PubMed] [PubMed Central] [Google Scholar]
  8. , , , . There is life beyond the statistical significance. Reprod Health. 2021;18:80.
    [CrossRef] [PubMed] [PubMed Central] [Google Scholar]
  9. . Sample size determination and power analysis using the g*Power software. J Educ Eval Health Prof. 2021;18:17.
    [CrossRef] [PubMed] [PubMed Central] [Google Scholar]
  10. . The basics of sample size estimation: an editor’s view. Ann Ib Postgrad Med. 2023;21:5-10.
    [Google Scholar]
  11. , , , . Practical considerations for sample size calculation for cluster randomized trials. J Epidemiol Popul Health. 2024;72:202198.
    [CrossRef] [PubMed] [Google Scholar]
  12. , , , . Sample size calculations in clinical research (3rd edition). New York: CRC Press; .
  13. , . Three recommendations for improving the use of\n p\n -values. The American Statistician. 2019;73:186-91.
    [Google Scholar]
  14. , , , , . Differences in the count of blood cells pre-and post-chemotherapy in patients with cancer: A retrospective study (2022) Front Med (Lausanne). 2025;12:1485676.
    [CrossRef] [PubMed] [PubMed Central] [Google Scholar]
  15. . Direct use of clinical tolerance limits for assessing the degree of agreement between two methods of measuring blood pressure. South Med J. 2023;116:435-9.
    [CrossRef] [PubMed] [Google Scholar]
  16. , , , , , , et al. A machine-learning model for prediction of Acinetobacter baumannii hospital acquired infection. PLoS One. 2024;19:e0311576.
    [CrossRef] [PubMed] [PubMed Central] [Google Scholar]
  17. , , . Use of ROC curve analysis for prediction gives fallacious results: Use predictivity-based indices. J Postgrad Med. 2024;70:91-6.
    [CrossRef] [PubMed] [PubMed Central] [Google Scholar]
  18. , , . Interpreting area under the receiver operating characteristic curve. Lancet Digit Health. 2022;4:e853-5.
    [CrossRef] [PubMed] [Google Scholar]
  19. , . Assessing the adequacy of a prediction model. Indian J Community Med. 2025;50:739-44.
    [CrossRef] [PubMed] [PubMed Central] [Google Scholar]
  20. , , , , . Clinical versus mechanical prediction: A meta-analysis. Psychol Assess. 2000;12:19-30.
    [CrossRef] [PubMed] [Google Scholar]
  21. , , , , , , et al. Digital tracking, provider decision support systems, and targeted client communication via mobile devices to improve primary health care. Cochrane Database Syst Rev. 2025;4:CD012925.
    [CrossRef] [PubMed] [PubMed Central] [Google Scholar]
  22. , , , , , , et al. Evaluation of novel cardiovascular risk calculators in patients with rheumatoid arthritis. J Rheum Dis. 2025;32:145-7.
    [CrossRef] [PubMed] [PubMed Central] [Google Scholar]
  23. . Personalized statistical medicine. Indian J Med Res. 2023;157:104-8.
    [CrossRef] [PubMed] [PubMed Central] [Google Scholar]
  24. , , . The corruption of power: On the use and abuse of a pre-trial concept. Exp Physiol. 2024;109:317-9.
    [CrossRef] [PubMed] [PubMed Central] [Google Scholar]
  25. , . Coping with the “small sample-small relevant effects” dilemma in elite sport research. Int J Sports Physiol Perform. 2021;16:1559-60.
    [CrossRef] [PubMed] [Google Scholar]
  26. , , , , , , et al. Does the sex of one’s co-twin affect height and BMI in adulthood? A study of dizygotic adult twins from 31 cohorts. Biol Sex Differ. 2017;8:14.
    [CrossRef] [PubMed] [PubMed Central] [Google Scholar]
Show Sections
Scroll to Top