Actions

Multiple testing: Difference between revisions

From TrialTree Wiki

Created page with "= Multiple testing = '''Multiple testing''', also known as '''multiplicity''', occurs in randomized controlled trials (RCTs) when multiple statistical comparisons are made. This increases the risk of a '''Type I error''' (false positives), potentially leading to incorrect conclusions about an intervention’s effectiveness or safety. == Why Is Multiple Testing a Concern? == Each statistical test carries a risk of a false positive. As the number of tests increases, so..."
 
 
Line 71: Line 71:
=== Pre-specify Outcomes ===
=== Pre-specify Outcomes ===
Clearly define one primary outcome in the trial protocol. This helps maintain analytical focus and ensures proper control of the family-wise error rate.   
Clearly define one primary outcome in the trial protocol. This helps maintain analytical focus and ensures proper control of the family-wise error rate.   
Limit the number of secondary outcomes to those that are clinically meaningful and hypothesis-driven. Avoid exploratory outcomes that introduce unnecessary multiplicity.
Limit the number of secondary outcomes to those that are clinically meaningful and [[hypothesis]]-driven. Avoid exploratory outcomes that introduce unnecessary multiplicity.


=== Use Hierarchical Testing ===
=== Use Hierarchical Testing ===
Line 91: Line 91:


=== Ensure Transparent Reporting ===
=== Ensure Transparent Reporting ===
Clearly describe how multiple testing was handled in the protocol, analysis, and publication.   
Clearly describe how multiple testing was handled in the protocol, [[analysis]], and publication.   
Follow CONSORT guidelines to report:
Follow [[CONSORT]] guidelines to report:
* the number of hypotheses tested,
* the number of hypotheses tested,
* any interim or subgroup analyses performed,
* any interim or subgroup analyses performed,
Line 117: Line 117:


Using appropriate statistical adjustments—such as Bonferroni, Holm’s, or FDR control—combined with pre-specification of outcomes, controlled interim analyses, and transparent reporting, ensures valid and reproducible conclusions.
Using appropriate statistical adjustments—such as Bonferroni, Holm’s, or FDR control—combined with pre-specification of outcomes, controlled interim analyses, and transparent reporting, ensures valid and reproducible conclusions.
----
=== Bibliography ===
# Bender R, Lange S. Adjusting for multiple testing—when and how? ''Journal of Clinical Epidemiology''. 2001;54(4):343–349.
# Rothman KJ. No adjustments are needed for multiple comparisons. ''Epidemiology''. 1990;1(1):43–46.
# Cook RJ, Farewell VT. Multiplicity considerations in the design and analysis of clinical trials. ''Journal of the Royal Statistical Society: Series A''. 1996;159(1):93–110.
# Dmitrienko A, Tamhane AC, Bretz F. Multiple Testing Problems in Pharmaceutical Statistics. CRC Press; 2009.
# EMEA. Points to Consider on Multiplicity Issues in Clinical Trials. European Medicines Agency; 2002. CPMP/EWP/908/99.
----
''Adapted for educational use. Please cite relevant trial methodology sources when using this material in research or teaching.''

Latest revision as of 11:17, 4 June 2025

Multiple testing

Multiple testing, also known as multiplicity, occurs in randomized controlled trials (RCTs) when multiple statistical comparisons are made. This increases the risk of a Type I error (false positives), potentially leading to incorrect conclusions about an intervention’s effectiveness or safety.

Why Is Multiple Testing a Concern?

Each statistical test carries a risk of a false positive. As the number of tests increases, so does the overall chance of observing at least one false positive result.

Example:

  • Significance level (α): 0.05
  • Number of independent tests (n): 10
  • Probability of at least one false positive:
 P = 1 − (1 − α)^n  
 P = 1 − (0.95)^10 = 0.40

This means a 40% chance of at least one false positive, compared to the intended 5%.

Common Sources of Multiple Testing in RCTs

Multiple Primary Outcomes

Trials may include more than one primary endpoint (e.g., survival and quality of life). Example: A cardiovascular trial measuring both heart attacks and strokes.

Multiple Secondary Outcomes

Additional secondary outcomes (e.g., blood pressure, cholesterol) increase the number of statistical tests and the potential for false positives.

Interim Analyses

Planned interim analyses, such as those conducted for early stopping due to efficacy or futility, can inflate the Type I error rate if not properly adjusted.

Subgroup Analyses

Exploring treatment effects across different subgroups (e.g., by age, sex, or comorbidity status) increases the number of comparisons and the likelihood of spurious findings.

Multiple Treatment Arms

Trials with multiple intervention groups (e.g., placebo vs. low-dose vs. high-dose) involve several pairwise comparisons, each of which requires control for Type I error.

Methods to Control for Multiple Testing

Bonferroni Correction

This method adjusts the significance level by dividing α by the number of comparisons. Example: With 5 tests, the new significance threshold becomes 0.05 / 5 = 0.01. Pros: Simple and widely used. Cons: Very conservative; may increase the risk of Type II error (false negatives).

Holm’s Step-Down Method

A sequential version of the Bonferroni method that offers more power while still controlling the family-wise error rate. Pros: Less conservative than Bonferroni. Cons: Still relatively strict.

Hochberg’s Step-Up Method

More powerful than Holm’s method, particularly under conditions of independence or positive correlation among tests.

False Discovery Rate (FDR) Control

Controls the expected proportion of false discoveries among the rejected hypotheses. Often implemented using the Benjamini-Hochberg procedure. Pros: More flexible for exploratory analyses. Cons: Does not strictly control the family-wise error rate.

Gatekeeping Procedures

Establish a predefined testing hierarchy. Secondary outcomes are only tested if the primary outcome is significant. Pros: Maintains error control while allowing multiple testing. Cons: Requires strict pre-specification in the trial protocol.

Group Sequential Methods

Used in interim analyses, these methods (e.g., O’Brien-Fleming or Pocock boundaries) adjust the significance threshold at each look at the data. Pros: Controls for inflated Type I error in interim analyses. Cons: Requires careful planning and statistical expertise.

Best Practices for Managing Multiple Testing in RCTs

Pre-specify Outcomes

Clearly define one primary outcome in the trial protocol. This helps maintain analytical focus and ensures proper control of the family-wise error rate. Limit the number of secondary outcomes to those that are clinically meaningful and hypothesis-driven. Avoid exploratory outcomes that introduce unnecessary multiplicity.

Use Hierarchical Testing

Establish a predefined order for testing outcomes (e.g., primary → key secondary → exploratory). Each subsequent outcome is only tested if the preceding one is statistically significant. This approach preserves the overall Type I error rate.

Plan Interim Analyses with Proper Adjustments

If interim analyses are planned, specify their timing and statistical boundaries in advance. Use group sequential designs to control the overall error rate across multiple analyses (e.g., with O’Brien-Fleming or Pocock boundaries).

Limit Subgroup Analyses

Restrict subgroup analyses to those that are biologically plausible and pre-specified in the protocol. Avoid post hoc analyses that may produce misleading or non-reproducible results.

Adjust for Multiple Comparisons

Apply correction methods appropriate to the number and nature of the hypotheses being tested. Use Bonferroni or Holm’s methods for confirmatory testing and FDR control for exploratory work. Gatekeeping procedures can help prioritize testing while maintaining validity.

Ensure Transparent Reporting

Clearly describe how multiple testing was handled in the protocol, analysis, and publication. Follow CONSORT guidelines to report:

  • the number of hypotheses tested,
  • any interim or subgroup analyses performed,
  • and the statistical methods used to control for multiplicity.

Example Application

Study Design: A diabetes RCT compares a new drug vs. placebo across three primary outcomes: HbA1c, weight loss, and cholesterol.

Problem: Testing all three outcomes at α = 0.05 inflates the probability of a false positive.

Solutions:

  • Apply Bonferroni correction (adjusted α = 0.05 / 3 = 0.017).
  • Use hierarchical testing: test weight loss and cholesterol only if the HbA1c result is significant.

Outcome: These approaches help ensure that trial findings are robust and not driven by chance.

Conclusion

Multiple testing is a common and important consideration in RCT design and analysis. Without proper correction, it can lead to misleading results and overestimation of treatment effects.

Using appropriate statistical adjustments—such as Bonferroni, Holm’s, or FDR control—combined with pre-specification of outcomes, controlled interim analyses, and transparent reporting, ensures valid and reproducible conclusions.


Bibliography

  1. Bender R, Lange S. Adjusting for multiple testing—when and how? Journal of Clinical Epidemiology. 2001;54(4):343–349.
  2. Rothman KJ. No adjustments are needed for multiple comparisons. Epidemiology. 1990;1(1):43–46.
  3. Cook RJ, Farewell VT. Multiplicity considerations in the design and analysis of clinical trials. Journal of the Royal Statistical Society: Series A. 1996;159(1):93–110.
  4. Dmitrienko A, Tamhane AC, Bretz F. Multiple Testing Problems in Pharmaceutical Statistics. CRC Press; 2009.
  5. EMEA. Points to Consider on Multiplicity Issues in Clinical Trials. European Medicines Agency; 2002. CPMP/EWP/908/99.

Adapted for educational use. Please cite relevant trial methodology sources when using this material in research or teaching.