Possible limitations in a randomized clinicial trial's capacity to determine the scientific truth: An analysis of the HERS and WHI trials
--------------------------------------------------------------------------------------------
This exploratory essay explores possible limitations in a randomized clinical trial's capacity to confidently determine the scientific truth in clinical trial situations where 3 phenomena occur:- the clinical outcome event rate is low, the therapeutic effect (or harmful effect) of the drug is non-existent or relatively small in magnitude, and the patient drop-out rate is very high. I have focused on two recently published RCTs [1,2] that evaluate hormone replacement (HRT) therapy, because both RCTs are characterized by low event rates, neutral or equivocally positive/negative results, and high patient drop-out rates.
Introduction:
For the past two decades, it was presumed, based on observational studies, that hormone replacement therapy (HRT) reduced the likelihood of coronary heart disease (CHD) events in post-menopausal women. The published figures for many observational studies of HRT therapy showed a large (35-45%) relative risk risk reduction of subsequent CHD events (MI or CV death) in post-menopausal women. Many reseachers have discounted those findings as being due to selection bias. In 1998, the HERS study [1], which was the first RCT to study the value of HRT in the secondary prevention of CAD events, demonstrated that HRT had no beneficial effect in the secondary prevention of MI, CV death and stroke. In July 2002, the WHI study was published [2], and it unexpectedly demonstrated that HRT therapy increased the risk of CAD events and stroke. This unexpected finding dismayed many proponents of HRT therapy and led to recommendations that HRT therapy should not be used in post-menopausal women.
In his August 20th 2002 CMAJ commentary article on the WHI study [3], Salim Yusuf concluded his commentary by stating:-
"In conclusion, the WHI is a large, well-designed, and carefully conducted study that will have a tremendous impact of the health of women. The message for healthy women without severe symptoms of menopause is now clear: to avoid as far as possible HRT, which on balance does more harm than good."It is my belief that there are many reasons to question the validity of that conclusion with respect to coronary heart disease events. Although a RCT is the optimum method of determining the scientific truth, a RCT has certain limitations that make the accuracy of the trial's final conclusions questionable -- especially if the absolute number of event rates is low, the efficacy (or harm) of the tested drug is marginal, and the patient drop-out rate is high.
Analysis of the HERS study
Consider the details of the HERS study.
Randomized Trial of Estrogen Plus Progestin for Secondary Prevention of Coronary Heart Disease in Postmenopausal Women
[Original Contributions]Hulley, Stephen MD; Grady, Deborah MD; Bush, Trudy PhD; Furberg, Curt MD, PhD; Herrington, David MD; Riggs, Betty MD; Vittinghoff, Eric PhD
The Heart and Estrogen/progestin Replacement Study (HERS) Research Group. From the University of California, San Francisco (Drs Hulley, Grady, and Vittinghoff); The Johns Hopkins University, Baltimore, Md (Dr Bush); Wake Forest University School of Medicine, Winston-Salem, NC (Drs Furberg and Herrington); and Wyeth-Ayerst Research, Radnor, Pa (Dr Riggs). A complete list of the HERS Research Group participants appears at the end of this article.
Context: Observational studies have found lower rates of coronary heart disease (CHD) in postmenopausal women who take estrogen than in women who do not, but this potential benefit has not been confirmed in clinical trials.
Objective: To determine if estrogen plus progestin therapy alters the risk for CHD events in postmenopausal women with established coronary disease.
Design: Randomized, blinded, placebo-controlled secondary prevention trial.
Setting: Outpatient and community settings at 20 US clinical centers.
Participants: A total of 2763 women with coronary disease, younger than 80 years, and postmenopausal with an intact uterus. Mean age was 66.7 years.
Intervention: Either 0.625 mg of conjugated equine estrogens plus 2.5 mg of medroxyprogesterone acetate in 1 tablet daily (n=1380) or a placebo of identical appearance (n=1383). Follow-up averaged 4.1 years; 82% of those assigned to hormone treatment were taking it at the end of 1 year, and 75% at the end of 3 years.
Main Outcome Measures: The primary outcome was the occurrence of nonfatal myocardial infarction (MI) or CHD death. Secondary cardiovascular outcomes included coronary revascularization, unstable angina, congestive heart failure, resuscitated cardiac arrest, stroke or transient ischemic attack, and peripheral arterial disease. All-cause mortality was also considered.
Results: Overall, there were no significant differences between groups in the primary outcome or in any of the secondary cardiovascular outcomes: 172 women in the hormone group and 176 women in the placebo group had MI or CHD death (relative hazard [RH], 0.99; 95% confidence interval [CI], 0.80-1.22). The lack of an overall effect occurred despite a net 11% lower low-density lipoprotein cholesterol level and 10% higher high-density lipoprotein cholesterol level in the hormone group compared with the placebo group (each P<.001). Within the overall null effect, there was a statistically significant time trend, with more CHD events in the hormone group than in the placebo group in year 1 and fewer in years 4 and 5. More women in the hormone group than in the placebo group experienced venous thromboembolic events (34 vs 12; RH, 2.89; 95% CI, 1.50-5.58) and gallbladder disease (84 vs 62; RH, 1.38; 95% CI, 1.00-1.92). There were no significant differences in several other end points for which power was limited, including fracture, cancer, and total mortality (131 vs 123 deaths; RH, 1.08; 95% CI, 0.84-1.38).
Conclusions: During an average follow-up of 4.1 years, treatment with oral conjugated equine estrogen plus medroxyprogesterone acetate did not reduce the overall rate of CHD events in postmenopausal women with established coronary disease. The treatment did increase the rate of thromboembolic events and gallbladder disease. Based on the finding of no overall cardiovascular benefit and a pattern of early increase in risk of CHD events, we do not recommend starting this treatment for the purpose of secondary prevention of CHD. However, given the favorable pattern of CHD events after several years of therapy, it could be appropriate for women already receiving this treatment to continue.
JAMA.1998;280:605-613
The HERS study investigators found that treatment with HRT, during an average follow-up period of 4.1 years, did not reduce the overall rate of CHD events in post-menopausal patients with established CHD.However, the HERS investigators noted a an increased rate of CHD events in the first year of the study in the HRT treated group (42.5 events/1000 person-years), and a tendency towards a decreased rate in the 3rd and 4th years of the study (21.4-28.7 events/1000 person-years). They therefore decided to continue the study for another 2.7 years (total of 6.8 years) to see if the tendency towards lower event rates persisted in the HRT patients, and they published the results of their extended study (HERS II) in July 2002 [4].
The final results for the 6.8 year time period confirmed that there was no overall benefit of HRT in the prevention of CV events in post-menopausal patients with a history of CHD.
Note that there was a minimally higher incidence of CHD events (MI or CHD death) in the HRT patients during the first 3 years of the trial, and that the curves crossed-over during the mid-trial period, and there was a minimally higher event rate in the placebo patients in the latter years of the trial. The adjusted overall relative hazard (RH) figure for the entire trial was 0.97 with relatively narrow confidence intervals (95% CI 0.82-1.14). The overall adjusted RH figure of 0.97 (95% CI 0.82-1.14) suggested that RHT therapy had no effect on CHD events, or at the most, a lesser chance-probability of a slightly positive/negative effect.However, consider the HERS trial's results in greater detail -- by looking at the results from year-to-year.
Note that the yearly rate of a primary CHD event in the placebo patients varied between 28-44.8 events per 1000 patient-years, and between 21.4-44.3 events per 1000 patient-years in the HRT patients.If the overall results of the entire trial showed no difference between the two groups (average yearly rate of approximately 37 events per 1000 patient-years), then how does one account for the large variations from year-to-year within each group (HRT and placebo groups), and between the HRT and placebo groups?
Consider the following relative hazard (RH) figures for the different years of the HERS study:-
Year 1 -- 1.52
Year 2 -- 0.98
Year 3 -- 0.85
Year 4 -- 0.60
Year 5 -- 1.09
Year 6 -- 0.99The "average" RH figure for the entire 6-year trial period was 1.00, and the RH figures for certain years was relatively close to the "average" figure in 4 of the 6 years (year 2, 3, 5, 6). However, the RH figure in year 1 (1.52) was much higher than the average RH figure (1.0); and the RH figure in year 4 (0.60) was much lower than the "average" RH figure (1.0). Were those differences due to chance, other confounders, or both?
The HERS investigators were concerned about the theoretical possibility that RHT therapy had a marked prothrombotic tendency in the first year of the trial -- because the RH was 1.52. However, note that the CHD event rate in placebo patients was 28.8 in the first year of the trial, and that figure was much lower than the "average" event rate for all the other years of the trial (average of 38 for all the other trial years).
Could the low event rate in placebo patients in the first year of the trial be a chance event that artefactually inflated the calculated RH figure for the first year?
Also, note that the HERS investigators performed a secondary analysis of their trials' results and came up with the following figures.
The *adjusted figures were based on an unspecified adjustment that considered all those listed variables eg. age, ethnicity, smoking ----- etc.Note that their *adjustment RH figures didn't differ very much from their unadjusted RH results. Although there was no difference between the unadjusted and *adjusted figures, it is important to note that it is impossible to critique the accuracy of their *adjusted figures, because the trial investigators supplied no details regarding the underlying assumptions that they used to make their adjustments.
Also, note that they performed another analysis involving "as treated" patients. In their analysis of "as treated" patients (patients who were known to have taken their medications), their RH results were significantly different to the unadjusted RH figures. For instance, although there was no significant difference in years 1, 2, and 4, there were large differences in years 3, 5 and 6.8. Why was there such a large variation? Were the differences due to confounders or chance?
Are their "as treated" figures more meaningful than the unadjusted figures? Why, and to what degree?
I personally think that the variability in those estimated RH figures (unadjusted RH, RH adjusted for potential confounders, and RH adjusted for "as treated" patients) demonstrates the "real life" weakness of RCTs that have borderline (equivocal) RH results -- the "subjective" interpretation of the final results are markedly affected by "arbitrary" corrections for known (or unknown) confounders, and the magnitude of the drop-out rates (number of patients who stop taking their medications during the trial).
Before I expand on that particular point, I'd like to make the problem-issue of the "as treated" analysis clearer.
Consider a hypothetical example of the potential weaknesses inherent in a RCT in "real-life" terms.Presume that 1000 ACS patients (500 treated patients and 500 placebo patients), who had positive serum troponin test results, were randomized to a RCT, and that the treated patients would be treated with drug X for a 3 month period to determine whether the drug would decrease the 3-month CV mortality rate. Presume that 10% of the treated patients stopped taking drug X every week. How accurate would the final results be, and what potential errors could occur when analysing the trial's results?
The first potential error could theoretically occur during the randomization process, and be due to an imbalance in a critically important prognostic variable between the treated and placebo patients -- the severity of the primary ACS event and its effect on future mortality.
It is an obvious error to assume that all ACS patients with a positive serum troponin result have the same (or similar) likelihood of short-term mortality. The likelihood of short-term mortality in ACS patients depends on the magnitude of the serum troponin elevation.
Consider the following graph:-
Note that the short-term mortality is markedly affected by the serum troponin level at the time of the ACS event -- to a maximum difference of approximately 7x (700%). Therefore, it would be critically important to ensure that the drug X-treated and placebo patients have a perfectly balanced group of patients -- based on the magnitude of elevation of the serum troponin level -- at the time of randomization. The trial would be highly flawed, and internally invalid, if all the treated patients had serum troponin levels in the range of 0.4-2.0, while all the placebo patients had serum troponin levels in the range of 5.0-10.0.
However, an equal balance in serum troponin levels at the time of starting the trial doesn't ensure continued perfect randomization if the drop-out rate is 10% per week. At the end of the 4th week, 40% of the patients would not be taking their drugs, and there is no guarantee that the number of treated patients still taking their drugs (60% of the original set) would have the same proportional balance in the number of high versus low serum troponin levels as the treatment group at the start of the trial. Theoretically, the balance in the treated patients could be changing from week-to-week, which would create a completely unbalanced situation at all time-points during the trial (with respect to the placebo patients).
The problem-situation of a varying baseline imbalance would be compounded if the number of treated (or placebo) patients taking other drugs (eg. aspirin, clopidogrel, statins), that could significantly affect short-term CV mortality, also changed from week-to-week. An adjusted "as treated" analysis would then also have to correct for the constantly changing drug combinations + constantly changing balance in a baseline prognostic variable (number of patients still taking the drug, who have a high serum troponin versus a low serum troponin level).
An analysis of the trial's final results would obviously be very different if it was analysed from an "as treated" rather than an "intention to treat" perspective. Also, the final "as treated" result-interpretation would vary depending on how one corrected for the constantly-changing baseline prognostic variables (number of treated patients with particular baseline serum troponin levels) and other constantly-changing confounders (different drug combinations) on an ongoing basis.
Could those confounding phenomena (as described in the above hypothetical scenario) have affected the HERS trial?In the HERS trial, at the time of randomisation, patients were selected on the basis of a history of established CHD, which was defined as evidence of one or more of the following:- i) MI, ii) coronary artery bypass graft surgery, iii) percutaneous coronary vasacularization, iv) angiographic evidence of at least 50% occlusion of one or more coronary arteries.
Do patients in each of those categories have the same risk of a subsequent CHD event? I cannot easily imagine that a patient, who only has angiographic evidence of coronary artery occlusion of 50%, is necessarily as high-risk-a-patient as a patient who has had a large anterior MI, or who has required a coronary vascular procedure, with respect to future CHD events.
What about the claim that the treatment and placebo groups were adequately balanced for those 4 different MI subgroups at the time of randomization? I think that even if the 4 MI subgroups were balanced at the time of starting the trial, there is no guarantee that the balance between those 4 MI subgroups remains the "same" during the entire trial period if the drop-out rate is so high. In the HERS trial, the number of treated patients taking HRT decreased as the trial proceeded, and in the 6th year of the trial period only 45% of the treated patients were still taking HRT (drop-out rate was 55%).
Consider the following graph:-
Note that the drop-out rate changes non-linearly from year-to-year, and there is no guarantee that at the beginning of any particular year, that the number of patients still taking HRT have the same balance of baseline coronary risk factors as the placebo group.As the trial proceeded, the HRT-treated patients may have had a constantly changing balance in the severity-pattern of known coronary disease (compared to the placebo group, and the treated group at trial onset), a constantly changing pattern of drug-drug interactions, and a constantly changing pattern of other confounders (lifestyle changes -- dieting, exercise) that can affect the treated patients' overall coronary risk profile at any point-in-time. Those cumulative changes may not be significant if the magnitude of the drug's effect is considerable, but those cumulative changes could presumably have a significant confounding effect if the effect of the drug is small (or nearly non-existent).
In the HERS II article [4], the authors performed a secondary analysis of time-dependent covariates (eg. other drug use), in an attempt to adjust for potential confounders. They also performed secondary "as-treated" analyses, but they were appropriately concerned that their conclusions had limited value, because of the small number of CHD events in the "as-treated" groups. Note that their calculated figure for the "as-treated" RH for primary CHD events in the HERS II sample was 0.82, which was significantly lower than the unadjusted RH of 1.0, and that it had wide confidence intervals (95% CI, 0.52-1.32).
How can one be sure which RH figure more accuratedly represents the results of the HERS trial -- the unadjusted RH figure or their adjusted RH figure for "as-treated" patients?
I think that this could be a major limitation of RCTs that have final unadjusted RH scores close to 1.0 plus high drop-out rates plus low event rates. The adjusted RH result (after "appropriate" adjustments are made for potential confounders and high drop-out rates) could be slightly higher or slightly lower than 1.0, and there is apparently no objective method of determining if the adjusted RH figure is valid. Different trial interpreters could review the same data and come to different conclusions, because they would "weigh" the facts differently.
Steven Goodman analysed the mammography controversy in a recent article [5] and he commented on the significant dilemma that occurs when two experts perform a systematic review of a series of mammography RCTs. Their choice of which RCTs to include in their systematic review may be very different, and it is therefore not surprising that different experts come to different conclusions.
In that article, he stated:-
"One of the tenets of evidence-based medicine is that scientific demonstrations of efficacy, from randomized, controlled trials (RCTs) or carefully designed population studies, should supersede expert opinions about efficacy. However, this controversy shows that the justification for why studies are included or excluded from the evidence base can rest on competing claims of methodologic authority that look little different from the traditional claims of medical authority that proponents of evidence-based medicine have criticized .......Some may argue that this is exactly the kind of debate in which the explicitness of the methods of evidence-based medicine shows its value, making it clear where and why people differ. The problem with this position is that every scientific methodology has untestable foundational assumptions, and these assumptions are precisely what are being contested here. In the mammography debate, the untestable assumptions relate to assessment of RCT quality and how that assessment should affect an analysis. The evaluations of RCT quality in the USPSTF and Cochrane reviews are not nearly as different as their final decisions about study inclusion or exclusion. The reasons for these decisions are far from explicit; each group merely asserts that its approach brings us closer to the truth. There can be no definitive resolution of this question because RCTs are their own gold standard, making any empirical test about the validity of a particular one either impossible or circular.I think that the same problem of what constitutes a "sufficiently valid" RCT applies to the internal analysis of a single trial, like the HERS trial -- how can one determine that the HERS trialists' adjusted analysis of the HERS trial is "sufficiently valid", and would different expert trial-interpreters come to the same conclusion if they reviewed the same data?An influential group of philosophers in the 1930s called the “logical positivists” (originally including Karl Popper) presaged the problem that evidence-based medicine faces here by trying to establish a secure foundation for scientific knowledge based only on observed facts. They failed, in part because observation requires tools (for example, microscopes) whose interpretation requires belief in underlying theories (for example, optics) whose components are not all directly observable. In other words, observational facts are always dependent on an underlying theory about what makes the observation reliable. In evidence-based medicine, when the observation is made through an RCT, the theory is about what constitutes a “sufficiently valid” RCT. This debate shows us that when there is no consensus on that theory, there cannot be consensus on the facts."
Before I consider that issue further, I would like to analyse the CHD event results of the WHI trial -- a RCT involving healthy patients, rather than known CAD patients?
Analysis of the WHI trial
Consider the evidence from the WHI trial.
Risks and Benefits of Estrogen Plus Progestin in Healthy Postmenopausal Women: Principal Results From the Women's Health Initiative Randomized Controlled Trial -- Writing Group for the Women's Health Initiative Investigators [2] Context:
Despite decades of accumulated observational evidence, the balance of risks and benefits for hormone use in healthy postmenopausal women remains uncertain.
Objective:
To assess the major health benefits and risks of the most commonly used combined hormone preparation in the United States.
Design:
Estrogen plus progestin component of the Women's Health Initiative, a randomized controlled primary prevention trial (planned duration, 8.5 years) in which 16608 postmenopausal women aged 50-79 years with an intact uterus at baseline were recruited by 40 US clinical centers in 1993-1998.
Interventions:
Participants received conjugated equine estrogens, 0.625 mg/d, plus medroxyprogesterone acetate, 2.5 mg/d, in 1 tablet (n = 8506) or placebo (n = 8102).
Main Outcomes Measures:
The primary outcome was coronary heart disease (CHD) (nonfatal myocardial infarction and CHD death), with invasive breast cancer as the primary adverse outcome. A global index summarizing the balance of risks and benefits included the 2 primary outcomes plus stroke, pulmonary embolism (PE), endometrial cancer, colorectal cancer, hip fracture, and death due to other causes.
Results:
On May 31, 2002, after a mean of 5.2 years of follow-up, the data and safety monitoring board recommended stopping the trial of estrogen plus progestin vs placebo because the test statistic for invasive breast cancer exceeded the stopping boundary for this adverse effect and the global index statistic supported risks exceeding benefits. This report includes data on the major clinical outcomes through April 30, 2002. Estimated hazard ratios (HRs) (nominal 95% confidence intervals [CIs]) were as follows: CHD, 1.29 (1.02–1.63) with 286 cases; breast cancer, 1.26 (1.00–1.59) with 290 cases; stroke, 1.41 (1.07–1.85) with 212 cases; PE, 2.13 (1.39–3.25) with 101 cases; colorectal cancer, 0.63 (0.43–0.92) with 112 cases; endometrial cancer, 0.83 (0.47–1.47) with 47 cases; hip fracture, 0.66 (0.45–0.98) with 106 cases; and death due to other causes, 0.92 (0.74–1.14) with 331 cases. Corresponding HRs (nominal 95% CIs) for composite outcomes were 1.22 (1.09–1.36) for total cardiovascular disease (arterial and venous disease), 1.03 (0.90–1.17) for total cancer, 0.76 (0.69–0.85) for combined fractures, 0.98 (0.82–1.18) for total mortality, and 1.15 (1.03–1.28) for the global index. Absolute excess risks per 10000 person-years attributable to estrogen plus progestin were 7 more CHD events, 8 more strokes, 8 more PEs, and 8 more invasive breast cancers, while absolute risk reductions per 10000 person-years were 6 fewer colorectal cancers and 5 fewer hip fractures. The absolute excess risk of events included in the global index was 19 per 10000 person-years.
Conclusions:
Overall health risks exceeded benefits from use of combined estrogen plus progestin for an average 5.2-year follow-up among healthy postmenopausal US women. All-cause mortality was not affected during the trial. The risk-benefit profile found in this trial is not consistent with the requirements for a viable intervention for primary prevention of chronic diseases, and the results indicate that this regimen should not be initiated or continued for primary prevention of CHD.
First of all, note the point estimate RH figure for CHD events was 1.29 (95% CI 1.02-1.63), and that the 95% confidence interval was very wide for the unadjusted RH figure -- the "true" RH figure could be anywhere between 1.02 (no effect) to 1.63 (moderately harmful effect). Those confidence intervals were based on the unadjusted RH figure. The adjusted RH's 95% confidence interval figures are even wider (95% CI 0.85-1.97), but the WHI investigators did not provide sufficient details about the adjustment methodology to enable one to understand why there was such a large difference between the unadjusted and adjusted RH's 95% confidence intervals.
I have noted that some commentators (and newspaper reporters) have entirely focused their attention on the point estimate RH figure of 1.29, and they have seemingly concluded that HRT therapy definitely has deleterious cardiac effects (although the WHI investigators only claimed in their original article that their study failed to demonstrate that HRT therapy had any beneficial effect in the primary prevention of CHD events). Those commentators may become less confident in their definite conclusion if they studied the year-by-year CHD event rates in greater depth.
Year of study (number of participants)
HRT therapy
Number of patients with CHD event
(annualized percentage)Placebo therapy
Number of patients with CHD event
(annualized percentage)Hazard ratio Year 1
(8435 treated, 8050 placebo)43
(0.51%)23
(0.29%)1.78 Year 2
(8353 treated, 7980 placebo)36
(0.43%)30
(0.38%)1.15 Year 3
(8268 treated, 7888 placebo)20
(0.24%)18
(0.23%)1.06 Year 4
(7926 treated, 7562 placebo)25
(0.32%)24
(0.32%)0.99 Year 5
(5964 treated, 5566 placebo)23
(0.39%)9
(0.16%)2.38 Year 6 plus
(5129 treated, 4243 placebo)17
(0.33%)18
(0.42%)0.78
First of all, note that the "average" number of CHD event rates per year was very low -- roughly 0.37% per year (in both the placebo and treated patients). The importance of the low number of CHD events per year is that a small number of chance CHD events can have a much greater effect on the final results than would be the case if the yearly number of CHD events was 10x larger.Is there any evidence that chance was playing a major role in the trial's results?
Look at the placebo patients' results from years 4 - 6.
Year 4 -- 0.32% annualized percentage CHD event rate
Year 5 -- 0.16% annualized percentage CHD event rate
Year 6 -- 0.42% annualized percentage CHD event rateIs there any physiological, or pathophysiological reason, that could explain why the placebo patients in year 5 had a >50% relative reduction in CHD events in that particular year (compared to the neighbouring years)?
I strongly suspect that chance played a major role in the placebo patients' CHD outcome results in year 5, and that the "true" expected CHD event rate figure for the placebo patients in year 5 should probably be somewhere between 0.32% and 0.42%, and maybe reasonably close to the figure of 0.37% (half-way between 0.32% and 0.42%) -- based on simple physiological expectations (a belief that there is no obvious pathophysiological reason why the placebo patients should have a CHD event rate in year 5 that is significantly different to the immediate neighbouring years).
What effect would it have on the overall results of the trial's 5 year time-period -- from the second year to the end of the study -- if the placebo patients' CHD event rate figure for the 5th year was 0.37%?
The trial's results would look like this:-
Year -- treated patients -- placebo patients
Year 2 -- 0.43% -- 0.38%
Year 3 -- 0.24% -- 0.23%
Year 4 -- 0.32% -- 0.32%
Year 5 -- 0.39% -- 0.37%
Year 6 -- 0.33% -- 0.42%Yearly average -- 0.34% -- 0.34%
It is surprising to realise that if a mere additional 11 placebo patients had a CHD event in the 5th year of the trial, that there would be zero difference in the overall rate of CHD events between placebo and treated patients for the last 5 years of the trial, and that the study's conclusion could be that HRT therapy had a neutral effect on cardiac events after the first year.
Could a few chance events (slightly decreased number of CHD events in placebo patients in the 5th year of the trial) really have biased the results of the trial, and radically affected the way that some people would interpret the trial's results?
I think that any answer to that question may be entirely subjective, and that there is no objective method of definitely proving that chance played a significant role.
I think that it is very important that clinicians realize that a single RCT (even if it contains a large number of patients) cannot necessarily determine the scientific truth with a sufficient degree of confidence -- if the study population has a low number of outcome events + the drug is minimally efficacious (or near-neutral in its effect) + the drop-out rate is high (which causes the "between-group contrasts" to be confounded by the fact that the treated group may no longer be balanced in terms of baseline prognostic variables at various time-points during the trial).
How can one establish a level of confidence in the conclusion of the WHI trial?
In a highly instructive article [6], David Sackett described a simple formulae for determining the level of confidence in the conclusion of a clinical trial.
If expressed in words, the formula states that the confidence in the conclusion of an RCT is the ratio of the magnitude of the signal to the magnitude of the noise, times the square root of the sample size. To selectively quote from David Sackett's article, which was targeted at clinical trialists who would be designing clinical trials:-
Confidence describes how narrow the confidence interval is (the narrower the better) around the effect of treatment, whether expressed as an absolute or relative risk reduction, or as some other measure of efficacy.
The signal describes the differences between the effects of the experimental and control treatments.
The noise (or uncertainty) in an RCT is the sum of all the factors - "sources of variation" - that can affect the absolute risk reduction or absolute difference.
The sample size is the number of patients in the trial. The influence of sample size on the confidence level of a trial is a function of its square root - if a trial designer wants to cut the confidence interval around a study's absolute risk reduction in half by adding more patients to it, he needs to quadruple the number of recruited patients. That is why trial designers may choose to concentrate their efforts on increasing the signal and decreasing the noise, rather than having to significantly increase the sample size in order to achieve the same effect.
Four determinants affect the magnitude of the signal generated in a RCT
Restricting eligibility to patients who are at higher than average "baseline" risk of outcome events leads to higher "control event rates" (CER) among those receiving placebo or the treatment. Because the absolute risk reduction signal is equivalent to the product of this control event rate and the relative risk reduction from therapy (ARR = CER x RRR) it follows that, if the relative risk reduction achieved by the experimental treatment is both true and constant over different control event rates, the experimental treatment will generate a larger absolute risk reduction signal when the control event rate is high than when it is low.
- "baseline" or control group's risk of an outcome event
- response of experimental patients to the treatment
- potency of the experimental treatment
- completness with which outcome events are ascertained and included in the analysis
The second way that one can increase the ARR signal and the confidence in a positive result, is by selectively enrolling highly responsive patients who are more likely (than average) to respond to the experimental therapy.
The noise element in a trial is reduced by eliminating or minimizing sources of uncertainty.
Variations in the outcome of study patients can be reduced by making the patients more homogeneous - using the same strategies used to improve signal: assembling patients with similar risks and similar responsiveness, and making experimental and control patients as similar as possible.
Ensuring high compliance and minimizing sloppiness and inconsistency in the ascertainment of outcomes are other ways of reducing noise.
How do Sackett's confidence-generating principles apply to the CHD events in the WHI trial?
Note that the WHI trial had a very small signal level with respect to CHD events -- because i) the "baseline" or control group's risk of an outcome CHD event was very low, ii) the potency of the experimental treatment (HRT) in influencing the CHD event rate was very limited, and iii) there was a high drop-out rate (many patients stopped taking the experimental treatment during the trial) which significantly affects the signal power of the experimental treatment.
Also, note that the WHI trial had a very high noise level with respect to CHD events -- because i) the degree of homogeneity between the treated group and the placebo group in terms of prognostic variables was not necessarily consistent throughout the trial (high drop-out rate) and the degree of homogeneity probably diminished as the trial proceeded, ii) the value of the adjusted RH would vary depending on an arbitrary (subjective) choice of selected adjustments that would be needed to correct for the high drop-out rate and other confounders, and iii) chance events could have a much greater effect as a confounder, because the CHD outcome event rate was low and the anticipated signal small.
The cumulative effect of all those multiple factors would suggest that the WHI trial had a low signal/noise ratio with respect to CHD events.
By contrast, the HERS trial recruited patients with known cardiac disease, their study population was much more homogenous than the WHI population, and the CHD event rate was much higher than in the WHI study (~10x greater). Those factors increased the potential power of the signal in the HERS trial. However, there was still a considerable noise problem due to the high drop-out rate, and the trials' final conclusion depends on whether one chooses to use an unadjusted "intention to treat" analysis, or an adjusted "as treated" analysis (a major source of interpretative *noise).
Overall, I would be more inclined to accord the HERS trial's CHD outcome event results a greater level of confidence than the WHI trial's CHD outcome event results (ie. more inclined to regard the HERS trial as having a higher signal/noise ratio trial), and I am therefore slightly more inclined to believe that HRT probably has a neutral effect on CHD events (or perhaps a slightly beneficial effect depending on the choice of adjustments used in an "as treated" analysis).
If HRT really has a neutral effect on CHD events in high risk patients (based on the conclusions of the HERS trial), what is the likelihood that it would have a significantly deleterious effect in low risk patients (healthy post-menopausal women)?
I think that one may need to consider the results of the WHI trial with a certain a priori bias, based on one's knowledge of the results of the HERS trial, which demonstrated that HRT therapy had a neutral effect on CHD events in high risk patients. Some people would strongly object to the deliberate assumption of an a priori bias, and they would state that the WHI trial should be independently judged on its own merits -- depending on the P and CI values that support its results. I don't agree with that viewpoint, because I am very sympathetic to the idea that P values are mainly reflective of the statistical significance of a trial's results, and that P values offer limited useful information regarding the clinical significance of a trial's results. P values also do not indicate whether the trial's results are clinically plausible, and a final decision regarding clinical plausability requires a judgement-call that is primarily based on a clinician's in-depth knowledge of the relevant medical literature.
Steven Goodman has repeatedly critiqued the use of P values in clinical trials, and in a recent article [7] he stated:-
The root cause of our problem is a philosophy of scientific inference that is supported by the statistical methodology in dominant use. This philosophy might best be described as a form of “naïve inductivism,” a belief that all scientists seeing the same data should come to the same conclusions. By implication, anyone who draws a different conclusion must be doing so for nonscientific reasons. It takes as given the statistical models we impose on data, and treats the estimated parameters of such models as direct mirrors of reality rather than as highly filtered and potentially distorted views. It is a belief that scientific reasoning requires little more than statistical model fitting, or in our case, reporting odds ratios, P-values and the like, to arrive at the truth.I am very sympathetic to Goodman's idea that we always have an epistemic "prior odds" mental idea in our minds prior to reviewing the results of a clinical trial, and that the revised odds after reviewing a trial's data is the epistemic "posterior odds". The evidence from a trial does not exist in isolation as an absolute event with absolute authority. Evidence from a trial only moves one's prior belief in the direction of greater or lesser doubt, depending on one's level of confidence in the conclusions of the trial. Virtually all the HRT trials (whether observational or RCT) suggest that HRT predisposes the treated patient to venous thromboembolism, and the published RH figures are uniformly high (RH >2). Under those circumstances (high RH values with narrow confidence intervals), the trial's evidence is very convincing because the trial's signal/noise ratio is high. However, the situation is very different when a trial has a low signal/noise ratio.How is this philosophy manifest in research reports? One merely has to look at their organization. Traditionally, the findings of a paper are stated at the beginning of the discussion section. It is as if the finding is something derived directly from the results section. Reasoning and external facts come afterward, if at all. That is, in essence, naïve inductivism. This view of the scientific enterprise is aided and abetted by the P-value in a variety of ways, some obvious, some subtle. The obvious way is in its role in the reject/accept hypothesis test machinery. The more subtle way is in the fact that the P-value is a probability – something absolute, with nothing external needed for its interpretation.
Now let us imagine another world – a world in which we use an inferential index that does not tell us where we stand, but how much distance we have covered. Imagine a number that does not tell us what we know, but how much we have learned. Such a number could lead us to think very differently about the role of data in making inferences, and in turn lead us to write about our data in a profoundly different manner.
This is not an imaginary world; such a number exists. It is called the Bayes factor. It is the data component of Bayes Theorem. The odds we put on the null hypothesis (relative to others) using data external to a study is called the “prior odds,” and the odds after seeing the data is the “posterior odds.” The Bayes factor tells us how far apart those odds are, ie, the degree to which the data from a study move us from our initial position. It is quite literally an epistemic odds ratio, the ratio of posterior to prior odds, although it is calculable from the data, without those odds. It is the ratio of the data’s probability under two competing hypotheses.........The Bayes factor is a measure of evidence in the same way evidence is viewed in a legal setting, or informally by scientists. Evidence moves us in the direction of greater or lesser doubt, but except in extreme cases it does not dictate guilt or innocence, truth or falsity."
I think that the issue of "level of confidence in the conclusions of a trial" is a major issue, especially when trial interpreters have to interpret the results of a low signal/noise ratio RCT -- due to the combination of a low control event rate, limited drug efficacy and a significant drug drop-out/drop-in rate involving multiple effective drugs (eg, aspirin, beta-blockers, statins, ACEIs in CHD trials). It would seem to me that an adjusted analysis based on the "as treated" patients is a necessary requirement in clinical trials that have high drop-out/drop-in rates, but I have not seen any discussion in the mainstream medical literature regarding the optimum method of performing that adjusted analysis. I also have not seen any concrete recommendations in the mainstream medical literature on how best to interpret clinical trials from a Bayesian perspective, and it would seem to me that a Bayesian interpretation may be especially important when evaluating the conclusions of randomised trials that have a low signal/noise ratio.
Jeff Mann. MD.Retired physician.
September 2002.
E-mail: jmannemg@earthlink.net
References:
1. Hulley S, Grady D, Bush T, et al, for the HERS Research Group. Randomized trial of estrogen plus progestin for secondary prevention of coronary heart disease in postmenopausal women: Heart and Estrogen/progestin Replacement Study (HERS) Research Group. JAMA. 1998;280:605–6132. Writing Group for the Women's Health Initiative Investigators. Risks and Benefits of Estrogen Plus Progestin in Healthy Postmenopausal Women: Principal Results From the Women's Health Initiative Randomized Controlled Trial. JAMA. 288(3):321-333, July 17, 2002
3. Yusuf, Salim. Anand, Sonia. Hormone replacement therapy: a time for pause. CMAJ. 167(4):357-359, August 20, 2002.
4. Grady, Deborah MD, MPH. Herrington, David MD, MHS. Bittner, Vera MD. Blumenthal, Roger MD. Davidson, Michael MD. Hlatky, Mark MD. Hsia, Judith MD. Hulley, Stephen MD, MPH. Herd, Alan MD. Khan, Steven MD. Newby, L. Kristin MD. Waters, David MD. Vittinghoff, Eric PhD. Wenger, Nanette MD. for the HERS Research Group. Cardiovascular Disease Outcomes During 6.8 Years of Hormone Therapy: Heart and Estrogen/Progestin Replacement Study Follow-up (HERS II). JAMA. 288(1):49-57, July 3, 2002
5. Goodman Steven. The Mammography Dilemma: A Crisis for Evidence-Based Medicine? Annals of Internal Medicine Volume 137:5 (Part 1) 3 September 2002 pp 363-365.
6. Sackett David. Why randomized controlled trials fail but needn't: 2. Failure to employ physiological statistics, or the only formula a clinician-trialist is ever likely to need (or understand!) CMAJ. 2001 165: 1226-1237.
7. Goodman SN. Of P-values and bayes: A modest proposal. Epidemiology. Volume 12(3) May 2001 pp295-297.