Can small RCTs produce results that are clinically significant and scientifically conclusive?
Introduction and background:
This essay deals with the question as to whether small sample-sized RCTs (that test a drug's efficacy) can produce results that are both clinically significant and scientifically conclusive.How does one define the terms "clinical significance" and "scientifically conclusive", and what is the difference between statistical significance and clinical significance?
Difference between statistical significance and clinical significance
It is interesting that the FDA does not have a standard operational definition for the term "clinical significance" and if one reviews the transcripts of many FDA Drug Advisory Committee meetings, one notes that the Drug Advisory Committee members frequently refer to a p-value <0.05 as indicating that a RCT's result is significant. However, p-values only deal with statistical significance, and a p-value <0.05 only means that there is a <1/20 chance that one will obtain a result as large, or larger than the trial's measured point estimate result, under the assumption that the null hypothesis (that there is no difference between the treatments) is true. In other words, from a definitional perspective, one 'a priori' assumes that the null hypothesis is true, and the p-value estimates what's the chance likelihood of getting a result as large, or larger than the point estimate result, if the trial was repeated many times. Because the probability is low (<1/20) when the p-value is <0.05, people automatically assume that the null hypothesis cannot be true, and that there must be a difference between the treatments. In fact, some people incorrectly think that the p-value is actually a test of the null hypothesis -- testing whether the null hypothesis is true, and that a p-value <0.05 means that there is <5% probability that the null hypothesis (that there is no difference between the treatments) is true. That's not correct! The p-value cannot calculate the probability of the null hypothesis because its calculation automatically assumes that the null hypothesis is true. Also, the p-value doesn't deal with the likelihood of another hypothesis -- that there is a difference between the treatments. The opposite of the null hypothesis is the alternate hypothesis, which is the hypothesis that there is a difference between the treatments. The p-value doesn't measure the likelihood of the alternate hypothesis being as true, or less/more true, than the null hypothesis, and it certainly doesn't attempt to quantify the absolute degree of efficacy difference between the two treatments. I therefore believe that the p-value does not help one determine whether an individual RCT's point estimate result is likely to be the "true" point estimate result, and it certainly cannot establish whether that specific result is clinically significant.
I think that the concept of "clinical significance" requires some quantification of a drug's degree of clinical efficacy and one needs to numerically define an OR value (or RR value or RRR value or ARR value) that represents a clinically significant result. There is no standard method of defining what degree of drug efficacy (eg. 10%, 20%, 25%) represents a clinically significant result, and it depends on individual physician and individual patient values. It must logically therefore also depend on the relationship between the life-threatening nature of the disease and the likelihood of serious drug side-effects. For example, one is much more likely to regard a 10% absolute risk difference in mortality as being clinically significant when one is dealing with a life-threatening disease such as AIDS than if one was dealing with a RCT that was testing an investigational drug against placebo for the treatment of the common cold (a much greater degree of clinical benefit may be required for the result to be universally regarded as being clinically significant). One is also much more likely to mandate a higher efficacy figure when defining "clinical significance" if a drug has a high likelihood of causing serious side-effects, because one will be subsequently calculating the drug's overall risk:benefit ratio and a high likelihood of serious risk needs to be balanced by a high likelihood of great potential benefit. People may even consider cost when defining a clinically significant result, and cost:benefit calculations attempt to establish whether a defined degree of clinical benefit is worth the financial cost. In other words, subjective value-driven judgements are needed to define "clinical significance" and there is no single drug efficacy figure that defines a clinically significant result.
What does the term "scientifically conclusive" mean?
Let's presume that one defines a clinically significant result as a 25% relative improvement when the new investigational agent is tested against standard therapy (either a placebo agent or a control agent) in a RCT. What is meant by the term "scientifically conclusive" with respect to that RCT's results? I think that the term "scientifically conclusive" refers to the objective certainty that a drug's measured efficacy (RCT's signal) is due to the drug's intrinsic effect rather than being due to noise (all other factors that increase or decrease the size of the RCT's signal), and Sackett defines a scientifically conclusive RCT result as a high signal/noise ratio RCT result [1]. There are many causes of noise and interested readers should read Sackett's paper [1] to become better informed about many factors that may produce noise during the conduct of a RCT (eg. enrollment of low risk patients who have also a low likelihood of being responsive to the tested drug, low drug compliance, cross-over effects which occur when enrolled patients take the opposite therapy rather than the assigned therapy, drop-out effects due to individual patients dropping out of the study before study completion). This essay will primarily focus on a major source of noise in RCTs -- chance events that cause the enrolled control patients and treated patients to have a different baseline likelihood of experiencing the outcome-event-of-interest (control event), because I think that many people do not realise that the process of fair randomisation does not guarantee that enrolled control/treated patients have the same baseline likelihood of a control event at the time of trial inception.
How does one define a high signal/noise ratio RCT? Sackett deals with this issue in his paper [1], and he basically asserts that a high signal/noise ratio RCT is a RCT that has a narrow 95%CI range around the point estimate value. If a RCT has a narrow 95%CI range, then that fact implies that the tested drug's "true" point estimate value (which truly reflects the tested drug's efficacy and which can also be defined as the no-bias value) must be very close to the RCT's measured point estimate value. For example, if a RCT's measured point estimate result is a 30% RRR and the 95%CI range extends from a 20%RRR to a 40%RRR, then one can rationally infer that there is a 95% probability that the drug's "true" efficacy (measured as a RRR) is somewhere between a 20%RRR and a 40% RRR. That means that the "true" efficacy result must at least be >20% RRR (although it may in fact be as high as a 40% RRR). Therefore, if one defines a clinically significant result as a RRR>20%, then this RCT's point estimate result (point estimate result of 30% RRR and 95%CI range extending from 20-40% RRR) can be considered to be a scientifically conclusive result that is also clinically significant. I will expand on this point in the next section, and I will use a practical example from a real-life clinical trial to make this issue much clearer and much easier to understand.
Determining whether a small RCT produced clinically significant results that are also scientifically conclusive
Consider the following "real life" RCT that tested the value of magnesium as add-on therapy for patients with sudden hearing loss.
MAGNESIUM THERAPY FOR SUDDEN HEARING LOSS. Nageris, BI. et al.
Ann Otol Rhinol Laryng. 113 (8) 672. August 2004.
Background: The precise cause of sudden sensorineural hearing loss is uncertain. Spontaneous recovery has been reported in about 65% of affected patients. No single cause-specific treatement has been identified, but moderate doses of oral steroids appear to be the most widely accepted treatment. Magnesium plays an important role in intracellelar metabolic processes. There is some evidence from animal and human studies to suggest that treatment with magnesium may have a therapeutic effect.
Methods: In this double-blind Israeli study, 28 patients, aged 22-75 years, referred within 48 hours of the onset of idiopathic sudden sensorineural hearing loss were randomised to treatment with steroids (1mk/kg) plus 167mg of oral magnesium aspartate or placebo.
Results: Improvement in hearing (defined as improvement of at least 10db at each frequency tested) at the three lowest frequencies occurred in approximately 67% of magnesium treated patients and 47% of control patients (p value <0.05).
Conclusion: Magnesium is a relatively safe and convenient adjunct to steroid therapy for enhancing improvement in hearing, especially in the low frequency range, in patients with sudden sensorineural hearing loss.
Note that the authors of the paper basically conclude that magnesium therapy is useful as add-on therapy for patients with idiopathic sudden sensorineural hearing loss on the basis of a small sample sized (28 patients) RCT's results -- improvement in hearing occurred in 68% of magnesium treated patients versus 47% of controls.
The trialists report that their results are statistically significant (p-value <0.05). However, they do not attempt to demontrate that their results are clinically significant and they do not provide concrete evidence that their results are scientifically conclusive (by providing 95%CIs).
I think that the most useful method of analysing these results is to use a Clinical Trial Simulator tool (available from http://randomization.org). Interested readers can refer to reference number [2] for a detailed description of the Clinical Trial Simulator tool.
Let's use the Clinical Trial Simulator tool to perform 1,000 simulated trials that have a sample size of 28 patients, a control event rate of 47%, and an experimental event rate of 67% -- and then plot the 95%CI range for the point estimate *RR result and the scattering of those 1,000 point estimate RR results.
(* I have decided to use RR values, rather than OR values, because the Clinical Trial Simulator tool does not provide OR values)
Figure 1: Scattergram of 1,000 simulated trials.
Note that the point estimate RR value is 1.44 and the 95%CI range extends from 0.74-3.25. The 95%CI range is very wide and this wide 95% CI result means that the RCT's point estimate RR result of 1.44 cannot be deemed to be scientifically conclusive -- the true RR value may be <1.0 (which disfavors magnesium therapy), or the true RR value may be 1.0-2.0 which somewhat favors magnesium therapy, or the true RR value value may be between 2.0-3.25 which strongly favors magnesium therapy.Note that only 11% of the 1,000 simulated trials would have a p-value <0.05, and their RR point estimate values are >1.7. What does that mean? That means that if a particular trial (28 patient sample size and control event rate of 47%) has a point estimate RR value of 1.7 (or larger than 1.7) then there is a <5% probability of that "event" happening by chance if the null hypothesis (that there is no difference between the treatments) is assumed to be true. It doesn't mean that there is a <5% probability of the null hypothesis being true, and it certainly doesn't provide any useful information as to whether that particular result is clinically significant or scientifically conclusive!
How should one establish whether this RR 1.44 (95%CI 0.74-3.25) result is clinically significant and scientifically conclusive?
I think that Sackett's approach [3] may be the best method of making this determination.
Figure 2: Adapted from figure 6.2 from reference number 3.
Note that Sackett uses the term MIB, which stands for minimally important benefit, and which can be regarded as a synonym for clinically significant benefit. Any RR (or ARR or OR) result that is better than the MIB value (to the left of the MIB threshold line) represents a clinically significant benefit result. Note in example D, that both tails of the 95%CI range are to the left of the MIB threshold line. That is a superiority conclusion -- which basically means that the RCT has definitively demonstrated that the treatment provides a clinically significant benefit. The likely degree of clinically significant benefit depends on the width of the 95%CI range. If the 95%CI range is narrow, then a more precise estimation (scientifically conclusive estimation) of superiority can be established. If the 95%CI range is wide, then the estimated degree of superiority is less precise (less scientifically conclusive), but an overall superiority conclusion may still be unquestionably valid.In example A, the 95%CI range lies between the MIB and MIH threshold lines. That represents a true negative (or equivalence) conclusion. If the MIB and MIH threshold lines are very close to the zero point (RR value of 1.0), then only a narrow 95%CI range (scientifically conclusive) RCT result would fall between the MIB and MIH threshold lines. If the RCT has a low signal/noise ratio (scientifically inconclusive RCT), then it would have a wide 95%CI range, and one end of the 95%CI range could extend beyond the MIB (and/or MIH) threshold line. Example B demonstrates that type of phenomenon, and that RCT result would be classified as an indeterminate conclusion -- a clinically significant benefit is possible, and cannot be ruled out.
Finally, the 95%CI range may straddle the MIB threshold line, as in example C, and this represents a RCT result that suggests the possibility of a clinically significant benefit, but cannot tell whether the benefit is clinically significant.
So, how would one classify the magnesium for sudden hearing loss RCT's RR result of 1.44 (95%CI 0.74-3.25) using Sackett's approach? Figure 3 graphically illustrates this result.
Figure 3: Magnesium for sudden hearing loss RCT.
Note that one first has to establish a MIB and MIH threshold value, which depends on individual physician and/or individual patient values. I have arbitrarily set the MIB threshold value as a RR value of 1.25, and the MIH threshold value as a RR value of 0.75. Then one can classify the magnesium for sudden hearing loss RCT's RR 1.44 (95%CI 0.74-3.25) result as an indeterminate conclusion (a clinically significant benefit is possible,and cannot be ruled out) from the perspective of its potential clinical significance. The wide 95%CI range indicates the extent of the uncertainty (scientific inconclusiveness) surrounding the potential magnitude of any clinically significant benefit that could be obtained from magnesium therapy.In conclusion, one can conclude that this magnesium for sudden hearing loss RCT did not demonstrate a superiority conclusion with a high degree of certainty (scientific conclusiveness), and it cannot therefore be used as valid EBM-evidence supporting the routine use of magnesium therapy for sudden hearing loss in clinical practice.
What would happen if magnesium therapy was actually more efficacious in a 28-patient magnesium for sudden hearing loss RCT, and if 80% of magnesium treated patients had hearing improvement (compared to 47% of control patients) -- trial 2. Would trial 2's result represent a clinically significant result, and would it be scientifically conclusive? Figure 4 shows the histogram results of 1,000 simulated trials -- which have a sample size of 28 patients, a control event rate of 47% hearing improvement, and an experimental event rate of 80% hearing improvement.
Figure 4: Histogram of 1,000 simulated 'hypothetical' magnesium for sudden hearing loss trials (trial 2).
Note that the estimated point estimate RR value is 1.74 (compared to 1.44 in the "real life" magnesium for sudden hearing loss trial). However, the 95%CI range still extends beyond the MIB threshold line (towards a RR value of <1.0) and it would therefore still be classified as an indeterminate conclusion from a clinical significance perspective. Also, note that the 95%CI range is very wide (extends from a RR value of 0.95-4.6), and one cannot therefore be certain regarding the likely magnitude of any clinically significant benefit that could potentially be obtained in clinical practice. From the perspective of knowing the precise degree of clinical benefit that could potentially be obtained from magnesium therapy, this RCT result represents a scientifically inconclusive result.What would it take to get a superiority conclusion from a 28-patient magnesium for sudden hearing loss RCT if the control event rate is 47%?
Let's consider a hypothetical magnesium for sudden hearing loss RCT situation (RCT sample size of 28 patients) where 94% of magnesium treated patients had hearing improvement compared to 47% of control patients -- trial 3. Figure 5 shows the results of 1,000 simulated trials -- which have a sample size of 28 patients, a control event rate of 47% hearing improvement, and an experimental event rate of 94% hearing improvement.
Figure 5: Histogram of 1,000 simulated 'hypothetical' magnesium for sudden hearing loss trials (trial 3).
Note that trial 3 (RR 2.0, 95%CI 1.28-4.89) can be classified as a superiority conclusion because both ends of the 95%CI range are to the right of the MIB threshold line. Trial 3 demonstrates that magnesium therapy could certainly produce a clinically significant benefit, and the only uncertainty is the degree of potential benefit. The uncertainty arises because the trial has a wide 95%CI range due to the trial's small sample size. It would take a much larger sample size to be able to precisely estimate the actual degree of clinically significant benefit that could be obtained from magnesium therapy with a high degree of scientific conclusiveness (presuming that its efficacy was unchanged and the RR was still 2.0, and the control event rate was still 47%).These trial simulation examples show that a small sample sized RCT can only produce a superiority conclusion if the drug's effect size is relatively large (eg. RR 2.0, reflecting a doubling of the control event rate of 47% hearing improvement to produce an experimental event rate of 94% hearing improvement).
Lesson number 1: A small (28-patient) sample sized RCT having a control event rate of 47% can only produce a superiority conclusion if the effect size is relatively large (RR = 2.0).
Corollary to lesson number 1 -- it is important to realise that the likelihood of a small sample sized RCT producing a superiority conclusion steadily diminishes as the baseline control event rate becomes progressively smaller -- presuming that the drug's effect size is constant at a RR value of 2.0.
Consider what would happen if a 28-patient RCT has a control event rate of 50%, 10% and 2% respectively -- presuming that the experimental drug is seemingly equally efficacious in all three situations (RR of 2.0).
Trial 1 - CER of 50% and EER of 100%; RR = 2.0
Trial 2 - CER of 10% and EER of 20%; RR = 2.0
Trial 3 - CER of 2% and EER of 4%; RR =2.0
What would the 95%CI range (surrounding the point estimate RR value of 2.0) be in each of those trials?
These are the 95%CI range results for each of those hypothetical trials.
Trial 1 -- RR 2.0 (95%CI = 1.3-4.0)
Trial 2 - RR 2.0 (95%CI = zero-infinity)
Trial 3 - RR 2.0 (95%CI = zero-infinity)
Note that trial 1 produced a superiority conclusion (if the MIB threshold is presumed to be a RR of 1.25). Note that trial 2 and 3 did not produce a superiority conclusion even though the drug was seemingly equally efficacious in those trials (point estimate RR value of 2.0). In fact, their results are so scientifically inconclusive that they cannot even be classified as being a valid scientific experiment.
Consider the scattergram of 1,000 simulated trials where the sample size is 28 patients, the CER is 10%, and the EER is 20% -- see figure 6.
Figure 6: Scattergram of 1,000 simulated trials having a sample size of 28 patients, CER of 10%, and an apparent 'no bias' EER of 20%.
Note the wide scattering of potential point estimate RR results -- they are all over the map.In other words, it is not possible to know what the "true" point estimate RR result is when a small sample size (28 patient) RCT has such a low control event rate.The point estimate RR result of 2.0 could be a *chance event, and there is no way of knowing if the investigational drug is beneficial, or even possibly harmful, if a RCT has a sample size of 28 patients and a control event rate of 10%.(* see reference number [2] to understand why the possibility of chance events producing considerable amounts of noise in a RCT increases dramatically as the control event rate steadily decreases -- even if the randomisation process is fair)Further difficulties/complexities that arise when trying to obtain a definitive superiority conclusion from a small sample sized RCTI previously demonstrated that it is possible to obtain a superiority conclusion from a small sample sized RCT consisting of 28 patients if the control event rate is ~50% and the RR is 2.0.However, it is important to realise that the superiority conclusion of trial 3 (RR = 2.0; 95%CI 1.28-4.89) is only borderline, because the lower limit of the 95%CI range is just above the MIB threshold RR value of 1.25. It is not possible to obtain a greater level of superiority (RR>2.0) from that 28-patient trial because because it is not possible to more than double the control event rate of 47% -- the maximum experimental event rate is 100% (2 x 50%). If the 28-patient trial had a lower control event rate (eg. 33%) then it is theoretically possible to triple the succces rate if the experimental event rate is >99%. Would that result in a more definitive superiority conclusion? The answer is affirmative, because the trial's point estimate result would be a RR of 3.0 and the 95%CI would extend from 1.65-11.0, and the lowest limit of the 95%CI is significantly greater than the MIB threshold RR value of 1.25.Another important point about a single small sample sized RCT's borderline superiority conclusion (eg. RR 2.0 95%CI 1.28-4.89) is that it cannot necessarily provide definitive proof of superiority when considered in isolation. There is always a small chance that it could be a statistical outlier result (false superiority conclusion) and the 28-patient trial would have to be repeated many times to discover that the "true" point estimate RR value is definitely >1.25 (MIB threshold RR value).Consider the following graphical representation of 100 hypothetical magnesium for sudden hearing loss RCT results that have a small sample size of only 28 patients (which will result in a RR point estimate result with a wide 95%CI range).
Let's presume, for argument sake, that the Truth (magnesium's true efficacy) is a point estimate RR result of 1.44 -- and that we are 95% confident regarding that fact. Note that the Truth is only discovered after repeating the trial 100x, and subsequently discovering that 95% of the trials have a single RR value in common -- a RR value of 1.44.Note that each individual trial has a 95% CI range, and that 95 of the 100 trials incorporate the Truth (RR value of 1.44), while 5 trials do not incorporate the Truth (RR value of 1.44) -- note that 2 trials (highlighted in orange) have a false superiority result and three trials (highlighted in green) have a false indeterminate/negative result. The two false positive results are spurious superiority conclusions -- false superiority conclusions that occur by chance (2% chance). There are three true superiority conclusions that incorporate the Truth (highlighted in pink) and they represent ~3% of the trials. Note that it is much more likely that a small 28-patient trial will produce an indeterminate result -- the wide 95%CI range incorporates the Truth (RR result of 1.44) but it also extends far beyond the MIB threshold line, often into negative territory.It is highly unlikely that a meta-analyst would necessarily discover the "Truth" if he only combined the results of a small number (eg. 5-10 trials) of 28-patient sample sized trials, and a meta-analyst would probably need to combine the results of many small sample sized trials if he wanted to accurately determine the "Truth".
Concluding remarks:
What is the primary purpose of performing a drug-RCT on a new investigational agent?The answer is obvious -- to determine whether the new investigational drug is more efficacious than standard drug therapy (control agent or placebo). However, when performing a drug-RCT, one doesn't only want to know whether the new investigational drug is better than standard therapy, one wants to know how much better it is, and whether the difference is clinically significant. Secondly, one expects to be reassured that the measured degree of clinical benefit is due to the drug and not due to chance events (reassured that the RCT's results are scientifically conclusive).The magnesium for sudden hearing loss trial (which had a sample size of 28 patients) was markedly underpowered and it cannot produce scientifically conclusive results. Why?Consider a simple analogy. Let's presume that a person tosses a coin 28x and documents whether each coin toss is heads-or-tails. If the coin is fair (unbiased), then ~50% of 1 million coin tosses would come up heads. However, what is the chance that ~50% of coin tosses would come up heads if the number of coin tosses was only 28x, and if one repeated the sequence of 28x coin tosses a number of times? Even the most unsophisticated layperson knows that the results could vary considerably, and that it would not be unusual for 67% of 28x coin tosses to come up heads --- due to chance alone. One wouldn't automatically presume that the coin was unfair (biased) and presume that the difference between 67% and 50% accurately reflected the absolute degree of bias. So, why are so many clinical trialists obliviously unaware of this basic fact?I think that a clinical trialist, and/or a clinical trial interpreter, should always estimate to what degree chance events could affect any drug trial's results, before they attempt to make definitive judgements about the drug's effect size.Imagine a 28-patient sample sized RCT testing an inert agent against an inert placebo. Because the investigational is inert (by design), it obviously cannot produce any benefit or harm, and the measured RR should theoretically be 1.0. What is the range of RR results that could occur due to chance alone if the control event rate is 47%? Consider the results of 1,000 simulated randomised trials where the 'no bias' RR value is pre-designed to be 1.0, because the tested agent is totally inert.
Note the wide range of possible RR results -- which are all due to chance (because the investigational agent is inert). Note that a RR result of 1.44 could easily occur, and that it would entirely be due to chance events (despite a fair randomisation process). Note that there is a 95% probability that any individual RCT would produce a RR point estimate result that is between 0.43-2.6, and that all of those different RR results would entirely be due to chance events (due to the fact that the group of 14 placebo patients would have a different baseline likelihood of having a control event than the group of 14 treated patients). Would it be rational to run the trial? How would one know if any RR result was due to the drug's intrinsic effect rather than due to chance?I think that it is irrational to perform a drug-RCT under conditions of such great uncertainty. I think that a drug-RCT needs to be sufficiently large, so that any measured RR result is likely to be due the drugs' effect, rather than due to chance.Consider how the chance-caused scattering of RR results would be much less if the above trial's sample size was 2,800 patients, rather than 28 patients.
Note how the RR results are closely clustered around the 'no bias' RR value of 1.0, reflecting the narrow 95%CI range.Under these circumstances, a RR value of 1.44 cannot be a chance event phenomenon and must be due to the drug's effect (if one were dealing with a drug-RCT testing a potentially effective drug, rather than an inert agent).What is particularly interesting, and important to understand, is that the required sample size is critically dependent on the control event rate, and a much larger sample size is required for trials that have a low control event rate [2]. For example, a sample size of 2800 patients is adequate if the control event rate is 47%, but it's totally inadequate if the control event rate is 2%. Consider the results of the above hypothetical RCT of an inert agent if the sample size is 2,800 patients and the control event rate 2%.
Note that we are in a similar situation to the hypothetical trial where the sample size was 28 patients and the CER/EER was 47% -- the 95%CI range is very wide reflecting the marked potential influence of chance events. If the investigational drug's effect is unknown (rather than pre-arranged to be inert), then it would be impossible to know whether any particular RR result is due to chance, or whether it is due to the difference in drug efficacy between the investigational agent and the control agent. I think that one cannot rationally interpret this trial's results under conditions of such great uncertainty (scientific inconclusiveness). A "real life" example of this type of great uncertainty can be found in secondary endpoint mortality results of the ICTUS trial, which was reported in the September 15th 2005 issue of the NEJM [4].The ICTUS trialists were attempting to determine whether a treatment strategy of early invasive therapy would be superior to a treatment strategy of selective invasive managment for acute coronary syndrome. They used a composite endpoint (death, MI, and re-hospitalisations for angina within one year of randomisation) as the primary endpoint. The ICTUS trialists made the following statement in the statistical analysis section of the paper:- "We calculated that, given a 21 percent incidence of the primary end point in the group assigned to an early invasive strategy, 1200 patients would be needed to provide the study with 80 percent power to detect a relative risk reduction of 25 percent between the two groups, at an alpha level of 0.05." In other words, the 1200-patient study was adequately powered, from a sample size perspective, for a control event rate of 21%, but not adequately powered for a control event rate of 2%. The one year mortality was 2.5% for both groups, and the calculated 95%CI was 0.49-2.0 surrounding a point estimate RR result of 0.99 (similar to the situation above). Therefore, it is impossible to know whether the lack of a mortality difference between the two treatment strategies truly reflects the fact that there is no real difference between the treatment strategies (that the RR value of 0.99 is really the true 'no bias' RR value), or whether a significant mortality difference is masked by chance events.Meta-analysts attempt to reduce the uncertainty by pooling the results from a number of trials, but considerable uncertainty can still remain in low control event situations. Consider a meta-analysis of a number of cardiology trials that tested routine early invasive therapy versus selective invasive therapy for Acute Coronary Syndrome [5].
Note the significant fact that the point estimate OR results from the seven different RCTs vary widely -- this could be a reflection of significant chance event bias, which frequently plagues small sample-sized trials that have a low control event rate. Note the very wide 95%CI range of the MATE and VINO trials, which had a ridiculously small sample size for a trial having a control event rate of 10-13%.The final meta-analytic result from a cumulative sample size of 9,212 patients was an OR of 0.92 (95%CI 0.77-1.09). If one uses the Clinical Trial Simulator tool to produce a dot plot of 1,000 simulated trials having a sample size of 9,212 patients, a CER of 6.0% and an EER of 5.5%, then one can see that a small amount of uncertainty still exists (presuming that it is rational to compare trials having considerable degees of heterogeneity due to many confounding variables).
In their paper [5], the meta-analysts concluded-: "The advantage of combining the results of all trials addressing the same broad question is that the results reflect a wider range of practices and technical skills, and a more robust estimate of risk is likely to be obtained. This also minimizes the play of chance affecting a particular end point in any one trial. Consequently, if a real difference (or lack of a difference) exists, it is likely to be detected."That statement is generally true, but one wonders if an arbitrary cumulative sample of seven low signal/noise ratio trials would produce a point estimate RR result that is truly reflective of the "true" RR value that could theoretically be obtained if the meta-analysts had ready access to a much larger number of similar low signal/noise ratio trials (as in the 100 trial-example previously described). Nonetheless, a small cumulative meta-analysis of seven trials is more likely to produce scientifically conclusive results than a single inadequately powered RCT.
In conclusion, I think that trialists, and trial interpreters, should become more fully aware of the potential influence of chance event bias, due to a combination of small sample size and/or low control event rates, on the scientific conclusiveness of randomised controlled trials. At the very least, I think that all trial interpreters should always take into consideration the trial's sample size and control event rate before unthinkingly accepting any claim that a single trial's point estimate RR result may accurately reflect an investigational drug's "true" comparative efficacy effect.Jeff Mann, MD.Retired physician.First draft version: September 2005.
References:
1. Sackett, David L. Why randomized controlled trials fail but needn't: 2. Failure to employ physiological statistics, or the only formula a clinician-trialist is ever likely to need (or understand!) CMAJ 165(9):1226-1237, October 30, 2001.
Available online at http://www.cmaj.ca/cgi/content/full/165/9/1226
2. Mann J. Quantifying the potential magnitude of chance event noise in randomised controlled trials.
Available at http://jeffmann.net/soapbox/chanceevents.htm
3. Sackett D. From chapter 6:5 (Superiority, Equivalence and Non-inferiority Trials) drafted for the third edition of "Clinical Epidemiology", which is due to be published in the near future.
Available at http://nsite.ca/index.cfm?page=221&CFID=35490&CFTOKEN=18728222
4. Robbert J. de Winter, M.D., Ph.D., Fons Windhausen, M.D., Jan Hein Cornel, M.D., Ph.D., Peter H.J.M. Dunselman, M.D., Ph.D., Charles L. Janus, M.D., Peter E.F. Bendermacher, M.D., H. Rolf Michels, M.D., Ph.D., Gerard T. Sanders, Ph.D., Jan G.P. Tijssen, Ph.D., Freek W.A. Verheugt, M.D., Ph.D., for the Invasive versus Conservative Treatment in Unstable Coronary Syndromes (ICTUS) Investigators. Early invasive versus selectively invasive management for acute coronary syndromes. NEJM 2005 Sep 15;353(11):1095-104.
5. Mehta SR, Cannon CP, Fox KA, et al. Routine vs selective invasive strategies in patients with acute coronary syndromes: a collaborative meta-analysis of randomized trials. JAMA 2005;293:2908-2917.