Quantifying the potential magnitude of chance event noise in randomised controlled trials

 

-----------------------------------------------------------

 

Introduction:

 

In recent months, I have become very interested in the problem of quantifying to what extent "chance event bias" can influence the the scientific validity of a randomised controlled trial (RCT), but I could not previously quantify the likely effect of chance event bias. This problematic predicament has radically changed since I discovered a powerful tool for estimating the likely amount of chance event bias in RCTs -- a Clinical Trial Simulator tool [1]. In this essay, I will be describing this Clinical Trial Simulator tool in considerable detail and I will also demonstrate how this tool can be used to establish to what degree chance events can potentially affect the scientific validity of a RCT's results. I will also use a few examples from the recent RCT-literature to demonstrate the usefulness of this powerful analytical tool in "real life" clinical practice.

Many trialists and clinicians simply accept the positive/negative results of a RCT at face value without paying enough attention to the trial's confidence intervals. Although trialists may concede that a RCT with wide confidence intervals cannot produce scientifically conclusive results, they often ignore the "scientific validity" significance of wide confidence intervals when they discuss a trial's point estimate effect at medical research meetings and in interviews with medical journalists. In this essay, I will demonstrate how chance events can cause a RCT to have wide confidence intervals, and how chance events therefore affect the scientific validity of a RCT's results.

 

Background information:

 

What do I mean by the term "scientific validity"?  My use of the term "scientific validity" applies to a consideration of the level of confidence that a trialist/trial interpreter can rationally adopt with respect to a single RCT's results. I believe that only high signal/noise RCTs can produce scientific conclusive results (trial results characterised by a narrow confidence interval). Please note that I will often be using the terms "scientifically valid", "scientifically legitimate" and "scientifically conclusive" interchangeably -- implying that one can only be confident in the scientific legitimacy of a RCT's results if the trial can generate trial results characterised by a high signal/noise ratio and narrow confidence intervals.

A RCT's results reflects the absolute difference in effect between a control agent (or placebo agent) and an investigational agent (experimental agent) on the control event rate. The absolute difference is usually expressed as the risk difference (RD) or absolute risk reduction (ARR), and this absolute difference represents a RCT's signal. If the ARR is large, then a RCT has a large signal. However, a large signal-RCT can only generate scientifically conclusive (scientifically valid) results if the RCT's signal/noise ratio is also high. Noise can be considered to be due to random variations in a RCT's absolute risk difference measurement (ARR) that is not due to the investigational agent's intrinsic therapeutic/harmful effect. In other words, a RCT's signal should be considered to have two components -- a component due to the investigational agent's therapeutic (or harmful) effect and a component due to noise. The noise component can either increase the size of the RCT's measured signal (if its effect is in the same direction as the investigational agent's effect) or it can decrease the size of the measured signal (if its effect is in the opposite direction as the investigational agent's effect).

Practical example:

If one uses a RCT to quantify the ability of an investigational antibiotic agent (experimental antibiotic agent) to reduce the mortality rate due to an infectious disease, then the control event rate (CER) would be the percentage of patients who die without treatment (placebo treatment) or who die when treated with standard antibiotic therapy (control antibiotic agent). If the RCT lasted 1 month, then the control event rate (CER) would be the average mortality rate of control group patients during that 1 month trial period. The CER would be 80% if the average mortality rate for the control group patients is 80%. If the experimental agent reduced the mortality rate by 25%, then the relative risk reduction (RRR) would be 25%, and the experimental event rate (EER) would be 60% (a 25% RRR would reduce the average mortality event rate by 1/4 from 80% to 60%). The absolute risk reduction (ARR) would be a measure of the absolute difference between the control agent's effect and the investigational agent's effect on the average mortality rate. In this particular example, the ARR would be 20% (CER-EER= 80%-60%=20%) in favor of the investigational agent. However, note that the power of the investigational antibiotic agent to reduce the mortality rate is best expressed by the RRR value. In this particular example, the RRR is 25%. If the investigational agent was actually twice as powerful (relative to the control agent), then it would double the RRR to 50%. If the investigational agent was only half as powerful, then the RRR would be 12.5%.

The absolute magnitude of the RCT's signal (ARR) depends on two factors - the investigational agent's power to reduce the mortality rate (RRR) and the control event rate (CER). For example, if the CER is 80% and the RRR is 25%, then the EER would be 60%. The ARR would be 20% (CER-EER). However, if the CER is only 40% (because the RCT's enrolled patients are less sick and therefore half as likely to die during the trial's 1 month observation period) and the RRR is 25% (same as before), then the EER would be 30%. The ARR would be 10% (CER-EER). In both situations, the investigational antibiotic agent could be considered to be equally efficacious (same RRR) but the ARR is very different (EER-CER). Therefore, for a given RRR value, the magnitude of a RCT's signal (ARR) is directly proportional to the CER. The lower the CER, the smaller the absolute magnitude of the signal.

Take the above situation of an ARR of 10% when the CER was 40%, the RRR 25% and the EER 30%. In this situation, the magnitude of the signal (ARR=CER-EER) is 10%, and this value represents the investigational antibiotic's absolute therapeutic advantage if there is no chance event bias. Chance event bias can occur if certain chance events either favor (or disfavor) the treatment group patients relative to the control group patients. Consider the effect of chance event bias on this RCT's results. 

Let's presume, for example, that all the control group patients are, by chance, less sick than the treatment group patients and that the average likely mortality rate of the control group patients is 40% (when treated with the control antibiotic agent) while the average likely mortality rate of the treatment group patients (if treated with the control antibiotic agent, and not the new investigational antibiotic agent) is 60%. If the investigational antibiotic treatment was equally as powerful as before (25% RRR compared to the control agent), then the EER would be 45% [60% - {25% of 60%} = 60%-15%=45%]. The trial's ARR would be -5% (CER-EER=40%-45%= minus 5%). The RCT's signal (ARR of minus 5%) would suggest that the investigational antibiotic agent is less efficacious than the control antibiotic agent. However, this fact is not true and this interpretative error is due to chance event bias, which masks the 25% RRR therapeutic power of the investigational agent over the control agent. The absolute magnitude of the chance event bias noise factor is *15% (10% ARR if the RCT was free of chance event bias, compared to -5% ARR if the RCT had this absolute amount of chance event bias)

(* Note that the chance bias noise factor has two elements that affect the signal in opposing directions -- i) a 20% baseline difference in likely average mortality rate between the treated and control group patients {which favors the control group}; a 5% ARR value due to the fact that 25% RRR of 60% is greater than 25% RRR of 40% by an absolute amount of 5% {which favors the treatment group}).

The magnitude of the measured signal is 5%, but the magnitude of the chance event noise factor is 15% . The RCT's signal/noise ratio is 5%/15. The RCT therefore has a low signal/noise ratio.

Sackett [2] used the following formulae to express confidence in a RCT's results.

Sackett stated that a low signal/noise ratio RCT cannot confidently generate scientifically conclusive results (trial results that have a narrow confidence interval). Sackett described many causes of noise (eg. noise due to low treatment compliance, noise due to cross-over effects whereby the treated and control patients take the opposite drug, noise due to patients lost to followup, noise due to inconsistency in the correct  ascertainment of the CER and/or the EER). I will, however, only be focusing on chance event noise in this essay - noise due to chance events causing the control patients and treated patients to have a different baseline likelihood of a control outcome event. In the above example, the RCT could be regarded as having such a low signal/noise ratio that it cannot produce scientifically valid results (scientifically conclusive results that are scientifically legitimate because they are accurately reflective of the investigational agent's true causal effect).   

Consider how chance event bias can work in the opposite direction, and favor the treatment group rather than the control group. Let's presume, for example, that all the control group patients are, by chance, more sick than the treatment group patients and that the average likely mortality rate of the control group patients is 60% (when treated with the control antibiotic agent) while the average likely mortality rate of the treatment group patients (if treated with the control antibiotic agent, and not the new investigational antibiotic agent) is 40%. If the investigational antibiotic treatment was equally as powerful as before (25% RRR compared to the control agent), then the EER would be 30% [40% - {25% of 40%} = 40%-10%=30%]. The trial's ARR would be 30% (CER-EER=60%-30%=30%) and this exaggerated ARR value (30% instead of 10%) is due to chance event bias. Although the magnitude of the RCT's signal is 30%, the magnitude of the chance bias noise factor is 20% (due to a 20% absolute increase in baseline mortality rate in the control group patients compared to the treated patients). 

If the magnitude of the measured signal is 30%, but the magnitude of the chance event noise factor is 20%, then the RCT's signal/noise ratio is 30%/20%. The RCT therefore has a low signal/noise ratio (two thirds of the measured signal is due to chance event noise) and it cannot produce scientifically conclusive results.

Can randomisation reduce the chance event noise bias factor to insignificant amounts?

A reader could immediately assert that significant amounts of chance event bias could not occur in "real life" RCTs if the randomisation process is fair and unbiased, and that my previous examples are deliberately exaggerated and not reflective of true reality.

However, I will take up the challenge of dealing with this assertion, and I will shortly demonstrate (using the clinical trial simulator tool) that a perfectly fair randomisation process cannot guarantee that chance event bias is reduced to insignificant levels if the RCT has a low control event rate and a small signal, and that many "real life" RCTs have such low signal/noise ratios (due to significant amounts of chance event noise) that they cannot possibly produce scientifically conclusive results.

 

Use of the Clinical Trial Simulator tool to quantify the potential amount of chance event bias in a RCT

 

How does the Clinical Trial Simulator tool work [1]?

Here is a screenshot of the Clinical Trial Simulator's opening screen.
 


Consider a RCT consisting of 2,000 patients (1,000 control patients and 1,000 treated patients) and a control event rate of 80%.

Before running the *Clinical Trial Simulator tool, one first has to input i) the number of patients in the trial; ii) the anticipated CER; and iii) the anticipated EER.

[* for simplicity sake, I am not going to consider the effect of noise due to i) crossover effects; ii) lost to followup effects; and non-compliance effects] 

After inputting the necessary data in the appropriate boxes, one then chooses the number of trial simulations. Then one clicks on "run simulation" and the software program supervises the necessary computing tasks. It only takes a modern-day laptop computer 30 seconds-few minutes to run 1,000 simulated RCTs.

The results are then expressed as follows.


Note that the Clinical Trial Simulator tool calculates the "average" RR and RRR values (with 95% confidence intervals) for those 1,000 simulated trials. The RR/RRR values in green represent the RR/RRR values that would exist if there was no chance event bias.

The magnitude of potential chance event bias is reflected by the width of the 95% confidence intervals.

This tool also informs one of the percentage of trials that have a P value <0.05. Note that, in this example, 100% of the 1,000 simulated trials have a P value <0.05 and that the 95% confidence interval range is very narrow (95% of the 1,000 simulated trials have a RRR value between 20-29% with an average value of ~25%).

One can then click on "Make graph" after selecting from a variety of different graphical display choices.

Two useful graphical displays are as follows.


HIstogram of the RRR results from the 1,000 simulated trials



The Y axis notes the number of trials that have a particular RRR value.

The green "X" value represents the RRR value if there is no chance event bias.

The confidence interval box represents 75% of the trials results, and the bold black line represents the 95% CI limit range.

Note that ~400 of the 1,000 simulated trials (40%) would generate a RRR value of 25% (equal to the "no bias" RRR value) and that chance event bias can cause some of the trials to have RRR values that are slightly more/less than 25%. However, the chance event bias factor is so small that there is a 95% likelihood that any particular RCT will have a RRR value between 20-30% (note that the base of the bell-shaped curve is very narrow).

One can therefore conclude that all these potential RCTs would have a high signal/noise ratio (presuming that there are no other significant sources of noise other than chance event noise) and that they would all produce scientifically conclusive results -- from both a qualitative and quantitative perspective. 

Another graphical display that is very informative is the following graphical display.



This graphical display plots the RRR values of the 1,000 simulated trials against the P values.

In this particular example, all the trial results are clumped together in one small clump, and all the simulated trials have a P value of <0.001. One can therefore conclude, with a high degree of confidence, that the measured RRR value of any potential trial will be reflective of true causal reality - reflective of the true intrinsic therapeutic value of the new agent (investigational agent) compared to the old agent (control agent).

OK! Now, let's make this subject more interesting. What would happen if the control event rate decreased to 20%, but the RRR and the sample size remained the same?


Effect of a control event rate of 20% on the magnitude of the chance event bias noise factor

 

If one inputs a sample size number of 2,000 patients (1,000 control patients, 1,000 treated patients), a CER of 20%, and an anticipated RRRof 25% into the Clinical Trial Simulator tool - the following results would occur if the trial simulation number was 1,000.





Note that the number of trials that have a P value <0.05 is only 84% and that the 95% confidence interval range is wider. To better appreciate how much wider the confidence intervals are when the CER is 20%, consider the following graphical display.



Note that the bell-shaped curve has a wider base, and that the 95% confidence interval range extends from 10% to 38% RRR (compared to 20% to 30% RRR when the CER is 80%).

The following graphical display demonstrates the results of the 1,000 simulated trials.




Note that most of 1,000 trial results generate a P value <0.05, and that one can still be very confident that any particular trial will have a low chance event bias noise factor, and therefore a high signal/noise ratio (if there are no other significant sources of noise). One can therefore rationally conclude, with a high degree of confidence, that any of these RCTs will produce scientifically conclusive results.

Now, consider the effect of a CER of 2% on a RCT's likely signal/noise ratio.
 

Effect of a control event rate of 2% on the magnitude of the chance event bias noise factor 
 

If one inputs a sample size number of 2,000 patients (1,000 control patients, 1,000 treated patients), a CER of 2%, and an anticipated RRR of 25% into the Clinical Trial Simulator tool - the following results would occur if the trial simulation number was 1,000. 



Note that the bell-shaped curve has a much wider base. Note that the 95% confidence interval range limit line extends across the zero RRR value and that it extends deeply into negative territory. Note that the 95% confidence interval range is very wide, from -48% to +65%.

Note that only 50 out of 1,000 simulated trials will have a RRR value of exactly 25% (the "no bias" value).

Consider the graphical display that plots the RRR against the P value.


 

Note the wide scattering of the 1,000 trial RRR results.

Note that only 11% of the trials have a P value <0.05.

Note that a significant minority of trials generate contrary results (a negative RRR favoring the control agent rather than a positive RRR favoring the investigational agent).

The wide scattering of likely RRR results, and the wide confidence intervals suggest that many of these 2% CER trials are handicapped by a large amount of chance event noise, which could potentially cause any particular RCT to have such a low signal/noise ratio that it cannot possibly produce scientifically conclusive results. Note that only a minority of the 1,000 trials will accurately quantify the RRR value (the "no bias" RRR value) and that many trials will produce inaccurate quantitative results due to significant amounts of chance event noise.

Why would the potential amount of chance event noise be so much larger when the CER is 2% (compared to a CER of 80%)?

My explanation will be found between the horizontal lines.
 


To make this chance event issue more understandable, I have created an imaginary RCT scenario.

Presume that a trialist wants to perform a 2,000 sample-sized RCT (1,000 control group patients and 1,000 treated group patients) that only enrolls low risk patients who have a likely CER of 2%. Presume that 1 million potential enrollees arrive and that 20,000 of them are likely to have a control event. Presume that the trialist corrals all the potential enrollees into a large field paddock and randomly selects 2,000 patients out of those 1,000,000 patients who are randomly milling about in the large field paddock. If the randomisation process does not suffer from selection bias, then there should be, on average, 20 patients in each 1,000 patient group who are likely to have a control outcome event. However, the potential likelihood of that "average" 2% CER value (20 out of every 1,000 enrolled patients) always occurring in any particular 1,000 patient control/treated patient group is actually very small - because of the play of chance.

It is easy to imagine that if the trialist selects another 2,000 patients for another RCT, and then performs the same random selection process repeatedly to a maximum of 500x (in order to run 500 equivalent randomised trials), that each 1,000-patient group will have a slightly different average chance likelihood of having a control event. The absolute difference may not be large, but the relative difference could be highly significant.

For example, if the enrolled control patient group has, on average, a 1.5% likelihood of experiencing a control outcome event, and the treated patient group has, on average, a 2.5% likelihood of experiencing a control outcome event, then the 1% absolute difference (2.5%-1.5%) represents the magnitude of the chance event noise factor. If the RCT's anticipated signal is a RRR of 25%, then the absolute magnitude of the anticipated signal is 0.5% (25% of an anticipated CER of 2%). It doesn't take much imagination to realise that this particular RCT has such a low signal/noise ratio (due to significant amounts of chance event noise) that it cannot generate scientifically conclusive results.

If you have difficulty understanding this random chance event phenomenon, imagine placing a red ribbon around the necks of 20,000 sheep and then herding those 20,000 red ribboned sheep into a large field paddock containing 980,000 sheep that do not have a red ribbon necklace. Imagine allowing the sheep to randomly mill about for 24 hours before photographing the field paddock scene from the air using a high resolution satellite-based camera. It is obvious that random chance events will cause the pattern-distribution of red ribboned sheep to vary significantly in different sections of the large field, so that any particular local grouping of 1,000 sheep may contain a slightly different number of red ribboned sheep. That's what the play of chance is all about! It also doesn't take much imagination to realise that the "relative degree of chance variation" would be proportionately much less if 80% of the sheep were red ribboned and 20% of the sheep were not red ribboned.


 

Effect of a control event rate of 0.5% on the magnitude of the chance event bias noise factor   


What would happen if the anticipated average CER is only 0.5%?

If one inputs a sample size number of 2,000 patients (1,000 control patients, 1,000 treated patients), a CER of 2%, and an anticipated RRR of 25% into the Clinical Trial Simulator tool - the following results would occur if the trial simulation number was 1,000.


 

Note that only ~5% of the 1,000 trials would have a P value of <0.05.

Note that the 95% confidence interval range for the "average" RRR value of 23.9% is very wide, and that it varies from -300% to 88.4%.

Consider the RRR histogram display.



Note that the bell-shaped curve has such a wide base that it doesn't even appear to be bell-shaped.  

Note that there is only a small probability that any particular trial will accurately identify the "no bias" 25% RRR value.

Note that a significant minority of trials will generate a contrary result (favoring the control group) and that the magnitude of the contrary result will vary widely.

Consider the graphical display that plots the RRR against the P value.




Note that the majority of trials (~95%) will generate a P value of >0.05.

Note that one cannot be confident that any particular trial's RRR results will accurately reflect the "no bias" RRR value (if it is unknown).

The wide variation of RRR results displayed in these two graphs should surely convince a sceptic that one cannot expect a single very low CER (0.5% CER) randomised controlled trial to be capable of producing a scientifically conclusive result.

I could imagine a reader stating that it should be possible to identify those particular trials that have a large chance event noise element by examining the RCT's baseline characteristics table (table 1 in official journal reports of RCTs) - imagining that trials that have a large chance event bias noise factor would have readily apparent differences in baseline characteristics between the control and treated groups. However, that is not necessarily true! I will be demonstrating that important fact when I analyse the TARGET trial, which had a CER of ~0.5%.

Can one decrease the chance event noise factor in these 0.5% CER trials by increasing the sample size?

Consider the RRR histogram results of 1,000 simulated trials having a 0.5% CER and an anticipated RRR of 25%, but varying sample size numbers.




Note that one gains very little advantage, from the perspective of one's ability to reduce the likelihood of chance event bias, by increasing a trial's sample size number from 1,000 patients to 5,000 patients. Even a sample size number of 50,000 cannot avoid a low signal ratio situation due to significant amounts of chance event noise (the bell-shaped curve would still have a wide base).

A randomised trial that can generate a quantitatively accurate point estimate value with narrow confidence intervals (bell-shaped curve with a narrow base), and thereby confidently produce scientifically conclusive results, would require a sample size of ~200,000 patients!

 

Analysis of a few "real life" RCTs that have a low control event rate of 0.5%-2%



Example number 1
The WHS trial of aspirin therapy for the primary prevention of cardiovascular events in women

 

A number of primary prevention trials have demonstrated that aspirin is useful for the primary prevention of coronary heart disease events (MI events) in men. The average relative risk reduction of MI events in those RCTs was 25-30%. Based on the previous experience in men, the WHS trialists devised a RCT hoping to demonstrate that aspirin would have a similar degree of therapeutic efficacy in reducing the MI rate in women.

Consider a passage from the WHS trial's official report [3].

"The Women’s Health Study was a large randomized, double-blind, placebo-controlled trial of low-dose aspirin in the primary prevention of cardiovascular disease among 39,876 apparently healthy women followed for a mean of 10 years for the major cardiovascular events of myocardial infarction, stroke, and death from cardiovascular causes. The primary end point was a combination of major cardiovascular events, including nonfatal myocardial infarction, nonfatal stroke, and death from cardiovascular causes, and the trial was initially designed to have a statistical power of 86 percent to detect a 25 percent reduction in this end point."

By using a combination endpoint of MI/stroke/death, the WHS trialists could generate a greater amount of statistical power for a given sample size number. However, I think that one really needs to determine whether a RCT has sufficient statistical power to generate scientifically valid conclusions for each component endpoint -- MI, stroke, and CV death. There is no " a priori" guarantee that aspirin will affect those individual endpoints to the same degree (or even in the same direction), and I think that one needs to examine the WHS trial's efficacy results from the perspective of each individual component endpoint.

Consider the WHS trial's design from the perspective of aspirin's ability to effectively reduce the incidence of MI events.

 

 

Note that the WHS trialists enrolled 39,876 patients. Note that 60.2% of the patients (24025 patients) were between 45-54 years of age. Note that only 10.3% of enrolled patients (4097 patients) were >65 years of age.

Note that the MI-CER in women aged 45-54 years was 0.46%, and note that aspirin caused the MI-EER to increase to 0.57%.

Note that the MI-CER in women aged 55-64 years was 1.27%, and note that aspirin caused the MI-EER to increase to 1.49%.

Note that the MI-CER in women aged >65 years was 3.02%, and note that aspirin caused the MI-EER to decrease to 2.0%.

Was the varying signal strength (CER-EER) in the three different patient subgroups due to aspirin's beneficial/harmful effect being different in the three subgroups, or was it due to chance events?

I believe that the exact answer is unknowable because one cannot determine to what degree this particular RCT (WHS trial) was affected by chance event noise. One can only determine the potential chance event noise factor for a series of identically structured trials (which are theoretically identical to the WHS trial) by running the Clinical Trial Simulator tool. 

This is the result of 1,000 simulated WHS-type trials (having the same sample size numbers and CER/RRR values as the WHS trial).

 

 

Note that a WHS-type trial (having those same sample size numbers and CER/RRR values as the WHS trial) would only have a ~5% chance of generating a P value <0.05.

Note the wide 95% confidence interval range values surrounding the overall study group's RRR point estimate value of -2.4% (the 95% CI range values vary from * -25.3% to 14.7%).

(*These particular CI values are obviously not the same as those obtained in the WHS trial because I have not corrected for many other noise confounders eg. variable compliance rates, variable drop-out rates, variable cross-over rates, variable lost to followup rates)

Consider the RRR histogram results for the different subgroups.




 

Note that each subgroup's bell-shaped curve has such a wide base that one cannot expect any particular RRR point estimate result to be scientifically conclusive (defined as a point estimate result that is associated with narrow confidence intervals). 

The MI results for the entire trial suggest that aspirin has a neutral effect overall, but the histogram curve for all the groups combined (black curve) has such a broad base that the 95% confidence interval values extend all the way from -25% to +15%.

Secondly, and more importantly, is it rational to simply "average" the results from different patient subgroups if they lie on opposite sides of the zero RRR point estimate value? *Would an "internal" meta-analysis of three low signal/noise ratio subgroup MI results really improve the trial's overall MI signal/noise ratio from the perspective of scientific legitimacy?

(* I am sceptical of the scientific validity of combining scientifically inconclusive subgroup results that have an opposing signal and then implying that the investigational agent has a neutral effect because the opposing signals cancel-out each other -- especially if a major component of some/all of the subgroup results could be primarily due to chance events)

In my personal analysis of the WHS trial [4], I concluded that the WHS trial has too low a signal/noise to generate scientifically valid results, and I concluded that one simply does not know if the point estimate effect result is due to a multiplicity of chance events or whether it is mainly reflective of the drug's intrinsic cardiovascular effects. The above histogram results solidify my initial impression?

Consider the RRR-P value scattergram graphical display.

 


Note that 95% of those simulated WHS-type trials have a predicted P value of >0.05.

Note that there is an equal chance-likelihood of a WHS-type trial, which has a P value of <0.05, generating a negative result as a positive result.

It is my general impression that even though the WHS trial had two satisfactory trial design features -- a reasonably large sample size of ~40,000 patients and a long study duration of 10 years -- it still could not hope to generate scientifically conclusive results, because too many enrolled patients had a low control event rate (60% of enrolled patients had a MI-CER of 0.5%) plus those low risk patients were not likely to be sufficiently responsive to such a low dose of aspirin (100mg every other day).

In his Powerpoint presentation on RCTs [1], Sackett specifically advises RCT-trialists to avoid enrolling low risk patients who are likely to have a low response to the study drug.

Slide from Sackett's Powerpoint presentation [1].



I think that the WHS trialists broke Sackett's cardinal trial design rule, which is an essential rule if a trialist wants to design a high signal/noise ratio RCT that can generate scientifically conclusive results.

The WHS trialists should have enrolled higher risk patients, who would be more likely to be highly responsive to aspirin, if the trialists wanted to minimise the likelihood of significant amounts of chance event noise causing their RCT to have a low signal/noise ratio.

Consider the likely RRR histogram results if the WHS trialists only recruited elderly patients >65 years, who have an anticipated CER of ~3% (6x greater than the CER for low risk patients) and an anticipated RRR of 33% for MI events, into their trial.

 

 

Note that an aspirin trial consisting of 40,000 higher risk/higher responsive patients could potentially generate a 33% RRR point estimate result with narrow confidence intervals, which would probably be widely considered to be scientifically conclusive (in contrast to the questionably conclusive results of the WHS trial's 4,000-patient elderly patient subgroup).

The WHI trial of aspirin therapy for the primary prevention of cardiovascular events in women is a poster-child example of a randomised controlled trial that cannot produce scientifically conclusive results, because it probably has too low a signal/noise ratio due to significant amounts of chance event noise. 

 

Example number 2:   The APPROVE study's analysis of rofecoxib's adverse cardiovascular effects 

 

When revewing the many COX-2 inhibitor randomised trials that are reputed to have demonstrated that COX-2 inhibitors increase the risk of adverse cardiovascular events, I noted that many of those RCTs had a CER between 0.5-1.0% with an average CER value of 0.8% [5]. At the FDA's Arthritis Drug Advisory Committee meeting in February 2005, I noted that the Advisory Committee Members' overall impression (after reviewing multiple observational and RCT studies) was that rofecoxib may increase the risk of cardiovascular events by an estimated RR value of 1.5. OK! So, let's presume that the anticipated RR value is 1.5 when running the Clinical Trial Simulator program.

Then consider the results of 1,000 simulated APPROVE-type trials that also have a sample size of 2500 patients, a CER of 0.8% and an anticipated RR value of 1.5.

 

 

Note that bell-shaped curve has a very broad base, and note that the 95% confidence interval range is very wide.

The APPROVe trial had a RR result of 1.92. Is that particular RR result due to chance events or is it due to rofecoxib's intrinsic harmful cardiovascular effects? I analysed the APPROVe trial in a previous essay [5] and I concluded that most of the signal was primarily due to chance events (see reference number [5] for my detailed analysis of likely chance events in the APPROVe trial). The APPROVe trial's RR result only represents a single possible RR result, and there are many other equally valid (or equally invalid) potential RR values. How can one rationally determine which particular simulated RCT's RR results (out of these 1,000 simulated trials) represents a true "no bias" RR value rather than an anticipated RR value -- if one didn't posit in advance a "no bias" RR value of 1.5? I personally suspect that it is unknowable! Therefore, what is the scientific value of performing (or interpreting) a single low signal/noise ratio RCT that cannot generate scientifically conclusive results -- because it has too small a sample size and too low a control event rate to generate a point estimate result with narrow confidence intervals?

 

Example number 3: The TARGET trial
 


The TARGET trial was a 18,000 patient study of a COX-2 inhibitor (lumiracoxib) versus two non-selective NSAIDs (ibuprofen and naproxen) [6]. The study was arbitrarily divided into two sub-studies -- lumiracoxib versus ibuprofen and lumiracoxib versus naproxen. The study had two endpoints -- to determine whether there was a difference in i) gastrointestinal ulceration complications and ii) cardiovascular complications between the comparative drug treatment groups.

Here is the description of the randomisation allocation process from the original Lancet report. 

"For logistical and masking reasons, TARGET was divided into two substudies, one with naproxen as the comparator and the other with ibuprofen. Within each substudy randomisation was stratified by age and lowdose aspirin use.The sponsor prepared a computer generated randomisation list with appropriate blocks. The study was centrally randomised according to strata with an interactive voice response system in all countries to ensure age and low-dose aspirin stratification. Allocation of treatment was done via the interactive system and all information was verified by this system before allocation of the patient to a treatment and assignment of the drug packs. To ensure allocation concealment all treatment packs were identically designed and all study drugs were supplied as tablets with matching placebo. We prespecified that data from the two substudies would be pooled for analysis."

When I personally analysed the TARGET trial [5], I reproduced the following graphical display image [7].
 




I then made the following series of statements-:

"Note how widely separated the two curves are for the two lumiracoxib subgroups. Note that the wide separation is apparent from the time of trial inception (even when the absolute number of events is very small).

Note that the degree of separation between the two lumiracoxib curves is larger than the degree of separation between any of the lumiracoxib curves and either the ibuprofen or naproxen curves -- this implies that the magnitude of the chance effect is larger than the magnitude of the signal.

My conclusion is that this randomised trial is a prime example of a RCT that has a low signal/noise ratio, and I conclude that one cannot therefore be confident in the trial's conclusion. In fact, one could easily conclude that the trial's interpretative conclusions have near-zero scientific validity, because the comparison between ibuprofen (or naproxen) to one of the lumiracoxib subgroups could be fairly changed to the other lumiracoxib subgroup, resulting in a totally contrary conclusion."

Now that I have access to the Clinical Trial Simulator tool, it is very easy to re-examine this trial from a different perspective.

Consider the TARGET trial as its actual results are run through the Clinical Trial Simulator program.

 


 


Note that the control event rate using a non-selective NSAID was 0.52% in substudy number 1, and 0.57% in substudy number 2.

Note that lumiracoxib decreased the cardiovascular event rate to 0.43% in substudy number 1 and increased the cardiovascular event rate to 0.84% in substudy number 2.

Consider the histogram results for the two substudies.



Note that the two substudies produced totally opposite results - even though they tested the same investigational agent against a control NSAID agent (either naproxen or ibuprofen), which had a consistent CER of ~0.5%, in the same trial.

Note that the trialists concluded that lumiracoxib does not significantly increase the cardiovascular event rate (as compared to a non-selective NSAID agent), and note that this fundamental conclusion is based on an "averaging" of the two low signal/noise ratio substudies. 

Consider figure 2 from the official TARGET trial report [6].



I have previously argued that this simple "averaging" of two contrary low signal/noise ratio sub-trials is not scientifically valid, especially if the disparate point estimate RR values are likely to be due to chance events. I think that the Clinical Trial Simulator's histogram graphical display image demonstrates the fundamental logic of my argument -- considering the fact that the bell-shaped curves for the two lumiracoxib substudies have a very broad base due to a wide 95% confidence interval range. I think that it irrational to simply average the RR results of two low signal/noise ratio substudies and then conclude that the overall RR value is scientifically conclusive!

To better illlustrate the irrationality of this line-of-thinking, consider the results of a 1,000 trial simulation run using the same sample size, the same CER, but an anticipated RR of 1.0.





If the RR of lumiracoxib (relative to a non-selective NSAID) is perceived to be 1.0, then how does one account for the actual RR results derived from the TARGET trial's two substudies?  The only rational answer must be that the disparate actual RR values are primarily due to chance event phenomena. Is it scientifically legitimate to "average" the results from two substudies, which could be under the influence of considerable amounts of chance event noise, and then conclude that the "average" result is truly reflective of lumiracoxib's intrinsic cardiovascular side-effects (as compared to a non-selective NSAID agent)? How can one prove that any one-of-the-two substudies' RR value is not truly representative of the "no bias" RR value, and that the "average" RR value is more reflective of lumiracoxib's intrinsic cardiovascular side-effects (as compared to a non-selective NSAID agent)?

I personally believe that it is more rational to adopt the position that when there is a high likelihood that a single RCT's results could be plagued by considerable amounts of chance event noise, that one should simply regard that particular RCT's results to be scientifically inconclusive. A meta-analyst may be able to utilise the raw data from multiple low signal/noise ratio RCTs to generate a likely point estimate RR value with narrower 95% confidence intervals. However, the scientific value/validity of meta-analyses is a different topic. In this essay, I am mainly focusing on the issue whether it is possible to obtain scientifically conclusive results from a single RCT, and I believe that it is not possible to obtain scientifically conclusive results from a single RCT if it has a low signal/noise ratio due to considerable amounts of chance event noise. I think that the TARGET trial is a poster-child example of a RCT that has an overwhelming amount of chance event noise.

I previously stated that one cannot necessarily identify RCTs, which are significantly plagued by chance event noise, by examining the baseline characteristics presented in table 1.

Consider the baseline characteristics of the two lumiracoxib subgroups (from reference number [6]).



If one compares the baseline characteristics of the two lumiracoxib subgroups, one notices that they have roughly the same incidence rate of any particular cardiovascular characteristic. I don't think that this is surprising. Consider my previous argument about chance events occurring during the randomisaton process. If the randomisation process is perfectly fair and free of selection bias, then one would expect the treated and control groups to have roughly the same chance likelihood of any particular prognostic variable - in direct proportion to their incidence rate of occurrence in the enrolled patients. The higher the incidence rate of occurrence of any baseline characteristic, the less likely that there will be a large baseline imbalance in that variable due to the play of chance. For example, if hypertension is found in ~40% of enrolled patients, then it is very likely that the incidence rate of hypertension will be ~40% in both groups (within +/- 10-20% range). Chance is much more likely to create larger relative degrees of imbalance if the incidence rate of occurrence is low. For example, if angina is found in only ~3% of enrolled patients, then it is not surprising that there could be a large relative difference of 100% in the incidence rate of angina between the two lumiracoxib subgroups (2% versus 4%). It is very unlikely that there would be such a large relative difference in incidence rates (~100% relative difference) if the baseline characteristic occurred more frequently (eg. 40% average incidence rate of hypertension among all trial enrollees, but a 27% incidence rate in the control group and a 54% incidence rate in the treated group).

In other words, the fact that the two lumiracoxib groups have a similar overall rate of baseline characteristics, doesn't imply that chance events couldn't have influenced the outcome event rate to a highly significant degree. As I have repeatedly demonstrated, chance events are much more likely to significantly affect the outcome event rate when the incidence rate of outcome events is low (0.5%-2%) and overt evidence of that chance event phenomenon will not necessarily be discernible in the data set of baseline characteristics.
 


Example number 4The APC trial of celecoxib

 

The APC trial of celecoxib is similar to the APPROVE trial in the sense that it was primarily designed to test whether a COX-2 inhibitor (celecoxib) would be useful in preventing colonic adenomatous polyp development in the target population. Like the APPROVE trial, it unexpectedly demonstrated that a COX-2 inhibitor (celecoxib) apparently caused a higher incidence rate of major cardiovascular side-effects (compared to placebo).

Consider the adverse cardiovascular outcome event results (from reference number [8]).



When reviewing the FDA's transcripts of the Arthritis Drug Advisory Committee's February 16-18th 2005 meeting [7], I noted that the Advisory Committee Members appeared to be totally flummoxed by these unexpected results. The APC trial is the first RCT that has demonstrated that celecoxib is associated with an increased risk of adverse cardiovascular events, and there even appears to be a dose-response effect suggesting a causal relationship. I believe that the FDA's Advisory Board Members' state of mental perplexity regarding the APC trial's result was due to the fact that they were mentally trapped by their particular mindset -- they simply accepted the APC trial's results at face value and they didn't seriously consider whether chance event bias could more plausibly account for the unanticipated results.

I think that the Clinical Trial Simulator tool provides an end-user with many useful insights regarding this problematic issue.

Consider the APC trial's MI results. 



Note that the APC trial's sample size number was very small - there were only 679 placebo patients and ~679 patients in each celecoxib subgroup. *Note that the MI control event rate was very low. The MI-CER was 0.4% and the MI-EER rate was 1.3% in each celecoxib subgroup.

(* I am again focusing on an individual component from the composite cardiovascular endpoint, because I believe that it is unscientific to combine the different component endpoints in a single analysis when analysing a trial's results)

Let's presume that there is no "a priori" reason to anticipate that celecoxib would increase the risk of MI events based on an extensive review of large amounts of evidence from the EBM literature. Then, from a Bayesian perspective, one could assume that the anticipated RR value will be 1.0, and the Clinical Trial Simulator's opening screenshot would appear as follows. 



Note that there are 680 patients in each group. Note that the MI-CER is 0.4%, and that the MI-EER is anticipated to be identical to the MI-CER because the anticipated RR value is 1.0.

Consider the histogram results from 1,000 simulated APC-type trials.



Note that the RR histogram provides us with many useful facts.

i) Note that the confidence interval range is huge and that it extends off the graph on both sides (0.000 to infinity) -- due to the fact that the  sample size is too small (1360 patients); the CER too low (0.4%); and the anticipated signal (ARR) too small.

ii) Note that the APC trial's actual RR result of 3.0 (1.3% versus 0.4%) could be a statistical outlier result, presumably due to a considerable amount of *chance event noise.

(* To those readers who are interested in searching for clues that suggest the presence of chance event bias, consider the fact that the major cardiovascular (MI/stroke/CV death) CER in the placebo group was only 0.29 events/100 patient years. That control event rate value is significantly lower than the CV-CER values found in similar low CV risk patient populations like those enrolled in the APPROVe and PreSAP trials. In fact, the CV-EER value (number of events/per 100 patient years) in the APC trial was not significantly higher than the average CV-CER found in those other trials)

iii) Note that there is no rational reason to expect any single APC-type trial result to be definitively capable of conclusively identifing a "no bias" RR value (if that value is unknown).

I think that there is only one attitude that a rational person can reasonably adopt with respect to the APC's trial MI results -- the APC trial's sample size and CER is too low to allow the trial to generate scientifically conclusive results.

Therefore, I am baffled that the FDA's Arthritis Drug Advisory Board [7] did not reason along the following lines-:

i) There is no significant "a priori" evidence (from observational studies and other RCTs) that celecoxib increases the risk of MI.

ii) The APC randomised controlled trial potentially has too low a signal/noise ratio (due to considerable amounts of potential chance event noise) to generate scientifically conclusive results.

iii) Therefore, because we have to heavily discount the scientific validity of the evidence derived from the APC trial, we have no reason to modify our "a priori" perception that celecoxib does not significantly increase the risk of MI events.

 

Example number 5: The WHI study of hormonal therapy in menopausal women

 

In my analysis of the APC trial, I implied that a rational Bayesian thinker, when faced with a low signal/noise ratio RCT's results, would heavily discount the scientific value of the RCT's scientifically inconclusive results when weighing all the evidence from multiple studies. In other words, a rational Bayesian thinker would approach any scientific evidence with an "a priori" bias that is based on the weight of all the previous scientific evidence, and a rational Bayesian thinker would only radically change his baseline position if any "new" contrary evidence is scientifically conclusive (scientifically incontestable).

Let's consider the controversial issue of whether the WHI study conclusively demonstrated that the administration of hormonal therapy (HRT) to menopausal women increases the risk of MI events - from a Bayesian perspective. Prior to the availability of direct RCT evidence, there was substantial evidence from multiple observational studies suggesting that HRT does not increase the risk of future MI events, and that it may actually have a moderate cardioprotective effect. Therefore, a rational Bayesian thinker should adopt an "a priori" position that favors the belief that HRT does not increase the risk of MI events, and that it may possibly even have a cardioprotective effect. Should that Bayesian thinker modify his position if RCT-evidence became available -- considering the fact that RCT evidence is widely regarded as being more scientifically valid than evidence from observational studies? I have previously argued that RCT evidence only has a substantial degree of scientific validity if the RCT has a high signal/noise ratio due to low amounts of chance event noise (and other noise elements). What is the signal/noise ratio of HRT-RCTs when viewed from the perspective of potential chance event noise?

The first HRT-RCT evidence that became available was the results of the HERS trial [9]. The HERS trialists studied the MI event rate in women who already had prior evidence of coronary heart disease. In other words, the HERS trial was a secondary prevention study, and it therefore had the advantage of a higher control event rate, which should decrease the potential  amount of chance event noise that could theoretically influence the trial's results. However, the absolute amount of potential chance event noise also depends on the trial's sample size and the anticipated RR (or actual RR) value. The HERS trial (which was subsequently extended to 7 years and finally reported as the HERS II trial [10]) had a sample size of 2763 patients and it demonstrated that HRT does not increase the risk of coronary heart disease events in high risk patients.


HERS trial's coronary heart disease event results -- from reference number [10]
.


 


Can the HERS trial's *RR value of 1.0 for CHD events be regarded as being scientifically conclusive?

(*In this essay, I am not considering the other sources of noise that plagued the HERS trial -- noise due to variable crossover, low compliance and drop-out effects).

Consider a series of 1,000 simulated HERS-type trials using a sample size of 2763 patients, a *control event rate of 21%, and a RR value of 1.0.

(* the overall CER value of 21% is not precisely accurate because it doesn't take into account the number of patients at risk, which varied from year-to-year. However, any slight imprecision in the CER value doesn't affect the overall message-value of the following Clinical Trial Simulator's results)

This is the histogram results of 1,000 simulated HERS-type trials.




Note that the "point estimate" RR value is 1.0.

Note that the 95% CI range is narrow, and that the bell-shaped curve of possible RR results has a narrow base.

Therefore, one can reasonably conclude that this histogram demonstrates that a HERS-type RCT can potentially establish, with a great deal of confidence, that HRT does not significantly increase (or decrease) the future risk of a coronary heart disease event in high risk patients. Note that the narrow base of the bell-shaped curve suggests that a HERS-type trial has a small chance event noise problem, and that potential chance event noise would not significantly affect the scientific legitimacy of the "average" HERS-type trial's overall conclusion that HRT does not significantly increase/decrease the risk of coronary heart disease events -- because most HERS-type RCTs (if run) would have a RR value that is very close to a RR value of 1.0.

I think that the HERS II trial's CHD results should not induce a Bayesian thinker to modify his "a priori" position that HRT does not increase the risk of CHD events when administered to menopausal women -- because the HERS-RCT's results are consonant with his "a priori" belief and the RCT's evidence can be deemed to have a high degree of "scientific conclusiveness" from a chance event noise perspective.

Should the WHI trial's results induce a Bayesian thinker to modify his position that HRT does not increase the risk of coronary heart disease events? The WHI randomised controlled trial's results first became available in mid-2002 [11] and the WHI-RCT demonstrated that HRT increases the risk of CHD events (RR = 1.29).

I noticed that many editorialists, writing at the time of official publication of the WHI trial's results, concluded that the WHI trial definitely proved that HRT increases the risk of CHD events in menopausal women, and they expressed their viewpoints without evincing any doubts.

Writing in the same issue of JAMA [12], the editorialists Fletcher and Colditz wrote "The methods of the WHI appear strong. A total of 16,608 women entered the randomized double/blind trial, and the active treatment group and the placebo group appeared comparable at baseline. --- How should practicing clinicians and millions of women taking estrogen/progestin combination react to the unexpected and disquieting results of this study. First, although the trial results are reported primarily in terms of relative risk, which is appropriate for studies of cause, when applying the results to practice, they must be translated into absolute risk. The absolute harm to an individual woman is very small. As the authors point out, the increased risk of the estrogen/progestin combination means that in 10,000 women taking the drug for a year (10000 must be used to register risk in whole integers) there will be 7 more coronary heart disease events. ---- Given the results, we recommend that clinicians stop prescribing the combination for long-term use".

In their August 20th 2002 CMAJ commentary article on the WHI study [13], the editorialists Salim Yusuf and Sonia Anand concluded their commentary article by stating:- "The results of the WHI may be viewed as "unwelcome news" by some, but for vast numbers of physicians and their patients, the information simplifies what has been a confusing past decade. We should not use HRT for its purported preventive effects, because it causes more harm than good. ---- The WHI also confirms the importance of well-designed, large randomized trials as the only reliable method to evaluate most common interventions. The direct and indirect costs related to the use of HRT probably run into a few billion dollars worldwide each year, with the cumulative costs over the last 2 decades probably in excess of a $100 billion. Had studies such as the WHI been conducted earlier, a significant proportion of this waste could have been avoided, not to mention the avoidance of adverse effects in several million women. The costs of conducting even "relatively expensive" trials pale in comparison to the economic costs saved and human suffering avoided. ---- In conclusion, the WHI is a large, well-designed, and carefully conducted study that will have a tremendous impact of the health of women. The message for healthy women without severe symptoms of menopause is now clear: to avoid as far as possible HRT, which on balance does more harm than good."

(my italics)

Note that both editorialists seemingly concluded that the WHI trial's results are solid and scientifically incontestable. Should a rational Bayesian thinker therefore modify his "a priori" position that HRT does not increase the risk of coronary heart disease events in menopasual women?

These were the WHI trial's CHD results as reported in reference number [11].


 

Note that the sample size was 16,608 patients, the annualised CER 0.30%, and the annualised EER 0.37%. Note that the nominal 95% CI range extends from 1.02-1.67 and the adjusted 95% CI range extends from 0.85-1.97 (based on a point estimate value of 1.29).

If a Bayesian thinker anticipates a RR value of 1.0 from a WHI-type trial (based on his "a priori" belief that there is no prior evidence that HRT increases the risk of CHD events when administered long-term to menopausal women), should the WHI trial's RR 1.29 result provoke him to modify his position? Consider the Clinical Trial Simulator's histogram results when we input a sample size of 16,608 patients, a CER of 0.30, and an * anticipated EER of 0.30%.

(* the anticipated EER of 0.3% is based on an "a priori" anticipation that the EER should not be more, or less, than the CER)

I programmed the Clinical Trail Simulator to produce simulated histogram curves for the HERS and WHI trials at the same time, so that one could compare the histogram and CI results on the same scale.


 

Note that the bell-shaped for the WHI trial has a broader base due to a wider 95% CI range.

Note that the WHI's trial's RR value of 1.29 could be a statistical outlier result due to a significant amount of chance event noise.

What should a rational Bayesian thinker conclude given the following two choices?

1) That the WHI trial's RR 1.29 value is definitely reflective of HRT's absolute harmful coronary effects, and that this conclusion is scientifically incontestable.

2) That the WHI trial had a combination of "too small a sample size and too low a control event rate" to yield a scientifically conclusive result, and that the estimated RR 1.29 value could possibly be reflective of a *significant amount of chance event noise.

* To help open-minded readers make up their minds regarding the possible presence of a significant amount of chance event noise, consider the WHI trial's year-to-year CHD event rate results.

 

Year of study

(number of participants)

 HRT therapy
Number of patients with CHD event
(annualized percentage)
Placebo therapy
Number of patients with CHD event
(annualized percentage)
Hazard ratio
Year 1
(8435 treated, 8050 placebo) 
43
(0.51%)
23
(0.29%)
1.78
Year 2
(8353 treated, 7980 placebo)
36
(0.43%)
30
(0.38%)
1.15
Year 3
(8268 treated, 7888 placebo)
20
(0.24%)
18
(0.23%)
1.06
Year 4
(7926 treated, 7562 placebo)
25
(0.32%)
24
(0.32%)
0.99
Year 5
(5964 treated, 5566 placebo)
23
(0.39%)
9
(0.16%)
2.38
Year 6 plus
(5129 treated, 4243 placebo)
17
(0.33%)
18
(0.42%)
0.78


The WHI trialists reported that the average annualised CHD event rate was 0.30% in the placebo group and 0.37% in the HRT group (RR 1.29), and those average annualised event rate figures represent the average value for the entire study period of 6+ years.

However, note that HRT was only "apparently" harmful in year 1 (HR 1.78) and year 5 (HR 2.38).

Now, do you really believe that HRT was only harmful in those two years and not harmful in the other four 1-year time periods (year 2, year 3, year 4, year 6+)? I think that an unyielding belief that the WHI trial's "actual recorded HR numbers in year 1 and year 5 is entirely due to HRT's intrinsic harmful effects" is a logically indefensible belief that stretches far beyond the limits of pathophysiological plausibility. I think that a much more plausible explanation for the "apparent" harm phenomenon in year 1 and year 5 is that there was a disproportionately high 0.51% event rate in the HRT group in year 1 and a disproportionately low 0.16% event rate in the placebo group in year 5 -- due to chance events.

Finally, when choosing whether to accept the RR results of the HERS trial or the WHI trial as being reflective of true reality, why should a rational person favor the wide 95% confidence interval RR result from a potentially low signal/noise ratio RCT performed in low risk patients (WHI trial) rather than the narrow 95% confidence interval RR result from a likely high signal/noise ratio trial performed in high risk patients (HERS trial)?

 

Concluding remarks:

 

I have presented substantial "real life" evidence from the medical literature that suggests that many "real life" RCTs have control event rates too low, and/or sample sizes too small, to generate scientifically conclusive results. I think that too many trialists have such overabundant faith in a RCT's ability to determine the scientific truth, that they do not seem to recognise that their unquestioning faith is unjustified when the control event rate is low. Chance events are a major source of potential noise when a RCT has a small sample size and a low control event rate, and many of those RCTs may therefore have too low a signal/noise ratio to be able to generate scientifically conclusive results.

I think that one major cause of the problem of "scientifically inconclusive RCTs" is the propensity of trialists to use composite endpoints, so that they can justify using a smaller sample size when designing their trial. Composite endpoints confound the issue because the individual component endpoints are not necessarily affected to the same degree, or even in the same direction, by the investigational agent. Each individual endpoint result has to have a high signal/noise value if trial interpreters are expected to have confidence in the scientific conclusiveness of the individual endpoint results. However, the control event rate of many individual component endpoints (eg. MI, stroke, CV death) in many RCTs is too low, thus resulting in a point estimate result that has wide 95% confidence intervals. Those wide 95% confidence interval point estimate results cannot be regarded as being scientifically conclusive.

I also think that too many trialists and clinicians search for clinically meaningful results by data-dredging a RCT's subgroup data. They don't seem to realise that if a "positive" result is found during a subgroup analysis, that two chance event noise factors can radically devalue the scientific validity of their subgroup analysis - i) the sample size of the subgroups is often too small to generate scientifically conclusive results, and ii) there is no guarantee that the treated and control subgroups are balanced for known (or unknown) prognostic variables even if the overall study is deemed to be well balanced at the time of trial inception.

I have also noticed that many trialists imply that their RCT's results underestimate the magnitude of the therapeutic effect when the trial has a large cross-over effect, a low compliance effect, or a large drop-out effect. However, they completely ignore the fact that those same effects can potentially increase the trial's chance event noise level to an even greater degree, which could cause the RCT's results to be even more scientifically inconclusive [5].

I also think that too many trialists and clinicians don't understand the significance of a RCT's point estimate result, and they automatically assume that the point estimate result is a "no bias" result (no chance event bias present). However, a point estimate result is not necessarily a "no bias" result. A trialist has to prove that zero chance event bias is present if he wants to convince people that a RCT's point estimate result is quantitatively reflective of true reality -- a "no bias" point estimate result that accurately reflects the investigational agent's quantitative therapeutic (or harmful) effect, and that is not affected by chance events. For example, in the APPROVe trial, the point estimate RR result was 1.92. However, that point estimate RR result can only be equal to the "no bias" point estimate RR result if the APPROVe trialists can prove that the treated and placebo group patients (if both groups remained untreated) had the same risk of a control event for the entire duration of the trial. In the following figure (from reference number [14]), one can readily note that the placebo group patients had a disproportionately low control event rate during the last 18 months of the trial.



Is that "low control event rate" phenomenon during the last 18 months of the trial pathophysiologically plausible? Can one automatically assume that the treated group patients, if untreated, would have the same low event rate as the placebo group patients during the last 18 months of the trial? Surely, one needs to validate that basic assumption if one wants to legitimise the trial's results? If one cannot conclusively demonstrate that the rofecoxib patients (if untreated) would have the same number of events as the placebo patients during the last 18 months of the trial, then one cannot legitimately claim that the trial's signal (absolute risk difference in serious thrombotic endpoint events) is entirely due to the drug's harmful cardiovascular effects rather than due to chance events [5]? Too many trialists automatically presume that a fair randomisation process should always produce balanced groups, but the APPROVe trial (and many other low control event trials) demonstrate that this is not neccessarily the case. I think that the burden of proof lies with the APPROVe trialists -- to legitimise their trial's results, they have to demonstrate that chance event bias was not a major problem in their trial.

In this essay, I have often underlined the word "potential" when using that word before certain phrases eg. "chance event bias". My reason for frequently pre-phrasing a statement with the word "potential" is cautionary. I am fully aware that chance events can distort the results of a low control event trial when the sample size is small. However, I cannot prove that a significant amount of chance event bias has occurred in any particular trial (other than the TARGET trial which was arbitrarily subdivided into two supposedly equivalent substudies). For example, in the APPROVE trial, the placebo group's low event rate during the last 18 months of the trial suggests the presence of chance event noise. However, I cannot prove that chance event noise is present, and I therefore cannot prove that the trial has a low signal/noise ratio due to significant amounts of chance event noise. Evidence of chance event noise is often suggestive and/or inferential. I think that the Clinical Trial Simulator tool is a powerful tool because it provides one with an inferential view of potential chance event noise. When looking at a histogram of 1,000 simulated trials, one can estimate the potential likelihood of chance event bias affecting a particular RCT that has a particular sample size, a particular control event rate and a particular experimental event rate. If the histogram's bell-shaped curve has a wide base, it indicates a wide 95% confidence interval range and a significant possibility of chance event bias. However, to rationally imply that a particular RCT's point estimate RR value is not likely to be the "no bias" RR value, one needs to search for additional clues that suggest the presence of chance event noise.

In conclusion, I think that the quality of published randomised controlled trial reports could radically improve if medical journal editors always performed a Clinical Trial Simulator analysis of submitted RCT manuscripts, and if they routinely asked pointed questions when the histogram results suggest that significant amounts of chance event noise could potentially be present. I think that trialists should be obliged to demonstrate that their RCTs are unlikely to be significantly affected by chance event noise before medical journal editors consider publishing their RCT's official report. The primary purpose of a RCT is to determine the scientific truth with a high degree of scientific conclusiveness, and RCTs that cannot generate scientifically conclusive results have little scientific value. Trialists should preferably not design or perform scientifically inconclusive trials and medical journal editors should not publish the results of scientifically inconclusive trials (other than as a cautionary tale). Considerable amounts of human effort, patients' health interests, money and time are wasted on scientifically inconclusive RCTs and I think that this wasteful situation needs an urgent remedy. 

 

Jeff Mann, MD.

Retired physician.

First version: April 2005.

E-mail address: jmannemg@earthlink.net

 

References:

 

1. Clinical Trial Simulator tool. Available for download at http://randomization.org

2. Sackett, David L. Why randomized controlled trials fail but needn't: 2. Failure to employ physiological statistics, or the only formula a clinician-trialist is ever likely to need (or understand!) CMAJ 165(9):1226-1237, October 30, 2001.

Available online at http://www.cmaj.ca/cgi/content/full/165/9/1226

3.  Ridker PM, Cook NR, Lee IM, et al. A randomized trial of low-dose aspirin in the primary prevention of cardiovascular disease in women. N Eng J Med 2005; 352: 1293-1304

4. Mann J. An analysis of the WHS trial of aspirin therapy for the primary prevention of cardiovascular events in women.

Available at  http://jeffmann.net/soapbox/Aspirin-WHSanalysis.htm

An adobe PDF version is available at http://jeffmann.net/soapbox/Aspirin-WHSanalysis.pdf

5. Mann J. Questioning the scientific validity of the randomised trials of COX-2 inhibitors showing an increased risk of adverse cardiovascular events.

Available at http://jeffmann.net/soapbox/vioxx-cox2critique.htm

An adobe PDF version is available at http://jeffmann.net/soapbox/vioxx-cox2critiqueadobe.pdf

6. Michael E Farkouh, Howard Kirshner, Robert A Harrington, Sean Ruland, Freek W A Verheugt, Thomas J Schnitzer, Gerd R Burmester, Eduardo Mysler, Marc C Hochberg, Michael Doherty, Elena Ehrsam, Xavier Gitton, Gerhard Krammer, Bernhard Mellein, Alberto Gimona, Patrice Matchaba, Christopher J Hawkey, James H Chesebro, on behalf of the TARGET Study Group* Comparison of lumiracoxib with naproxen and ibuprofen in the Therapeutic Arthritis Research and Gastrointestinal Event Trial (TARGET), cardiovascular outcomes: randomised controlled trial. Lancet Vol 364 p 675-84. August 21, 2004

7. FDA public website. CDER Meeting Documents. Arthritis Drug Advisory Committee. February 16-18, 2005 Joint Meeting with the Drug Safety and Risk Management Advisory Committee. Available at http://www.fda.gov/ohrms/dockets/ac/cder05.html

8. Solomon SD, McMurray JJ, Pfeffer MA, Wittes J, Fowler R, Finn P, Anderson WF, Zauber A, Hawk E, Bertagnolli M; Adenoma Prevention with Celecoxib (APC) Study Investigators. Cardiovascular risk associated with celecoxib in a clinical trial for colorectal adenoma prevention. NEJM 2005 Mar 17;352(11):1071-80. Epub 2005 Feb 15.

9. Hulley S, Grady D, Bush T, et al, for the HERS Research Group. Randomized trial of estrogen plus progestin for secondary prevention of coronary heart disease in postmenopausal women: Heart and Estrogen/progestin Replacement Study (HERS) Research Group. JAMA. 1998;280:605–613. 

10. Grady, Deborah MD, MPH. Herrington, David MD, MHS. Bittner, Vera MD. Blumenthal, Roger MD. Davidson, Michael MD. Hlatky, Mark MD. Hsia, Judith MD. Hulley, Stephen MD, MPH. Herd, Alan MD. Khan, Steven MD. Newby, L. Kristin MD. Waters, David MD. Vittinghoff, Eric PhD. Wenger, Nanette MD. for the HERS Research Group. Cardiovascular Disease Outcomes During 6.8 Years of Hormone Therapy: Heart and Estrogen/Progestin Replacement Study Follow-up (HERS II). JAMA. 288(1):49-57, July 3, 2002

11. Writing Group for the Women's Health Initiative Investigators. Risks and Benefits of Estrogen Plus Progestin in Healthy Postmenopausal Women: Principal Results From the Women's Health Initiative Randomized Controlled Trial. JAMA. 288(3):321-333, July 17, 2002.

12. Fletcher SW and Colditz GA. Failure of Estrogen and Progestin Therapy for Prevention. JAMA. 288(3) 366-7. July 3, 2002.

13. Yusuf, Salim. Anand, Sonia. Hormone replacement therapy: a time for pause. CMAJ. 167(4):357-359, August 20, 2002.

14. Bresalier RS, Sandler RS, Quan H, Bolognese JA, Oxenius B, Horgan K, Lines C, Riddell R, Morton D, Lanas A, Konstam MA, Baron JA. Cardiovascular Events Associated with Rofecoxib in a Colorectal Adenoma Chemoprevention Trial. NEJM March 17th 2005. Vol 352. 1092-1102.