Questioning the scientific validity of the randomised trials of COX-2 inhibitors showing an increased risk of adverse cardiovascular events.

---------------------------------------------------------------------------------


An adobe PDF version of this essay is available at http://jeffmann.net/soapbox/vioxx-cox2critiqueadobe.pdf


Introduction:
 

In September 2004, Merck, the drug company manufacturing rofecoxib, withdrew the drug from market because the APPROVe trial demonstrated that the drug was associated with an increased incidence of adverse cardiovascular events. Since then, other randomised trial evidence has surfaced suggesting that other cox-2 inhibitors may be associated with an increased risk of adverse cardiovascular events -- in particular, the APC trial of celecoxib. The official reports of the APPROVe trial [1] and the APC trial [2] were both published in the March 17th issue of the NEJM. In that same issue of the journal, the NEJM editor made the following series of statements [3]-:

"In the VIGOR study, there was a higher incidence of myocardial infarction in the rofecoxib group than in the control group treated with naproxen. Because the study lacked a placebo group, it was unclear whether the effect was due to an increased cardiovascular risk with rofecoxib or a protective effect of naproxen, or whether this was merely a chance finding. --------

In September 2004, Merck withdrew rofecoxib from the market because its trial, designed to test the hypothesis that COX-2 inhibitors could prevent recurrent colonic polyps, showed increased cardiovascular toxicity (one of the articles in this issue of the Journal presents the cardiovascular data from this study 3 ). The National Cancer Institute stopped a similar trial of celecoxib when an independent panel of cardiovascular experts reviewed the data and also found a greater risk of cardiovascular events among patients treated with celecoxib; the data on cardiovascular events from that trial are reported in this issue of the Journal.1 Also reported in this issue are the cardiovascular toxicity data from a trial of another COX-2 inhibitor, valdecoxib (and its intravenous prodrug, parecoxib).2 This trial, which examined pain relief in patients recovering from coronary-artery bypass surgery, showed an increased incidence of cardiovascular end points at 30 days among patients who had received a total of only 10 days of COX-2 inhibition.

Taken together, these three large, randomized, controlled trials designed to test the efficacy of different COX-2 inhibitors for a variety of indications confirmed the cardiovascular toxicity that had been suggested five years earlier."

Note that Drazen used the word "confirmed" (which I have highlighted). From my perspective, "confirmation" implies an incontestable state of scientific proof of causation, a state of near-100% certainty that COX-2 inhibitors increase the risk of adverse cardiovascular events. After examining the two trials' reports, I decided that the evidence from the APPROVe and APC trial trials suggesting that COX-2 inhibitors cause an increased risk of adverse cardiovascular events is very weak and definitely contestable. In fact, I think that the APPROVe trial will go down in history as the type of randomised trial that demonstrates the fundamental weakness of randomised clinical trials as a clinical research tool in the establishment of the scientific truth -- when the control event rate is low, the signal small, and the noise due to confounding variables large.

Sackett [4] stated that confidence in the results of a randomised clinical trial (RCT) is directly proportional to the trial's signal/noise ratio, and he stated that one cannot be confident in the results of a RCT if it has a low signal/noise ratio.

I particularly like Sackett's simple physiological formulae because it doesn't depend on complicated statistical formulae (P values and confidence intervals), which most clinicians cannot really understand. Basically, Sackett is saying that confidence in the scientific validity of a RCT's results is directly proportional to the signal/noise ratio, and that one can only be confident that a RCT is reflecting the scientific truth regarding a drug's effect (either therapeutic efficacy or harm) if the magnitude of the signal is much larger than the magnitude of the noise. In a RCT, the signal is the absolute difference in the measured outcome event rate between the control agent (or placebo) and the investigational agent, and this difference is best represented by the ARR (absolute risk reduction) value. Noise in a RCT is that portion of the signal (ARR) that is due to random chance variations, and not specifically due to the drug being tested. If the magnitude of the noise increases relative to the magnitude of the signal, then the magnitude of the signal that is definitely due to the tested drug must decrease. If the magnitude of the noise exceeds the magnitude of the drug's effect, this will cause the signal/noise ratio value to be much lower, and one can no longer be confident in the scientific validity of the study's results (that the tested drug's measured effect is being reflected by the signal (ARR value).    

In this essay, I will demonstrate that the RCTs of COX-2 inhibitors, which demonstrate an increased risk of adverse cardiovascular events, have such a low signal/noise ratio that they cannot possibly generate scientifically valid evidence with a high level of confidence . Before I present my case, I will provide some background information on RCTs and signal/noise ratios so that interested readers, who have a limited background knowledge of evidence-based medicine terminology/methodology, can better understand my arguments. Although I will be using the terms signal and noise in a slightly different manner to Sackett, I still believe that Sackett's simple formulae for confidence in a clinical trial's results will apply to my use of the terms.
 

Background section:
 

What is the major difference between scientific research in the physical sciences and randomised clinical trial research?

In the physical sciences, laboratory researchers routinely perform research on homogenous materials, while RCTs in clinical research use a heterogenous population of patients who have a variable likelihood of generating a signal. Consider the following imaginary example of a laboratory research experiment on a homogeneous substance -- plastic.

Let's presume that a plastic manufacturer is producing plastic shopping bags for supermarkets and he wants to produce the strongest plastic bag that can maximally support the weight of a loaded grocery bag without the need for double-bagging. Presume that the plastic manufacturer tests two plastics -- substance A and substance B -- which are homogeneous plastic substances free of contamination. Let's presume that he tests the two materials and he finds that shopping bags made from plastic substance A can support 60 lbs of weight and shopping bags made from plastic substance B can support 50 lbs of weight. He can then conclude that i) plastic substance A is qualitatively stronger than plastic substance B and ii) that the difference in strength is a weight support difference of 10lbs. That difference represents the experimental study's signal. I arbitrarily define noise as any confounding variable that can affect the experiment's signal, that is not specifically due to the intrinsic properties of plastic substance A or plastic substance B. For example, if an aggrieved employee adds contaminants to the plastic mix of either substance A and/or substance B to produce a heterogeneous plastic substance, then the ability of those contaminants to alter the strength of the plastic material would represent noise in any strength testing procedure. In the absence of such noise, the plastic manufacturer's testing procedure is characterised by a high signal/noise ratio and one can be extremely confident in the scientific validity of his experiment's results -- that substance A produces a stronger plastic bag than substance B by an order of magnitude of 10lbs extra weight support.

Consider what effect contaminants would have on the experimental results if an aggrieved employee decided to disrupt the manufacturer's scientific testing process by adding either contaminant substance C or contaminant substance D to the the two plastics A and B. Presume that substance C increases the strength of any plastic material by 3lbs extra weight support, and that substance D decreases the strength of any plastic material by 3lbs extra weight support. Then 4 possible experimental result-situations would exist.  

Scenario 1:

1) Substance A + substance C compared to substance B + substance D = 60+3 compared to 50-3 = 63-47 = 16lbs weight support difference.

2) Substance A + substance C compared to substance B + substance C = 60+3 compared to 50+3 = 63-53 = 10 lbs weight  difference.

3) Substance A + substance D compared to subtance B + substance C = 60-3 compared to 50+3 = 57-53 = 4lbs weight support difference.

4) Substance A + substance D compared to substance B + substance D = 60-3 compared to 50-3 = 63-53 = 10 lbs weight support difference.

What affect does the noise (effect of contaminants that either inflate or deflate the size of the signal) have on the experiment's signal/noise ratio?

Approximately 50% of the time [subscenarios numbers 2) and 4)], the presence of noise doesn't affect the magnitude of the signal, because the two different contaminants have a neutralising effect and cancel each other out. The experimental test results in subscenario number 2) and subscenario number 4) therefore still have a very high signal/noise ratio, and one can be confident that the measured strength difference (ARR) truly reflects the strength difference between substance A and substance B. Approximately 25% of the time, the contaminants inflate the signal by 6lbs (ARR of 16lbs strength difference) and that extra 6lbs strength difference is due to noise and not due to the true differential strength property of substance A compared to substance B.  Approximately 25% of the time, the contaminants deflate the signal by 6lbs (ARR of 4lbs strength difference) and the apparent 6lbs loss of strength is due to noise and not due to the true differential strength property of substance A compared to substance B. One can still conclude that substance A is qualitatively stronger than substance B each time the experiment is performed, but the quantitative estimation of strength will vary from trial-to-trial (estimated ARR will vary from 4-16lbs in differential strength difference). From a confidence perspective, one can still be 100% confident in the scientific validity of any of those 4 trials' results that concludes that substance A is qualitatively stronger than plastic substance B, although one can only be 50% confident that the measured effect (ARR) truly represents the true magnitude of the strength difference between substance A and substance B.

Now consider the same experimental situation if substance C could increase the strenth of any plastic material by 12lbs, and substance D could decrease the strength of any plastic material by 18 lbs. Then 4 possible experimental result-situations would exist.

Scenario 2:

1) Substance A + substance C compared to substance B + substance D = 60+12 compared to 50-18 = 72-32 = 40lbs weight support difference.

2) Substance A + substance C compared to substance B + substance C = 60+12 compared to 50+12 = 72-62 = 10 lbs weight support difference.

3) Substance A + substance D compared to subtance B + substance C = 60-18 compared to 50+12 = 42-62 =  minus 20lbs weight support difference.

4) Substance A + substance D compared to substance B + substance D = 60-18 compared to 50-18 = 42-32 = 10 lbs weight support difference.

Now there is a 25% chance event likelihood of an experimental test producing a test result demonstrating a qualitative difference in the opposite direction that suggests that substance A is weaker than substance B by an order of magnitude of 20 lbs, even though substance A is actually stronger than substance B by an order of magnitude of 10 lbs weight support difference (sub-scenario number 3). In this situation, the signal/noise ratio of the experimental situation can be deemed to be very low because the noise swamps the signal, and the measured ARR result does not reflect the true reality of the comparative strength of substance A compared to substance B. In fact, the effect of the noise is so large that it overpowers the strength difference between substances A/B effect to such a degree that the measured ARR value actually makes it appear that substance A is weaker than substance B by a moderate amount (20 lbs strength difference) when in actuality substance A is stronger than substance B by an order of magnitude of 10lbs strength difference. The signal/noise ratio of subscenario number 1) is also very low, but in this instance the noise factor has inflated (rather than deflated) the ARR value, and the test result makes it appear that the strength difference between substance A and substance B (40lbs measured strength difference) is 4x larger than true reality (10lbs strength difference).

What is particularly interesting is that the plastic manufacturer cannot really determine the scientific truth when faced with 4 disparate experimental results of this order of magnitude, even if he suspects that contaminants are affecting the strength of his plastics, because he cannot accurately quantify the extent of the contaminant's confounding effect if he doesn't know which contaminants are present and to what degree they affect the strength of the plastic material. If this confusing situation is compounded multifold by having a situation of 20+ contaminants that have varying effects on the strength of the plastic substances A and B, then the scientific testing process will produce endlessly variable results and the plastic manufacturer may learn nothing concrete about the relative strength of plastic substances A and B by performing a comparative scientific testing procedure. In this situation, one's confidence in the scientific validity of the strength-testing experiments could be close to zero!

These imaginary examples demonstrate why it is is necessary for experimenters to always ensure that they are dealing with a homogeneous substance when performing scientific testing -- if the experimenter wants to be extremely confident that the test's results accurately reflects true reality. Marked degrees of heterogeneity in a substance's physical properties can significantly confound a scientific test's results if the confounding variables have a capacity to change the magnitude of the measured signal to a large degree.

I think that it is especially important that scientists perform tests on homogeneous materials when dealing with critically important materials. For example, the manufacturer of heat shield tiles for the space shuttle cannot afford to be unsure about the properties of the material used to manufacture the heat-shield tiles. The results of the heat-resistance experimental tests have to be absolutely full-proof ; they have to demonstrate 100% consistency and 100% repeatability. The heat-shield tile manufacturer would never accept experimental scenarios like scenarios 1/2 above as being scientifically acceptable. The same applies to the entire field of aeronautics. Would you fly in a commercial airplane if you were not 100% confident that all the scientific tests utilised in airplane design and manufacturing were 100% scientifically valid?

What has all this to do with RCTs involving the COX-2 inhibitors?

I'll cut to the chase.

During a three day period - February 16th-18th - the FDA held an advisory committee meeting to discusss the evidence that COX-2 inhibitors are associated with an increased risk of adverse cardiovascular events. The committee concluded that there is evidence to suggest that rofecoxib may be associated with an increased risk of adverse cardiovascular events and that the estimated RR value is 1.5-2.0.

Let's presume, for argument's sake, that rofecoxib is associated with a RR value of 2.0, which implies a doubling of the baseline risk. Now consider the following two RCT scenarios involving rofecoxib where the yearly control event rate (yearly rate of MI/ stroke/CV death) is 0.5% (low risk patients) and 5% (high risk patients) respectively. If rofecoxib doubles the risk, then the yearly control event rate would be 1% and 10% respectively, an absolute increase of 0.5% and 5% respectively. If the treatment and placebo groups enrolled in these hypothetical rofecoxib RCTs were homogeneous and they had the same risk of an outcome event at baseline, which is the fundamental purpose of randomisation, and chance played an insignificant role in the trial, then the results should be as predicted. However, heterogeneity and chance effects occur in all RCTs. Consider the influence of heterogeneity/chance effects on these two RCTs if the random variation in prognostic variables factor/chance variability factor creates a potential yearly chance variability effect of 0.5% -- causing either a 0.5% absolute increase, or a 0.5% absolute decrease, in the the yearly control event rate.   

Scenario for high risk placebo and rofecoxib patients

1) Yearly baseline risk for rofecoxib patients + increased yearly risk due to heterogeneity effects = 10% + 0.5% = 10.5%

2) Yearly baseline risk for rofecoxib  patients + decreased yearly risk due to heterogeneity effects = 10% - 0.5% = 9.5%

3) Yearly baseline risk for placebo patients + increased yearly risk due to heterogeneity effects = 5% + 0.5% = 5.5%

4) Yearly baseline risk for placebo patients + decreased yearly risk due to heterogeneity effects = 5% - 0.5% = 4.5%

This theoretical exercise demonstrates that heterogeneity chance effects of this order of magnitude will not significantly affect the accuracy of estimation of rofecoxib's harmful effect -- both qualitatively and quantitatively.

Scenario for low risk placebo and rofecoxib patients

1) Yearly baseline risk for rofecoxib  patients + increased yearly risk due to heterogeneity effects = 1% + 0.5% = 1.5%

2) Yearly baseline risk for rofecoxib  patients + decreased yearly risk due to heterogeneity effects = 1% - 0.5% = 0.5%

3) Yearly baseline risk for placebo patients + increased yearly risk due to heterogeneity effects = 0.5% + 0.5% = 1.0%

4) Yearly baseline risk for placebo patients + decreased yearly risk due to heterogeneity effects = 0.5% - 0.5% = 0%

Now there are 4 possible results of a RCT involving those patients' results.

1) Rofecoxib 1.5% versus placebo 1.0% = absolute increase in yearly risk of 0.5%.

2) Rofecoxib 1.5% versus placebo 0% = absolute increase in yearly risk of 1.5%

3) Rofecoxib 0.5% versus placebo 1% = absolute decrease in yearly risk of 0.5%

4) Rofecoxib 0.5% versus placebo 0% = absolute increase in yearly risk of 0.5%

Aren't the results somewhat surprising?

The results demonstrate that a mere 0.5% absolute chance increase/decrease in the yearly event rate of adverse cardiovascular events (equivalent to 0.5 events/100 patients years) in both the placebo and treated patient groups can markedly affect the results of a COX-2 inhibitor RCT that enrolls "low risk" patients -- both qualitatively and quantitatively.

Note that under these circumstances there is only a 50% probability that a RCT will demonstrate the true harmful influence of the COX-2 inhibitor (presuming that it really exists).

Note that under these circumstances there is a 25% probability that a RCT will markedly exaggerate the harmful influence of the COX-2 inhibitor by a large order of magnitude (3x).

Note that under these circumstances there is a 25% probability that a RCT will demonstrate that the COX-2 inhibitor decreases the risk of adverse cardiovascular events, thereby implying that it may have a beneficial effect!

Obviously, chance events do not necessarily have to affect the placebo and treated patients groups to the same degree, so the potential outcome results can vary markedly between these extremes (I have only used those 4 possible theoretical scenarios for simplicity sake, in order to demonstrate a basic point that one cannot automatically regard COX-2 inhibitor RCTs, which enroll low risk patients, as having a high signal/noise ratio). 

Do I have substantial "evidence" to demonstrate that chance event variations of this order of magnitude (~0.5% events/100 patient years) actually happens in "real life" RCTs involving COX-2 inhibitors?  Consider the following evidence.


Chance effects in RCTs involving
COX-2 inhibitors -- the TARGET trial


Consider the results from the TARGET trial [5]. This was a large trial involving ~18,000 patients which lasted approximately 12 months.

The TARGET trial was a study of a COX-2 inhibitor (lumiracoxib) versus two non-selective NSAIDs (ibuprofen and naproxen). The study was arbitrarily divided into two sub-studies -- lumiracoxib versus ibuprofen and lumiracoxib versus naproxen. The study had two endpoints -- to determine whether there was a difference in i) gastrointestinal ulceration complications and ii) cardiovascular complications between the comparative drug treatment groups.

Here is the description of the randomisation allocation process from the original Lancet report. 

"For logistical and masking reasons, TARGET was divided into two substudies, one with naproxen as the comparator and the other with ibuprofen. Within each substudy randomisation was stratified by age and lowdose aspirin use.The sponsor prepared a computer generated randomisation list with appropriate blocks. The study was centrally randomised according to strata with an interactive voice response system in all countries to ensure age and low-dose aspirin stratification. Allocation of treatment was done via the interactive system and all information was verified by this system before allocation of the patient to a treatment and assignment of the drug packs. To ensure allocation concealment all treatment packs were identically designed and all study drugs were supplied as tablets with matching placebo. We prespecified that data from the two substudies would be pooled for analysis."

The fundamental purpose of randomisation is to create 4 subgroups of patients who have the same baseline risk of a control event. Because I am only considering cardiovascular complications in this essay, let's consider whether the four groups were adequately balanced at baseline in terms of cardiovascular prognostic variables.

Here is a copy of table 1 (baseline characteristics) from the Lancet paper [5].

Note that the four groups are well balanced at baseline by the usual standards of general acceptability.

Therefore, there should be no* reason why the two lumiracoxib groups should have a significantly different adverse cardiovascular event outcome rate -- unless chance and other confounding effects played a highly significant role in the trial.

(* note that even if lumiracoxib has a slight propensity to increase the risk of a adverse cardiovascular event when compared to a placebo patient, the additional risk should theoretically be the same for the two lumiracoxib subgroups)

Consider the cardiovascular outcome results for the two lumiracoxib groups in the following table from the Lancet paper [5]

I have yellow-highlighted the primary cardiovascular outcome results for APTC events (MI/stroke/CV death).

Note that the primary cardiovascular outcome event rate for the lumiracoxib subgroup patients varied from 0.43% (substudy 1) to 0.84% (substudy 2). The absolute difference is 0.41%. This relative doubling of the cardiovascular event rate between the lumiracoxib subgroups from substudy 1 and substudy 2 is a pure chance occurrence! This absolute difference of 0.41% may not seem to be a large absolute difference, but in absolute terms it is actually larger than the magnitude of the measured signal (difference between lumiracoxib and any of the non-selective NSAIDs) that measures the relative cardiovascular harm between a COX-2 inhibitor and a non-selective NSAID drug.

To fully dramatise this critically important fact (and to emphasise the fact that this trial has a very low signal/noise ratio as a result of this chance variability), I am reproducing the Kaplan-Meier curves for cumulative cardiovascular events for the entire 12 month duration of the study.

From reference [6].

Note how widely separated the two curves are for the two lumiracoxib subgroups. Note that the wide separation is apparent from the time of trial inception (even when the absolute number of events is very small).

Note that the degree of separation between the two lumiracoxib curves is larger than the degree of separation between any of the lumiracoxib curves and either the ibuprofen or naproxen curves -- this implies that the magnitude of the chance effect is larger than the magnitude of the signal. This situation is somewhat similar to scenario 2 in the imaginary "plastics for supermarket shopping bag" example.

My conclusion is that this randomised trial is a prime example of a RCT that has a low signal/noise ratio, and I conclude that one cannot therefore be confident in the trial's conclusion. In fact, one could easily conclude that the trial's interpretative conclusions have near-zero scientific validity, because the comparison between ibuprofen (or naproxen) to one of the lumiracoxib subgroups could be fairly changed to the other lumiracoxib subgroup, resulting in a totally contrary conclusion.

Do I have other evidence that chance event variations of this order of magnitude (~0.5% events/100 patient years) actually happens in "real life" RCTs involving COX-2 inhibitors? I do have more evidence, but before I present that evidence, let's consider what factors may have caused the comparatively large chance variation in the rate of adverse cardiovascular events between the two lumiracoxib subgroups in the TARGET trial (absolute difference of 0.41 events/100 patient years).

I can think of three reasons that could explain the chance variability.

i) Chance occurrences.

ii) Pattern-distribution variability in baseline prognostic variables.

iii) Constantly changing state of risk as a result of a differential drop-out phenomenon.

I will discuss each of these chance-type phenomena separately.

Chance occurrences

What is the likelihood that the lumiracoxib patients would have an adverse cardiovascular event in the 1 year time period of the trial?

To answer that question, let's consider the type of patients recruited into the TARGET trial.

I have taken the baseline characteristics of the lumiracoxib subgroups and presented some of the most relevant baseline characteristics in the following table.

Characteristic

Lumiracoxib subgroups

Age (mean)

63 years

Sex

76% female

High cardiovascular risk

2%

Low-dose aspirin use

24%

History of MI

2%

Current smoker

10%

Hypertension

46%

Dyslipidemia

20%

On average, if one considers all the patients as being a "single" individual, who is representative of the "average" enrolled patient, then one could state that the patient is likely to be a 63 years old female, who has a relatively low 10 year risk of a future adverse CAD event. It is theoretically possible to estimate the 10 year likelihood of a future adverse CAD event in a "single" individual if one inputs the required data into a coronary risk prediction tool. Let's presume, for argument sake, that the *predicted risk of a future adverse CAD event is 10% over 10 years. That works out to an average yearly risk of 1%/year.

(* Note that it is actually impossible to accurately calculate the "average" risk of the trial patients as a single "representative" value for a variety of reasons. First of all, the risk prediction tables require input of the specific level of hypertension and serum cholesterol, and that information is not provided in this table. Also, one cannot pretend to know how to combine all the different factors into a single "representative" risk factor figure for an entire group of patients. However, that is not important, because I have only chosen this arbitrary predicted risk value of 10% over 10 years for illustrative purposes, and my argument wouldn't be any different if the "average" predicted risk figure for the entire group of patients was 6%, or 8%, or 12% over 10 years. It also doesn't matter if one thinks of the risk as being a risk of a CAD event or as the risk of an APTC cardiovascular event, which also includes stroke and sudden CV death)

Therefore, let me reframe my basic position. If the "average" predicted 10 year risk of an APTC event for each lumiracoxib subgroup of patients (taken as a whole) is 10%, then although the "average" yearly value will be 1%, there is no method of precisely determining when those events will actually occur. If all the underlying risk factors remain unchanged throughout the 10 period, then there has to be a slightly higher incidence in the last 5 years (compared to the first 5 years) because the patient are aging by 10 years over that same time period, and increasing age is a major risk factor for a future APTC event. Let's presume, for argument sake, that 6% of the predicted APTC events would occur in the last 5 years and 4% of the APTC events would occur in the first 5 years. How would those 4% APTC events be distributed during the first 5 years? We obviously have no idea! There is no risk prediction tool that can accurately predict how that pattern distribution will occur.

Here are some possible chance variations of distribution of those cardiovascular events over a 5 year time period (4% overall risk distributed over 5 years).

Variation number 1 (yearly event rate for 5 consecutive years): 0.75%/0.75%/1.0%/0.6%/0.9%

Variation number 2 (yearly event rate for 5 consecutive years): 0.4%/0.8%/1.2%/0.6%/1.0%

Variation number 3 (yearly event rate for 5 consecutive years): 0.5%/0.4%/1.2%/0.8%/1.1%

I could produce an endless number of possible variations, but I think that the average reader should already grasp the point that I am trying to make -- that if a trial only lasts 1 year, then the likelihood of a group of enrolled patients having an adverse cardiovascular event during that particular year can vary markedly due to chance alone. In fact, this chance variability phenomenon could alone account for the absolute chance variation of 0.41% that occurred between the two lumiracoxib subgroups, because there is no intrinsic reason why chance should cause both groups to have the same magnitude of chance effect in the same year.

Pattern-distribution variability in baseline prognostic variables

This chance variable is closely related to the first chance variable that I have already discussed.

Imagine that one enrolls 1 million patients into a TARGET-type trial, and subdivides those 1 million patients into one thousand subgroups, each containing 1,000 patients. Then, theoretically, each 1,000-patient subgroup should on "average" have the same distribution of baseline characteristics if the inclusion/exclusion criteria and randomisation scheme is identical to that used in the TARGET trial. Let's presume that the baseline characteristics of each 1,000-patient subgroup is as follows (identical to the TARGET trial).
 

Characteristic

Lumiracoxib subgroups

Age (mean)

63 years

Sex

76% female

High cardiovascular risk

2%

Low-dose aspirin use

24%

History of MI

2%

Current smoker

10%

Hypertension

46%

Dyslipidemia

20%


In my first argument, I stated that one could imagine that this entire group of patients (taken as whole) has a finite 10-year predicted risk of an APTC event and I arbitrarily (for illustrative purposes) used a 10-year risk prediction figure of 10%. I then argued that it is impossible to accurately predict what the yearly risk figure would be for any particular year, and that this represented a chance variable for a trial that only lasted 1 year.

However, there is a second chance variable that applies to each of these one thousand 1,000-patient subgroups. Each of these subgroups could have a different pattern-distribution of these baseline characteristics. I will provide a few illustrative examples.

Although each 1,000-patient subgroup has a low-dose aspirin use value of 24%, each subgroup could have a different pattern-distribution of aspirin use. It is theoretically possible that in some of those 1,000-patient subgroups, that disproportionatley more males could be taking aspirin. Aspirin is reputed to decrease the risk of a future CAD event by about 23% in males, but not in females (according to the latest RCT evidence). Therefore the effect of the "low-dose aspirin" factor could be slightly different in each of those 1,000-patient subgroups. It is also theoretically possible that chance could result in disproportionately more patients being aspirin-resistant in some of those subgroups compared to the other subgroups. This aspirin-resistance phenomenon adds another chance element to the "low-dose aspirin" factor chance variable mix.

The same chance phenomenon applies to each of the coronary risk factors. If only a small subset of each subgroup's 1,000 patients has a particular coronary risk factor, then many different pattern-distribution variations are possible, and each pattern-distribution variation may result in a slightly different level of risk for the entire group of 1,000 patients taken as a whole.

The point that I am trying to make is that each of those one thousand 1,000-patient subgroups could have a slightly different chance likelihood of having an APTC event within a time period of 1 year, and this pattern-distribution chance variable is another chance factor that cannot be predicted, or controlled, despite an optimum randomisation process. Therefore, it should not be surprising that the two lumiracoxib subgroups had different adverse cardiovascular event rates in the TARGET trial.

Constantly changing state of risk as a result of a differential drop-out phenomenon

I have argued that even if the randomisation process is optimal, that chance alone could affect the "real life" likelihood that two enrolled groups of patients could really be perfectly balanced at baseline. However, even if the treated and control patient groups (or the two lumiracoxib subgroups) were perfectly balanced at baseline in terms of all prognostic variables, it is very unlikely that the two groups would remain perfectly balanced for the entire duration of the trial if the RCT has a high drop-out rate.

The TARGET trial had a 40% drop-out rate, which is a very high figure. If 40% of the enrolled patients drop-out, then it is obviously possible that there could be a constantly changing pattern-distribution of prognostic variables in the remaining patients, and it would not be surprising if the treated and placebo groups (or the two lumiracoxib subgroups) would each have a different pattern-distribution of prognostic variables at different time points throughout the trial.

Consider the following illustrative example of a drop-out pattern. I didn't have a graphical display for the TARGET trial, so I am using the VIGOR trial for illustrative purposes.

From reference [6].

In the VIGOR trial, the drop-out rate was similar for the rofocoxib and naproxen subgroups. However, this graph-pattern doesn't prove that the overall predicted risk of a control event for each of those groups for the remainder of the trial remains balanced. Chance could have affected one group more than the other group in terms of dropping-out the highest risk (or lowest risk) patients, and this could radically alter the prognostic balance in a low control event trial.

Also, in any particular RCT there is no guarantee that the slope angle of the drop-out phenomenon will be the same for the treated and control groups and this differential effect could also affect the prognostic balance between the two groups.

I think that it is impossible to know to what degree this drop-out phenomenon affects a randomised clinical trial's balance at different time points during the trial if the phenomenon is not studied in great depth. However, in the absence of evidence to the contrary, I think that it is reasonable to assume that the drop-out phenomenon could be a major confounding variable, especially in low control event RCTs that generate a small signal.

I find it interesting that the VIGOR trial showed the greatest harm from rofecoxib relative to naproxen in the last 3 months of the trial, when the drop-out rate was undergoing its maximal change.

Consider the Kaplan Meier curves for the VIGOR trial (from reference number [6]).
 


Note how steeply the curve for the rofecoxib patients increases after 8 months.This phenomenon of an abruptly increased rate of CV events occurs at exactly the same time that the "sampling size remaining in the trial" value plummets from 70% to 10% (see graph above). Could the differential adverse cardiovascular event rate after 8 months be due to the fact that a differential drop-out phenomenon left the rofecoxib patients with a subgroup of patients who were at higher risk of an adverse cardiovascular event compared to the baseline situation that existed at the time of trial inception, and compared to the naproxen group who could theoretically even be experiencing the opposite effect of having a lesser risk profile as a result of the drop-out phenomenon? 

I noticed that this same phenomenon occurred in the APPROVe trial. I will discuss the drop-out phenomenon as it applies to the APPROVe trial in the next section.


Further evidence of random chance effects -- the APPROVe trial


Consider the the official report of the APPROVe trial's results -- from the abstract published in the NEJM [1]. I have specifically highlighted the trial investigators' primary conclusion.


"Background: Selective inhibition of cyclooxygenase-2 (COX-2) may be associated with an increased risk of thrombotic events, but only limited long-term data have been available for analysis. We report on the cardiovascular outcomes associated with the use of the selective COX-2 inhibitor rofecoxib in a long-term, multicenter, randomized, placebo-controlled, double-blind trial designed to determine the effect of three years of treatment with rofecoxib on the risk of recurrent neoplastic polyps of the large bowel in patients with a history of colorectal adenomas.

Methods: A total of 2586 patients with a history of colorectal adenomas underwent randomization: 1287 were assigned to receive 25 mg of rofecoxib daily, and 1299 to receive placebo. All investigator-reported serious adverse events that represented potential thrombotic
cardiovascular events were adjudicated in a blinded fashion by an external committee.

Results: A total of 46 patients in the rofecoxib group had a confirmed thrombotic event during 3059 patient-years of follow-up (1.50 events per 100 patient-years), as compared with 26 patients in the placebo group during 3327 patient-years of follow-up (0.78 event per 100 patient-years); the corresponding relative risk was 1.92 (95 percent confidence interval, 1.19 to 3.11; P=0.008). The increased relative risk became apparent after 18 months of treatment; during the first 18 months, the event rates were similar in the two groups. The results primarily reflect a greater number of myocardial infarctions and ischemic cerebrovascular events in the rofecoxib group. There was earlier separation (at approximately five months) between groups in the incidence of nonadjudicated investigator-reported congestive heart failure, pulmonary edema, or cardiac failure (hazard ratio for the comparison of the rofecoxib group with the placebo group, 4.61; 95 percent confidence interval, 1.50 to 18.83). Overall and cardiovascular mortality was similar in the two groups."

Consider the Kaplan Meier curves graph for cumulative adverse cardiovascular events for the APPROVe trial (from reference [6]).



The official interpretation of the APPROVe trial was as follows-:

1) There is no signal that rofecoxib was harmful in the first 18 months of the trial.

2) The harm signal became apparent in the second half of the trial.

3) The signal of increased risk of adverse cardiovascular events in the APPROVe trial primarily reflects an increased risk of events in the rofecoxib patients.

4) The overall increased degree of cardiovascular risk for the entire duration of the trial works out to a RR of 1.92.

Do you agree with the official interpretation? Do you think that the RR value of 1.92 is truly a 100% accurate absolute reflection of rofecoxib's harmful CV effect?

Although I think that the official interpretation is vaguely accurate, I also think that the official interpretation is far too simplistic because it does not discuss the potential effect of any noise elements and it does not give us any idea whether this trial has a high or a low signal/noise ratio. Without estimating the trial's signal/noise ratio, one cannot quantify to what degree one should have confidence in the trial's official conclusion [7].

It is true that there were more adverse cardiovascular events in the rofecoxib group in the last 18 months of the trial compared to the first 18 months of the trial (1.71 events/100 patient years for the second half versus 1.33 events/100 patient years for the first half), but the absolute increase was not as great as the marked absolute decrease in adverse cardiovascular events in the placebo group in the second half of the trial compared to the first half of the trial (0.38 events/100 patient years for the second half versus 1.13 events/100 patient years for the first half).

From my perspective, the most dramatic event-phenomenon in the APPROVe trial's Kaplan Meier curves (seen clearly in the above graphical display slide) is the fact that the cumulative adverse cardiovascular events curve for the placebo patients flattened dramatically during the last 18 months of the trial, and that this marked flattening is the primary cause of the signal becoming readily apparent. In other words, I am suggesting that noise (due to random chance events) could be responsible for a major part of the APPROVe trial's signal. This fact suggests that the APPROVe trial has a low signal/noise ratio, which implies that one cannot therefore be confident in the trial's quantitative conclusion.

How does one quantify the amount of noise in the APPROVE trial? See the box section if you are interested in my hypothetical conjecturing regarding the magnitude of the noise factor. Bypass the box section if you want to stay connected to the overall thrust of my argument, and don't want to be diverted.
 

Quantifying the *magnitude of the noise factor in the APPROVe trial

(* I will be using APTC events in this hypothetical calculation instead of confirmed thrombotic events which was the endpoint used by the APPROVE trialists in many of their result-presentations and in their Kaplan Meier graph presentation. The APTC results of the APPROVE trial was as follows:- first 18 months -- placebo group 0.68 events/100 patient years, rofecoxib group 0.84 events/100 patient years; last 18 months -- placebo group 0.38 events/100 patient years, rofecoxib group 1.42 events/100 patient years)

I made the statement that "noise" (due to random chance events) could be responsible for a major part of the APPROVe trial's signal in the last 18 months of the trial.

On what basis can I make that claim?

I am making that claim on the basis that the placebo group's event rate figure for the last 18 months of the trial (0.38 APTC events/100 patient years) is too low, and that a low level inflates the size of the signal (ARR). However, when I state that the event rate level is too low, how do I know what the APTC event rate/100 patient years value should be during that 18 month time period? How would I answer a critic who claims that the event rate was apparently too low in the last 18 months because the event rate was actually too high in the first 18 months and that it all balances out anyway to an "average" value of 0.54 events/100 patient years?

First of all, is an average value of 0.54 events/100 patient years for a control event rate value what one would expect on "average" for a group of placebo patients who had the same baseline characteristics as the APPROVe trial's patients? I don't know how one can accurately answer that question considering that the APPROVe trial's sample size was so small (~1,200 patients). I think that one can only answer that question accurately if one had a sample size sufficiently large so as to markedly decrease the likelihood of chance events affecting the "average" APTC event rate value eg. 12,000 patients instead of ~1,200 patients in each arm of the trial. Let's presume, for argument sake, that the APPROVe trial had three arms -- two placebo arms (one placebo arm consisting of 1,200 patients and the other placebo arm consisting of 12,000 patients) and one rofecoxib arm (1,200 patients). In that situation, I would use the "average" APTC event rate value from the larger placebo arm to make any quantitative calculations. Let's presume, for arguments sake, that the 12,000 patient placebo arm had an average APCT event rate overall of 0.8 events/100 patient years. How would I then estimate the magnitude of the noise (relative to the signal) in the APPROVe trial?

For the first 18 months of the APPROVe trial

If the expected control event rate for placebo patients should be 0.8 APTC events/ 100 patient years on average, and the placebo group's event rate in the APPROVE trial was 0.68 events/100 patient years, then I would conclude that there was an absolute 0.12 events/100 patient years magnitude noise factor which would artefactually inflate the "apparent" portion of rofecoxib's contribution to the measured signal (ARR).

What about the magnitude of the noise factor in the rofecoxib group? I believe that it is unknowable! If one presumes that the measured value of 0.84 APTC events/100 patient years in the rofecoxib group of the APPROVE trial was entirely due to the baseline prognotic factors which actually predict an expected "average" value of 0.8 events/100 patient years, then one would conclude that rofecoxib had no significant intrinsic harmful cardiovascular effect in the first 18 months of the trial. However, we do not really know whether random chance events also affected the rofecoxib group (like it did the placebo group) and to what degree (the same degee or a different degree). If random chance events caused the rofecoxib group's expected event rate level to be lower than average and to exactly the same extent as the placebo group's (eg. 0.68 events/100 patient years instead of an expected value of 0.8 events/100 patient years), then the magnitude of the noise factor would be 0.8-0.68 = 0.12 events/100 patient years, and that amount of noise would be masking rofecoxib's intrinsic harmful cardiovascular effect to that order of magnitude (0.12 events/100 patient years). However, if random chance events actually affected the rofecoxib group to an even greater degree (in the same direction) than it affected the placebo group, then we have underestimated rofecoxib's intrinsic harmful cardiovascular effect by an even larger order of magnitude. By contrast, if random chance events actually caused the rofecoxib patients' expected event rate to be higher than average (eg. event rate value of 1.0 event/100 patient years instead of an expected average value of 0.8 events/100 patient years), then the magnitude of the noise factor is 0.20 events/100 patient years in the opposite direction (the noise factor disguises the fact that rofecoxib actually decreased the adverse cardiovascular event rate by that order of magnitude). However, all these theoretical possibilities are not knowable; they are simply guesstimations!

I think that the only noise factor that is accurately knowable in this hypothetical situation is the 0.12 events/100 patient years noise factor in the placebo group, which is due to the fact that random chance events caused the placebo group to have a lower event rate than the expected average value of 0.8 events/100 patient years, and this noise factor inflates rofecoxib's "apparent" harmful effect by that order of magnitude.

For the last 18 months of the APPROVe trial

If the control event rate for placebo patients should be 0.8 APTC events/ 100 patient years on average, and the placebo group's event rate in the APPROVE trial was 0.38 events/100 patient years, then I would conclude that there was an absolute 0.42 events/100 patient years magnitude noise factor which would artefactually inflate the "apparent" portion of rofecoxib's contribution to the measured signal.

What about the magnitude of the noise factor in the rofecoxib group? I believe that it is unknowable! If one presumes that the rofecoxib group was not affected by random chance events in the last 18 months of the APPROVe trial and that the expected average value for untreated patients of 0.8 events/ 100 patient years is applicable, then rofecoxib's intrinsic harmful cardiovascular effect is 1.42 - 0.8 = 0.52 events/100 patient years. However, we do not really know whether random chance events also affected the rofecoxib group (like it did the placebo group) and to what degree (the same degee or a different degree). If random chance events also caused the rofecoxib group's expected event rate level to be lower than average and to exactly the the same extent as the placebo group's (eg. 0.38 events/100 patient years instead of an expected value of 0.8 events/100 patient years), then the magnitude of the noise factor would be 0.8-0.38 = 0.42 events/100 patient years, and that amount of noise would be masking rofecoxib's intrinsic harmful cardiovascular effect to that order of magnitude (0.42 events/100 patient years). In other words, one would be underestimating the extent of rofecoxib's true intrinsic harmful cardiovascular effect by that order of magnitude. However, if random chance events actually affected the rofecoxib group to an even greater degree (in the same direction) than it affected the placebo group, then we have underestimated rofecoxib's intrinsic harmful cardiovascular effect by an even larger order of magnitude. By contrast, if random chance events actually caused the rofecoxib patients' expected event rate to be higher than an expected average value (eg. event rate value of 1.0 event/100 patient years instead of an expected average value of 0.8 events/100 patient years), then the magnitude of the noise factor is 0.20 events/100 patient years in the opposite direction, and one would be overestimating rofecoxib's intrinsic harmful cardiovascular effect by that order of magnitude. However, as I stated previously, all these theoretical possibilities are not knowable; they are simply guesstimations!

I think that the only noise factor that is accurately knowable in this hypothetical situation is the 0.42 events/100 patient years noise factor in the placebo group, which is due to the fact that random chance events caused the placebo group to have a lower event rate than the expected average value of 0.8 events/100 patient years, and this noise factor inflates rofecoxib's "apparent" harmful effect by that order of magnitude.

Note that I have arbitrarily utilised an "average" event rate of 0.8 events/100 years as representing the "average" control event rate. My basic conclusion would be unchanged if the "average" control event rate was actually 0.5, 0.6, or 1.0 events/100 patient years -- that it is impossible to determine the true intrinsic harmful cardiovascular effect of a COX-2 inhibitor if one cannot accurately measure the noise level due to random chance events, which are an unavoidable consequence of performing short duration, small sample-sized RCTs in low risk patients who have a low, unpredictable, control event rate. The fundamental presumption underpinning a fair comparison between the control group and the treatment group in a randomised controlled trial is that chance events should affect the control and treatment groups to the same degree. However, what is the likelihood of that happening in a COX-2 inhibitor RCT that has a small sample size and a low, unpredictable, control event rate?


What the APPROVe trial mainly demonstrates is that chance plays a large confounding role in COX-2 inhibitor trials when the control event rate is low (control event rate of 0.4-1.0 events/100 patient years) and that the year-by-year event rate can vary markedly (3x greater number of events in the first 18 months compared to the second 18 months of the trial). If the APPROVE trial demonstrated that chance could affect the placebo group to that degree, then there is no intrinsic reason why chance could not affect the rofecoxib group to the same (or a different) degree. Awareness of that fact allows one to view the APPROVe trial's Kaplan Meier curves in a totally different light.

I have previously argued that if one takes a group of healthy middle-aged/elderly patients who have a low risk of cardiovascular disease (expected "average" yearly adverse cardiovascular event rate in the range of 0.8-1.0% events/year) that chance alone could cause a large chance-variation in yearly adverse cardiovascular event rates from year-to-year. What actually happened to the placebo patients in "real life" in the APPROVe trial seems to prove my point! Therefore, there is no reason to believe that the rofecoxib patients could not be affected by chance to the same (or a different) degree. How can one estimate what percentage of the rofecoxib patients' yearly adverse cardiovascular event rate is due to noise (random chance events) and how much is due to the drug's intrinsic harmful effect?

Here is a detailed table demonstrating the adverse cardiovascular event rate for the APPROVe trial as a function of time (from reference [6]).



 

First of all, note how the adverse cardiovascular event rate in the placebo group decreases dramatically from 0.57 events/100 patient years  to 0.39 events/100 patient years to 0.19 events/100 patient years in the 18-24 month, 24-30 month and >30 month time periods respectively. During those same respective time periods, the adverse cardiovascular event rate increases less dramatically in the rofecoxib group from 1.46 events/100 patient years to 1.32 events/100 patient years to 2.36 events/100 patient years. We know that the changing event rate values in the placebo group is due to chance, and that the marked decrease in placebo event rates drives the RR value to much higher levels. How much of the increase in event rates in the rofecoxib group (that occur during the same time period) is due to chance and how much is due to rofecoxib's intrinsic harmful cardiovascular effect?   

I personally suspect that it is not possible to separate-out the part of rofecoxib's increased event rate that is due to chance, and I therefore think that the part that remains, and which is due to rofecoxib's intrinsic harmful cardiovascular effects, is therefore unknowable. I think that the burden of proof lies with the trialists. They have to separate-out the part of rofecoxib's signal that is due to chance from the part that is due to rofecoxib's intrinsic harmful cardiovascular effects if they want to accurately quantify rofecoxib's true harmful cardiovascular effects.

What is particularly interesting is that the adverse cardiovascular event rate increases most dramatically in the rofecoxib group when the drop-out rate is undergoing its greatest change. Note that the APPROVe trial's RR value increases abruptly to a RR value of 12.30 after 30 months (last 6 months of the trial) and that the marked increase in RR is due to the fact that the event rate in rofecoxib patients increased to 2.36 events/100 patients years in the last 6 months of the trial, while the event rate in placebo patients plummeted to 0.19 events/100 patient years during the same time period. How does one explain such a radical change? Could this radical change be related to a differential drop-out phenomenon? Note that the number of patients remaining in the trial plummets during the last 6 months of the trial.

The following table represents the number of patients at risk (percentage of patients who remain in the trial) as a percentage of the original sample that existed at the time of trial inception. I derived these values from the Kaplan Meier graph above.

 

0 months

6 months

12 months

18 months

24 months

30 months

36 months

Placebo group

100%

92%

88%

83%

80%

77%

36%

Rofecoxib group

100%

87%

82%

77%

73%

70%

32%

Note that the number of patients at risk decreases markedly during the last 6 months of the trial. Could the marked change in the RR value in the last 6 months of the trial be due to a differential drop-out phenomenon that left the rofecoxib group with disproportionatley more high risk patients than the placebo group? Don't you think that the legitimacy of the RR value of 12.30 for the last 6 months of the trial is dependent on the trial investigators demonstrating that the two groups were still reasonably balanced in terms of prognostic variables during the last 6 months of the trial? The trial's harm signal for rofecoxib only became apparent during the last 18 months of the trial, but approximately 50% of the total number of events in the rofecoxib group occurred in the last 6 months of that 18 month period (11 events occurred in the last 6 months out of a total of 24 events for the last 18 months). During that same time period of the last 6 months of the trial, only 1 placebo patient had an event (out of a total of 6 events for the placebo group for the last 18 months of the trial).

If the effects of a differential drop-out phenomenon could be radically biasing the results of the APPROVe trial, then excluding the results from the last 6 months of the trial may result in an interpretative conclusion that more accurately reflects reality. Don't you think that the burden of proof lies with the trial investigators? Shouldn't they be obliged to prove that the results from the last 6 months of the trial are consonant with reality -- reality being defined as trial results that truly reflect rofecoxib's intrinsic harmful cardiovascular effects, free of any noise effects.

There is another important lesson that can be learned from the APPROVE trial

Consider the adverse cardiovascular event rate that occurred in the APPROVe trial's placebo patients as a function of time.

This table plots the confirmed thrombotic cardiovascular event rate/100 patient years for placebo patients from the APPROVe trial for each 6 month period.

 

0-6 months

6-12 months

12-18 months

18-24 months

24-30 months

30-36 months

Placebo group

0.80

1.2

1.43

0.57

0.39

0.19

Hypothetical COX-2 group

1.2

1.2

1.2

1.2

1.2

1.2

What this table demonstrates is that if one takes a group of low cardiovascular risk patients and simply follows them for a time period of 36 months, that the cardiovascular event rate will vary markedly during that 36 months time period if considered from the time perspective of 6-monthly time quotas. Note that the 6 monthly event rate varied from 0.19-1.43 events/100 patient years due to the natural chance variation of APTC events in low risk patients. This large natural variation (due to random chance event effects) has huge implications for trialists who run short duration trials.

Consider a hypothetical situation where these placebo patients are enrolled into a COX-2 inhibitor RCT that lasts 6 months. Presume that the COX-2 inhibitor has an adverse cardiovascular event rate of 1.2 events/100 patient years for that 6 month time period. What would the RR be for that COX-2 inhibitor for that 6 month trial? If you look at the above table, you can see that the calculated RR will vary markedly depending on when the trial is actually run. If the trial was run in the 0-6 months time period, then the RR would be >1.0 (1.2 versus 0.80) suggesting that the COX-2 inhibitor is moderately harmful. If the trial had to be delayed 6 months, then the RR result would be 1.0 (1.2 versus 1.2) suggesting that the COX-2 inhibitor would not be associated with any increased cardiovascular risk. If the trial had to be delayed 12 months, then the RR would be <1.0 (1.2 versus 1.43) suggesting that the COX-2 inhibitor is associated with a slightly less than average cardiovascular risk. If the trial had to be delayed 30 months, then the trial would show that the COX-2 inhibitor is extremely harmful because the RR would be >>1.0 (1.2 versus 0.19).

This hypothetical example demonstrates that short duration COX-2 inhibitor RCTs may have zero scientific validity when it comes to assessing whether a COX-2 inhibitor is associated with increased cardiovascular risk -- because the signal/noise ratio of the trial would likely be very low if the *noise level is very high (if the placebo group's control event rate is either disproportionately low or disproportionately high due to chance effects). 

(* the noise level would be even higher if the treated patient group also had a disproportionately low or disproportionately high event rate due to chance, and not due to the drug's intrinsic harmful cardiovascular effects, and the random chance events occurred at a different time and to a different degree compared to the control group)

Yet, meta-analysts still meta-analyse short duration RCTs in low risk patients hoping to find a scientifically valid signal that a COX-2 inhibitor drug is associated with an increased cardiovascular risk. I think that the results of their pooled meta-analysis of short duration RCTs are scientifically invalid if they have not taken natural chance variations into full account (by proving that a pooling of random chance events from multiple short duration studies will result in an "average" value that is truly representative of reality).

Consider this table from Konstam's meta-analytical review of rofecoxib trials [7].



 

Note that many of the reviewed trials lasted 4-13 weeks. How can those short duration trials individually have any scientific validity if they have not taken natural chance event variations into full account (as described above)?  How can a meta-analyst prove that pooling results from short duration RCTs corrects for natural chance event variations of a potentially high order of magnitude?

I personally think that one has to run a continuous long duration COX-2 inhibitor RCT lasting approximately 5 years if one hopes to obtain scientifically valid results. Also, the sample size of the RCT would have to be much larger than those currently performed if one wanted to ensure that the trial's signal/noise ratio is sufficiently high.


How to test my noise hypothesis


Critics could argue that my noise theory, which is based on the chance variability of random events in low control event trials, is not scientifically proven, and that the previously provided evidence is not sufficiently solid. I partially agree! I think that the clinical research community, community clinicians, and the public would be well served if my hypothesis is formally tested. I don't think that it would be difficult to test my hypothesis. Here is one easy method of testing my hypothesis.

Study design to test a hypothesis of signficant noise due to variability in random chance events in low risk cardiovascular patients:

My hypothesis is that there is a large random chance variability in adverse cardiovascular event rates in low risk patient groups from year-to-year, and between similar groups during the same time period of 1, 2, 3, 4 or 5 years, and that the random degree of variability could be a major source of noise in RCTs that have a low control event rate and a small signal.  

To test my hypothesis, one should randomise 20,000 patients into twenty subgroups of 1,000 patients each, using the APPROVe trial's inclusion/exclusion criteria and randomisation allocation scheme to ensure that each of the twenty subgroups (each containing 1,000 patients) has the same baseline risk characteristics as existed in the APPROVe trial's placebo and rofecoxib groups. Then one should follow the 20 subgroups of placebo patients for 5 years and measure the adverse cardiovascular event rate (APTC events/100 patient years) for the total 5 year time period and for each 1 yearly time period during that 5-year time period. The final 120 measurement-results (1 yearly measurement-results for each of the twenty subgroups for each of the five years equals 100 measurements, plus twenty measurement-results for the entire 5 year time period) will allow one to determine i) the yearly adverse cardiovascular event rate for each subgroup and how consistent the results are from year-to-year over a time period of 5 years; ii) the range of random variations from the lowest to the highest values for each year and for the entire 5-year time period; and iii) whether the overall adverse cardiovascular event rates over the 5-year time period is similar for each of the twenty subgroups, and to what degree they can vary due to chance alone.

If my hypothesis is confirmed, then it will call into question the scientific validity of performing RCTs in low risk patients when the control event rate is low, the signal small, and the potential noise level high.

If the 20,000 patient study is performed, then additional sub-studies should also be performed.

First of all, the investigators should study the pattern-distribution of the baseline prognostic variables in each of the 20 subgroups to see whether they can correlate the pattern-distribution of prognostic variables with the outcome APTC event rate.

Secondly, the investigators should study the baseline prognostic variables to determine whether they can produce a risk prediction tool that will allow a low risk patient to estimate his yearly risk of an adverse cardiovascular event. If an individual patient can roughly estimate his yearly risk of an adverse cardiovascular event, then he will be able to estimate his absolute increased risk if taking a COX-2 inhibitor daily for the entire year is associated with an increased cardiovascular risk of potentially up to RR 2.0 magnitude. For example, the risk prediction tool may predict that the individual patient's yearly risk of an APTC event is 0.6% and it will also predict that taking a particular COX-2 inhibitor may possibly increase his yearly risk to 1.2% (positive framing). The computer-based risk prediction tool could also supply the same absolute predicted values in terms of negative framing -- by stating that the individual patient's yearly risk of not having an APTC event will change from 99.4% to 98.8% if he takes a COX-2 inhibitor daily for one year. By providing the same information in terms of positive and negative framing, the individual patient will be able to make independent health care decisions that are evidence-based and not tainted by an individual health care provider's personal biases.

Thirdly, the investigators should study the drop-out rate phenomenon to determine whether there is any causal connection between a differential drop-out rate and the outcome event rates. For example, in the APPROVe trial, the drop-out rate was at its highest level in the last 6 months of the trial -- and that is the 6 month time period when the APTC event rate decreased most dramatically in the placebo group (to 0.19 events/100 patient years) and increased most dramatically in the rofecoxib group (to 2.36 events/100 patient years). It is critically important to know to what degree the drop-out phenomenon affects the signal/noise ratio of RCTs if one hopes to legitimise the use of RCTs in low risk patients who have low control event rates. 


Conclusion
:
 

James Penston ended his book [8], which analyses large-scaled randomised trials, by making the following bold prediction-:  

"Large-scale randomised trials are deeply flawed. They deliver nothing but a deep shadow of therapeutic benefit, a pretense of efficacy to fool the unwary and the means by which those with little interest in either the integrity of medical research or genuine improvements in health care inflate their reputations and maximise their profits. Time is likely to show that what is being built on the flimsy foundations of mega-trials is nothing more than a house of cards."

I highly recommend that all clinical researchers read Penston's book, but I personally think that it is still an open question whether Penston's bold predictions apply to all RCTs irrespective of the magnitude of the signal/noise ratio. However, his statements certainly seem to apply to RCTs that are performed in low risk patients who have a low control event rate, a small signal and a high noise level. I think that there are substantial reasons to regard those RCTs as having a low signal/noise ratio, which implies that they cannot provide scientifically valid evidence with a high degree of confidence. I think that the research world and the FDA needs to consider this issue very seriously, so that they can think of practical solutions to remedy the situation. They may decide to abandon the use of RCTs as a clinical research tool in low risk patients who have low control event rates if the anticipated signal level is likely to be small. Or, they may decide to institute measures that can potentially strengthen the ability of RCTs to generate high signal/noise ratios (eg. use higher risk patients who can generate a higher control event rate and a larger signal; ensure that randomisation does in fact result in relatively homogeneous groups at baseline and that the control/treatment groups remain balanced for the entire duration of the trial; ensure that the drop-out rate is minimised; perform appropriate post hoc corrections to adjust for differential drop-out rates that produce significant noise; ensure that the sample size is sufficiently large and the study period sufficiently long in order to minimise the noise due to random chance effects).

I think that health policy decision-makers need scientifically valid evidence in order to make rational health policy decisions regarding the use of COX-2 inhibitors. At present, they are deprived of scientifically valid evidence from RCTs (evidence from RCTs that have a high signal/noise ratio), and I think that this predicament is problematic. I think that a new approach is needed to solve this predicament, and I think that public health officials should start off by recognising the limitations of COX-2 inhibitor RCTs in low risk patients, who have low control events rates, when the anticipated signal is likely to be small.

Jeff Mann. MD

Retired physician.

jmannemg@earthlink.net

Date of first version: March 2005.


References:


1. Bresalier RS, Sandler RS, Quan H, Bolognese JA, Oxenius B, Horgan K, Lines C, Riddell R, Morton D, Lanas A, Konstam MA, Baron JA. Cardiovascular Events Associated with Rofecoxib in a Colorectal Adenoma Chemoprevention Trial. NEJM March 17th 2005. Vol 352. p1092-1102.

2. Scott D. Solomon, M.D., John J.V. McMurray, M.D., Marc A. Pfeffer, M.D., Ph.D., Janet Wittes, Ph.D., Robert Fowler, M.S., Peter Finn, M.D., William F. Anderson, M.D., M.P.H., Ann Zauber, Ph.D., Ernest Hawk, M.D., M.P.H., Monica Bertagnolli, M.D., for the Adenoma Prevention with Celecoxib (APC) Study Investigators. Cardiovascular Risk Associated with Celecoxib in a Clinical Trial for Colorectal Adenoma Prevention. NEJM March 17th 2005. Vol. 352. p1071-1080.

3. Drazen JM. COX-2 Inhibitors. A Lesson in Unexpected Problems. NEJM March 17th 2005. Vol. 352. p1131-1132.

4. Sackett, David L. Why randomized controlled trials fail but needn't: 2. Failure to employ physiological statistics, or the only formula a clinician-trialist is ever likely to need (or understand!) CMAJ 165(9):1226-1237, October 30, 2001.

Available online at http://www.cmaj.ca/cgi/content/full/165/9/1226

5. Michael E Farkouh, Howard Kirshner, Robert A Harrington, Sean Ruland, Freek W A Verheugt, Thomas J Schnitzer, Gerd R Burmester, Eduardo Mysler, Marc C Hochberg, Michael Doherty, Elena Ehrsam, Xavier Gitton, Gerhard Krammer, Bernhard Mellein, Alberto Gimona, Patrice Matchaba, Christopher J Hawkey, James H Chesebro, on behalf of the TARGET Study Group* Comparison of lumiracoxib with naproxen and ibuprofen in the Therapeutic Arthritis Research and Gastrointestinal Event Trial (TARGET), cardiovascular outcomes: randomised controlled trial. Lancet Vol 364 p 675-84. August 21, 2004.

6. FDA public website. CDER Meeting Documents. Arthritis Drug Advisory Committee. February 16-18, 2005 Joint Meeting with the Drug Safety and Risk Management Advisory Committee. Available at http://www.fda.gov/ohrms/dockets/ac/cder05.html

7. Marvin A. Konstam, MD; Matthew R. Weir, MD; Alise Reicin, MD; Deborah Shapiro, DrPh; Rhoda S. Sperling, MD; Eliav Barr, MD; Barry J. Gertz, MD, PhD. Cardiovascular thrombotic events in controlled, clinical trials of rofecoxib. Circulation 2001 Nov 6;104 (19) 2280-8.

8. Penston James. Fiction and Fantasy in Medical Research: The Large-scaled Randomised Trial.

Available in the USA at http://www.amazon.com/exec/obidos/tg/detail/-/0954463617/qid=1109011767/sr=1-1/ref=sr_1_1/002-0572628-8824038?v=glance&s=books