A Critique of the Reanalysis of the NINDS Trial

 

---------------------------------------------

 

This critical essay is an analysis of the Special Report relating to the NINDS trial that was published in the October 2004 issue of the Stroke journal.

The Special Report is titled "Findings From the Reanalysis of the NINDS Tissue Plasminogen Activator for Acute Ischemic Stroke Treatment Trial" and it was authored by Ingall et al [1]. One of the main purposes of the Special Report was to determine whether the subgroup imbalance (in baseline stroke severity) invalidated the entire NINDS trial. In this critical essay, I am only going to be addressing the specific issue of whether an imbalance in baseline stroke severity invalidated the NINDS trial's results as they were originally interpreted, because I am the person who specifically implied that an imbalance in stroke severity in the NINDS trial invalidated the trial's results [2].  The NINDS Reanalysis Committee, after reanalysing the NINDS trial in depth, concluded that "there was no evidence that the imbalance in the distribution of baseline NIHSS (stroke severity) between the treatment groups had either a statistically or clinically significant effect on the trial's results". I think that the NINDS Reanalysis Committee's conclusion is wrong, and this essay represents my concerted attempt to prove my point.

Because this issue is enormously complex (especially to the uninitiated) I am going to start off by providing some background information on how this contentious issue first arose.

 

Background information:

 

The NINDS trial was a randomised controlled trial (RCT) that tested whether Tissue plasminogen Activator (tPA) is effective in acute ischemic stroke. The results of the NINDS trial was published in the NEJM in 1995 [3]. The NINDS trial demonstrated that tPA was effective in acute ischemic stroke, and it established that tPA significantly increased the likelihood of a favorable stroke outcome (compared to placebo). I personally became interested in the NINDS trial after reading a discussant's commentary in an online discussion group that questioned the validity of tPA's efficacy in acute ischemic stroke. The discussant claimed that one should not readily accept that tPA is effective in acute ischemic stroke because the NINDS trial was the only tPA-for-stroke RCT that had a positive result. I then decided to personally review the original NEJM report of the NINDS trial and compare it to the other tPA-for-stroke RCTs. I immediately noticed that there was something "different" about the NINDS trial -- its placebo group had a lower rate of favorable stroke outcome compared to all the other tPA-for-stroke RCTs (27% versus 32-34%). I then wondered whether the definitively positive result of the NINDS trial could be due to the fact that its comparison placebo group had a less favorable stroke outcome result compared to the "average" tPA-for-stroke RCT's placebo group, thereby inflating tPA's "apparent" efficacy. However, I couldn't fathom why this would be the case. I discovered a major clue when I read the Marler article published in December 2000 [4]. That article had a table which demonstrated that there was a significant stroke severity imbalance in the 91-180 minute arm of the NINDS trial, and I wondered whether that imbalance could account for the fact that the NINDS trial was the only tPA-for-stroke RCT that had a positive response. After researching the stroke literature, and becoming slightly better informed about the design and interpretation of stroke trials, I wrote a paper implying that the stroke severity imbalance in the 91-180 minute arm of the NINDS trial could invalidate the entire trial's results because of its effect on the "apparent" efficacy of tPA [2]. My audacious claim was wildly over-the-top, because at that point in time, I was simply guesstimating. I had no access to the trial's raw data data, so I could not make any precise numerical estimations.

I made my first serious attempt to estimate to what degree the stroke severity imbalance could have affected the NINDS trial's results when I read Grotta's rapid response letter, which was published in the online version of the BMJ on July 2nd, 2002 [5]. It was the publication of the following table that gave me my first taste of "real" stroke severity subgroup data from the NINDS trial.

Grotta's table.

 
Baseline NIHSS

(Patients treated 91 to 180 minutes)

Patients with Rankin Good Outcome (0,1) at Three months
Relative

Risk

(95%CI)

TPA
Placebo
       
1-5
24/29 (83%)
6/7 (86%)
1.0 (0.7,1.4)
       
6-10
23/37 (62%)
23/46 (50%)
1.2 (0.8,1.8)
       
11-15
10/26 (38%)
5/35 (14%)
2.7 (1.0,6.9)
       
16-20
9/33 (27%)
6/33 (18%)
1.5 (0.6,3.7)
       
>20
4/28 (14%)
2/46 (4%)
3.3 (0.6,16.8)
       
All Patients
70/153 (46%)
42/167 (25%)
1.8 (1.3,2.5)
       
>5 (All, excluding 1-5)
46/124 (37%)
36/160 (23%)
1.6 (1.1,2.4)


Using numerical data from that table, I was able to make a more refined guesstimation of the degree to which stroke severity imbalances in the 91-180 minute arm of the NINDS trial could have influenced the NINDS trial's results.

I subsequently made a more precise guesstimation in a rapid response letter to the BMJ on July 14th 2002 [6]. First of all, note that in the above table, that the absolute risk difference in the rate of favorable stroke outcome (mRS 0,1) for all patients was 21% (46%-25%). I specifically claimed, in my rapid response letter, that this 21% risk difference figure represented the "apparent" efficacy of tPA and that the "true" efficacy of tPA was probably only ~8%. In other words, I was now suggesting that 11% of the "apparent" risk difference was due to the stroke severity imbalance in that particular arm of the NINDS trial and that 2% was due to an anomaly in the placebo NIHSS 11-15 subgroup's results (more about that minor "anomaly" later in this critique). From that point onwards, I never claimed that the stroke severity imbalance issue affected the entire trial's results, and I subsequently only made claims about its effect on the 91-180 minute arm of the NINDS trial.

In October 2003, I received a significant amount of raw data pertaining to the NINDS trial. The provision of the raw data enabled me to analyse the NINDS trial's results in much greater depth, and with much greater precision. I published an in-depth analysis of the NINDS trial in the soapbox section of my personal website two weeks later [7]. My in-depth analysis confimed my tentative guesstimation -- that the "true "efficacy of tPA (absolute risk difference) for patients treated between 91-180 minutes in the NINDS trial was only ~8-11%, and not 21%. That assertion represented my final opinion on this stroke severity imbalance issue as of late 2003/early 2004.

 

Back to the NINDS Reanalysis Committee's reanalysis of the NINDS trial.



My criticism of the NINDS trial obviously had an "effect", because the NINDS administrator decided to appoint an independent panel (chaired by Prof. O'Fallon), that was given specific instructions to independently reanalyse the NINDS trial's results and report back to the NIH's NINDS division. Part of the NINDS Reanalysis Committee's mission was to determine whether the stroke severity imbalance in the 91-180 minute arm of the NINDS trial significantly invalidated the entire trial. 

Consider how the independent NINDS Reanalysis Committee came to its final conclusion.

The NINDS Reanalysis Committee first confirmed that there was a significant stroke severity imbalance in the 91-180 minute arm of the NINDS trial, as compared to the 0-90 minute arm.

This is table 2 from their paper.

I have highlighted the most important figures. Note that in the 91-180 arm of the trial, that there were 4x many more very mild stroke patients (baseline NIHSS stroke severity score of 0-5) in the tPA group than the placebo group (81% versus 19%). Note, also, that in the 91-180 minute arm of the trial, that there were significantly more very severe stroke patients (baseline NIHSS stroke severity score of >20) in the placebo group than the tPA group (62% versus 38%). These values are very important -- see later discussion.

The NINDS Reanalysis Committee then calculated the OR for each of the five stroke severity subgroups for each stroke outcome scoring system. I have only highlighted the results for the modified Rankin Scale scoring system because i) that is the most commonly used stroke outcome scoring system, and ii) because I only had the raw data for that particular stroke outcome scoring system for my personal analysis .

This is table 3 from the Ingall paper.

It is important to note that the OR favors tPA for all the quintile subgroups -- except the baseline NIHSS 0-5 subgroup where the OR is <1.0. Also, keep in mind that there were 4x more of those very mild stroke patients in the tPA group compared to the placebo group in the 91-180 minute arm of the trial, and this imbalance markedly favors the tPA group in a pooled OR analysis. This fact has to be taken into account if one performs a pooled (non-stratified) OR analysis. (see later discussion)

Note that there is a spread of OR values for the four quintile groups (Q2-Q5) and that the OR results vary from 1.5-2.3. The NINDS Reanalysis Committee claims that they performed statistical tests for equal ORs (last two columns) and that "these analyses demonstrate that for each of the 4 outcome measures and the global analysis there was insufficient evidence to declare a difference in treatment effects (ORs) across the 5 quintiles." I don't understand the statistics underlying these calculations, although I am perfectly willing to accept at face value their statistical validity. It seems to me that the NINDS Reanalysis Committee is implying that one can presume that there is no evidence of a significant difference in OR values across the five quintiles, and that tPA is likely to be equally effective across the entire stroke severity range. I certainly do not accept that fact with respect to the Q1 quintile stroke severity subgroup, but I am willing (more-or-less) to accept that it applies to the other four quintile (Q2-5) subgroups.

The NINDS Reanalysis Committee also performed other statistical analyses that they claim does not demonstrate that there is any reason to believe that the effect of tPA is clinically different for acute stroke patients with different levels of stroke severity.

OK! So, what does the NINDS Reanalysis Committee finally conclude after performing all these complex statistical analyses?

This is what they stated in their Special Report (see right-hand side of page 2422) -:

"Thus, this study does not support the presence of a clinically important interaction between baseline NIHSS and t-PA, and the baseline imbalance in NIHSS plays a very minor role in the estimated (my italics) benefit of t-PA. The Committee concludes that there was no evidence that the imbalance in the distribution of baseline NIHSS between the treatment groups had either a statistically or clinically significant effect on the trial results."

I think that the NINDS Reanalysis Committee's conclusion is invalid. Why?

I think that the NINDS Reanalysis Committee is looking at this baseline stroke severity problem from only one perspective (using a stratified OR analysis), which doesn't apply to the quantitative method (pooled OR analysis) used by the NINDS Study Group Investigators when they originally interpreted the trial's results.

What is the difference between a stratified OR analysis and a pooled OR analysis?

The NINDS Reanalysis Committee apparently used a stratified OR analysis to compute their final adjusted OR value. It is my understanding that a stratified OR analysis involves a calculation of the OR for each stroke severity subgroup (five subgroups), followed by an estimation of the "average" OR for the entire trial by dividing the cumulative value of the five individual ORs by 5. The advantage of this technique of estimating the "average" OR value is that it eliminates the problem of an imbalance in patient numbers between the treated and placebo stroke severity subgroups groups as a potential confounder (the source of my criticism of the pooled OR analysis originally performed by the NINDS trialists). However, a potential problem with this stratified OR technique is that it can be quantitatively inaccurate if tPA is not similarly effective throughout the stroke severity range.

Can one "a priori" assume that tPA should be similarly effective throughout the stroke severity range? I think not -- for sound pathophysiological reasons.  

There are two contrary approaches to interpreting a tPA-for-stroke trial's results. The one approach is to simply interpret a tPA-for-stroke trial's results without harboring an "a priori" belief that a thrombolytic drug should not to be similarly effective throughout the stroke severity range. That seems to be the approach adopted by the NINDS Study Group Investigators, who seemed to believe that there is no "a priori" reason to expect tPA not to have been similarly effective throughout the stroke severity range in the NINDS trial. That particular "a priori" expectation allowed the NINDS Study Group Investigators to originally conclude that although tPA-treated patients with a very mild stroke (Q1 subgroup patients) did not do better than untreated Q1 subgroup placebo patients in the NINDS trial (the unadjusted OR was less than 1.0), that it is still reasonable to conclude that tPA was similarly effective throughout the stroke severity range, because the unadjusted OR was definitely positive in all the other stroke severity subgroups (Q2-5 subgroups). In other words, the NINDS Study Group Investigators arbitrarily decided to ignore the negative Q1 subgroup results by stating that the NINDS trial was only powered to look at the overall trend of tPA's efficacy for the entire trial, and that it is scientifically invalid to attempt to come to any alternative conclusion by looking at any of the individual stroke severity subgroup results, because the sample size of each individual stroke severity subgroup is too small. I don't dispute their "sample size" argument, but I think that there is a scientifically valid reason not to "a priori" assume that tPA will be uniformly effective throughout the stroke severity range -- prior to examining any tPA-for-stroke trial's results.

We presumably all agree that tPA works by dissolving a clot in an occluded cerebral artery, and that the beneficial effect of tPA therapy is secondary to vessel recanalisation. However, the NINDS trialists did not measure the efficacy of tPA by directly measuring the rate and degree of vessel recanalisation. They used a surrogate endpoint marker -- the rate of favorable stroke outcome at 3 months. Their underlying "a priori" pathophysiological assumption is that there should be, more-or-less, a direct linear relationship between the degree of vessel recanalisation and the likelihood of a favorable stroke outcome at 3 months. However, that arbitrary "a priori" belief has no sound pathophysiological basis with respect to very mild stroke patients. A knowledgeable Bayesian trial-interpreter could reasonably argue that one's prior expectation should be based on "known" stroke pathophysiology, and that there is no reason to expect that there should be a direct linear relationship between the degree of vessel recanalisation and the rate of favorable stroke outcome at 3 months. Other pathophysiological phenomena greatly affect the likelihood of favorable stroke outcome at 3 months (other than the rate/degree of vessel recanalisation) -- the ability of the collateral circulation to maintain viability of the ischemic penumbral tissue while the patient is awaiting spontaneous, or thombolytic-induced, recanalisation; the rate/degree of vessel re-occlusion following initially successful recanalisation; and the rate/degree of reperfusion injury [9]. In reference number 9, I provided substantial evidence to demonstrate that it is very likely that very mild stroke patients (Q1 subgroup patients) manifest a number of pathophysiological phenomena that limit the size of their stroke at baseline and/or increase their likelihood of a favorable stroke at 3 months. Compared to moderate-severe stroke patients, very mild stroke patients are more likely to i) have a small clot located distally in a branch vessel rather than a large clot located proximally in a large vessel (eg. internal carotid artery); ii) have a good collateral circulation that can sustain viability of the ischemic cerebral tissue while the patient is awaiting spontaneous, or thrombolytic-induced, recanalisation; iii) have sufficient collateral circulation to limit the final size of the core infarct zone even if recanalisation is incomplete; iv) have a surprisingly high rate of spontaneous vessel recanalisation even if they are not treated with a thrombolytic agent. The combined effect of all those pathophysiological phenomena explains why so many very mild stroke patients have a favorable stroke outcome at 3 months -- even in the absence of active thrombolytic treatment (>80% of untreated Q1 subgroup patients have a favorable stroke outcome at 3 months). Therefore, I would argue, that from a pathophysiological "expectancy" perspective, that it is rational to "a priori" assume that tPA's perceived efficacy (its effect on the clinical outcome measured at 3 months) could be different in very mild stroke patients (Q1 subgroup patients) compared to moderate-severe stroke patients (Q2-4 subgroup patients). I actually think that the NINDS Study GROUP Investigators already knew this fact, because they pre-specified in their trial's protocol that very mild stroke patients should preferentially not be enrolled in the NINDS trial. Yet, 19% of the stroke patients enrolled in the treatment group of the 91-180 minute arm of the trial had very mild strokes (compared to 4% of placebo patients). Those patients had a roughly 80% probability of having a favorable stroke outcome at 3 months even if they did not get tPA therapy. Knowing that fact, how can one rationally "a priori" assume that tPA should be as effective in that subgroup of very mild stroke patients (compared to intermediate severity stroke patients)? Also, if very mild stroke patients, who receive tPA therapy, have a favorable stroke outcome at 3 months, how can one know whether the clinical improvement is due to tPA-induced thrombolysis rather than spontaneous thrombolysis (which can occur relatively frequently in stroke patients who have a small clot burden in peripheral branch vessels), and/or whether the clinical improvement is due to an excellent collateral circulation that prevented growth of the core infarct zone into the surrounding ischemic penumbral tissue zone despite incomplete recanalisation? 

In other words, I think that it is quantitatively inaccurate to use the "averaging" technique of estimating the adjusted OR, because it is based on an incorrect belief that tPA should be similarly effective throughout the stroke severity range. However, I concede that the calculation error will be quite small because the OR value for the Q1 subgroup results represents only one-fifth of the total value of the entire trial's results. If the NINDS Reanalysis Committee is implying that this error (due to incorrectly assuming that tPA is as effective for the Q1 subgroup patients as it is for the Q2-5 subgroup patients) is insignificant from an overall perspective, then I wouldn't quibble with that assertion. However, if they imply that "that there was no evidence that the imbalance in the distribution of baseline NIHSS between the treatment groups had either a statistically or clinically significant effect on the trial results" as originally interpreted by the NINDS trialists (using a pooled OR analysis technique), then I strongly disagree.

I believe that the "imbalance in the distribution of baseline NIHSS between the treatment groups" had a significant confounding effect on the accurate quantitative assessment of the NINDS trial's 91-180 minutes results as originally interpreted by the NINDS trialists, because they used a pooled analysis technique to determine the final estimated OR. 

What is a pooled analysis technique? In a pooled analysis, the trialists do not separate the results based on baseline stroke severity. They simply compute the rate of favorable stroke outcome results for all the placebo patients and all the treated patients, as performed by Grotta in this table. I think that a pooled analysis OR is only valid if a tPA-for-stroke trial enrolls a homogeneous group of patients from a stroke severity perspective, and I believe that the OR results become inaccurate if the enrolled patient population is very heterogeneous from a stroke severity perspective and there is a marked numerical imbalance in Q1 and Q5 subgroup patients between the treated and placebo groups.

I think that it is important to minimise marked heterogeneity in baseline stroke severity in tPA-for-stroke RCTs that are analysed using a pooled technique, because patients at the extreme ends of the stroke severity spectrum may markedly influence the RCT's final estimated OR results due to an effect that is not due to tPA -- especially if there is a marked distribution imbalance of those patients between the tPA and placebo groups. For example, if there are many more very mild stroke patients (baseline NIHSS stroke severity 0-5) in the tPA group compared to the placebo group, then their disproportional presence increases the final estimated OR of the trial in a way that has nothing to do with any "degree of responsiveness" to tPA. In fact, very mild stroke patients have a ~80% spontaneous cure rate due to the natural course of disease, and their high "cure" rates gets chalked up to the tPA side of the final estimated OR equation in a pooled analysis, even though their high favorable stroke outcome rate has nothing to do with tPA's effect. The opposite problem occurs at the other end of the stroke severity spectrum. If there are many more very severe stroke patients (baseline NIHSS stroke severity >20) in the placebo group compared to the tPA group, then their disproportional presence disfavors the placebo side of the equation in a pooled analysis, because they have a very low likelihood of having a favorable stroke outcome (<5% likelihood of a favorable stroke outcome). Their effect on the final estimated OR value of any tPA-for-stroke trial is to make the final estimated OR less than it would otherwise be, and this "passive" dilutional phenomenon has nothing to do with variations in tPA responsiveness across the stroke severity range.

An important clue that suggests that the NINDS Reanalysis Committee didn't think of the influence of these "non-tPA" effects on a pooled OR analysis estimation is their statement on the left-hand side of page 2421. The NINDS Reanalysis Committee first mention that 72% of Q1 patients were randomised to tPA treament versus only 28% randomised to placebo treatment. That situation favors the tPA side of the equation because Q1 patients have a >80% likelihood of having a favorable stroke outcome that is due to the natural course of disease, and not due to tPA, and this bias-effect will increase the final estimated OR in favor of tPA. They then state-: "This imbalance was compensated (my italics) in Q2 and Q5 patients where the percentage of patients in the tPA and placebo groups were 45% and 55% respectively". Think of that statement very carefully. If there are proportionately more Q5 patients in the placebo group, how does it compensate for the imbalance in the Q1 group. By using the word "compensate" one presumes that the NINDS Reanalysis Committee means that its effect on the final estimated OR value will be in the opposite direction to the Q1 imbalance effect, and it should thus disfavor the tPA group. However, this is not true! A disproportionately greater number of Q5 patients in the placebo group favors the tPA group, because those patients have a very low favorable stroke outcome rate (<5%), and their disproportional presence in placebo patients favors the tPA side of the equation in the final estimated OR equation by decreasing the number of patients in the placebo group who could possibly have a favorable stroke outcome. In other words, by "passively" favoring the tPA side in the final estimated OR equation, its effect is a compounding (additive) effect, and not a compensatory (cancelling) effect in a pooled analysis.

This is a critical key issue that clinicians have to understand when interpreting RCTs using a pooled OR analysis!

I am now going to go to extraordinary lengths to make this critical key issue understandable for the "average" clinician, who, like me, may be statistically challenged. I will use two techniques. I will start off with a highly instructive parable, that will dramatically, and clearly, illustrate the important fact that the final estimated OR value for an entire tPA-for-stroke RCT (using a pooled analysis technique) is very dependent on the degree of stroke severity imbalance between treated and placebo patients (degree of heterogeneity in baseline stroke severity). I will then provide a detailed example-presentation that demonstrates that calculated OR values are only truly reflective of a drug's "true" efficacy when a trial recruits a homogeneous group of patients, and that the OR value becomes far less accurately reflective of the "true" efficacy of a drug as the trial's patient population becomes more-and-more heterogeneous from a disease severity perspective.

I will start off with this imaginary parable.


Once upon a time, in the latter half of the first decade of the 21st century, a drug manufacturer developed a new thrombolytic drug to treat acute ischemic stroke patients, and he was confident that his thrombolytic drug would be better than the standard-of-care thrombolytic agent (tPA). He called the new thrombolytic drug "agent X" and he was confident, based on the positive results of phase I and II trials, that the drug would cure 50% of highly responsive acute ischemic stroke patients if the patients were treated between 91-180 minutes. What did he mean by the term "highly responsive"? He regarded "highly responsive" stroke patients as being stroke patients with a baseline NIHSS score between 6-20. He knew that very mild stroke patients (baseline NIHSS stroke severity score 0-5) would not likely be responsive to his drug, because i) they had a ~80% chance of a spontaneous favorable stroke outcome result due to the natural course of the disease, and ii) the NINDS trial had already demonstrated that tPA has no therapeutic effect in that stroke severity subgroup (see the Q1 results in table 3 which demonstrates that tPA patients did not have a higher favorable stroke outcome rate than placebo patients). He also couldn't understand why a clinician would want to use a thrombolytic drug in very mild stroke patients if there was a small risk of a symptomatic thrombolytic-induced intracranial hemorrhage, but no prospect of a therapeutic benefit (harm:benefit ratio >1.0). He would therefore have preferred not to recruit very mild stroke patients in his RCT testing of "agent X".  He would also have preferrred not to recruit very severe stroke patients (baseline NIHSS stroke severity score >20) into his RCT because he didn't expect them to manifest a significant absolute benefit in response to agent X. He expected agent X to be have the same efficacy as tPA for very severe stroke patients -- a OR of approximately 2.0, but an absolute benefit of only ~4%. He couldn't quite understand why clinicians would want to use a thrombolytic agent in a very severe stroke patient if the predicted harm:benefit ratio was likely to be greater than 1.0. It is important to remember that although the "average" ICH rate is roughly ~6% for an entire group of stroke patients, the risk of an ICH may often be significantly greater than 6% for very severe stroke patients, and the potential harm could possibly exceed the potential therapeutic value of tPA therapy (absolute anticipated benefit of ~4%) thus resulting in a harm:benefit ratio >1.0.

Despite having those personal preferences/biases, the drug manufacturer had no choice but to enroll both very mild and very severe stroke patients into his RCT. Why? The underlying reason relates to the fact that the self-appointed "tPA-for-stroke expert opinion leaders" had convinced the stroke interventionalist community that tPA was equally effective for all stroke severity subgroups. If the drug manufacturer only enrolled moderate and moderately severe stroke patients into his RCT and demonstrated that agent X was more efficacious than tPA, community clinicians would probably not preferentially use agent X, because they would perceive that agent X was only efficacious for a certain subgroup of stroke patients, and not all acute ischemic stroke patients. So, he resigned himself to enrolling all five stroke severity subgroups patient into his RCT. He also decided to enroll an equal number of patients into each stroke severity subgroup and ensure that there was a perfect numerical balance between treated and placebo patients, in order to offset the possibility that an amateur trial interpreter would accuse him of running an unbalanced trial.

These were the results of his RCT of agent X for acute ischemic stroke.     

 
Baseline NIHSS

(Patients treated 91 to 180 minutes)

Patients with favorable stroke outcome (mRS 0,1) at three months
Odds Ratio

 

Agent X
Placebo
       
0-5
83/100 (83%)
83/100 (83%)
 
       
6-10
60/100 (60%) [62%]
50/100 (50%)
 
       
11-15
50/100 (50%) [38%]
30/100 (30%)
 
       
16-20
40/100 (40%) [27%]
18/100 (18%)
 
       
>20
14/100 (14%)
4/100 (4%)
 
       
All Patients
247/500 (49%)
185/500 (37%)
1.66  [2.4]
       
       


The drug manufacturer was very happy with the results of his RCT of agent X, because he believed that it demonstrated that agent X was more efficacious than tPA. On what basis did he make that judgement? First of all, the RCT's results were exactly as he predicted -- he predicted that agent X would produce an "average" 50% favorable response rate in highly responsive stroke patients (baseline NIHSS stroke severity score of 6-20). Secondly, when he compared agent X's  results in those highly responsive stroke subgroup patients to tPA's results in similar highly responsive subgroup patients in the NINDS trial, agent X outperformed tPA (the NINDS trial's tPA values are in red and they come from Grotta's table). He also noted that the results for the baseline NIHSS 0-5 stroke severity subgroup and the NIHSS >20 stroke severity subgroup were nearly identical to the results of the NINDS trial (both treated and placebo patients). He therefore predicted that clinicians would switch over to using agent X, because he perceived that it was the more efficacious thrombolytic drug. 

However, no community clinicians used agent X for the treatment of acute ischemic stroke. Why? There are three major reasons.

1) The community clinicians stated that the OR of agent X was only 1.66 for patients treated between 91-180 minutes, while the OR of tPA was 2.4 for patients treated between 91-180 minutes in the NINDS trial (using the same mRS 0,1 stroke outcome scoring system), thereby "supposedly" proving that tPA was more efficacious than agent X. Another "supposed" indicator of this "truth of efficacy" is that the absolute risk difference in the NINDS trial for tPA patients treated between 91-180 minutes was 21% (46%-25%) compared to 12% (49%-37%) for agent X in the above trial.

2) This trial of agent X was only one positive RCT result, and the community clinicians insisted that the drug manufacturer perform another confirmatory positive RCT before they would consider using his drug.

3) The community clinicians ignored the drug manufacturer's subgroup comparison, because they stated that it was not scientifically valid to compare the subgroup results from one trial to that of another trial.

The drug manufacturer knew that he had no choice. He would have to perform another RCT to prove that agent X was better than tPA. He was also much more savvy now from a "street-smarts" perspective-- he realised that he had "stacked the deck" against his own drug by performing a balanced RCT, and he wasn't going to make the same mistake again. The next time he performed a RCT, he would deliberately run an unbalanced trial. In fact, he decided to exactly mimic the NINDS trial in terms of the number of enrolled treated and placebo patients in each stroke severity subgroup.   

This is the result of his second RCT of agent X for acute ischemic stroke.

 
Baseline NIHSS

(Patients treated 91 to 180 minutes)

Patients with favorable stroke outcome (mRS 0,1) at three months
Odds Ratio

 

Agent X
Placebo
       
0-5
24/29 (83%)
6/7 (86%)
 
       
6-10
22/37 (60%) [60%]
23/46 (50%)
 
       
11-15
13/26 (50%) [50%]
5/35 (14%)
 
       
16-20
13/33 (40%) [40%]
6/33 (18%)
 
       
>20
4/28 (14%)
2/46 (4%)
 
       
All Patients
76/153 (50%)
42/167 (25%)
2.9  [2.4]


The drug manufacturer was very happy with results of his second RCT of agent X for acute ischemic stroke. First of all, it confirmed his strongly held belief that agent X would cure 50% of highly responsive stroke patients. In fact, in the subgroup of highly reponsive stroke patients (stroke patients with a baseline NIHSS stroke severity score of 6-20) agent X performed exactly as well as it did in his first RCT (the results in red are from the first agent X RCT). That confirmatory result definitively proved that agent X was better than tPA from the drug manufacturer's personal perspective. But, the drug manufacturer knew that he was really playing a marketing game, and he knew that he had won the game by deliberately stacking the deck in favor of his thrombolytic drug. By deliberately running an unbalanced RCT, he had ended up with a OR value of 2.9 which was much better than the OR value of 1.66 obtained in his first trial, and significantly better than tPA's OR from the NINDS trial (OR value in red). He was therefore not surprised when the majority of community clinicians started to use agent X, instead of tPA, for the treatment of acute ischemic stroke.


Let's carefully review the results of the two RCTs of agent X for acute ischemic stroke.  

Why did the final estimated OR for the entire RCT's results (a pooled OR analysis of ALL the patients) change so dramatically if agent X's "true" efficacy in highly responsive stroke patients remained the same in the two trials, and if the overall favorable stroke outcome rate figure for treated patients remained more-or-less the same (50% versus 49%)?

That is the key issue that I was referring to previously!

It is critically important to realise that a trial's final estimated OR measurement in a pooled OR analysis is a comparative estimation, and that its absolute value can be inflated by any phenomenon that increases the favorable response rate in the treated patients and/or decreases the favorable response rate in the placebo patients (even if it has nothing to do with the drug's therapeutic effect).

In the second trial, there was proportionately far more very mild stroke patients (baseline NIHSS stroke severity score of 0-5) in the treated patient group relative to the placebo patient group. This provided an "artefactual" boost in favor of the treated group of patients relative to the placebo patients. In my personal analysis of the NINDS trial [7], I estimated that the "artefactual" boost due to this phenomenon would be about 7% in favor of tPA. The size of the "artefactual" boost is dependent on i) the relative number of stroke patients in the Q1 quintile stroke severity subgroup compared to the other four quintile stroke severity subgroups (for both the treated and placebo groups); and/or ii) the degree of difference in favorable stroke outcome rate between the Q1 stroke severity subgroup and the Q2/3/4 stroke severity subgroups (for both the treated and placebo groups).

The major difference between the first agent X trial and the second agent X trial is the fact that the overall placebo response rate for the entire group of placebo patients decreased from 37% to 25%. That is a huge difference! Why did it occur? First of all, there was far fewer very mild stroke patients (baseline NIHSS stroke severity score of 0-5) recruited into the placebo group in the second trial compared to the first trial (only 7 patients in the second trial compared to 83 patients in the first trial, which works out to a percentage value of 4% of total placebo patients for the second trial versus a percentage value of 20% for the first trial). The lack of recruitment of a large number of Q1 stroke patients, who have a very high likelihood of spontaneously improving, into the placebo arm of the trial decreases the overall favorable response rate for the entire placebo group patients in the second trial (by a disproportional "absence" effect). Secondly, there was proportionately far more very severe stroke patients (baseline NIHSS stroke severity score >20) in the placebo group in the second trial compared to the first trial (28% versus 20%). That phenomenon decreases the overall favorable response rate in the placebo group in the second trial, because the disproportionately "extra" number of very severe stroke patients have very little chance of having a favorable stroke outcome, and their added presence decreases the overall favorable response rate for the entire group of placebo patients by a disproportionate "dilution of potential favorable response" phenomenon.

There is one more minor factor that disfavored the placebo group in the second trial. Can you identify it?

Look at the favorable response rate for the placebo subgroup patients, who have a baseline NIHSS stroke severity score of 11-15 in the first and second agent X trials. In the first trial it was 30% and in the second trial it was 14%. What caused the difference? The correct answer -- the drug manufacturer "cooked the books". The drug manufacturer noticed that in the 91-180 minute arm of the NINDS trial, placebo patients with a baseline NIHSS stroke severity score of 11-15 only had a 14% favorable stroke outcome response rate (presumably due to a multiplicity of "chance" events) [7]. The stroke research community didn't question the validity of that "14% value" in the NINDS trial, so the drug manufacturer of agent X decided to use the same 14% value in his formal trial report, even though the "true" value was 28%. He knew that a lower value would help deflate the placebo group's overall favorable response rate and thereby help to boost the final estimated OR value for the entire trial. He didn't expect that people would notice this minor point, because a trial's raw data is not routinely in the public domain. Who routinely checks the raw data of clinical trials other than the trialists?

 I hope that each reader is getting a much clearer idea of where I am coming from, and where I am heading. To make this complex "phenomenon" even more understandable, consider the following example-presentation.

Example-presentation of how to calculate, and interpret, the OR for a clinical trial.

Presume that a trialist is testing a drug A to cure disease Z.

How would the trialist determine the "true" efficacy of drug A? Obviously, he would need to measure how many patients with disease Z are cured following treatment with drug A. However, he also needs to know how many patients with disease Z would get better without treatment (spontaneously cured due to the natural course of the disease), so that he can determine the drug's "true" therapeutic effect by subtracting that spontaneous cure rate amount from the total measured cure rate amount.

Therefore, the drug's therapeutic effect = percentage of patients with disease Z who get better when taking the drug - percentage of patients who get better in the absence of the drug.

For example, if 30% of disease Z patients get better without treatment, and 60% get better with treatment (using drug A), then the absolute benefit is 30%. That difference is often called the risk difference.

Statisticians often use the mathematic expression Odds Ratio (OR) to quantify the therapeutic benefit of a drug. 

This is how it is calculated:

Presume that there are 100 treated patients and 100 untreated patients with disease Z, and that 30% of untreated patients get better without treatment, and 60% of drug A treated patients get better with treatment.
 

  Number of patients cured Number of patients not cured
Drug A present (treated) 60 (a) 40 (b)
Drug A absent (untreated) 30 (c) 70 (d)


OR = a/b divided by c/d = ad/cb

The final estimated OR = 60x70/30x40 = 3.5

Now, if a drug manufacturer uses an OR of 3.5 to claim that the OR measurement estimates the "true" efficacy of drug A in patients with disease X, I would have no problem with that claim if the patient population was totally homogeneous from the perspective of disease severity and if every patient had a 30% spontaneous cure rate if not treated.

What would happen if 33% of the untreated patient population get better without treatment 20% of the time, 33% get better without treatment 30% of the time, and 33% get better without treatment 40% of the time. How would that disease severity heterogeneity phenomenon affect the OR presuming that treated patients always still get better 60% of the time.

Trial 1: Presume that there are 300 treated patients, and 300 untreated patients, and that there are equal numbers of patients of the three degrees of disease severity (Q1 = slightly less sick than moderately sick patients, Q2 = moderately sick patients, Q3 = slighly more sick than moderately sick patients). Presume that if any Q1/Q2/Q3 patients get treated, that they all have a 60% cure rate.
 

Disease Z severity

Number of treated patients cured Number of untreated patients cured

 

   

Q1 patients

60/100 (60%) 40/100 (40%)

Q2 patients

60/100 (60%) 30/100 (30%)

Q3 patients

60/100 (60%) 20/100 (20%)

 

   

All patients

180/300 (60%) 90/300 (30%)

 

  Number of patients cured Number of patients not cured
Drug A present (treated) 180 (a) 120 (b)
Drug A absent (untreated) 90 (c) 210 (d)


The final estimated OR = 180x210/90x120 = 3.5

One can therefore conclude that slight variations in disease severity do not affect the OR value, and therefore do not affect the OR's utility as a statistical measuring tool when estimating drug A's therapeutic effect in the treatment of disease Z if the patients are moderately sick and potentially highly responsive to drug A.

Now consider performing a trial of drug A when the disease Z patient population is much more heterogeneous in terms of disease severity, and presume that the patient population includes disease Z patients with five degrees of disease severity, and therefore with five degrees of variability in spontaneous cure rates in the absence of treatment. Q1 patients are minimally sick and they get better 90% of the time without treatment; Q2 patients are slightly less than moderately sick and they get better without treatment 40% of the time; Q3 patients are moderately sick and they get better without treatment 30% of the time; Q4 patients are slightly more than moderately sick and they get better without treatment 20% of the time, and Q5 patients are very sick and they get better only 5% of the time whether treated or untreated.

Trial 2: Presume that there are 500 treated and 500 untreated disease X patients, and that there are unequal numbers of Q1/5 patients enrolled in the trial and that they are unequally distributed between the treated and placebo groups, and that drug A cures 60% of Q2, Q3 and Q4 patients.
 

Disease Z severity Number of treated patients cured Number of untreated patients cured
     
Q1 patients 157/175 (90%) 22/25 (90%)
Q2 patients 60/100 (60%) 40/100 (40%)
Q3 patients 60/100 (60%) 30/100 (30%)
Q4 patients 60/100 (60%) 20/100 (20%)
Q5 patients 1/25 (5%) 9/175 (5%)
     
All patients 338/500 (67%) 121/500 (24%)

 

  Number of patients cured Number of patients not cured
Drug A present (treated) 338 (a) 162 (b)
Drug A absent (untreated) 121 (c) 379 (d)


The final estimated OR = 338x379/121x162 = 6.5

Why is the final estimated OR so much higher higher if the drug's efficacy hasn't changed?

There are two major factors at play.

i) Note that the total number of treated patients, who were "apparently" cured by treatment, went up from 60% to 67% -- not because the drug was more efficacious in trial 2, but because 175 Q1 patients, who were so minimally sick that 90% recovered spontaneously, were added to the treatment arm of the trial.

ii) Note that the total number of placebo patients, who spontaneously improved, went down from 30% to 24% because a disproportionately large number of very sick Q5 patients, who were too sick to respond to drug A, were added to the placebo arm of the trial.

Both factors have nothing to do with drug A's therapeutic effect, but both factors have a compounding (additive) effect on the OR causing the final estimated OR to be significantly higher.

These results suggest that the utility of a final estimated OR (using a pooled OR analysis technique) to accurately determine the "true" efficacy of a drug in a clinical trial may be significantly compromised if there is a marked degree of disease severity heterogeneity and a marked degree of disease severity imbalance between treated and placebo patients.


Concluding remarks

 

I hope that my preceding presentation was sufficiently lucid so as to allow clinician readers to understand my personal viewpoint. My basic viewpoint is that an estimated OR value is only truly reflective of the "true" efficacy of a drug if the trial patient population is relatively homogeneous with respect to disease severity. I believe that the estimated OR becomes less accurately reflective of the "true" efficacy of a drug if the enrolled patient population is very heterogeneous from a disease severity perspective, especially if there is a gross maldistribution of mild/severe disease severity cases between the treated and placebo groups.

Although there was no randomisation bias in the NINDS trial, chance events resulted in a gross imbalance in baseline stroke severity in the 91-180 minute arm of the trial. The repercussions of that imbalance is that the final estimated OR (using a pooled OR analysis technique) for tPA patients treated between 91-180 minutes became significantly over-inflated.

Consider the following simple mind-experiment, which will help you to appreciate to what degree the stroke severity imbalance could have affected the accuracy of the final estimated OR value in the 91-180 minute arm of the NINDS trial. 

These are the results from the 91-180 minute arm of the NINDS trial (copy of Grotta's table). The original numbers are in black. The changed numbers are in red.

 
Baseline NIHSS

(Patients treated 91 to 180 minutes)

Patients with Rankin Good Outcome (0,1) at Three months
Relative

Risk

(95%CI)

TPA
Placebo
       
1-5
24/29 (83%)
6/7 (86%)
1.0 (0.7,1.4)
  [6/7] (83%)    
6-10
23/37 (62%)
23/46 (50%)
1.2 (0.8,1.8)
       
11-15
10/26 (38%)
5/35 (14%)
2.7 (1.0,6.9)
       
16-20
9/33 (27%)
6/33 (18%)
1.5 (0.6,3.7)
       
>20
4/28 (14%)
2/46 (4%)
3.3 (0.6,16.8)
    [1/28] (4%)  
All Patients
70/153 (46%)
42/167 (25%)
1.8 (1.3,2.5)
  52/131 (40%)  41/149 (28%)  
Final estimated OR 2.4  1.7
 
 


I have yellow-highlighted the important patient number values that demonstrate that there was a marked imbalance in patient numbers between the treated and placebo Q1 and Q5 subgroups.

Let's correct those patient numbers (but not change the favorable stroke outcome rate percentages), so that there is no imbalance in the Q1 and Q5 subgroups, and then see what effect is has on the final estimated OR value if those "corrected" values are used instead -- see the numbers in red.

You can see that if one applies a correction that eliminates the baseline stroke severity bias in the Q1 and Q5 subgroups, that the final estimated risk difference decreases from 21% (46%-25%) to 12% (40%-28%) and that the final estimated OR value decreases from 2.4 to 1.7.

(* I am not implying that this means that the "true" efficacy of tPA is less than the "apparent" efficacy of tPA -- which I have previously asserted many times in the past. In fact, I now agree with the NINDS Reanalysis Committee's assertion that tPA may even be perceived to be more efficacious if one eliminates interpretative errors that occur when one relies on a pooled OR analysis technique -- see my reasoning in the How to design and accurately interpret a thrombolytic-for-stroke trial subsection of appendix section)

Is the difference between 2.4 and 1.7 important?

I think so. If the NINDS Study Group investigators had thought along these lines, then they would have noted that the efficacy of tPA was very similar for patients treated between 0-90 minutes and 91-180 minutes, and that there was no evidence to suggest that tPA is significantly more effective if given earlier (<90 minutes) rather than later (91-180 minutes). [* Note that the estimated absolute risk difference for patients treated between 0-90 minutes was 12% and the estimated OR was 1.7 -- using the same mRS 0,1 stroke outcome scoring system. See the appendix for complete details].

If the NINDS Study Group investigators realised that tPA was similarly effective for stroke patients treated between 91-180 minutes as it was for stroke patients treated between 0-90 minutes, then they would probably not have indulged in an over-adventurous mind-game experiment. Instead, the NINDS Study Group investigators performed an over-adventurous mind-game experiment and they hypothesised that a "re-analysis" of the NINDS study's raw data justified a belief that tPA was much more effective if given earlier rather than later. How much more effective?

This is the graph from the Marler paper [4]. 

Figure 2. Graph of model estimating OR for favorable outcome at 3 months in recombinant tissue-type plasminogen activator (rt-PA) treated patients compared to placebo treated patients by time from stroke onset to treatment (onset-to-treatment time [OTT]) with 95% confidence intervals, adjusting for the baseline NIH Stroke Scale. OR >1 indicates greater odds that rt-PA treated patients will have a favorable outcome at 3 months compared to the placebo treated patients. Range of OTT was 58 to 180 minutes with mean (m) of 119.7 minutes.



Note that the NINDS Study Group investigators hypothesised that tPA is much more effective if given earlier rather than later. Did the NINDS study's raw data support such a hypothesis? The NINDS Reanalysis Committee studied this issue in great depth, although they did not report their findings in their Stroke article. Their detailed analyses and opinions regarding this issue are available in their final report [8] (see pages 63-67).

This is what the NINDS Reanalysis Committee stated with respect to the above graph-: 

"In the aforementioned article19, the NINDS investigators presented a figure suggesting a range of ORs from 4.0 to 1.0 between OTT values of 60 and 180 minutes (see Figure 3). However, almost no patients had an OTT ~60 minutes. Indeed < 10% had OTT values as large as 82 mins, with a similar percent having OTT values between 176 and 180 minutes. According to the figure, the OR corresponding to 82 is <3 whereas the OR corresponding to 180 is > 1. Therefore, their own best estimate of OR differences suggests a less than 3-fold change over a reasonable OTT range, a change that the study has little power to detect."

The NINDS Reanalysis Committee made a nuumber of other important observations, and they finally concluded-:

"In light of these results, the substantially nonlinear nature of the distribution of OTT when considered as a continuous variable, and the idiosyncratic distribution of favorable response rates among the placebo patients, we conclude that the data provided by this study failed to support a conclusion that the effect of t-PA therapy diminishes with increasing values of OTT within the protocol specified 3 hour time limit."

If you agree with my personal analysis of the NINDS trial, are you surprised to discover that the NINDS Reaanalysis Committee didn't find evidence to support a conclusion that the effect of tPA diminished with increasing time-to-treatment within the specified 3 hour time limit?

Another issue that is contentious is the NINDS Study Group investigators' claim that tPA is effective throughout the stroke severity range, and that there is no stroke severity subgroup that will not benefit from tPA therapy. This belief is also echoed by the NINDS Reanalysis Committee. On the right-hand side of page 2423 of their Special Report, the NINDS Reanalysis Committee states-: "The Committee was charged with addressing whether eligible stroke patients may not benefit from t-PA given according to the protocol used in the trials. Multiple exploratory analyses performed to address this question did not identify any subgroup of acute ischemic stroke patients who would be more likely either to benefit from or be harmed by receiving t-PA." I cannot understand how they can justify this claim based on the data-analyses that they present. There is no evidence that Q1 subgroup (baseline NIHSS stroke severity 0-5 subgroup) patients benefit from tPA, and the rate of favorable stroke outcome is not greater for tPA patients than placebo patients (in fact, the OR is <1.0 for four-out-of-five stroke outcome measures -- see table 3). If Q1 subgroup patients have no prospect of benefit, but a small risk of harm due to the small risk of an iatrogenic symptomatic ICH, then the harm:benefit ratio must be >1.0. That situation does not apply to the Q2,3,4 subgroups where the harm:benefit ratio is definitely <1.0. The situation for Q5 subgroup patients is more equivocal, because although tPA produces a similar relative benefit in Q5 subgroup patients as compared to Q2-4 subgroup patients, the absolute benefit is small. If tPA increases the absolute rate of favorable stroke outcome by <5%, and the risk of a major iatrogenic symptomatic ICH is >6%, then the harm:benefit ratio could theoretically be deemed to be >1.0.

The central theme of this critical essay is that the final estimated OR (or the final estimated RR) of a tPA-for-stroke trial (using a pooled OR analysis) may not reflect the "true" efficacy of tPA in acute ischemic stroke, because the OR value is a comparative judgement, and the final estimated OR value may be overinflated if the placebo arm of a trial has a disproportionately low overall favorable response rate, because it enrolled a disproportionately large number of very severe stroke patients and/or a disproportionately small number of very mild stroke patients compared to the treatment arm. The 91-180 minute arm of the NINDS trial is a prime example of this type of error, because the placebo group had too few very mild stroke patients and too many very severe stroke patients compared to the treatment group. In the 91-180 minute arm of the NINDS trial, the placebo group's final estimated rate of favorable stroke outcome value was only 25%, and that low absolute value artefactually inflates the final estimated OR value of the 91-180 minute arm of the NINDS trial. Using an "artefactually" low placebo group rate of favorable stroke outcome value of 25% makes any comparative judgement with a tPA-treatment group (that has proportionately fewer very severe stroke patients and proportionately more very mild stroke patients) scientifically invalid, because it overinflates the "apparent" efficacy of tPA when using a pooled OR analysis technique. The stroke research community seems to be obliviously unaware of this important fact! Consider the following prime example of such a comparison error.

In a review paper on tPA-for-stroke [10], the paper's authors (who were prominent stroke researchers) were comparing the results from a number of tPA-for-stroke trials in order to demonstrate that other tPA-for-stroke trials confirmed the fact that tPA is highly effective in acute ischemic stroke. They used the following graphical display to emphasise their point.

Figure 2:  Hacke et al. [10] compared the results of three tPA-for-stroke studies to the NINDS placebo group (* the Cologne study was not a RCT).
 


Note that the authors of the review paper arbitrarily used the NINDS placebo group to make their comparison. Recall that the definition of a favorable stroke outcome is a modified Rankin Scale Score of 0-1. In the above example, the rate of favorable stroke outcome for the placebo group was reported as 20%, while the rate of favorable stroke outcome for the treatment groups was ~40%. That comparison makes it "appear" that tPA is highly efficacious in acute ischemic stroke, because the absolute risk difference "appears" to be 20%. First of all, the 20% figure for the NINDS placebo group is actually a typographical error, and the "correct" value for the NINDS placebo group is 27% for the entire 0-180 minute arm of the NINDS trial, and 25% for the 90-180 minute arm of the NINDS trial. Secondly, the placebo group of the 91-180 minute arm of the NINDS trial was plagued by the fact that it had disproportionately too few very mild stroke patients and disproportionately too many very severe stroke patients (see Grotta's table), thereby decreasing its overall rate of favorable stroke outcome value. Consider the overall rate of favorable stroke outcome for the placebo groups from the ECASS ITT and ECASS II tPA-for-stroke trials -- 29% and 36% respectively. The "apparent" efficacy of tPA would "appear" to be far less if those placebo groups were used for comparison purposes in the above example. In fact, if Hacke used the ECASS II trial's placebo group for comparison purposes, it would make tPA "appear" to be insignificantly effective, because the "apparent" absolute risk difference would only be 4% (40%-36%)!

I believe that all comparative judgements (like the Hacke example above) are scientifically invalid, because there is no proof that the placebo and tPA groups are equivalent from a stroke severity perspective, and therefore there is no proof that the final estimated OR value (or final estimated absolute risk difference), which is a measure of the "apparent" efficacy of tPA, is actually reflective of the "true" efficacy of tPA. Don't you think that the stroke research community needs to pay much more attention to the issue of "imbalances in baseline stroke severity" in tPA-for-stroke trials if it wants to determine the "true" efficacy of tPA therapy in acute ischemic stroke? I think that stroke researchers should also use a stratified OR analysis when analysing their trial's results to ensure that there is comptability between their pooled OR analysis and a stratified OR analysis.

(See the module "How to design and accurately interpret a thrombolytic-for-stroke trial" in the appendix section for further details on how I would estimate the "true" efficacy of tPA in acute ischemic stroke)

Finally, I would like to comment on the issue of confounding variables in tPA-for-stroke trials. It is my opinion that stroke researchers have not fully grasped the importance of designing, and interpreting, stroke trials so as to minimise "estimation-of-efficacy" errors. Stroke severity heterogeneity (discussed in this essay) is only the tip of the iceberg when it comes to confounding variables that can hinder the "accurate" interpretation of a stroke trial's results. I think that there are more important confounding variables that the stroke research community needs to consider, and I would encourage interested readers to read my essay "Determining the efficacy of thrombolytic therapy in acute ischemic stroke: An analysis of the recent stroke literature" [9] to learn much more about that important topic.

 

Jeffrey Mann, MD.

Retired physician.

jmannemg@earthlink.net

First draft: early October 2004.

Radically revised draft: late November 2004.
 


Appendix:



1. Table 5.4 from page 13 of the tables pdf file from reference number 8.

 

How to design and accurately interpret a thrombolytic-for-stroke trial

 

It is my belief that the optimum method of designing a thrombolytic-for-stroke trial is to ensure that a minimum number of very mild stroke (bNIHSS0-5) patients and very severe stroke (bNIHSS>20) patients are enrolled in the trial, and to specifically ensure that there is no imbalance in the distribution of those patients between the treated and placebo groups if one uses a pooled (unstratified) OR analysis technique. Also, the sample size of each stroke severity subgroup must be sufficiently large.

In the following hypothetical trial scenarios, I will even presume that agent X has similar efficacy throughout the stroke severity range, and that the unadjusted OR for each stroke severity subgroup is 2.0 -- in order to demonstrate that the "effect" of a baseline stroke severity imbalance in a pooled OR analysis is not dependent on variations in tPA-responsiveness between stroke severity subgroups.

Consider this hypothetical trial of a thrombolytic agent -- agent X -- in patients with acute ischemic stroke.

Hypothetical trial number 1:
 

Baseline NIHSS

(Patients treated 91 to 180 minutes)

Patients with favorable stroke outcome (mRS 0,1) at three months
Odds Ratio

 

Agent X
Placebo
       
Q1 patients 0-5
907/1000 (91%)
830/1000 (83%)
2.0
       
Q2 patients 6-10
6666/10000 (67%)
5000/10000 (50%)
2.0
       
Q3 patients 11-15
4615/10000 (46%)
3000/10000 (30%)
2.0
       
Q4 patients 16-20
3050/10000 (31%)
1800/10000 (18%)
2.0
       
Q5 patients >20
77/1000 (7.7%)
40/1000 (4%)
2.0
       
All Patients
15315/32000 (47.8%)
10670/32000 (33%)
1.83


Note that the total number of Q1 and Q5 subgroup patients is only 10% as large as the total number of Q2,Q3,and Q4 subgroup patients. Note that there are equal numbers of Q1 and Q5 subgroup patients in the treated and placebo groups.

Note that all the stroke severity subgroups have an unadjusted OR of 2.0.

The absolute RD for all the patients is 14.8% and the unadjusted OR is 1.83 using a pooled OR analysis technique. These values apparently represent the "true" efficacy of agent X in acute ischemic stroke because they are similar to the unadjusted RD and OR values for the "highly responsive" patients (Q2,3,4 subgroup patients) -- absolute RD of 14.6% and unadjusted OR of 1.88; and because the unadjusted OR for the entire trial is close to the "theoretical" OR value of 2.0 (which is the OR value obtained when one uses a stratified OR analysis technique and averages the OR values from the five stroke severity subgroups).

Now, consider what happens if one recruits disproportionately too many Q1 and Q5 stroke patients, and also allows there to be an imbalance in baseline stroke severity between the treated and placebo groups to the same degree as existed in the NINDS trial's 91-180 minute cohort. See hypothetical trial number 2.

Hypothetical trial number 2:
 

Baseline NIHSS

(Patients treated 91 to 180 minutes)

Patients with favorable stroke outcome (mRS 0,1) at three months
Odds Ratio

 

Agent X
Placebo
       
Q1 patients 0-5
2630/2900 (91%)
581/700 (83%)
2.0
       
Q2 patients 6-10
2466/3700 (67%)
2300/4600 (50%)
2.0
       
Q3 patients 11-15
1199/2600 (46%)
1050/3500 (30%)
2.0
       
Q4 patients 16-20
1006/3300 (31%)
600/3300 (18%)
2.0
       
Q5 patients >20
215/2800 (7.7%)
200/4600 (4%)
2.0
       
All Patients
7516/15300 (49%)
4731/16700 (28%)
2.44 


Note that agent X was equally efficacious in this trial (as compared to trial number 1), and that the percentage of patients who had a favorable stroke outcome in each stroke severity subgroup is identical to trial number 1.

Note that the unadjusted OR is 2.0 for each stroke severity subgroup.

Note that the absolute RD for all the patients is 21% and the unadjusted OR is 2.44 (using a pooled OR analysis), and that the "apparent" efficacy of agent X is greater than the "true" efficacy of agent X as measured in trial number 1 -- an absolute RD of ~14% and an unadjusted OR of 1.83 for all the patients.

The exaggerated "apparent" efficacy of agent X in this trial is due to the imbalance in absolute patient numbers in the Q1 and Q5 subgroups -- specifically note that the placebo group's overall rate of favorable stroke outcome is only 28%

In other words, the "apparent" efficacy of agent X changes even though agent X has remained identically efficacious in these two hypothetical trials -- note that the stratified OR value remains unchanged at 2.0. This interpretative distortion results from using a pooled OR analysis technique when there is a marked imbalance in the number of Q1 and Q5 subgroup patients between the treated and placebo groups.

Note that the placebo response rate is approximately 32% for many tPA-for-stroke RCTs -- ECASS trial 29%, ECASS II trial 36%, ATLANTIS trial 32%.

Here is figure 4 from the pooled analysis of the NINDS, ECASS, ECASS II and ATLANTIS trials [11]

Note the range of favorable stroke outcome (mRS 0,1) at 3 months for placebo patients from the four different time-to-treatment groupings -- 29%, 30, 33%, 36%. Note that the average rate of a favorable stroke outcome for placebo patients is 32% (if one simply averages the four results for the four different time-to-treatment groupings) and 32.7% (if one averages the results for ALL the patients added together).

Note that for patients treated between 91-180 minutes, there were 315 placebo patients who had an "average" rate of favorable stroke outcome of 30%. However, I know that 167 of those patients came from the NINDS trial and that the "average" rate of favorable stroke outcome for the NINDS placebo patients was 25%. That means that 148 patients must have came from the ECASS, ECASS II, ATLANTIS trials, and their "average" rate of favorable stroke outcome must have been 35.4% -- that fact alone could possibly account for the fact that those three tPA-for-stroke trials could not demonstrate that tPA was significantly efficacious for patients treated between 91-180 minutes when using a pooled OR analysis technique.

I suspect that it may be possible to demonstrate that those RCT's could have had a positive result if one only examines the results of "highly responsive" stroke patients (Q2,3,4 subgroup patients).

Consider the NINDS trial's results from a different perspective.

Look at this graphical display derived from the NINDS Reanalysis Committee's Special Report [1].

To assess whether tPA is effective in acute ischemic stroke, and to quantify that effect, one should look at the results for the Q2, 3, 4 subgroups.

Note that the unadjusted OR is 2.6 for the Q2 subgroup, 2.4 for the Q3 subgroup and 1.6 for the Q4 subgroup. The "average" OR is 2.2 for those three subgroups, and that value quantitatively reflects the efficacy of tPA in a manner that is free of the distortion produced by the presence of Q1 and Q5 subgroup patients. This "average" OR value of 2.2 is actually higher than an "average" OR value of 2.0, which is the average OR for the five quintile subgroups; and I think that the OR value of 2.2 may better reflect the "true" efficacy of tPA in "highly responsive" stroke patients in the NINDS trial.

I think that it would be very fruitful to know what the comparable "average" OR values would be for the ECASS/ECASSII/ATLANTIS trials using the same stratified OR analysis technique.

In conclusion, I think that a pooled analysis technique of determining the OR of tPA therapy in tPA-for-stroke trials produces an unadjusted OR value that may be distorted by the presence of a Q1 and Q5 subgroup patient imbalance, and I believe that trialists should also use a stratified OR analysis technique to analyse their trial's results (especially for the "highly responsive" Q2-4 stroke severity subgroups) so that they can better quantify the likely "true" efficacy of tPA therapy in acute ischemic stroke.

 

Addendum (added during early January 2005):

 

I have decided to add the following commentary because of an interesting paper published in the January 2005 issue of Stroke [12].

The paper was authored by Kent et al. [12] and the authors analysed the pooled data from the NINDS/ECASS II/ATLANTIS trials from a gender perspective. The specific finding that particularly interested me was the fact that the researchers could not demonstrate a clinically significant benefit in tPA-treated male patients who were treated between 0-6 hours (1190 male patients). The tPA-treated male patients had a 38.5% favorable stroke outcome rate at 3 months, while the placebo male patients had a 36.7% favorable stroke outcome rate. Those figures suggest that tPA is not clinically effective in male patients treated between 0-6 hours.

However, the authors argue that tPA is still effective if given earlier and they produce the following graphical display to present their case.

This graph was produced using logistic regression to correct for confounding variables, and it was presumably based on the same questionable statistical inferences used to generate Marler's time-to-treatment graph.

For males, the graph infers that tPA is only beneficial if given earlier than 270 minutes and that it is potentially harmful if given after 270 minutes. One can also infer from the graph that tPA is marginally effective if given after 150 minutes (adjusted OR <1.5) and that its potential benefit increases to a maximum OR of 2.0 if given much earlier (<90 minutes). 

First of all, I regard this graph as being entirely "hypothetical" and not based on a substantial amount of hard data. For patients treated in <180 minutes, nearly all the data comes from the NINDS trial (two thirds of the pooled raw data apparently came from patients treated between 3-6 hours -- personal communication with Kent). Therefore, all the criticisms that the NINDS Reanalysis Committee made with respect to the Marler graph applies to this graph (see pages 63-67 of reference number 8). I personally think that this graph is simply a product of conjectural hypothesising based on questionable statistical techniques!

However, there is a much more important issue -- the fact that the favorable stroke outcome rate in male placebo patients was 36.7% (compared to 27% for the NINDS trial's overall placebo group, and 25% for the 91-180 minute placebo cohort). Why was the placebo male patients' favorable stroke outcome rate so high? Could it be due to the fact that there was a disproportionately large number of Q1 subgroup patients and/or a disproportionately small number of Q5 subgroup patients in the placebo male patient group compared to the treated male patient group? I don't know the answer because the Kent paper did not offer a stratified analysis, and their presented analysis is apparently an unstratified pooled analysis. Therefore, one cannot determine whether tPA was clinically ineffective in male patients, or whether the "apparent" lack of efficacy is due to a distortion secondary to a baseline stroke severity imbalance between the treated and placebo male patients.

This paper, with its unexpected and controversial finding in male patients, demonstrates why it it so important to always present a stratified analysis, as well as an unstratified pooled analysis, when interpreting the raw data from tPA-for-stroke trials.

 

Addendum (added in May 2005):

 

I will be discussing two important issues in this section.

 

Issue number 1:

 

I wrote the following letter to Stroke regarding Kent's paper, and the letter was published in the May 2005 issue of Stroke.

 

To the Editor:

In their pooled analysis article on gender differences in response to tissue plasminogen activator therapy, the authors conclude that tissue plasminogen activator was not significantly effective in male stroke patients (3-month favorable stroke outcome rate of 38.5% for treated patients versus 36.7% for placebo patients; P=0.52).

However, that particular conclusion is only scientifically valid if the male placebo patients had the same baseline likelihood of a spontaneous stroke recovery as the male treated patients. Considering that the male placebo group’s favorable response rate was 36.7%, compared with 25% for the placebo patients in the 91- to 180-minute arm of the NINDS trial, it is questionable whether the treated and placebo male pooled analysis groups were balanced at baseline.

To substantiate the scientific validity of their conclusion, the authors should provide a stratified analysis in addition to a pooled (unstratified) analysis to verify that there was no significant imbalance in baseline stroke severity between the male treated and male placebo patients in their pooled analysis.

 

 

Note that I specifically queried whether the treated and placebo male stroke patients were balanced at baseline in terms of stroke severity.

This is the authors' response letter that appeared in the same issue of the Stroke journal.

 

David M. Kent, MD, MS; Lori Lyn Price, MS; Harry P. Selker, MD, MSPH

Tufts-New England Medical Center, Boston, Mass

Peter Ringleb, MD

University of Heidelberg, Heidelberg, Germany

Michael D Hill, MD

University of Calgary, Calgary, Alberta, Canada

We thank Dr Mann for writing to give us the opportunity to correct a misperception some might have about our study results. Dr. Mann states that we "conclude that tPA was not significantly effective in male stroke patients." This is not correct. While we state that among men "the trend toward benefit in the overall group did not reach statistical significance," when there is a signficant treatment-effect interaction (such as the one with symptom onset to treatment time), the absence of an overall effect is not terribly informative. The data are clear that some male patients benefit from thrombolytic therapy (eg, those treated early), but this effect is diluted by those who do not benefit (eg, those treated later in the 6-hour window). This is a very important point, as we would be quite appalled if our results were used to suggest that men should not be given thrombolytic therapy, for example, even within the currently approved 3 hour time window.

From Dr. Mann’s other comments, it appears that he is concerned that there may have been an imbalance in the baseline characteristics of the male patients in the treatment versus the placebo group that biases toward the null (and presumably that this imbalance is present only in males, thus explaining the interaction). Our results are not consistent with this hypothesis since the gender interaction was found both in the unadjusted analysis (which would not control for any imbalance in baseline characteristics) and in the logistic regression adjusted analysis (which does control for potential imbalances, including those in NIHSS score). In any event, results stratified by NIHSS (Table) are consistent with the overall result.

 

Note three very interesting facts from Kent's table.

i) Note that there was no significant imbalance in patient numbers between the treated and placebo patients in each stroke severity subgroup (although I cannot understand why the authors only divided the patients into three stroke severity subgroups rather than the traditional five stroke stroke severity subgroups). By contrast, in the 91-180 minute arm of the NINDS trial, the placebo group had disproportionately fewer very mild stroke patients and disproportiona