PDF Crowdsourced analysis of clinical trial data to predict ...

  • Pdf File 553.91KByte

a n a ly s i s

? 2014 Nature America, Inc. All rights reserved.

Crowdsourced analysis of clinical trial data to predict amyotrophic lateral sclerosis progression

Robert K?ffner1,2,19, Neta Zach3,19, Raquel Norel4, Johann Hawe2, David Schoenfeld5,6, Liuxia Wang7, Guang Li7, Lilly Fang8, Lester Mackey9, Orla Hardiman10, Merit Cudkowicz11, Alexander Sherman11, Gokhan Ertaylan12, Moritz Grosse-Wentrup13, Torsten Hothorn14, Jules van Ligtenberg15, Jakob H Macke16, Timm Meyer13, Bernhard Sch?lkopf13, Linh Tran17, Rubio Vaughan15, Gustavo Stolovitzky4 & Melanie L Leitner3,18

Amyotrophic lateral sclerosis (ALS) is a fatal neurodegenerative disease with substantial heterogeneity in its clinical presentation. This makes diagnosis and effective treatment difficult, so better tools for estimating disease progression are needed. Here, we report results from the DREAM-Phil Bowen ALS Prediction Prize4Life challenge. In this crowdsourcing competition, competitors developed algorithms for the prediction of disease progression of 1,822 ALS patients from standardized, anonymized phase 2/3 clinical trials. The two best algorithms outperformed a method designed by the challenge organizers as well as predictions by ALS clinicians. We estimate that using both winning algorithms in future trial designs could reduce the required number of patients by at least 20%. The DREAMPhil Bowen ALS Prediction Prize4Life challenge also identified several potential nonstandard predictors of disease progression including uric acid, creatinine and surprisingly, blood pressure, shedding light on ALS pathobiology. This analysis reveals the potential of a crowdsourcing competition that uses clinical trial data for accelerating ALS research and development.

1Institute of Bioinformatics and Systems Biology, German Research Center for Environmental Health, Munich, Germany. 2Department of Informatics, Ludwig-Maximilians-University, Munich, Germany. 3Prize4Life, Tel Aviv, Israel and Cambridge, Massachusetts, USA. 4IBM T.J. Watson Research Center, Yorktown Heights, New York, USA. 5MGH Biostatistics Center, Massachusetts General Hospital, Boston, Massachusetts, USA. 6Harvard Medical School, Charlestown, Massachusetts, USA. 7Sentrana Inc., Washington, DC, USA. 8Latham&Watkins LLP, Silicon Valley, California, USA. 9Department of Statistics, Stanford University, Stanford, California, USA. 10Department of Neuroscience, Beaumont Hospital and Trinity College Dublin, Dublin, Ireland. 11Neurological Clinical Research Institute, Massachusetts General Hospital, Charlestown, Massachusetts, USA. 12Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch/Alzette, Luxembourg. 13Max Planck Institute for Intelligent Systems, T?bingen, Germany. 14Institute of Social- and Preventive Medicine, University of Z?rich, Z?rich, Switzerland. 15Orca XL Problem Solvers, Amsterdam, the Netherlands. 16Max Planck Institute for Biological Cybernetics and Bernstein Center for Computational Neuroscience, T?bingen, Germany. 17Berkeley School of Public Health, University of California, Berkeley, California, USA. 18ALS Innovation Hub, Biogen Idec, Cambridge, Massachusetts, USA. 19These authors contributed equally to this work. Correspondence should be addressed to R.K. (robert.kueffner@helmholtz-muenchen.de) or N.Z. (nzach@).

Received 17 January; accepted 23 September; published online 2 November 2014; doi:10.1038/nbt.3051

ALS, also known as Lou Gehrig's disease, is a progressive neurodegenerative disorder affecting upper and lower motor neurons. Symptoms include muscle weakness, paralysis and eventually death, usually within 3 to 5 years from disease onset. Approximately 1 out of 400 people will be diagnosed with, and die of ALS1,2, and modern medicine is faced with a major challenge in finding an effective treatment2,3. Riluzole (Rilutek) is the only approved medication for ALS, and has a limited effect on survival4.

One substantial obstacle to understanding and developing an effective treatment for ALS is the heterogeneity of the disease course, ranging from under a year to over10 years. The more heterogeneous the disease, the more difficult it is to predict how a given patient's disease will progress and thereby to demonstrate the effect of a potential therapy, making clinical trials especially challenging. In addition, the uncertainty surrounding prognosis is an enormous burden for patients and their families. A more accurate way to anticipate disease progression, as measured by a clinical scale (ALS Functional Rating Scale: ALSFRS5, or the revised version, ALSFRS-R6), can therefore lead to meaningful improvements in clinical practice and clinical trial management, and increase the likelihood of seeing a future treatment brought to market7,8.

In an effort to address the important issue of variability of ALS disease progression, we took advantage of two tools: a large data set of clinical, longitudinal, patient information and the vast knowledge and new computational approaches obtainable through crowdsourcing.

Pooled clinical trial data sets have proven invaluable for researchers seeking to unravel complex diseases such as multiple sclerosis, Alzheimer's and others9?12. With that in mind, Prize4Life and the Neurological Clinical Research Institute (NCRI) at Massachusetts General Hospital created the Pooled Resource Open-Access ALS Clinical Trials (PRO-ACT, ) platform with funding from the ALS Therapy Alliance and in partnership with the Northeast ALS Consortium. The vision of the PRO-ACT project was to accelerate and enhance translational ALS research by designing and building a data set that would contain the merged data from as many completed ALS clinical trials as possible. Containing >8,600 patients, PRO-ACT was launched as an open access platform for researchers in December 2012.

We turned to crowdsourcing13 to facilitate an unbiased assessment of the performance of diverse methods for prediction and to

nature biotechnology advance online publication

A n a ly s i s

? 2014 Nature America, Inc. All rights reserved.

n = 1,822 patients 279 918 625


raise awareness of the research potential of this new data resource. To address the question of the variability in the progression of ALS, a subset of the PRO-ACT data was used before its public launch for an international crowdsourcing initiative, The DREAM-Phil Bowen ALS Prediction Prize4Life. The prize for the challenge was $50,000 to be awarded for the most accurate methods to predict ALS progression. The challenge was developed and run through a collaboration between the Dialogue for Reverse Engineering Assessments and Methods (DREAM) initiative and Prize4Life using the InnoCentive Platform. In this challenge, solvers were asked to use 3 months of individual patient level clinical trial information to predict that patient's disease progression over the subsequent 9 months.

The challenge resulted in the submission of 37 unique algorithms from which two winning entries were identified. Overall, the bestperforming algorithms predicted disease progression better than both a baseline model and clinicians using the same data. Clinical trial modeling indicates that using the algorithms should enable a substantial reduction in the size of a clinical trial required to demonstrate a drug effect. Finally, the challenge uncovered several clinical measurements formerly unknown to be predictive of disease progression, which may shed new light on the biology of ALS.

RESULTS Challenge design and participation statistics

As part of the 7th DREAM initiative, the DREAM-Phil Bowen ALS Prediction Prize4Life (referred to henceforth as the ALS Prediction Prize) solicited computational approaches for the assessment of the progression of ALS patients using clinical trial data from the PRO-ACT data set (Online Methods and Supplementary Note 1). The challenge offered a $50,000 award for the most reliable and predictive solutions.

Solvers were asked to use 3 months of ALS clinical trial information (months 0?3) to predict the future progression of the disease (months 3?12). The progression of the disease was assessed by the slope of change in ALSFRS (a functional scale that ranges between 0 and 40). The solvers were given 12 months of longitudinal data for the development and training of algorithms, and were asked to submit their algorithm to be evaluated on a separate data set not available for the development or training of the algorithms.

For evaluation, algorithms were run by InnoCentive on the InnoCentive servers. Algorithms were fed data from the first 3 months of a given patient's participation in a clinical trial. Data from the subsequent 9 months were not supplied. The performance results against a test data set were presented on a leaderboard. Finally, participants were assessed on a third, fully blinded and previously unseen validation set to prevent overfitting. The prize was awarded according to performance on this validation set (Fig. 1).

The challenge lasted 3 months, from July 13 to October 15, 2012. It drew 1,073 registrants from >60 countries. Following the challenge, a survey of registrants was conducted. The survey revealed a diverse audience comprising academic (58%) and industry (30%) professionals as well as others (12%). Notably, 80% of the solvers had almost no familiarity with ALS. No fewer than 93% expressed interest in participating in a future challenge (comprehensive survey results appear in the Supplementary Note 2 and Supplementary Tables 1 and 2). However, as is typical for crowdsourcing challenges, only a small fraction of the registrants submitted an algorithm for testing. During the challenge a total of 37 teams submitted an algorithm to be tested through the leaderboard and 10 teams made valid final submissions. In order to be valid, the submitted R14 code was required to be executable within InnoCentive's system and to predict ALSFRS slopes for all given patients.

Method performance and assessment

We evaluated and compared the ten final submissions provided by the solvers, as well as an eleventh method designed by the challenge organizers. The latter method is referred to as the baseline method, as it was used to establish the baseline performance that best-performing teams would need to outperform. The solvers' methods and the baseline algorithm are described in Supplementary Note 3, Supplementary Figure 1 and Supplementary Tables 2?8; the full sets of predictions are provided in Supplementary Predictions. As a separate archive (Supplementary Software), we provide the source code of six teams that may be used in compliance with the algorithms' own copyright statements. Method performance was assessed by root mean square deviation (r.m.s. deviation) and Pearson's correlation (PC) to compare predictions of the ALSFRS slope against the actual slope derived from the data. Although the r.m.s. deviation can be directly interpreted as estimation error in units of ALSFRS, PC is useful to assess the correct prediction of trends. For visual inspection of the performance see Supplementary Results 1 and Supplementary Figures 2 and 3.

Solver teams and their methods were assessed based on their r.m.s. deviation scores on the separate validation data set (Fig. 2), which was crucial to guarantee robustness (for differences between validation and leaderboard performance, see Supplementary Results 2, Supplementary Figs. 4?8 and Supplementary Table 9). The top six teams (ranked at positions 1?6) exhibited an r.m.s. deviation lower than the baseline method. The ten solvers used a variety of methodological approaches, but it is interesting to observe that four out of the six top-ranking teams employed variants of the random forest machine learning approach. Two other approaches that ranked at position 1 and 6 were based on Bayesian trees and nonparametric regression, respectively. Simple regression methods (ranks 8 to 10) performed substantially worse than the baseline method. Method 7 is

Figure 1 Challenge outline. (a) The data for





the challenge comprised 1,822 patients from ALS clinical trials from the PRO-ACT data set.


Validation (months 0?3)

Data types included demographics, clinical and family history information, laboratory tests and vital signs. (b) We divided the data into three subsets: training data provided to solvers in full, leaderboard and validation data reserved

Demographic Vital signs Lab tests Family history Clinical scales

Training (months 0?12)

Leaderboard (months 0?3)

Solvers' algorithms

Slope months 3?12

True slope

for the scoring of the challenge. Leaderboard and validation data were only available to

Patient records



Blind assessment

the challenge managers for the testing of the

algorithms submitted by the solvers. Algorithms were fed with data from the first 3 months to perform predictions, and evaluated based on the subsequent

9 months of data. (c) At the end of the challenge, solvers submitted their algorithms to be tested by the challenge organizers on the validation data set.

(d) The predictions obtained in c were then assessed by the judges for accuracy.

advance online publication nature biotechnology

a n a ly s i s

? 2014 Nature America, Inc. All rights reserved.

Figure 2 Performance of methods. (a,b) We




compared the approaches of the ten teams

Aggregate of teams 1 + 2


that submitted executable R code in the final phase of the challenge and a baseline approach

Team 1: Bayesian trees Team 2: Random forest


designed by the challenge organizers. All the

Team 3: Random forest

solvers' algorithms had to be compatible with

Team 4: Random forest


R version 2.13.1. The teams are numbered

Team 5: Random forest

according to their ranking in r.m.s. deviation

Team 6: Nonpar. regression

performance (a) or Pearson Correlation (b).

Baseline: Support vector regr.

They are colored blue if they performed better

Team 7: Prediction of mean

than, and gray if they performed worse than, the baseline. In addition, an aggregate of the predictions of teams 1 and 2 is shown. Whiskers indicate bootstrapped s.d. (inset). The frequency with which methods were ranked first is estimated across different bootstrap samples

Team 8: Linear regression Team 9: Multivariate regr. Team 10: Linear regression

... 0.89

... 1.30 Better

0.50 0.55 0.60 r.m.s. deviation (1/month)


0.1 0.2 0.3 0.4 Correlation


5 10 15 20 Trial size reduction [%]

of patients. Teams 1 and 2 were ranked first in 71% and 26%, respectively, of the bootstrap samples (percentage rounded to the nearest integer). (d) By

simulation, we estimated to what extent clinical trials can be reduced in size by each of the participating approaches corresponding to their improved

prediction of disease progression.

a naive predictor that calculates the average slope of the training set patients and predicts the value of this slope for any further patients. We will refer to the resulting deviation as base r.m.s. deviation because it provides a good estimate for the difficulty of prediction of a given patient set, thus achieving a PC of 0. Except for this method, the performance rankings determined using r.m.s. deviation and those determined using PC were quite consistent.

In addition, we employed bootstrapping to assess the robustness of performance (Online Methods). Here, we evaluated the probability that a given method would achieve the best overall performance on different subsets of patients. Teams 1 and 2 achieved the best r.m.s. deviation in 71% and 26%, respectively, of the patient samples generated by bootstrapping (Fig. 2b). We thus concluded that the algorithms of these two teams provide both the most robust and the most reliable predictions. Therefore, teams 1 and 2 were declared the best performers of the challenge and received an award of $20,000 each. Team 3 (ref. 15) won a third place prize of $10,000. As in previous installments of the DREAM challenges16, the aggregation of predictions across teams 1 and 2 further reduced the prediction error. Bootstrapping further allowed us to estimate that a statistically significant improvement over the baseline algorithm would correspond to an r.m.s. deviation of 0.5, which was not achieved by any method.

The two top-performing methods as well as the baseline algorithm were then applied to predict the disease progression of patients in the full PRO-ACT data set. The algorithms maintained their ability to predict disease progression reliably (r.m.s. deviations: team 1, 0.544; team 2, 0.559; baseline, 0.559; Supplementary Results 3 and Supplementary Figs. 9 and 10). That performance was slightly lower than during the challenge can be explained by the greater variability in data across a larger number of trials, which also increased the base r.m.s. deviation from 0.566 to 0.610.

Besides the selection of an appropriate predictive approach, performance was also influenced by the processing of the clinical mea surements. Static features (those with one value per patient, e.g., gender or age) could be exploited as is. In contrast, the remaining features were `time-resolved' and could not, therefore, be incorporated directly into standard machine learning frameworks because time points and number of measurements varied between patients. Generally, teams converted each type of time-resolved data per patient into a constant number of static features by applying various statistics. For instance, linear regression was applied to represent a set of measurements by the slope and intercept (e.g., baseline method). Another approach was to select designated measurements as features, such as the minimum and maximum of the values. The latter approach was successfully applied






Clinicians 0.6





ALSFRS (1/month)

r.m.s. deviation (1/month) ALSFRS (1/month)


40 30 20 10

5 10 15 20 Time (months)

Medium 0.5



Median clinician



Methods 1 -1.5


Predicted fast Predicted slow


0.2 0.4 0.6 0.8


-1.0 -0.8 -0.6 -0.4 Predicted ALSFRS: method 1 (1/month)




Predicted fast

Fast Predicted slow

-1.2 -1.0 -0.8 -0.6 -0.4 Predicted ALSFRS: median clinician (1/month)

Figure 3 Prediction and classification by algorithms and clinicians. (a) ALSFRS slopes were partitioned into 14 clusters of commonly occurring disease progression profiles via k-means. Clusters predominantly contain slow (black), average (gray) or fast patients (red). Patients closest to the center of each cluster were selected to yield 14 representative patients. (b) Performance of 12 clinicians (red, "A" indicates the performance of their aggregated predictions) and two algorithms (blue) is assessed based on r.m.s. deviation (ordinate) and Pearson's correlation (abscissa). Here, r.m.s. deviation and correlation are calculated based on the 14 representative patients. (c,d) The slope predictions (abscissa) as generated by algorithm 1 (c) and the median clinician regarding r.m.s. deviation (d). Individual patients were classified as slow (-), medium (o) or fast (+) according to the true progression (ordinate). The predicted classifications were assessed relative to a threshold (dashed line). Patients left or right of the threshold are assumed to be predicted fast or slow, respectively. Circles highlight incorrectly classified patients (d).

nature biotechnology advance online publication

A n a ly s i s

? 2014 Nature America, Inc. All rights reserved.

Probability Relative value Relative value



Time from onset

1 1 2 1 1 1 1.2



3 2.2

Time from diagnosis






2 14 6.3

Blood pressure




Uric acid


3 11



5 7 20







4 12 24

3 18 12.2

Site of onset

25 10

2 12.3






5 29

6 13.3





Creatine kinase









1 2 3 4 5 6 Mean




0.20 Phosphorus Pulse


0.10 0.05

?0.8 ?0.6 ?0.4 ?0.2 0 0.2 0.4 0.6 0.8




Patient 136474

1.0 Phosphorus


= 0.92




= 0.87

0.2 Pulse 0 = 0.57


100 200 300 400 500 Time after onset (days)



Patient 134770







= 0.03

?0.2 Phosphorus ?0.4 = ?0.43

Pulse = 0.92

100 200 300 400 Time after onset (days)

Figure 4 Analysis of predictive features. (a) The heat map depicts the features that were identified (ranked from 1?30, illustrated by colors from blue to yellow) by at least two of the algorithms among their 30 most predictive. The column termed mean is the average across the top six solvers. ALSFRS_Q denotes the usage of at least one of the individual ALSFRS questions in contrast to the usage of just their sum, ALSFRS. Three features not previously reported in the literature, phosphorus, creatinine and pulse, are analyzed in the remaining panels. (b) The probability of correlation between a feature and ALSFRS distributed across patients. (c,d) For two example patients, the time progression of these predictive features. Note: to show different measures in one diagram, we normalized them to relative values based on quartiles (Online Methods).

by the best-performing team 1. Notably, this min/max approach represented the time-resolved data in a more robust way than the linear regression approach, which apparently suffered from the relatively few data points available. The treatment of features was also what distinguished the four methods based on random forest variants, that is, they used specific approaches for feature selection, missing value imputation or feature summary statistics.

To be useful clinically, algorithms should be able to maintain predictability with limited data that have either (i) fewer features (ii) or cover less than 3 months. Therefore, we tested the effects of (i) by using only the five most-predictive features and (ii) by limiting the time period of information available to the algorithms to the first month of data. For both the best-performing algorithm and the baseline methods, performance did not substantially deteriorate (Supplementary Results 3 and Supplementary Figs. 9 and 10).

Predictions facilitate reduction in clinical trial size ALS clinical trials serve to evaluate the effect of a given drug treatment on disease progression, with ALSFRS and ALSFRS-R scores serving as common outcome measures. The great variability in ALS disease progression hinders the ability to detect effects of a given treatment, necessitating larger and more costly clinical trials. The ability to more accurately predict the expected disease progression for a given patient (without treatment intervention) can therefore reduce the number of patients needed by increasing the ability to detect a drug's effect on disease (less obscured by the inherent disease variability). To quantify the potential trial size reduction engendered by use of the algorithms, we simulated trials (Online Methods). We estimated that trial size could be reduced by up to 20.4% using the aggregated predictions of teams 1 and 2 (Fig. 2d). As the average cost per patient in an ALS clinical trial is $30,000 (L.A. White and D. Kerr, personal communication), for a phase 3, 1,000patient trial, this would translate into a $6-million reduction in cost.

Comparison between algorithms and clinicians ALS prognostic prediction is challenging. Clinicians often feel they lack the necessary tools to provide their patients with accurate prognostic information. Thus, we aimed to evaluate whether the ALS prediction algorithms could help clinicians by comparing their predictive performance.

Therefore, we selected a relatively small but representative subset of patients. Using k-means clustering, we divided the 1,822 patients

into 14 clusters, reflecting commonly occurring ALSFRS time courses. Clusters were distinguished by both their intersect (ALSFRS score at time 0) and the shape of their progression curves (Fig. 3a). Based on these 14 distinct, disease-progression patterns, we selected 14 representative patients, that is, the centroid of each cluster.

Subsequently, 12 clinicians from top ALS clinics from seven countries were asked to estimate the future disease progression of these representative patients (Supplementary Results 4, Supplementary Fig. 11, Supplementary Table 10 and Supplementary Data 1), using the exact same data provided to the algorithms. The two bestperforming algorithms substantially outperformed all clinician predictions, indicated by both higher PC and lower r.m.s. deviation (Fig. 3b). In addition, the algorithms also outperformed the aggregate of all the clinicians' predictions. The rate of progression predicted by the best-performing algorithm (team 1's)--in contrast to the rate predicted by the clinicians--and the actual rate of disease progression across the 14 cases, were well-correlated (Fig. 3c). These results suggest that the prediction algorithms might prove useful in helping clinicians to assess patient disease progression.

In addition to an algorithm's predictive accuracy, it is crucial that it provides a broad assessment of patient progression, that is, that it can correctly classify a given patient as having an average disease progression, or having an unusually slow or fast disease course. Therefore, we analyzed both the algorithm- and clinician-based predictions to determine to what extent slow and fast progressors were correctly classified. Three patients showing an ALSFRS slope of less than -1.1 points/month were considered fast, seven patients with a slope greater than -0.5 points/month were considered slow, and the remaining four patients were considered average.

On this limited subset of patients, the best algorithm (team 1's) discriminated perfectly between slow and fast-progressing patients (Fig. 3c,d). In contrast, the rate of progression predicted by a typical clinician showed substantially less correlation to the true rate and many patients were misclassified. On average, clinicians misclassified 35% of the patient cases (Supplementary Results 4, Supplementary Fig. 11 and Supplementary Table 10).

Predictive features

To assess the importance of each feature for the different algorithms, we looked at the predictive features as they were ranked by the top six best-performing algorithms. We focused on the features that at least

advance online publication nature biotechnology

a n a ly s i s

? 2014 Nature America, Inc. All rights reserved.

two solvers included within the top 30 predictive features. Sixteen such features were identified (Fig. 4a). The list includes several features previously reported to predict ALS progression, including time from onset, age, forced vital capacity (FVC), site of onset, gender, weight17?20, as well as uric acid concentration in blood, a feature that has only been suggested recently as a predictor20. In addition, the challenge was successful in identifying nonstandard predictive features, opening the door to new insights into ALS disease mechanisms. These were pulse, blood pressure as well as the concentration of creatine kinase, creatinine and phosphorus.

We further assessed these features by determining the correlation, over time, between the relevant feature and the ALSFRS score, for each subject (Fig. 4b?d). Notably, for creatinine the distribution of correlations across the patients was skewed toward higher correlations, indicating a subset of patients that exhibit an unusually high correlation between changes in creatinine and ALSFRS score. This was also found, to a lesser extent, for creatine kinase, which is correlated with creatinine. This suggests that these features may be especially predictive for specific subgroups of patients and therefore might be useful biomarkers of the disease. A similar trend was not found for pulse, phosphorus or blood pressure (Supplementary Results 5 and Supplementary Figs. 12 and 13), for which further detailed analysis, beyond the scope of this study, is needed to explore their potential predictive properties.


The current lack of robust approaches for estimating the future disease progression of ALS patients represents a major obstacle for the testing of novel therapeutic approaches in clinical trials and the understanding of disease mechanisms. As ALS is a rapidly progressing disease, the accurate estimation of progression is very important for patient care and making decisions regarding clinical interventions and assistive technology.

The unique global challenge presented here brought together the efforts of 37 participating teams to develop tools to predict disease progression in a way useful to ALS clinical trials and clinicians, and to identify new predictive features that can provide new insights into disease processes and could provide important biomarkers.

The PRO-ACT platform, the largest existing ALS clinical trial data set, has provided an unprecedented opportunity to increase our understanding of the ALS patient population and the natural history of the disease21.

The crowdsourcing approach had several advantages. First, the challenge attracted new minds and new perspectives to a problem largely unknown outside the ALS research community22. Second, the format of the challenge allowed blinded side-by-side comparisons of different prediction methods, tested on a data set the solvers never saw and to which the algorithms had never been exposed. This allowed a better assessment of the robustness of the solutions.

Notably, the algorithms could be divided into two groups based on their performance, with teams using tree-based ensemble regression techniques, such as random forest or Bayesian additive regression, almost always outperforming teams using simple regression. These results suggest that tree-based ensemble regression techniques are likely suitable for clinical data in general, beyond the scope of ALS and are therefore of broad general importance for analysis of clinical trial information in the context of clinical trials as well as clinical health records. In addition, robust processing of the time-resolved clinical measurements seemed to be the key to achieving the overall most-predictive results, where simple summary statistics performed best. This may be due to the limited number of time points available and the intrinsic noise, issues prevalent in medical data.

By simulation, we estimated that predictions by the winning algorithms could lead to a 20% reduction in population size for an ALS clinical trial. This reduction stems from changes in trial design. When planning clinical trials, variability between the patients is estimated to plan for a sufficient trial size to capture the effect of the drug beyond that variability. An algorithm that gives more information about the patients, and thereby reduces the interpatient variability, can facilitate a reduction in trial size. Furthermore, a reduction of the estimated magnitude could also affect the number of medical sites required by a trial, leading to further cost savings. Prognostic methods could also lead to improvements in trial effectiveness, so assessing the financial implications of incorporating predictive algorithms into clinical trial design is not straightforward. If we limit the effect of a more quantitative understanding of disease heterogeneity only to calculating the number of patients needed, we estimate that predictions enable a 20% reduction in population size of a phase 3 trial, resulting in a $6-million dollar reduction in costs. These financial benefits need to be weighed against the potential costs of providing a lead-in period where patients are tracked before the start of the trial to determine their expected progression and other factors that affect the trial, such as patient drop-out or limited survival. The finding that the algorithms maintain their predictability using just 1 month of information (Supplementary Results 3), and the fact that the performance of the algorithms remained robust when tested on the larger and more diverse full PRO-ACT data set demonstrates the power of crowdsourcing, where a challenge with a monetary award of $50,000 can potentially reduce the costs of multiple future clinical trials by millions of dollars. The algorithms are currently being further tested and validated on proprietary ALS trial data. Furthermore, efforts are underway to transform the algorithms into a ready-to-use software tool for evaluation and future application in clinics and in clinical trials.

To further assess the ability of the winning algorithms to help clinicians in accurately determining the prognosis of their patients, we directly compared the algorithms' predictions with the estimates of 12 leading ALS clinicians. For all 14 patient cases examined, the algorithms outperformed all of the clinicians and the aggregate of the clinicians by a substantial margin. Clearly, predicting disease prognosis based solely on an anonymized data set in the absence of a clinical encounter certainly cannot truly reflect the wealth of information that can be gleaned by an experienced clinician through clinical observation. However, these results demonstrate how a predictive algorithm could prove helpful to clinicians when advising patients. As the patient population participating in clinical trials is not fully representative of the patient population seen in the clinic23, steps are being taken to directly test the utility of these algorithms in a clinical setting.

Another important goal of the ALS Prediction Prize challenge was to validate features that had previously been suggested to be predictive in small studies and to identify novel predictive features. Overall, 15 different features were identified by more than one solver. Several were features reported in the literature17,18,24?33, including age, site of disease onset, gender, the slope of disease progression so far, past ALSFRS slope, and past FVC slope, thus serving as validation of both the features and the algorithms. Unfortunately for many of the patients in PRO-ACT, FVC information was not available or other key features required to calculate FVC were missing. Weight, which has been disputed as a predictor in the ALS literature19,34,35, was found to be predictive. Specific ALSFRS questions were found to be predictive by the different teams, but with no consensus over certain questions being more predictive of the total score than others.

nature biotechnology advance online publication


Online Preview   Download