Key Drivers for Successful Patient Event Prediction: Empirical Findings on What Matters and to What Extent

Key Drivers for Successful Patient Event Prediction: Empirical Findings on What Matters and to What Extent
Srinivas Chilukuri, Principal Data Scientist, ZS Associates and Sagar Madgi, Senior Data Scientist, ZS Associates
Abstract:
Observational healthcare databases, such as administrative claims and electronic health records, present rich data sources for knowledge discovery from patient longitudinal histories. One such use case is the prediction of various events across the patient treatment journey, such as diagnosis and therapy initiation, progression or discontinuation.

If implemented well, patient event prediction models enable several applications in the commercial (predictive customer targeting, patient services design) and research (target patient universe determination, trial site selection) domains. However, owing to the richness, complexity and nuances in the data, there are several things to get right when it comes to model design. For instance, selection of right data set and sample size, length of medical history, prediction time window, modeling parameters, type of features (recency, frequency, sequence); and mechanism of feature generation (knowledge-driven vs. automatically generated).

In this paper, we present empirical findings on how these considerations weigh on model performance and downstream utility, drawing upon results from a diverse set of use cases spanning multiple therapy areas.

Keywords: Patient Event Prediction, Machine Learning, Knowledge Features, Data-Driven Features, Prediction Window

1. Background
1.1 Introduction and Motivation
The focus of the pharma industry is increasingly turning to specialty products for treating niche conditions. For the brands to succeed in this environment, it is imperative to identify the right patient at the right time. This necessitates predicting patient events ahead of time, which can then inform several downstream applications such as clinical trial planning, predictive customer targeting, personalized patient assistance, etc. Traditionally, these have been approached from a clinical perspective (e.g. CHADS2, MELD etc.) but such scores are not available across the spectrum of events and so there is a need to build prediction models.

However, it is non-trivial to set up, operationalize and derive business value from patient event prediction models. This is because several distinct aspects must come together into a coherent machine learning pipeline for such efforts to be successful. This paper aims to discuss the critical success factors and provide empirical guidance for brand/analytics leaders and data scientists who would be undertaking such endeavors.
1.2 Prediction Modeling Components and Scope of the Paper
A typical patient event prediction model involves the following key ingredients:
  • Right patient data set
  • Representative patient sample
  • Optimal length of medical history
  • Optimal prediction time window
  • Exhaustive model features (hypotheses underlying events)
  • Suitable machine learning model and corresponding hyperparameters
Each of these steps involve making choices across several potential options (see Figure 1 for examples). This makes the whole process complex and time consuming.

Figure 1: Key Steps in Setting Up a Patient Event Prediction Model


Domain experts can hypothesize parameters for each step in the process that will lead to the best model performance; however, in most cases, determining these parameters is a matter of conjecture and presents combinatorially complex possibilities. Typically, multiple iterations are run to understand the effect of varied permutations and combinations of choices made in each step on model accuracy before finalization of the model. This is a time-consuming and laborious process, even for experts.3

Essentially, there’s a trade-off then between model performance and resources (time and cost) that can be expended on improving the model accuracy. The cost associated with the exploration is steep and may increase rapidly with number of combinations, without a corresponding increase in performance. It is critical to find the right balance between exploration and performance.

To this end, this paper seeks to present empirical findings that can be used as reference for finding the right balance between the two. These will be based on observations derived from a set of event prediction experiments implemented for different use cases and therapy areas. Our focus is not on understanding the causal relationships between input variables and outcomes but more around understanding how choices for a specific set of parameters (observation windows, variables, etc.) affect model performance. Specifically, we seek to understand how the following considerations weigh on performance in patient event prediction:
  1. Length of medical history – What is the right length of time window over which a patient’s prior medical events should be considered?
  2. Prediction window – What is the optimal time window to making predictions so they are actionable for the end users and address the business need at hand?
  3. Feature classes and types – What kind of features matter across use cases and therapy areas? What is the relative importance of features across concept domains (diagnosis, medication, procedure, etc.) and type (recency, frequency, sequence, etc.) of features?
  4. Mechanism of feature generation – how does mechanism of feature generation (clinical knowledge vs. data-driven) weigh in on prediction accuracy?
These experiments will be conducted using a claims database, a de facto standard choice for event prediction algorithms, given their ubiquity across applications and higher accuracy.4,5 While recent literature suggests that EHR does add more predictive power6, exploration of impact of adding EHR data will be out of scope for this study, and will be assessed in a subsequent paper.

In the rest of this paper, we describe the experimental setup and results, concluding with a discussion and assessment of findings that could be used to guide the setup of event prediction models in general.
2. Experimental Setup
2.1 Prediction Use Cases and Disease Areas
We focus on four major events occurring through a patient’s journey (see Figure 2) viz. diagnosis of a condition, treatment adoption, treatment progression (change in line) and treatment drop-off.

Figure 2: Use Cases Across Patient Journey


From a disease area perspective, the experiments will focus on disease areas which a) are representative of low prevalence to high prevalence scenarios, and b) where event prediction models are relatively important for healthcare stakeholders such as payers, providers and pharma, given the burden imposed by these diseases. To this end, we believe the following diseases can serve as a good representative set (see Figure 3).
  • Oncology – Non-small cell lung cancer (NSCLC)
  • Immunology – Rheumatoid arthritis (RA)
  • Primary care – Chronic heart failure (CHF)
Figure 3: Disease Areas by Prevalence7-13 and Economic Burden 14,15-21


Building on the above choice of use cases and disease areas, experiments will be conducted across the set of event prediction models listed in Table 1.

Table 1: List of Disease Areas and Use Cases for Experimentation
Prediction Use Case
Oncology (NSCLC)
Immunology (RA)
Primary Care (CHF)
Identify patients who are likely to
Disease Diagnosisbe diagnosed with metastatic NSCLC be diagnosed with RA be diagnosed with CHF
Treatment Adoptionadopt an EGFR drug adopt an anti-TNF drug adopt an ACE-inhibitor drug
Treatment Change (Line Switch) switch to 2nd line switch to 2nd line biologic switch to 2nd line
Treatment Drop-off drop off from an EGFR drug therapy drop off from an anti-TNF therapy drop off from an ACE-inhibitor therapy

The effect of varying length of medical history will be tested for all the event prediction models listed above, as it is varied from 6-36 months. The impact of varying prediction window (from 1 month to 6 months) will be tested for models built with 12-month medical history.

The impact of knowledge-driven vs. data-driven features on model performance will be assessed for diagnosis use cases across all the above disease areas; these are the cases where existing research gives us a good baseline for building knowledge-driven features. This will be done for models built using 12-month medical history.

Figure 4 shows the typical machine learning pipeline for building patient event prediction models. The same steps have been followed in our experimental set up.

Figure 4: Typical Patient Event Predict Model Development Pipeline


2.2 Data
The experimental results in this paper are based on Optum’s de-identified ClinformaticsTM DataMart US healthcare claims database for NSCLC, RA and CHF during the time period 2012-2018.
2.3 Specifying the Prediction Problem
The prediction problem will be generalized as following, i.e., given a target cohort of patients with certain medical history, what is the probability of a patient experiencing an event of interest in the given prediction window? The target population and outcomes will be defined based on a set of inclusion/exclusion rules (such as occurrence of certain diagnosis, medication, labs etc.).

Additional details around inclusion/exclusion criteria and sample size for each model are available in the appendix.
2.4 Feature Definition and Selection
Variables derived from demographics, symptoms, comorbidities, drugs, procedures, visits and other observations recorded prior to the anchor date (see Figure 5) will be used for feature creation across experiments. To enable rapid experimentation, an automated, intelligent feature generation and selection framework will be utilized for agile discovery of relevant features. Additional details around this framework are available in the appendix.

Figure 5: Specifying the Prediction Problem


For experiments involving knowledge-driven features, features will be defined based on existing research and expert opinion.22,23,24,25 Details around these knowledge-driven features are available in the supplementary data file.
2.5 Model Building
Labeled data, along with features, will be split into train and test sets in a 70:30 ratio. XGBoost, a state-of-the-art machine learning model, will be utilized for model training. We prefer XGBoost over more complex models such as Artificial Neural Networks (ANN) given our focus on explainability in addition to prediction performance. Appropriate hyperparameter tuning will be performed to ensure optimal learning. Additional details around the models are available in the appendix.
2.6 Model Evaluation
The models will be validated on test dataset, defined in the previous phase. The area under the receiver operating curve (AUC) will be used for assessing the performance of models across different scenarios of medical history, prediction window and mechanism of feature generation. See Figure 6.

Figure 6: Area Under the Receiver Operating Curve (AUC)


In general, the higher the AUC the better a model performs. A random model without any predictive power generally results in a 50% AUC. On the other hand, a perfect model would result in a 100% AUC. While it varies by use case, typically AUC of 70% is considered good and 85% or above is considered excellent.

Feature importance will be assessed using the adjusted F-score. Additional details around AUC, and evaluation for various scenarios, are available in the appendix.
3. Results
3.1 Length of Medical History
Except for treatment drop-off and mNSCLC diagnosis, we note a monotonic lift in AUC across all use cases as medical history is varied from 6 months to 36 months. The gain, however, seems to be marginal after a certain length of history – 6 months in case of mNSCLC diagnosis and treatment drop-off vs. 18 months in other use cases. See Chart 1.

Chart 1: Prediction Performance (AUC) vs. Medical History (in Months)


3.2 Prediction Window
These results indicate that AUC peaks at the first month across therapy areas and use cases, gradually declining as the window extends from the first month to the sixth month. The decline is pronounced across all cases, except for CHF diagnosis and treatment adoption. See Chart 2.

Chart 2: AUC for Different Prediction Windows


3.3 Relative Importance of Features
We note that comorbid conditions invariably contribute to highest feature importance followed by medications, across all use cases. The trend is slightly distinct in case of treatment change; comorbid conditions still contribute to highest feature importance across all diseases but are followed by financial burden features (for NSCLC, CHF) and symptoms (for RA). See Charts 3, 4, and 5.

Frequency (or occurrence of events) of comorbidities, medications, symptoms and labs contribute to highest feature importance, followed by recency. Financial burden metrics are invariably associated with metrics showing change across time, such as trend and averages.

Chart 3: Relative Importance of Concept Domain Features Across Different Use Cases


Chart 4: Time-from-Anchor Distribution of Recency Concept Domain Features


Chart 5: Time-from-Anchor Distribution of Frequency Features


3.4 Mechanism of Feature Generation
In terms of model performance, we note that data-driven features significantly out-perform knowledge-driven features across all disease areas. It is interesting to note however that the features generated from data-driven approaches do capture knowledge-driven features in a different form and they tend to be among the top predictors. See Chart 6.

Chart 6: Mechanism of Feature Generation


4. Discussion
From our investigations on event prediction models based on patient claims data, we have the following observations.
4.1 Length of Medical History
For most use cases, we see that there is no significant gain in model performance beyond a certain medical history (even for the few cases of exceptions, the incremental gains are diminishing and arguably don’t justify the increased cost of implementation). In fact, models built using data from 6 to 18-months of medical history yield close to best performance in almost all use cases. The likely explanation for this is that the recent history of events has the highest impact on patient outcome event, and the impact decays as the window expands, leading to negligible impact of events past a certain date.

Prior research done in provider settings corroborates our findings that recent data contributes more to predictive accuracy than more data. Chen et al26 indicated a half-life of four months for clinical data relevance and Min et al27 observe no difference in prediction performance on records with a one-year observation window or a full history.

Based on this, we recommend a 12-month medical history for patient event prediction modeling as that seems to be the sweet spot that enables achieving good prediction performance as well as eases down the data preparation and computational complexity.
4.2 Prediction Window
We note that the prediction performance decreases as we increase the time window for prediction, i.e., we can predict well for the next 1-mo, but not as well for the next 3-mos (70-80% of 1-mo) and for 6-mo the prediction is as good as a coin flip in most cases. This is likely because as the prediction window expands, we will have less availability of predictor events which most often occur closer to predicted event.

Prior research around diagnostic prediction models corroborate this. Kleiman et al28 noted an inverse relationship between the length of the prediction window and the quality of the model. They attribute this observation to a decrease in the number of patients available, smaller amounts of data and the importance of patients’ recent health state on their immediate future.

The above findings indicate that it is extremely important to set up models for shorter prediction windows rather than longer (6-mos or more). This would require the modelers to educate and set the right expectations with the business stakeholders, as we have often seen a desire from them to predict as much ahead as possible. However, once the trade-offs are clear, a mutually workable solution could be developed.
4.3 Relative Importance of Features
Features derived from comorbidities and medication variables seem to play a prominent role in driving model performance across all use cases, except for treatment line change for NSCLC and CHF. These are closely followed by symptoms, labs/visits and financial burden metrics. The pre-ponderance of comorbidities and medications in driving model performance may have to do with the fact that these are more indicative of underlying disease conditions and thereby capture latent information more readily than other sets of features.

In terms of temporal distribution, we note that a high proportion of features associated with a time component, such as frequency and recency, seem to be concentrated within a window of 1-mo to 3-mo prior to the anchor date, with only a minority of features spanning to 6-mo and in some cases 12-mo window. This is in line with observations from the medical history, wherein it was observed that history beyond a 6 to 12-mo window doesn’t add substantial incremental lift.
4.4 Mechanism of Feature Generation
We note that auto-extracted, data-driven features drive higher prediction performance (AUC – 0.75, 0.7,0.73) vs. knowledge-driven features (0.521, 0.57, 0.64). The conclusions driven by prior research, however, are mixed on this topic. Min et al27 report improvement of model accuracy when data-driven features are added alongside handcrafted features, suggesting that while knowledge-driven features are powerful, data-driven features do help in improving model accuracy. Tran et al29, however, suggest that the auto-extracted disease-agnostic features from medical data can achieve better discriminative power than carefully crafted comorbidity lists.

Typically, in the feature generation phase, analysts tend to rely on prior clinical and disease knowledge to craft features and test them iteratively, retaining features with the highest predictive power. These handcrafted features aid in getting to a certain baseline model accuracy; however, incremental lift in model accuracy may necessitate additional features, the discovery of which is non-trivial given the span of feature space. Advances in machine learning are making this computationally feasible, allowing for search across the entire feature space (which might span hundreds of dimensions) and identifying the most relevant features, which can then be used for driving incremental model accuracy. Therefore, we recommend using a combination of knowledge-driven, along with auto-extracted, data driven features for good predictive performance.

Another key advantage afforded by data-driven features is their ability to handle concept drift.30, 31 Models built using knowledge-driven features are more susceptible to performance degradation, given that these features are usually static in nature, and don’t capture changes to underlying patterns in healthcare databases. In contrast, auto-extracted data-driven features can capture these changes with every refresh of database, given their very definition. Periodic iterations of automated feature generation algorithms can generate a newer set of data-driven features that can capture their patterns more readily.
5. Conclusion and Future Work
In this paper, we have presented results from experiments testing the impact of medical history, prediction window, different features and mechanism of feature generation on prediction performance. We believe these provide valuable benchmarks that can be utilized by data scientists and analysts while building patient event prediction models.

In terms of future work, we would like to make these conclusions more generalizable by expanding the experiments to cover a much wider range of use cases and therapy areas. Secondly, we would like to incorporate additional structured and unstructured data available in sources such as EHR, to assess the lift in predictive performance.

APPENDIX: Glossary
CHADS2 – Score for Atrial Fibrillation Stroke Risk
MELD – Model for End Stage Liver Disease
NSCLC – Non-small cell lung cancer
CHF – Chronic Heart Failure
RA – Rheumatoid Arthritis
EGFR – Epidermal growth receptor factor
TNF – Tumor Necrosis Factor
ROC – Receiver Operating Curve
AUC – Area under the ROC curve
Specifying the Prediction Problem
Table 2: Inclusion/Exclusion Criteria for Cohort
Use Cases
Metastatic NSCLC
RA
CHF
Diagnosis
Patient Universe: Patients with at least one NSCLC diagnosis Patient Universe: Patients with at least one RA diagnosis Patient Universe: Patients with at least one CHF diagnosis
Outcome Label 1: Patient with at least one metastasis diagnosis Outcome Label 1: Patient with the first RA diagnosis in 2017 Outcome Label 1: Patient with the first CHF diagnosis in 2017
Outcome Label 0: All other patients with only NSCLC diagnosis Outcome Label 0: Patient with no RA diagnosis till 2015 Outcome Label 0: Patient with no CHF diagnosis till 2015
Anchor Date: The first secondary diagnosis for Outcome Label 1. The last event (Rx / Px / Dx) for Outcome Label 0 Anchor Date: The first RA diagnosis for Outcome Label 1. The last event (Rx / Px / Dx) for Outcome Label 0 (till the end of 2015) Anchor Date: The first CHF diagnosis for Outcome Label 1. The last event (Rx / Px / Dx) for Outcome Label 0 (till the end of 2015)
Medical history: 1, 2, 3 years (1080 days) Medical history: 1, 2, 3 years (1080 days) Medical history: 1, 2, 3 years (1080 days)
Additional inclusion/exclusion criterias:
1.) Excluded patients with NSCLC diagnosis before 2014
2.) Excluded patients with secondary diagnosis before NSCLC diagnosis
Additional inclusion/exclusion criterias:
1.) Excluded patients with RA diagnosis before 2014
Additional inclusion/exclusion criterias:
1.) Excluded patients with CHF diagnosis before 2014

Details for other use cases are available in the supplementary data file.
Table 3: Patient Pool
Use Cases
Disease Areas
Metastatic NSCLC
RA
CHF
Train Test Train Test Train Test
Diagnosis 42,747 10,687 56,775 24,332 154,952 66,408
Treatment Adoption 5,956 2,553 6,481 2,778 210,234 90,100
Treatment Drop-off 1,493 641 11,673 5,004 90,830 38,927
Treatment Change (Line Progression) 3,078 770 16,681 7,150 156,006 66,860

Feature Discovery and Extraction
As outlined earlier, varied feature permutations can be created out of the combination of different concept domains (diagnosis, procedure, medication, demographics, etc.), time windows and aggregators (recency, frequency, change, sequence, etc.). To allow for agile feature discovery across concept domains and time, an intelligent feature generation and selection framework will be utilized for experiments involving prediction window, medical history and mechanism of feature generation. The framework utilizes a unique feature construction and selection architecture, enabled by evolutionary algorithms, that allows creation and testing of an exhaustive feature set across combinations of different domains, time windows and operators. Overall, this framework achieves feature selection in two broad steps:
  1. Aggregator functions (recency, frequency, change in frequency, slope, change in slope) are applied on raw features available in the database, allowing the creation of multiple permutations and combinations across time.
  2. Iterative selection and testing of these features via genetic algorithms across multitudes of generations, ensuring only the “fittest” features survive at the end of the iterations.
Model Building
XGBoost, an optimized distributed gradient boosting decision tree (GBDT) Python package, is used for modeling. This model is tuned by optimizing parameters such as number of trees (n_estimators), maximum depth of the tree (max_depth), regularization parameters (lambda & alpha) and others. Training and test AUCs are assessed for any evidence of overfitting to ensure build of a stable model.

Our choice of XGBoost as model of choice is driven by experience. In our experience, for event prediction using real world data, while Recurrent Neural networks (RNNs) have the potential, XGBoost provides almost equivalent performance with a higher degree of interpretability.

This is corroborated by prior research like Wang et al3 where post investigation of different machine learning models for readmission prediction concluded that deep convolutional networks (CNN) and recurrent neural networks (RNN) barely help; the additional performance uplift provided by such models is not commensurate with the complexity added to the model.
Table 4: Model Hyperparameters
Model Parameters
Metastatic NSCLC
RA
CHF
Diagnosis
Treatment Adoption
Treatment Drop-off
Treatment Change (Line Switch)
Diagnosis
Treatment Adoption
Treatment Drop-off
Treatment Change (Line Switch)
Diagnosis
Treatment Adoption
Treatment Drop-off
Treatment Change (Line Switch)
max_depth 3 3 3 3 3 3 3 3 3 3 3 3
learning_rate 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
n_estimators 100 100 100 100 500 250 100 100 800 100 100 100
verbosity 1 1 1 1 1 1 1 1 1 1 1 1
objective binary: logistic binary: logistic binary: logistic binary: logistic binary: logistic binary: logistic binary: logistic binary: logistic binary: logistic binary: logistic binary: logistic binary: logistic
booster gbtree gbtree gbtree gbtree gbtree gbtree gbtree gbtree gbtree gbtree gbtree gbtree
tree_method auto auto auto auto auto auto auto auto auto auto auto auto
n_jobs 1 1 1 1 1 1 1 1 1 1 1 1
gamma 0 0 0 0 0 0 0 0 0 0 0 0
min_child_weight 1 1 1 1 1 1 1 1 1 1 1 1
max_delta_step 0 0 0 0 0 0 0 0 0 0 0 0
subsample 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7
colsample_bytree 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7 0.7
colsample_bylevel 0.7 0.7 0.7 0.7 0.7 1 0.7 0.7 0.7 0.7 0.7 0.7
colsample_bynode 1 1 1 1 1 1 1 1 1 1 1 1
reg_alpha 0 5 10 0 0 2 10 0 0 0 0 0
reg_lambda 1 20 30 1 1 1 10 1 1 1 1 1
scale_pos_weight 1 9 2 1 7.5 1 13 6 1 5 6.5 1
base_score 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5 0.5

Model Evaluation
Area Under the Curve (AUC)
The area under the curve (AUC) is a widely used metric for assessing performance of a machine learning model in a classification problem. Essentially, it calculates the probability of assigning higher outcome risk to a randomly chosen patient with the outcome vs. without the outcome. It is typically generated by plotting the model’s True Positive Rate (TPR) against 1-specificity. AUC is a well reported benchmark and doesn’t depend on probability thresholds making the comparison unbiased. Other unbiased benchmarks such as AUPRC exist but are not widely reported across publications.

Prediction Window
For assessment of the impact of prediction window on model accuracy, area under the curve (AUC) metrics will be calculated while the window is varied from a period of one month to six months in one-month length.

Length of Medical History
To understand the effect of length of considered medical history on the model performance, AUC metrics will be calculated for each model iteration while the medical history is varied in semester-length windows from a semester to a period of three years.

Concept Domain and Types of Feature Classes
The contribution of the class of features by concept domain (diagnosis, medication, co-morbidities etc.) and aggregator type (recency, frequency, sequence, etc.) will be evaluated by disease area and use case to assess if a certain class of features are dominant. Feature importance will be assessed via adjusted F-score metric available from model outputs using data from a specific window.

Mechanism of Feature Generation
AUC metrics will be compared for models utilizing features from the knowledge-driven and data approach calculated for a 12-month medical history window.
Table 5: Model Performance
Use Case
Medical History (in Months)
Metastatic NSCLC
RA
CHF
Diagnosis
6
75%
73%
78%
12
76%
78%
83%
18
76%
81%
86%
24
76%
82%
87%
30
76%
82%
87%
36
77%
84%
88%
Treatment Adoption
6
83%
83%
72%
12
84%
83%
74%
18
84%
83%
75%
24
85%
84%
75%
30
85%
83%
79%
36
85%
84%
79%
Treatment Change (Line Progression)
6
75%
73%
72%
12
77%
74%
73%
18
78%
74%
73%
24
78%
75%
74%
30
79%
75%
75%
36
80%
75%
77%
Treatment Drop off
6
76%
69%
73%
12
77%
69%
73%
18
75%
70%
73%
24
76%
69%
73%
30
76%
69%
73%
36
77%
69%
73%

About the Authors
Srinivas Chilukuri is a Principal data scientist with ZS Associates where he leads the Artificial Intelligence Center of Excellence in New York. He has 15+ years of experience in applying machine learning solutions across various industries, primarily healthcare.

Sagar Madgi is a Senior data scientist with ZS Associates where he leads the Real World Data Artificial Intelligence Lab in Bengaluru. He has nearly 10 years of experience in applying machine learning solutions in healthcare.
References
1 Van Niel TG, McVicar TR, Datt B (2005) On the relationship between training sample size and data dimensionality: Monte Carlo analysis of broadband multi-temporal classification. Remote Sens Environ 98: 468–480

2 Dobbin KK, Zhao Y, Simon RM. How large a training set is needed to develop a classifier for microarray data? Clin Cancer Res. 2008 Jan 1; 14(1):108-14.

3 Thornton C, Hutter F, Hoos HH, Leyton-Brown K. Auto-WEKA: combined selection and hyperparameter optimization of classification algorithms. In: Proc KDD. 2013. p. 847–55.

4 Kharrazi H, Chi W, Chang HY, Richards TM, Gallagher JM, Knudson SM, Weiner JP. Comparing Population-based Risk-stratification Model Performance Using Demographic, Diagnosis and Medication Data Extracted From Outpatient Electronic Health Records Versus Administrative Claims. Med Care. 2017 Aug;55(8):789-796. doi: 10.1097/MLR.000000000000075

5 J. M. Franklin, C. Gopalakrishnan, A. A. Krumme, K. Singh, J.R. Rogers, C. McKay, N. McEllwee, and N. K. Choudhry. 3/2018. The relative benefits of claims and electronic health record data for predicting medication adherence trajectory. American Heart Journal, 197:153-162. doi: 10.1016/j.ahj.2017.09.019

6 Weissman, G. E., & Harhay, M. (2018). Incomplete Comparisons Between the Predictive Power of Data From Administrative Claims and Electronic Health Records. Medical care, 56(2), 202. doi:10.1097/MLR.0000000000000848

7 Howlader N, Noone AM, Krapcho M, Miller D, Brest A, Yu M, Ruhl J, Tatalovich Z, Mariotto A, Lewis DR, Chen HS, Feuer EJ, Cronin KA (eds). SEER Cancer Statistics Review, 1975-2016, National Cancer Institute. Bethesda, MD, https://seer.cancer.gov/csr/1975_2016/, based on November 2018 SEER data submission, posted to the SEER web site, April 2019.

8 Hunter TM, Boytsov NN, Zhang X, Schroeder K, Michaud K, Araujo AB. Prevalence of rheumatoid arthritis in the United States adult population in healthcare claims databases, 2004-2014.Rheumatol Int. 2017 Sep;37(9):1551-1557. doi: 10.1007/s00296-017-3726-1

9 Komanduri, S., Jadhao, Y., Guduru, S. S., Cheriyath, P., & Wert, Y. (2017). Prevalence and risk factors of heart failure in the USA: NHANES 2013 - 2014 epidemiological follow-up study. Journal of community hospital internal medicine perspectives, 7(1), 15–20. doi:10.1080/20009666.2016.1264696

10 Hofmeister MG, Rosenthal EM, Barker LK, Rosenberg ES, Barranco MA, Hall EW, Edlin BR, Mermin J, Ward JW, Ryerson AB. Estimating Prevalence of Hepatitis C Virus Infection in the United States, 2013-2016.Hepatology. 2019 Mar;69(3):1020-1031. doi: 10.1002/hep.30297

11 Centers for Disease Control and Prevention. Nov 2015. Available from: https://www.cdc.gov/ibd/data-statistics.htm

12 Reveille JD, Witter JP, Weisman MH. Prevalence of axial spondylarthritis in the United States: estimates from a cross-sectional survey. Arthritis Care Res (Hoboken). 2012 Jun;64(6):905-10. doi: 10.1002/acr.21621

13 Anne G. Wheaton, Timothy J. Cunningham, Earl S. Ford, MD, and Janet B. Croft., “Employment and activity limitations among adults with chronic obstructive pulmonary disease — United States, 2013,” Morbidity and Mortality Weekly Report (MMWR), 64 (11), pp. 289-295 (March 7, 2015), Centers for Disease Control and Prevention (CDC)

14 Hugh Waters and Marlon Graf. The Cost of Chronic Diseases in the U.S, May 2019.

15 Medical Expenditures Panel Survey(MEPS), Agency for Healthcare Research and Quality, US Department of Health and Human Services, 2008-2012

16 Bui, A. L., Horwich, T. B., & Fonarow, G. C. (2011). Epidemiology and risk profile of heart failure. Nature reviews. Cardiology, 8(1), 30–41. doi:10.1038/nrcardio.2010.165

17 Razavi, H., Elkhoury, A. C., Elbasha, E., Estes, C., Pasini, K., Poynard, T., & Kumar, R. (2013). Chronic hepatitis C virus (HCV) disease burden and cost in the United States. Hepatology (Baltimore, Md.), 57(6), 2164–2170. doi:10.1002/hep.26218

18 The facts about inflammatory bowel diseases. Crohn’s & Colitis Foundation of America website. http://www.ccfa.org/assets/pdfs/updatedibdfactbook.pdf. Published November 2014. Accessed September 15, 2015

19 Guarascio, A. J., Ray, S. M., Finch, C. K., & Self, T. H. (2013). The clinical and economic burden of chronic obstructive pulmonary disease in the USA. ClinicoEconomics and outcomes research : CEOR, 5, 235–245. doi:10.2147/CEOR.S34321

20 Chen, Q., Jain, N., Ayer, T., Wierda, W. G., Flowers, C. R., O’Brien, S. M., Chhatwal, J. (2017). Economic Burden of Chronic Lymphocytic Leukemia in the Era of Oral Targeted Therapies in the United States. Journal of clinical oncology : official journal of the American Society of Clinical Oncology, 35(2), 166–174. doi:10.1200/JCO.2016.68.2856

21 S Gala, A Shah, M Mwamburi. Economic Burden Associated with Chronic Myeloid Leukemia (CML) Treatments in The United States: A Systematic Literature Review. Value in Health. November 2016 Volume 19, Issue 7, Page A727. doi: https://doi.org/10.1016/j.jval.2016.09.2179

22 Beth L. Nordstrom, Jason C. Simeone, Karen G. Malley, Kathy H. Fraeman, Zandra Klippel, Mark Durst, John H. Page, and Hairong Xu. Validation of Claims Algorithms for Progression to Metastatic Cancer in Patients with Breast, Non-small Cell Lung, and Colorectal Cancer. Front Oncol. 2016; 6: Published online 2016 Feb 1. doi: 10.3389/fonc.2016.00018

23 Savannah L. Bergquist, Gabriel A. Brooks, Nancy L. Keating, Mary Beth Landrum, and Sherri Rose. Classifying Lung Cancer Severity with Ensemble Machine Learning in Health Care Claims Data. Proc Mach Learn Res. 2017 Aug; 68: 25–38.

24 Tripoliti, E. E., Papadopoulos, T. G., Karanasiou, G. S., Naka, K. K., & Fotiadis, D. I. (2016). Heart Failure: Diagnosis, Severity Estimation and Prediction of Adverse Events Through Machine Learning Techniques. Computational and structural biotechnology journal, 15, 26–47. doi:10.1016/j.csbj.2016.11.001

25 Awan, S. E., Bennamoun, M., Sohel, F., Sanfilippo, F. M., & Dwivedi, G. (2019). Machine learning-based prediction of heart failure readmission or death: implications of choosing the right model and the right metrics. ESC heart failure, 6(2), 428–435. doi:10.1002/ehf2.12419

26 Chen JH, Alagappan M, Goldstein MK, Asch SM, and Altman RB. Decaying relevance of clinical data towards future decisions in data-driven inpatient clinical order sets. Int J Med Inform. 2017 Jun;102:71-79. doi: 10.1016/j.ijmedinf.2017.03.006

27 Min X, Yu B, Wang F. Predictive Modeling of the Hospital Readmission Risk from Patients’ Claims Data Using Machine Learning: A Case Study on COPD. Sci Rep. 2019 Feb 20;9(1):2362. doi: 10.1038/s41598-019-39071-y.

28 Ross S. Kleiman, Paul S. Bennett, Peggy L. Peissig, Richard L. Berg, Zhaobin Kuag, Scott J. Hebbring, Michael D. Caldwell and David Page. High throughput machine learning from electronic health records. Quantitative Methods (q-bio.QM); Machine Learning (cs.LG); Machine Learning (stat.ML). arXiv:1907.01901

29 Tran, T., Luo, W., Phung, D. et al. A framework for feature extraction from hospital medical data with applications in risk prediction. BMC Bioinformatics 15, 425 (2014) doi:10.1186/s12859-014-0425-8

30 Oded Maimon and Lior Rokach. 2010. Data Mining and Knowledge Discovery Handbook (2nd ed.). Springer Publishing Company, Incorporated

31 Black M., Hickey R. (2004) Detecting and Adapting to Concept Drift in Bioinformatics. In: López J.A., Benfenati E., Dubitzky W. (eds) Knowledge Exploration in Life Science Informatics. KELSI 2004. Lecture Notes in Computer Science, vol 3303. Springer, Berlin, Heidelberg