--- title: "Stranded Model Tutorial" author: "Gary Hutson" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Stranded Model Tutorial} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` This vignette details why the `stranded_model` dataset was created, how to load it, and gives examples of use with the caret Machine Learning library. The dataset contains: - __stranded.label:__ a character metric to indicate whether the patient is stranded, or not - __age:__ Integer - the age of the patient on admission to hospital - __care.home.referral:__ Integer - flag to indicate referred from care home - __medicallysafe:__ Integer - flag to indicate whether the patient is medically safe for example safe to be discharged but hasn't been - __hcop:__ Integer - flag to indicate whether the patient is in a Health Care for Older People area - __mental_health_care:__ Integer - flag to indicate mental health care provision - __period_of_previous_care:__ Integer - flag to indicate previous periods of care - __admit_date:__ Date - admit date - __frailty_index:__ Character - specifying frailty type, if frail ## First, load the data and inspect it ```{r load, warning=FALSE, message=FALSE} library(NHSRdatasets) library(dplyr) library(ggplot2) library(caret) library(rsample) library(varhandle) data("stranded_data") glimpse(stranded_data) prop.table(table(stranded_data$stranded.label)) ``` This is good, it shows a relatively even split between the not stranded and stranded labels. Please refer to the webinar on [Advanced Modelling](https://www.youtube.com/watch?v=rO40vvKXU-4&t=1360s) to look at how you can deal with classification imbalance using techniques such as SMOTE (Synthetic Minority Oversampling Technique Estimation) and ROSE (Random Oversampling Estimation), to name a few. ## Feature engineering The next step will be to decide which features need to be engineered for our machine learning model. We will drop the admit_date and recode the frailty index, and perhaps allocate the age into age bands. ```{r feature_engineering} stranded_data <- stranded_data %>% dplyr::mutate(stranded.label = factor(stranded.label)) %>% dplyr::select(everything(), -c(admit_date)) ``` Next, I will select the categorical variables and make these into dummy variables, that is a numerical encoding of a categorical variable: ```{r dummy_encode} cats <- select_if(stranded_data, is.character) cat_dummy <- varhandle::to.dummy(cats$frailty_index, "frail_ind") # Converts the frailty index column to dummy encoding and sets a column called "frail_ind" prefix cat_dummy <- cat_dummy %>% as.data.frame() %>% dplyr::select(-frail_ind.No_index_item) # Drop the field of interest # Drop the frailty index from the stranded data frame and bind on our new encoding categorical variables stranded_data <- stranded_data %>% dplyr::select(-frailty_index) %>% bind_cols(cat_dummy) %>% na.omit(.) ``` The data is now ready for splitting into a simple train and validation split, to do the machine learning on the set. ## Splitting the data The next step is to create a simple hold out train/test split: ```{r train_test_split} split <- rsample::initial_split(stranded_data, prop = 3 / 4) train <- rsample::training(split) test <- rsample::testing(split) ``` ## Create simple Logistic Regression Model to classify stranded patients The next step will be to create a stranded classification model, in CARET: ```{r class_model} set.seed(123) glm_class_mod <- caret::train(factor(stranded.label) ~ ., data = train, method = "glm" ) print(glm_class_mod) ``` This is a very basic model and could be improved by model choice, hyperparameter selection, different resampling strategies, etc. ## Predicting the test set to validate model Next, we will use the test dataset to see how our model will perform in the wild: ```{r predicting} preds <- predict(glm_class_mod, newdata = test) # Predict class pred_prob <- predict(glm_class_mod, newdata = test, type = "prob") # Predict probs # Join prediction on to actual test data frame and evaluate in confusion matrix predicted <- data.frame(preds, pred_prob) test <- test %>% bind_cols(predicted) %>% dplyr::rename(pred_class = preds) glimpse(test) ``` ## Evaluating with confusion matrix The final step is to evaluate the model: ```{r evaluation} caret::confusionMatrix(test$stranded.label, test$pred_class, positive = "Stranded") ``` The model performs relatively well and could be improved by better predictors, a bigger sample and class imbalance techniques. ## Conclusion This dataset can be used for a number of classification problems and can be the NHS's equivalent to the iris dataset for classification, albeit this only works for binary classification problems.