---
title: "LOS_model: A simulated hospital length-of-stay dataset"
author: "Chris Mainey"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{LOS_model: A simulated hospital length-of-stay dataset}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
This vignette details why the `LOS_model` dataset was created, how to load it, and gives examples of use for learning/teaching regression modelling using Generalized Linear Models (GLMs), and related techniques.
The data were created specifically for regression tutorials, simulating a small set of data on hospital in-patient spells. The data are 10 sets of 30 simulated patient records, representing 10 different hospitals ("Trusts"). The dataset contains:
- __ID:__ an integer value patient number
- __Organisation:__ A factor, containing hospital name, for example "Trust1"
- __Age:__ an integer representing patient age in years
- __LOS:__ 'Length of Stay,' an integer representing the number of days a patient was in hospital
- __Death:__ an integer flag (0 or 1) representing whether a patient died
## First, load the data and inspect it
```{r load, warning=FALSE, message=FALSE}
library(NHSRdatasets)
library(dplyr)
library(ggplot2)
data("LOS_model")
head(LOS_model)
summary(LOS_model)
# 82.3% survived
prop.table(table(LOS_model$Death))
```
Now lets look at the distributions of Age and LOS. We might expect Age to be normally distributed, but this would be unrealistic, as elderly patient represent higher proportions of the total in-patients. It could be uniform, it could be bi-modal (or have more peaks), but nothing is particularly clear. LOS is highly right-skewed. This means that the high values will affect the mean, dragging it out, and the median would be a better estimation of the centre of the distribution. This shows that most patients don't stay very long, but a few patients stay notably longer.
```{R VisageLOS}
ggplot(LOS_model, aes(x = Age)) +
geom_histogram(alpha = 0.5, col = 1, fill = 12, bins = 20) +
ggtitle("Distribution of Age")
ggplot(LOS_model, aes(x = LOS)) +
geom_histogram(alpha = 0.5, col = 1, fill = 13, bins = 20) +
ggtitle("Distribution of Length-of-Stay")
```
## Modelling LOS and Death
We will attempt to model LOS, using Age and Death, and also to model Death using LOS and Age. This vignette does not describe linear models per se, but assumes you are familiar with linear regression. If not, pause here and do a little reading before you continue.
Our data are not continuous, linear, or normally distributed. Death is binary, LOS is a count. We can extend the linear model idea for this as a Generalized Linear Model (GLM)[1] , by fitting our model via a 'link function.' This function transforms data and allows our response, and it's variance, to relate to a linear model through the link function:
$$\large{g(\mu)= \alpha + \beta x}$$
Where $\mu$ is the expectation of Y, and $g$ is the link function
The Ordinary Least Squares (OLS) fitting technique used in linear regression, is not a good fit here as our data are not necessarily distributed like that. Here we can use 'maximum-likelihood estimation' (MLE). MLE is related to probability and can be used to identify the parameters that were most likely to have produced our data (the largest likelihood value, or 'maximum likelihood').
MLE is an iterative process, and is actually performed on the log-likelihood. The model is iteratively refitted to maximise the log-likelihood. Our model is said to 'converge' when we can't improve it any more (usually changes in the log-likelihood < 1e-8). __If you get convergence-related error messages, it means the modelling function can't settle on, or reach the MLE. This often means that the scale of you model predictors is mismatched (for example one is binary (0,1), and another is on log scale), or there correlations in your data.__
Although many of the methods for `lm` are common to `glm`, inspecting plots of residuals is trickier. We also can't use R2 either, as the assumptions reflect OLS. Various methods other methods exist, depend on what you want to compare, including:
- AUC: For binary (or other multiple categorical models), the 'area under the receiver operator characteristic curve' ('AUC', 'ROC' or 'C-statistic') can be used and interpreted like R2, as the proportion of variation in $y$ explained by our model.
- Likelihood Ratio test (LRT): Two `glm`s can be compared using likelihood ratio tests, that compare the log-likelihoods between two models. We test a reduced model against the larger model, with the null hypothesis that the two likelihoods are not significantly different. If our test is positive (usually p<0.05, but beware of p-values...), our models are significantly different, with the one with higher log-likelihood the better model.
- Akaike Information Criterion (AIC) [2]: is a measure of relative information loss. The value cannot be directly interpreted, but can be compared between models, with lower AICs suggesting less information loss, and being 'better' models.
- Prediction error: If your model is primarily about prediction, and less about explanation[3], measures of fit such as Mean-Square Error of Mean Absolute error can be helpful measures of predictive accuracy.
## Generalized Linear Models
Let's fit a generalized linear model, firstly for `Death`, and secondly for `LOS`.
### Modelling death:
`Death` is a binary variable recorded in our `LOS_model` dataset. This cannot be linear and is usually modelled using the `binomial` family argument, and the `logit` link function. This link function is the default in `R` for `binomial` regressions, and is commonly referred to as __Logistic Regression__. We'll fit this using the `glm` function. Rather than fitting `Death` as such, the link function makes this the log-odds of death.
```{r glm1, collapse=TRUE}
glm_binomial <- glm(Death ~ Age + LOS, data = LOS_model, family = "binomial")
summary(glm_binomial)
ModelMetrics::auc(glm_binomial)
```
Using the `summary` function, we can see our model output, including:
- The 'call', of model formula
- A description of the distribution of the deviance residuals (don't worry about this for now).
- The 'coefficients', the estimates of the effects of our predictors, including their standard errors, and z-tests for their 'significance.' There is a legend, using `*`, to indicate the significance levels.
- Information on fit statistics
- We also used the `ModelMetrics` package to calculate the `auc`. This is suggests that ~66% of the variation in our data can be explained by our model.
If we consider the logic of the model above, it is telling us that LOS is a significant predictor for the log-odds of death, but that age is not. This seems unlikely, unless Age and LOS are not independent. If this is the case, for example elderly patients stay longer or that shorter staying elderly patients are more likely to have died, the effects of LOS may be obscuring the effects of Age. This is referred to as confounding. We can examine this using an __interaction__ term.
### Interactions
'Interactions' are where predictor variables affect each other. For example, the Hospital Standardised Mortality Ratio (HSMR) [4] uses an interaction term between co-morbidity and age. So co-morbidities score may have different effects related to age, as well as the independent effects of age and co-morbidity score.
We can add these to our models with `*` to include the interaction and the independent terms, or `:` if you just want the interaction without the independent terms:
```{r glm2}
glm_binomial2 <- glm(Death ~ Age + LOS + Age * LOS, data = LOS_model, family = "binomial")
summary(glm_binomial2)
ModelMetrics::auc(glm_binomial2)
```
Now our model is suggesting that Age, LOS and their interaction are significant. This suggests that, as well as their own effects, their combination is important, and their effects confound each other.
## Prediction
Now we have built a model, we can use it to predict our expected $y$, the probability of death per patient in this case. We can supply a new dataset to fit it to, using `newdata`, or omitting this fits our model to the original dataset. Using a `glm`, we have built a model via a link function, so we need to decide what scale to make our predictions on: the `link` scale (the log-odds of death in this case) or the `response` (the probability of death in this case).
```{r glm5}
LOS_model$preds <- predict(glm_binomial2, type = "response")
head(LOS_model, 5)
```
We have added our predictions back into the original `data.frame` using the column name `preds`, and patient ID 1 has a risk of death of 0.136 or 13.6% probability of death. These predictions can be used in many different applications, but common uses of them are identifying the highest risk patients for clinical audit, comparing ratios of the observed deaths and expected deaths (sum of the probability), or risk-adjusted control chart monitoring.
## Length of Stay (LOS)
LOS, the time a patient is in hospital, is another variable to examine in the dataset. This is a count, and is therefore bounded at zero (can't have counts < 0), can only take integer values (for example 2 or 3, but not 2.5) and is heavily right-skewed, as most patient are only admitted for one or two days, but rarely patients are in for much longer. We'll apply a similar modelling approach to the last example, but the `family` argument and link function must be different. We will use the `Poisson` distribution and the natural logarithm link function, the standard choices for count data[5].
```{r glm3}
glm_poisson <- glm(LOS ~ Age * Death, data = LOS_model, family = "poisson")
summary(glm_poisson)
```
We can't use the `AUC` value here, as that is related to categories. We can only make comparisons between models or in terms of prediction error. Lets fit the same model without the interaction and see if it is different using the `AIC` or the likelihood ratio test (that tell you the same thing):
```{r glm4}
glm_poisson2 <- glm(LOS ~ Age + Death, data = LOS_model, family = "poisson")
AIC(glm_poisson)
AIC(glm_poisson2)
# anova(glm_poisson2, glm_poisson, test="Chisq") will do the same thing without using the lmtest package
lmtest::lrtest(glm_poisson, glm_poisson2)
```
Our `AIC` is lower for the interaction model, and the likelihood ratio test suggest the reduced model is significantly worse.
### Overdispersion
One of the downfalls of a Poisson models that it's variance is fixed. This is always proportional to the conditional mean. I will spare you the maths at this point, but what it means is that the variance doesn't scale to the data. This would be fine if our data were perfectly Poisson distributed, but that is rare in practice. Data commonly display more variance than we would expect, and this is referred to as ___overdispersion___ (OD). OD causes us to underestimate the variance in our model, and our assessments of the significance of parameters are too optimistic because our error is too small. It also affects overall assessment of model fit.
We can test for this, by comparing the ratio of the sum of the squared Pearson residuals over the degrees of freedom. If this value is 1, then the variance is as expected ('equidispersion'), but if it is >1, then we have OD.
```{r od}
sum(residuals(glm_poisson, type = "pearson")^2) / df.residual(glm_poisson)
```
__Our model is overdispersed, with residual variance around 1.4 times what the Poisson distribution expects.__
There are various different ways to deal with this, and this vignette demonstrates three of the following list. Our options include:
- __Ignoring it:__ the Poisson is an `unbiased` model, so if we only care about our estimates, or predicted values (and not the error, or 'significance') we might do this.
- __Manually scaling our variance:__ we can do this using our dispersion ratio
- __Quasi-likelihood models:__ these do not use a full MLE process, using estimating equations instead. These models allow for scaling of the error, but with no MLE we can't use AIC or likelihood ratio tests.
- Variance structures (this is a huge topic, but we'll pick one). Regression assumes all data points are independent and uncorrelated. If this is not the case in our data, we need to consider this in our models. We have repeated measurements from 10 Trusts in this dataset. It likely that LOS, and demographics like age, are at least partly related to which Trust the data come from. This is referred to as ___clustering___, and it introduces correlations in the data. There are ways to explicit model this correlation, including:
+ __Mixture models:__ using more than one distribution to model the data. Negative Binomial family uses two distributions, for between and within unit variances. The NB2[5] parametrisation assumes a Poisson distribution within units and a gamma distribution between.
+ __Random-Intercept Model__: GLMs can be further updated to allow for variance structures, using a Generalised Linear Mixed Model (GLMM), or Multi-level model[6]. The variance structures are referred to as 'random-effects' and fix the parameter estimate at 0, but allow the variance to be estimated. Our model coefficients we reported in the GLMs earlier are referred to as 'fixed-effects' to distinguish between the two. In our case, a random-intercept at Trust-level is appropriate.
So lets try some of them out. To fit a quasi-Poisson model, we simply change the family. Note the dispersion parameter matches the calculation above:
```{r quasi}
quasi <- glm(LOS ~ Age * Death, data = LOS_model, family = "quasipoisson")
summary(quasi)
```
For a negative binomial model, two options are `glm.nb` from the `MASS` package, or `glmmTMB` function from the package of the same name. Note that the function from `MASS` does not take a `family` argument. The dispersion factor here is quadratic to the mean, so don't compare it directly to the scale factor above.
```{r nb, warning=FALSE, message=FALSE}
library(MASS)
nb <- glm.nb(LOS ~ Age * Death, data = LOS_model, )
summary(nb)
```
Finally, using the random-intercept model, fitted with `glmer` from the `lme4` package. The extra formula argument specifies the random-intercept, where 1 means the intercept, and the `|` reads as 'intercept varying by organisation.'
```{r glmm, warning=FALSE}
library(lme4)
glmm <- glmer(LOS ~ scale(Age) * Death + (1 | Organisation), data = LOS_model, family = "poisson")
summary(glmm)
```
```{r confint}
confint(glmm)
```
These models are significantly more complicated. `Confint` can be use to calculate the confidence intervals above by profiling the likelihood (provided you have the `MASS` package installed). This allow you to assess the significance of the random-effects. Our CI butts up to zero here, and the AIC is slightly worse than the `glm`. This suggests it is not well-modelled with the normally distributed random-intercept. The NB model appears to perform better.
Predictions from GLMMS also requires more thought. You can predict with ("conditional") or without ("marginal") the random effects. Conditional predictions will have the adjustment for their cluster (or whatever random effect you have fitted), but marginal could be considered the model average, subject to the fixed effects, removing the effects of cluster.
## Summary
The `LOS_model` dataset is a fabricated patient dataset with Age, Length of Stay and Death status for 300 patients across 10 hospitals ('Trusts'). It can be used to learn GLM, GLMM and other related modelling techniques, with examples above.
---
## References
1. NELDER, J. A. & WEDDERBURN, R. W. M. 1972. Generalized Linear Models. _Journal of the Royal Statistical Society. Series A (General)_, __135__, 370-384
1. AKAIKE, H. 1998. Information Theory and an Extension of the Maximum Likelihood Principle. In: PARZEN, E., TANABE, K. & KITAGAWA, G. (eds.) _Selected Papers of Hirotugu Akaike_. New York, NY: Springer New York
1. SHMUELI, G. 2010. To Explain or to Predict? _Statist. Sci._ , __25__, 289-310.
1. JARMAN, B., GAULT, S., ALVES, B., HIDER, A., DOLAN, S., COOK, A., HURWITZ, B. & IEZZONI, L. I. 1999. _Explaining differences in English hospital death rates using routinely collected data_. BMJ, __318__, 1515-1520
1. CAMERON, A. C. & TRIVEDI, P. K. 2013. _Regression Analysis of Count Data_, Cambridge University Press
1. GOLDSTEIN, H. 2010. _Multilevel Statistical Models_, John Wiley & Sons Inc.