Title: | NHS and Healthcare-Related Data for Education and Training |
Description: | Free United Kingdom National Health Service (NHS) and other healthcare, or population health-related data for education and training purposes. This package contains synthetic data based on real healthcare datasets, or cuts of open-licenced official data. This package exists to support skills development in the NHS-R community: <https://nhsrcommunity.com/>. |
Authors: | Gary Hutson [aut] |
Maintainer: | Zoë Turner <[email protected]> |
License: | CC0 |
Version: | 0.3.3 |
Built: | 2025-02-03 06:12:20 UTC |
Source: | https://github.com/nhs-r-community/NHSRdatasets |
Reported attendances, 4 hour breaches and admissions for all A&E departments in England for the years 2016/17 through 2018/19 (Apr-Mar). The data has been tidied to be easily usable within the tidyverse of packages.
Tibble with six columns
The month that this data relates to
The ODS code for this provider
The department type. either 1, 2 or other
the number of patients who attended this department in this month
the number of patients who breaches the 4 hour target in this month
the number of patients admitted from A&E to the hospital in this month
Data sourced from NHS England Statistical Work Areas which is available under the Open Government Licence v3.0
NHS England Statistical Work Areas
data(ae_attendances) library(dplyr) library(ggplot2) library(scales) # Create a plot of the performance for England over time ae_attendances %>% group_by(period) %>% summarise_at(vars(attendances, breaches), sum) %>% mutate(performance = 1 - breaches / attendances) %>% ggplot(aes(period, performance)) + geom_hline(yintercept = 0.95, linetype = "dashed") + geom_line() + geom_point() + scale_y_continuous(labels = percent) + labs(title = "4 Hour performance over time") # Now produce a plot showing the performance of each trust ae_attendances %>% group_by(org_code) %>% # select organisations that have a type 1 department filter(any(type == "1")) %>% summarise_at(vars(attendances, breaches), sum) %>% arrange(desc(attendances)) %>% mutate( performance = 1 - breaches / attendances, overall_performance = 1 - sum(breaches) / sum(attendances), rank = rank(-performance, ties.method = "first") / n() ) %>% ggplot(aes(rank, performance)) + geom_vline(xintercept = c(0.25, 0.5, 0.75), linetype = "dotted") + geom_hline(yintercept = 0.95, colour = "red") + geom_hline(aes(yintercept = overall_performance), linetype = "dotted") + geom_point() + scale_y_continuous(labels = percent) + theme_minimal() + theme( panel.grid = element_blank(), axis.text.x = element_blank() ) + labs( title = "4 Hour performance by trust", subtitle = "Apr-16 through Mar-19", x = "", y = "" )
data(ae_attendances) library(dplyr) library(ggplot2) library(scales) # Create a plot of the performance for England over time ae_attendances %>% group_by(period) %>% summarise_at(vars(attendances, breaches), sum) %>% mutate(performance = 1 - breaches / attendances) %>% ggplot(aes(period, performance)) + geom_hline(yintercept = 0.95, linetype = "dashed") + geom_line() + geom_point() + scale_y_continuous(labels = percent) + labs(title = "4 Hour performance over time") # Now produce a plot showing the performance of each trust ae_attendances %>% group_by(org_code) %>% # select organisations that have a type 1 department filter(any(type == "1")) %>% summarise_at(vars(attendances, breaches), sum) %>% arrange(desc(attendances)) %>% mutate( performance = 1 - breaches / attendances, overall_performance = 1 - sum(breaches) / sum(attendances), rank = rank(-performance, ties.method = "first") / n() ) %>% ggplot(aes(rank, performance)) + geom_vline(xintercept = c(0.25, 0.5, 0.75), linetype = "dotted") + geom_hline(yintercept = 0.95, colour = "red") + geom_hline(aes(yintercept = overall_performance), linetype = "dotted") + geom_point() + scale_y_continuous(labels = percent) + theme_minimal() + theme( panel.grid = element_blank(), axis.text.x = element_blank() ) + labs( title = "4 Hour performance by trust", subtitle = "Apr-16 through Mar-19", x = "", y = "" )
Full raw data from the AphA CPD Survey
This tidied raw data is available here as a tibble with 38 columns (blank or superfluous columns from the raw data were removed) and 237 rows (1 per respondent ID).
Variables have been named using a "controlled language" approach informed by Emily Riederer's "Column Names as Contracts" https://emilyriederer.netlify.app/post/column-name-contracts/.
Columns ending in "_id"
are numeric and represent a
unique ID for that response.
Columns ending in "_dttm"
are in datetime format.
Columns ending in "_cat"
contain categorical data,
though in some cases this is mixed with free text responses and may
require tidying if you need it to be strictly categorical/factor data.
Columns ending in "_n"
are theoretically counts, but in
this tibble they may be mixed with non-numeric values and so the columns
are in character format.
Columns ending in "_ind"
are theoretically indicator
values with 2 main value options (Yes/No). These are in character format,
but should be convertible to 1/0 or TRUE/FALSE values, if desired, with
minimal wrangling.
Columns ending in "_txt"
contain free text responses and
are in character format.
Multi-part questions have column name stubs with sequential letters. For
example, "q20a_"
, "q20b_"
and so on.
For formatting consistency, questions with a single part still have a
column name stub with the letter a, for example "q01a_"
Original survey questions (lightly edited) are provided as variable labels
using the {labelled}
These labels provide more descriptive context for the "clean" column names.
Variable labels can be viewed using labelled::get_variable_labels
Survey press release web page: https://www.aphanalysts.org/ltnws/nhs-at-risk-of-losing-a-generation-of-data-analysts/
The survey of NHS and other healthcare data analysts was conducted in July 2022. The results data is made available in this package with the permission of AphA.
Reported COVID-19 infections, and deaths, collected and collated by the European Centre for Disease Prevention and Control (ECDC, provided by day and country). Data were collated and published up to 14th December 2020, and have been tidied so they are easily usable within the 'tidyverse' of packages.
Tibble with seven columns
The date cases were reported
A 'factor' for the geographical continent in which the reporting country is located.
A 'factor' for the country or territory reporting the data.
A 'factor' for the a three-letter country or territory code.
The reported population of the country for 2019, taken from Eurostat for Europe and the World Bank for the rest of the world.
The reported number of positive cases.
The reported number of deaths.
Data sourced from European Centre for Disease Prevention and Control which is available under the open licence, compatible with the CC BY 4.0 license, further details available at ECDC.
European Centre for Disease Prevention and Control
data(covid19) library(dplyr) library(ggplot2) library(scales) # Create a plot of the performance for England over time covid19 |> filter(countries_and_territories == c("United_Kingdom", "Italy", "France", "Germany", "Spain")) |> ggplot(aes( x = date_reported, y = cases, col = countries_and_territories )) + geom_line() + scale_color_discrete("Country") + scale_y_continuous(labels = comma) + labs( y = "Cases", x = "Date", title = "Covid-19 cases for selected countries", alt = "A plot of covid-19 cases in France, Germany, Italy, Spain & the UK" ) + theme_minimal()
data(covid19) library(dplyr) library(ggplot2) library(scales) # Create a plot of the performance for England over time covid19 |> filter(countries_and_territories == c("United_Kingdom", "Italy", "France", "Germany", "Spain")) |> ggplot(aes( x = date_reported, y = cases, col = countries_and_territories )) + geom_line() + scale_color_discrete("Country") + scale_y_continuous(labels = comma) + labs( y = "Cases", x = "Date", title = "Covid-19 cases for selected countries", alt = "A plot of covid-19 cases in France, Germany, Italy, Spain & the UK" ) + theme_minimal()
Artificially generated hospital data. Fictional patients at 10 fictional hospitals, with LOS, Age and Date status data Data were generate to learn Generalized Linear Models (GLM) concepts, modelling either Death or LOS.
Data frame with five columns
A fictional patient ID number
A factor representing one of ten fictional hospital trusts, for example Trust1
Age in years of each fictional patient
In-hospital length of stay in days. The difference between admission and discharge date in dates
Binary for death status: 0 = survived, 1= died in hospital
Generated by Chris Mainey, Feb-2019
data(LOS_model) model1 <- glm(Death ~ Age + LOS, data = LOS_model, family = "binomial") summary(model1) # Now with an Age, LOS, and Age*LOS interaction. model2 <- glm(Death ~ Age * LOS, data = LOS_model, family = "binomial") summary(model2)
data(LOS_model) model1 <- glm(Death ~ Age + LOS, data = LOS_model, family = "binomial") summary(model1) # Now with an Age, LOS, and Age*LOS interaction. model2 <- glm(Death ~ Age * LOS, data = LOS_model, family = "binomial") summary(model2)
Provisional counts of the number of deaths registered in England and Wales, by age, sex and region, from week commencing 8th January 2010 to 3rd April 202.
Data frame with five columns
character, containing the names of the groups for counts, for example "Total deaths", "all ages".
character, subcategory of names of groups where necessary, for example details of region: "East", details of age bands "15-44".
numeric, numbers of deaths in whole numbers and average numbers with decimal points. To retain the integrity of the format this column data is left as character.
date, format is yyyy-mm-dd; all dates are a Friday.
integer, each week in a year is numbered sequentially.
Source and licence acknowledgement
This data has been made available through Office of National Statistics under the Open Government Licence http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/
Collected by Zoë Turner, Apr-2020 from https://www.ons.gov.uk/peoplepopulationandcommunity/birthsdeathsandmarriages/deaths/datasets/weeklyprovisionalfiguresondeathsregisteredinenglandandwales
data(ons_mortality) library(dplyr) library(tidyr) # create a dataset that is "wide" with each date as a column ons_mortality |> select(-week_no) |> pivot_wider( names_from = date, values_from = counts )
data(ons_mortality) library(dplyr) library(tidyr) # create a dataset that is "wide" with each date as a column ons_mortality |> select(-week_no) |> pivot_wider( names_from = date, values_from = counts )
ONS Population Estimates for Mid-year 2023 National and subnational mid-year population estimates for the UK and its constituent countries by administrative area, age and sex (including components of population change, median age and population density).
Tibble with six columns
male or female
country/geography code
country of the UK
year of age
the number of people in this group
ONS Estimates of the population for the UK, England, Wales, Scotland, and Northern Ireland
data(ons_uk_population_2023) library(dplyr) library(tidyr) # create a dataset that has total population by age groups for England ons_uk_population_2023 |> filter(Name == "ENGLAND") |> mutate(age_group = case_when( as.numeric(age) <= 17 ~ "0-17", as.numeric(age) >= 18 & as.numeric(age) <= 64 ~ "18-64", as.numeric(age) >= 65 ~ "65+", age == "90+" ~ "65+" )) |> group_by(age_group) |> summarise(count = sum(count))
data(ons_uk_population_2023) library(dplyr) library(tidyr) # create a dataset that has total population by age groups for England ons_uk_population_2023 |> filter(Name == "ENGLAND") |> mutate(age_group = case_when( as.numeric(age) <= 17 ~ "0-17", as.numeric(age) >= 18 & as.numeric(age) <= 64 ~ "18-64", as.numeric(age) >= 65 ~ "65+", age == "90+" ~ "65+" )) |> group_by(age_group) |> summarise(count = sum(count))
This model is to be used as a machine learning classification model, for supervised learning. The binary outcome is stranded vs not stranded patients.
Tibble with nine columns (1 x outcome and 8 predictors)
Outcome variable - whether the patient is stranded or not
Patient age on admission
Whether than have been referred from a care home
Medically safe for discharge - means the patient is assessed as safe, but has not been discharged yet
Indicates whether they have been triaged from a Health Care for Older People specialty
Flag to indicate whether they need mental health support and care
Count of the number of previous spells of care
Date they were admitted to hospital
An initial index assessment to say if the patient is frail or not. This is needed for alignment of service provision.
Synthetically generated by Gary Hutson, Mar-2021.
library(dplyr) data(stranded_data) stranded_data |> glimpse()
library(dplyr) data(stranded_data) stranded_data |> glimpse()
Synthetic NEWS data to show as the results of the NHSR_synpop package. These datasets have been synthetically generated by this package to be utilised in the NHSRDatasets package.
Tibble with twelve columns
character string containing gender code
age of patient
National Early Warning Score (NEWS)
Systolic BP - Systolic BP result
Diastolic Blood Pressure - result on NEWS scale
Temperature of patient
Pulse of the patient
Level of response from the patient
SATS(Oxygen Saturation Levels) of the patient
Suppressed Oxygen score
Level of alertness of patient
Indicator to monitor patient death
Generated by Dr. Muhammed Faisal and created by Gary Hutson, Mar-2021
library(dplyr) data("synthetic_news_data") synthetic_news_data |> glimpse()
library(dplyr) data("synthetic_news_data") synthetic_news_data |> glimpse()