Project: Dr. Semmelweis and the Importance of Handwashing

Hungarian physician Dr. Ignaz Semmelweis worked at the Vienna General Hospital with childbed fever patients. Childbed fever is a deadly disease affecting women who have just given birth, and in the early 1840s, as many as 10% of the women giving birth died from it at the Vienna General Hospital. Dr.Semmelweis discovered that it was the contaminated hands of the doctors delivering the babies, and on June 1st, 1847, he decreed that everyone should wash their hands, an unorthodox and controversial request; nobody in Vienna knew about bacteria.

You will reanalyze the data that made Semmelweis discover the importance of handwashing and its impact on the hospital.

The data is stored as two CSV files within the datasets folder.

yearly_deaths_by_clinic.csv contains the number of women giving birth at the two clinics at the Vienna General Hospital between the years 1841 and 1846.

Column	Description
`year`	Years (1841-1846)
`births`	Number of births
`deaths`	Number of deaths
`clinic`	Clinic 1 or clinic 2

monthly_deaths.csv contains data from ‘Clinic 1’ of the hospital where most deaths occurred.

Column	Description
`date`	Date (YYYY-MM-DD)
`births`	Number of births
`deaths`	Number of deaths

# Imported libraries
library(tidyverse)

Warning: package 'ggplot2' was built under R version 4.2.3

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.2     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.0
✔ ggplot2   3.5.0     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.1     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

# Start coding here..

Load the CSV files

Load the CSV files into yearly and monthly data frames and check the data.

Rows: 12 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): clinic
dbl (3): year, births, deaths

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# A tibble: 12 × 4
    year births deaths clinic  
   <dbl>  <dbl>  <dbl> <chr>   
 1  1841   3036    237 clinic 1
 2  1842   3287    518 clinic 1
 3  1843   3060    274 clinic 1
 4  1844   3157    260 clinic 1
 5  1845   3492    241 clinic 1
 6  1846   4010    459 clinic 1
 7  1841   2442     86 clinic 2
 8  1842   2659    202 clinic 2
 9  1843   2739    164 clinic 2
10  1844   2956     68 clinic 2
11  1845   3241     66 clinic 2
12  1846   3754    105 clinic 2

Rows: 98 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl  (2): births, deaths
date (1): date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# A tibble: 98 × 3
   date       births deaths
   <date>      <dbl>  <dbl>
 1 1841-01-01    254     37
 2 1841-02-01    239     18
 3 1841-03-01    277     12
 4 1841-04-01    255      4
 5 1841-05-01    255      2
 6 1841-06-01    200     10
 7 1841-07-01    190     16
 8 1841-08-01    222      3
 9 1841-09-01    213      4
10 1841-10-01    236     26
# ℹ 88 more rows

Answer

# Load and inspect the data
#yearly = read.csv('datasets/yearly_deaths_by_clinic.csv') # R default function to read csv file
yearly = read_csv('datasets/yearly_deaths_by_clinic.csv') # read csv file function from tidyverse package

Rows: 12 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): clinic
dbl (3): year, births, deaths

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

yearly

# A tibble: 12 × 4
    year births deaths clinic  
   <dbl>  <dbl>  <dbl> <chr>   
 1  1841   3036    237 clinic 1
 2  1842   3287    518 clinic 1
 3  1843   3060    274 clinic 1
 4  1844   3157    260 clinic 1
 5  1845   3492    241 clinic 1
 6  1846   4010    459 clinic 1
 7  1841   2442     86 clinic 2
 8  1842   2659    202 clinic 2
 9  1843   2739    164 clinic 2
10  1844   2956     68 clinic 2
11  1845   3241     66 clinic 2
12  1846   3754    105 clinic 2

monthly = read_csv("datasets/monthly_deaths.csv")

Rows: 98 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl  (2): births, deaths
date (1): date

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

monthly

# A tibble: 98 × 3
   date       births deaths
   <date>      <dbl>  <dbl>
 1 1841-01-01    254     37
 2 1841-02-01    239     18
 3 1841-03-01    277     12
 4 1841-04-01    255      4
 5 1841-05-01    255      2
 6 1841-06-01    200     10
 7 1841-07-01    190     16
 8 1841-08-01    222      3
 9 1841-09-01    213      4
10 1841-10-01    236     26
# ℹ 88 more rows

Add a `proportion_deaths` column

Add a proportion_deaths column to each df, calculating the proportion of deaths per number of births for each year in yearly and month in monthly.[proportion_deaths = deaths / births]

# A tibble: 12 × 5
    year births deaths clinic   proportion_deaths
   <dbl>  <dbl>  <dbl> <chr>                <dbl>
 1  1841   3036    237 clinic 1            0.0781
 2  1842   3287    518 clinic 1            0.158 
 3  1843   3060    274 clinic 1            0.0895
 4  1844   3157    260 clinic 1            0.0824
 5  1845   3492    241 clinic 1            0.0690
 6  1846   4010    459 clinic 1            0.114 
 7  1841   2442     86 clinic 2            0.0352
 8  1842   2659    202 clinic 2            0.0760
 9  1843   2739    164 clinic 2            0.0599
10  1844   2956     68 clinic 2            0.0230
11  1845   3241     66 clinic 2            0.0204
12  1846   3754    105 clinic 2            0.0280

# A tibble: 98 × 4
   date       births deaths proportion_deaths
   <date>      <dbl>  <dbl>             <dbl>
 1 1841-01-01    254     37           0.146  
 2 1841-02-01    239     18           0.0753 
 3 1841-03-01    277     12           0.0433 
 4 1841-04-01    255      4           0.0157 
 5 1841-05-01    255      2           0.00784
 6 1841-06-01    200     10           0.05   
 7 1841-07-01    190     16           0.0842 
 8 1841-08-01    222      3           0.0135 
 9 1841-09-01    213      4           0.0188 
10 1841-10-01    236     26           0.110  
# ℹ 88 more rows

Answer

# Add proportion_deaths to both data frames
yearly <- yearly %>% 
  mutate(proportion_deaths = deaths / births)

monthly <- monthly %>% 
  mutate(proportion_deaths = deaths / births)

Create two ggplot line plots

Create two ggplot line plots: one for the yearly proportion of deaths. Create a different colored line for each clinic.

And another for the monthly proportion of deaths.

Answer

# Plot the data
ggplot(yearly, aes(x = year, y = proportion_deaths, color = clinic)) +
  geom_line()

ggplot(monthly, aes(date, proportion_deaths)) +
  geom_line() +
  labs(x = "Year", y = "Proportion Deaths")

Add a `handwashing_started` boolean column and plot again

Add a handwashing_started boolean column to monthly using June 1st, 1847 as the threshold; TRUE should mean that handwashing has started at the clinic.

# Add the threshold and flag and plot again
handwashing_start = as.Date('1847-06-01')

monthly <- monthly %>%
  mutate(handwashing_started = date >= handwashing_start)

head(monthly)

# A tibble: 6 × 5
  date       births deaths proportion_deaths handwashing_started
  <date>      <dbl>  <dbl>             <dbl> <lgl>              
1 1841-01-01    254     37           0.146   FALSE              
2 1841-02-01    239     18           0.0753  FALSE              
3 1841-03-01    277     12           0.0433  FALSE              
4 1841-04-01    255      4           0.0157  FALSE              
5 1841-05-01    255      2           0.00784 FALSE              
6 1841-06-01    200     10           0.05    FALSE

tail(monthly)

# A tibble: 6 × 5
  date       births deaths proportion_deaths handwashing_started
  <date>      <dbl>  <dbl>             <dbl> <lgl>              
1 1848-10-01    299      7            0.0234 TRUE               
2 1848-11-01    310      9            0.0290 TRUE               
3 1848-12-01    373      5            0.0134 TRUE               
4 1849-01-01    403      9            0.0223 TRUE               
5 1849-02-01    389     12            0.0308 TRUE               
6 1849-03-01    406     20            0.0493 TRUE

Plot the new df with different colored lines depending on handwashing_started.

Answer

ggplot(monthly, aes(x = date, y = proportion_deaths, color = handwashing_started)) +
  geom_line()

Calculate the mean proportion of deaths

Calculate the mean proportion of deaths before and after handwashing from the monthly data, and store the result as a 2x2 df named monthly_summary with the first column containing the handwashing_started groups and the second column having the mean proportion of deaths.

Answer

# Find the mean
monthly_summary <- monthly %>% 
  group_by(handwashing_started) %>%
  summarize(mean_proportion_deaths = mean(proportion_deaths))

monthly_summary

# A tibble: 2 × 2
  handwashing_started mean_proportion_deaths
  <lgl>                                <dbl>
1 FALSE                               0.105 
2 TRUE                                0.0211

Load the CSV files

Add a proportion_deaths column

Create two ggplot line plots

Add a handwashing_started boolean column and plot again

Calculate the mean proportion of deaths

Add a `proportion_deaths` column

Add a `handwashing_started` boolean column and plot again