Part 1

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

# install.packages("naniar")
library(naniar)

Read in the CalEnviroScreen data using read_csv and the URL https://daseh.org/data/CalEnviroScreen_data.csv

Assign this dataset to an object called “ces”

ces <- read_csv(file = "https://daseh.org/data/CalEnviroScreen_data.csv")

## Rows: 8035 Columns: 67
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (3): CaliforniaCounty, ApproxLocation, CES4.0PercRange
## dbl (64): CensusTract, ZIP, Longitude, Latitude, CES4.0Score, CES4.0Percenti...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

1.1

Use the is.na() and any() functions to check if the Lead variable in ces has any NA values. Use the pipe between each step. Hint: You first need to pull out the vector version of this variable to use the is.na() function.

Lead: an estimate of the risk for lead exposure in children living in low-income communities with older housing. A higher number indicates a greater risk.

# General format
TIBBLE %>%
  pull(COLUMN) %>%
  is.na() %>%
  any()

ces %>%
  pull(Lead) %>%
  is.na() %>%
  any()

## [1] TRUE

1.2

Clean rows of ces, so that only rows remain that do NOT have missing values for the Education variable, using drop_na. Assign this to the object have_ed_data.

Education: the percentage of the population over 25 with less than a high school education.

have_ed_data <- ces %>% drop_na(Education)

1.3

Use naniar to make a visual of the amount of data missing for each variable of ces (use gg_miss_var()). Check out more about this package here: https://www.njtierney.com/post/2018/06/12/naniar-on-cran/

gg_miss_var(ces)

Practice on Your Own!

P.1

What percentage of the LinguisticIsol variable is complete in ces ? Hint: use another naniar function.

LinguisticIsol: the percentage of limited English speaking households within each census tract.

pull(ces, LinguisticIsol) %>% pct_complete() # this

## [1] 96.01742

miss_var_summary(ces) # or this

## # A tibble: 67 × 3
##    variable           n_miss pct_miss
##    <chr>               <int>    <num>
##  1 Unemployment          335     4.17
##  2 UnemploymentPctl      335     4.17
##  3 LinguisticIsol        320     3.98
##  4 LinguisticIsolPctl    320     3.98
##  5 LowBirthWeight        227     2.83
##  6 LowBirthWeightPctl    227     2.83
##  7 HousingBurden         145     1.80
##  8 HousingBurdenPctl     145     1.80
##  9 CES4.0Score           103     1.28
## 10 CES4.0Percentile      103     1.28
## # ℹ 57 more rows

Part 2

New Data set

Let’s imagine we work in a clinic and we are trying to understand more about blood types of patients.

Run the following code to create a dataset that we might collect.

BloodType <- tibble(
  exposure =
    c(
      "Y", "No", "Yes", "y", "no",
      "n", "No", "N", "yes", "Yes",
      "No", "N", NA, "N", "Other"
    ),
  type = c(
    "A.-", "AB.+", "O.-", "O.+", "AB.-",
    "B.+", "B.-", "o.-", "O.+", "A.-",
    "A.+", "O.-", "B.-", "o.+", "AB.-"
  ),
  infection = c(
    "Yes", "No", "Yes", "No", "No",
    "No", "Yes", "No", "Yes", "No",
    "No", "Yes", "Yes", "Yes", "NotSure"
  )
)

BloodType

## # A tibble: 15 × 3
##    exposure type  infection
##    <chr>    <chr> <chr>    
##  1 Y        A.-   Yes      
##  2 No       AB.+  No       
##  3 Yes      O.-   Yes      
##  4 y        O.+   No       
##  5 no       AB.-  No       
##  6 n        B.+   No       
##  7 No       B.-   Yes      
##  8 N        o.-   No       
##  9 yes      O.+   Yes      
## 10 Yes      A.-   No       
## 11 No       A.+   No       
## 12 N        O.-   Yes      
## 13 <NA>     B.-   Yes      
## 14 N        o.+   Yes      
## 15 Other    AB.-  NotSure

There are some issues with this data that we need to figure out!

2.1

Determine how many NA values there are for exposure (assume you know thatN and n is for no).

count(BloodType, exposure) # the simple way

## # A tibble: 10 × 2
##    exposure     n
##    <chr>    <int>
##  1 N            3
##  2 No           3
##  3 Other        1
##  4 Y            1
##  5 Yes          2
##  6 n            1
##  7 no           1
##  8 y            1
##  9 yes          1
## 10 <NA>         1

sum(is.na(pull(BloodType, exposure))) # another way

## [1] 1

BloodType %>% # another way
  pull(exposure) %>%
  is.na() %>%
  sum()

## [1] 1

2.2

Recode the exposure variable of the BloodType data so that it is consistent. Use case_when(). Keep “Other” as “Other”. Don’t forget to use quotes!

# General format
NEW_TIBBLE <- OLD_TIBBLE %>%
  mutate(NEW_COLUMN = case_when(
    OLD_COLUMN %in% c( ... ) ~ ... ,
    OLD_COLUMN %in% c( ... ) ~ ... ,
    TRUE ~ OLD_COLUMN
  ))

BloodType <- BloodType %>%
  mutate(exposure = case_when(
    exposure %in% c("N", "n", "No", "no") ~ "No",
    exposure %in% c("Y", "y", "Yes", "yes") ~ "Yes",
    TRUE ~ exposure # the only other value is an NA so we could include this or we don't need to (it's generally good practice unless we want to create NAs)
  ))

count(BloodType, exposure)

## # A tibble: 4 × 2
##   exposure     n
##   <chr>    <int>
## 1 No           8
## 2 Other        1
## 3 Yes          5
## 4 <NA>         1

2.3

Check to see how many values exposure has for each category (hint: use count). It’s good practice to regularly check your data throughout the data wrangling process.

BloodType %>% count(exposure)

## # A tibble: 4 × 2
##   exposure     n
##   <chr>    <int>
## 1 No           8
## 2 Other        1
## 3 Yes          5
## 4 <NA>         1

2.4

Recode the type variable of the BloodType data to be consistent. Use case_when(). Hint: the inconsistency has to do with lower case o and capital O. Don’t forget to use quotes! Remember that important extra step that we often do for case_when(). Sometimes it matters and sometimes it doesn’t. Why is that?

BloodType <- BloodType %>%
  mutate(type = case_when(
    type == "o.-" ~ "O.-",
    type == "o.+" ~ "O.+",
    TRUE ~ type))
BloodType

## # A tibble: 15 × 3
##    exposure type  infection
##    <chr>    <chr> <chr>    
##  1 Yes      A.-   Yes      
##  2 No       AB.+  No       
##  3 Yes      O.-   Yes      
##  4 Yes      O.+   No       
##  5 No       AB.-  No       
##  6 No       B.+   No       
##  7 No       B.-   Yes      
##  8 No       O.-   No       
##  9 Yes      O.+   Yes      
## 10 Yes      A.-   No       
## 11 No       A.+   No       
## 12 No       O.-   Yes      
## 13 <NA>     B.-   Yes      
## 14 No       O.+   Yes      
## 15 Other    AB.-  NotSure

2.5

Check to see that type only has these possible values: “A.-”,“A.+”, “AB.-”, “AB.+”, “B-”,“B+”, “O.-”, “O.+”

BloodType %>% count(type)

## # A tibble: 8 × 2
##   type      n
##   <chr> <int>
## 1 A.+       1
## 2 A.-       2
## 3 AB.+      1
## 4 AB.-      2
## 5 B.+       1
## 6 B.-       2
## 7 O.+       3
## 8 O.-       3

2.6

Make a new tibble of BloodType called Bloodtype_split that splits the type variable into two called blood_type and Rhfactor. Note: periods are special characters that generally are interpreted as wild cards thus we need “\.” instead of simply “.” for the separating character to tell R that we want it to be interpreted as a period. Make sure you use quotes around “\.” and the column names like shown below (don’t want backticks).

# General format
NEW_TIBBLE <- OLD_TIBBLE %>%
  separate(OLD_COLUMN,
           into = c("NEW_COLUMN1", "NEW_COLUMN2"),
           sep = "SEPARATING_CHARACTER")

BloodType_split <- BloodType %>%
  separate(type, into = c("blood_type", "Rhfactor"), sep = "\\.")

Practice on Your Own!

P.2

How many observations are there for each Rhfactor in the data object you just made:

count(BloodType_split, Rhfactor)

## # A tibble: 2 × 2
##   Rhfactor     n
##   <chr>    <int>
## 1 +            6
## 2 -            9

P.3

Filtering for patients with type O, how many had the infection?

BloodType_split %>%
  filter(blood_type == "O") %>%
  count(infection)

## # A tibble: 2 × 2
##   infection     n
##   <chr>     <int>
## 1 No            2
## 2 Yes           4

Data Cleaning Lab - Key

Part 1

1.1

1.2

1.3

Practice on Your Own!

P.1

Part 2

2.1

2.2

2.3

2.4

2.5

2.6

Practice on Your Own!

P.2

P.3