Part 1

library(readr)
library(tidyverse)
library(dplyr)
library(lubridate)
library(dasehr)
library(tidyverse)
library(broom)
# install.packages("naniar")
library(naniar)

Read in the CalEnviroScreen data, you can use the URL or download the data.

CalEnviroScreen Dataset: CalEnviroScreen is a project that ranks census tracts in California based on potential exposures to pollutants, adverse environmental conditions, socioeconomic factors and the prevalence of certain health conditions. Data used in the CalEnviroScreen model come from national and state sources.

The data is from https://calenviroscreen-oehha.hub.arcgis.com/#Data

You can Download as a CSV in your current working directory. Note its also available at: https://daseh.org/data/CalEnviroScreen_data.csv

ces <- read_csv(file = "https://daseh.org/data/CalEnviroScreen_data.csv")
## New names:
## Rows: 8035 Columns: 68
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (3): CaliforniaCounty, ApproxLocation, CES4.0PercRange dbl (65): ...1,
## CensusTract, ZIP, Longitude, Latitude, CES4.0Score, CES4.0Pe...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`

1.1

The Lead variable in this dataset is an estimate of the risk for lead exposure in children living in low-income communities with older housing. A higher number indicates a greater risk.

Use the is.na() and any() functions to check if the ces Lead variable has any NA values. Use the pipe between each step. Hint: You first need to pull out the vector version of this variable to use the is.na() function.

# General format
TIBBLE %>%
  pull(COLUMN) %>%
  is.na() %>%
  any()
ces %>%
  pull(Lead) %>%
  is.na() %>%
  any()
## [1] TRUE

1.2

The Education variable reports the percent of population over 25 with less than a high school education.

Clean rows of ces, so that only rows remain that do NOT have missing values for the Education variable, using drop_na. Assign this to the object have_ed_data.

have_ed_data <- ces %>% drop_na(Education)

1.3

Use naniar to make a visual of the amount of data missing for each variable of calenviroscreen (use gg_miss_var()). Check out more about this package here: https://www.njtierney.com/post/2018/06/12/naniar-on-cran/

gg_miss_var(ces)

Practice on Your Own!

P.1

The LinguisticIsol variable reports the percent limited English speaking households in each census tract.

What percentage of the LinguisticIsol variable is complete in ces ? Hint: use another naniar function.

pull(ces, LinguisticIsol) %>% pct_complete() # this
## [1] 96.01742
miss_var_summary(ces) # or this
## # A tibble: 68 × 3
##    variable           n_miss pct_miss
##    <chr>               <int>    <num>
##  1 Unemployment          335     4.17
##  2 UnemploymentPctl      335     4.17
##  3 LinguisticIsol        320     3.98
##  4 LinguisticIsolPctl    320     3.98
##  5 LowBirthWeight        227     2.83
##  6 LowBirthWeightPctl    227     2.83
##  7 HousingBurden         145     1.80
##  8 HousingBurdenPctl     145     1.80
##  9 CES4.0Score           103     1.28
## 10 CES4.0Percentile      103     1.28
## # ℹ 58 more rows

Part 2

New Data set

Now imagine we work in a clinic and we are trying to understand more about blood types of patients.

Let’s say we the data like so:

BloodType <- tibble(
  weight_loss =
    c(
      "Y", "No", "Yes", "y", "no",
      "n", "No", "N", "yes", "Yes",
      "No", "N", NA, "N", "Other"
    ),
  type = c(
    "A.-", "AB.+", "O.-", "O.+", "AB.-",
    "B.+", "B.-", "o.-", "O.+", "A.-",
    "A.+", "O.-", "B.-", "o.+", "AB.-"
  ),
  infection = c(
    "Yes", "No", "Yes", "No", "No",
    "No", "Yes", "No", "Yes", "No",
    "No", "Yes", "Yes", "Yes", "NotSure"
  )
)

BloodType
## # A tibble: 15 × 3
##    weight_loss type  infection
##    <chr>       <chr> <chr>    
##  1 Y           A.-   Yes      
##  2 No          AB.+  No       
##  3 Yes         O.-   Yes      
##  4 y           O.+   No       
##  5 no          AB.-  No       
##  6 n           B.+   No       
##  7 No          B.-   Yes      
##  8 N           o.-   No       
##  9 yes         O.+   Yes      
## 10 Yes         A.-   No       
## 11 No          A.+   No       
## 12 N           O.-   Yes      
## 13 <NA>        B.-   Yes      
## 14 N           o.+   Yes      
## 15 Other       AB.-  NotSure

There are some issues with this data that we need to figure out!

2.1

Determine how many NA values there are for weight_loss (assume you know thatN and n is for no).

count(BloodType, weight_loss) # the simple way
## # A tibble: 10 × 2
##    weight_loss     n
##    <chr>       <int>
##  1 N               3
##  2 No              3
##  3 Other           1
##  4 Y               1
##  5 Yes             2
##  6 n               1
##  7 no              1
##  8 y               1
##  9 yes             1
## 10 <NA>            1
sum(is.na(pull(BloodType, weight_loss))) # another way
## [1] 1
BloodType %>% # another way
  pull(weight_loss) %>%
  is.na() %>%
  sum()
## [1] 1

2.2

Recode the weight_loss variable of the BloodType data so that it is consistent. Use case_when(). Keep “Other” as “Other”. Don’t forget to use quotes!

# General format
NEW_TIBBLE <- OLD_TIBBLE %>%
  mutate(NEW_COLUMN = case_when(
    OLD_COLUMN %in% c( ... ) ~ ... ,
    OLD_COLUMN %in% c( ... ) ~ ... ,
    TRUE ~ OLD_COLUMN
  ))
BloodType <- BloodType %>%
  mutate(weight_loss = case_when(
    weight_loss %in% c("N", "n", "No", "no") ~ "No",
    weight_loss %in% c("Y", "y", "Yes", "yes") ~ "Yes",
    TRUE ~ weight_loss # the only other value is an NA so we could include this or we don't need to (it's generally good practice unless we want to create NAs)
  ))

count(BloodType, weight_loss)
## # A tibble: 4 × 2
##   weight_loss     n
##   <chr>       <int>
## 1 No              8
## 2 Other           1
## 3 Yes             5
## 4 <NA>            1

2.3

Check to see how many values weight_loss has for each category (hint: use count). It’s good practice to regularly check your data throughout the data wrangling process.

BloodType %>% count(weight_loss)
## # A tibble: 4 × 2
##   weight_loss     n
##   <chr>       <int>
## 1 No              8
## 2 Other           1
## 3 Yes             5
## 4 <NA>            1

2.4

Recode the type variable of the BloodType data to be consistent. Use case_when(). Hint: the inconsistency has to do with lower case o and capital O. Don’t forget to use quotes! Remember that important extra step that we often do for case_when(). Sometimes it matters and sometimes it doesn’t. Why is that?

BloodType <- BloodType %>%
  mutate(type = case_when(
    type == "o.-" ~ "O.-",
    type == "o.+" ~ "O.+",
    TRUE ~ type))
BloodType
## # A tibble: 15 × 3
##    weight_loss type  infection
##    <chr>       <chr> <chr>    
##  1 Yes         A.-   Yes      
##  2 No          AB.+  No       
##  3 Yes         O.-   Yes      
##  4 Yes         O.+   No       
##  5 No          AB.-  No       
##  6 No          B.+   No       
##  7 No          B.-   Yes      
##  8 No          O.-   No       
##  9 Yes         O.+   Yes      
## 10 Yes         A.-   No       
## 11 No          A.+   No       
## 12 No          O.-   Yes      
## 13 <NA>        B.-   Yes      
## 14 No          O.+   Yes      
## 15 Other       AB.-  NotSure

2.5

Check to see that type only has these possible values: “A.-”,“A.+”, “AB.-”, “AB.+”, “B-”,“B+”, “O.-”, “O.+”

BloodType %>% count(type)
## # A tibble: 8 × 2
##   type      n
##   <chr> <int>
## 1 A.+       1
## 2 A.-       2
## 3 AB.+      1
## 4 AB.-      2
## 5 B.+       1
## 6 B.-       2
## 7 O.+       3
## 8 O.-       3

2.6

Make a new tibble of BloodType called Bloodtype_split that splits the type variable into two called blood_type and Rhfactor. Note: periods are special characters that generally are interpreted as wild cards thus we need “\.” instead of simply “.” for the separating character to tell R that we want it to be interpreted as a period. Make sure you use quotes around “\.” and the column names like shown below (don’t want backticks).

# General format
NEW_TIBBLE <- OLD_TIBBLE %>%
  separate(OLD_COLUMN,
           into = c("NEW_COLUMN1", "NEW_COLUMN2"),
           sep = "SEPARATING_CHARACTER")
BloodType_split <- BloodType %>%
  separate(type, into = c("blood_type", "Rhfactor"), sep = "\\.")

Practice on Your Own!

P.2

How many observations are there for each Rhfactor in the data object you just made:

count(BloodType_split, Rhfactor)
## # A tibble: 2 × 2
##   Rhfactor     n
##   <chr>    <int>
## 1 +            6
## 2 -            9

P.3

Filtering for patients with type O, how many had the infection?

BloodType_split %>%
  filter(blood_type == "O") %>%
  count(infection)
## # A tibble: 2 × 2
##   infection     n
##   <chr>     <int>
## 1 No            2
## 2 Yes           4