library(readr)
library(tidyverse)
library(dplyr)
library(lubridate)
library(dasehr)
library(tidyverse)
library(broom)
# install.packages("naniar")
library(naniar)
Read in the CalEnviroScreen data, you can use the URL or download the data.
CalEnviroScreen Dataset: CalEnviroScreen is a project that ranks census tracts in California based on potential exposures to pollutants, adverse environmental conditions, socioeconomic factors and the prevalence of certain health conditions. Data used in the CalEnviroScreen model come from national and state sources.
The data is from https://calenviroscreen-oehha.hub.arcgis.com/#Data
You can Download as a CSV in your current working directory. Note its also available at: https://daseh.org/data/CalEnviroScreen_data.csv
ces <- read_csv(file = "https://daseh.org/data/CalEnviroScreen_data.csv")
## New names:
## Rows: 8035 Columns: 68
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (3): CaliforniaCounty, ApproxLocation, CES4.0PercRange dbl (65): ...1,
## CensusTract, ZIP, Longitude, Latitude, CES4.0Score, CES4.0Pe...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
The Lead
variable in this dataset is an estimate of the
risk for lead exposure in children living in low-income communities with
older housing. A higher number indicates a greater risk.
Use the is.na()
and any()
functions to
check if the ces Lead
variable has any NA
values. Use the pipe between each step. Hint: You first need to
pull
out the vector version of this variable to use the
is.na()
function.
# General format
TIBBLE %>%
pull(COLUMN) %>%
is.na() %>%
any()
ces %>%
pull(Lead) %>%
is.na() %>%
any()
## [1] TRUE
The Education
variable reports the percent of population
over 25 with less than a high school education.
Clean rows of ces, so that only rows remain that do NOT have missing
values for the Education
variable, using
drop_na
. Assign this to the object
have_ed_data.
have_ed_data <- ces %>% drop_na(Education)
Use naniar
to make a visual of the amount of data
missing for each variable of calenviroscreen
(use
gg_miss_var()
). Check out more about this package here: https://www.njtierney.com/post/2018/06/12/naniar-on-cran/
gg_miss_var(ces)
The LinguisticIsol
variable reports the percent limited
English speaking households in each census tract.
What percentage of the LinguisticIsol
variable is
complete in ces
? Hint: use another naniar
function.
pull(ces, LinguisticIsol) %>% pct_complete() # this
## [1] 96.01742
miss_var_summary(ces) # or this
## # A tibble: 68 × 3
## variable n_miss pct_miss
## <chr> <int> <num>
## 1 Unemployment 335 4.17
## 2 UnemploymentPctl 335 4.17
## 3 LinguisticIsol 320 3.98
## 4 LinguisticIsolPctl 320 3.98
## 5 LowBirthWeight 227 2.83
## 6 LowBirthWeightPctl 227 2.83
## 7 HousingBurden 145 1.80
## 8 HousingBurdenPctl 145 1.80
## 9 CES4.0Score 103 1.28
## 10 CES4.0Percentile 103 1.28
## # ℹ 58 more rows
New Data set
Now imagine we work in a clinic and we are trying to understand more about blood types of patients.
Let’s say we the data like so:
BloodType <- tibble(
weight_loss =
c(
"Y", "No", "Yes", "y", "no",
"n", "No", "N", "yes", "Yes",
"No", "N", NA, "N", "Other"
),
type = c(
"A.-", "AB.+", "O.-", "O.+", "AB.-",
"B.+", "B.-", "o.-", "O.+", "A.-",
"A.+", "O.-", "B.-", "o.+", "AB.-"
),
infection = c(
"Yes", "No", "Yes", "No", "No",
"No", "Yes", "No", "Yes", "No",
"No", "Yes", "Yes", "Yes", "NotSure"
)
)
BloodType
## # A tibble: 15 × 3
## weight_loss type infection
## <chr> <chr> <chr>
## 1 Y A.- Yes
## 2 No AB.+ No
## 3 Yes O.- Yes
## 4 y O.+ No
## 5 no AB.- No
## 6 n B.+ No
## 7 No B.- Yes
## 8 N o.- No
## 9 yes O.+ Yes
## 10 Yes A.- No
## 11 No A.+ No
## 12 N O.- Yes
## 13 <NA> B.- Yes
## 14 N o.+ Yes
## 15 Other AB.- NotSure
There are some issues with this data that we need to figure out!
Determine how many NA
values there are for
weight_loss
(assume you know thatN
and
n
is for no).
count(BloodType, weight_loss) # the simple way
## # A tibble: 10 × 2
## weight_loss n
## <chr> <int>
## 1 N 3
## 2 No 3
## 3 Other 1
## 4 Y 1
## 5 Yes 2
## 6 n 1
## 7 no 1
## 8 y 1
## 9 yes 1
## 10 <NA> 1
sum(is.na(pull(BloodType, weight_loss))) # another way
## [1] 1
BloodType %>% # another way
pull(weight_loss) %>%
is.na() %>%
sum()
## [1] 1
Recode the weight_loss
variable of the
BloodType
data so that it is consistent. Use
case_when()
. Keep “Other” as “Other”. Don’t forget to use
quotes!
# General format
NEW_TIBBLE <- OLD_TIBBLE %>%
mutate(NEW_COLUMN = case_when(
OLD_COLUMN %in% c( ... ) ~ ... ,
OLD_COLUMN %in% c( ... ) ~ ... ,
TRUE ~ OLD_COLUMN
))
BloodType <- BloodType %>%
mutate(weight_loss = case_when(
weight_loss %in% c("N", "n", "No", "no") ~ "No",
weight_loss %in% c("Y", "y", "Yes", "yes") ~ "Yes",
TRUE ~ weight_loss # the only other value is an NA so we could include this or we don't need to (it's generally good practice unless we want to create NAs)
))
count(BloodType, weight_loss)
## # A tibble: 4 × 2
## weight_loss n
## <chr> <int>
## 1 No 8
## 2 Other 1
## 3 Yes 5
## 4 <NA> 1
Check to see how many values weight_loss
has for each
category (hint: use count
). It’s good practice to regularly
check your data throughout the data wrangling process.
BloodType %>% count(weight_loss)
## # A tibble: 4 × 2
## weight_loss n
## <chr> <int>
## 1 No 8
## 2 Other 1
## 3 Yes 5
## 4 <NA> 1
Recode the type
variable of the BloodType
data to be consistent. Use case_when()
. Hint: the
inconsistency has to do with lower case o
and capital
O
. Don’t forget to use quotes! Remember that
important extra step that we often do for case_when()
.
Sometimes it matters and sometimes it doesn’t. Why is that?
BloodType <- BloodType %>%
mutate(type = case_when(
type == "o.-" ~ "O.-",
type == "o.+" ~ "O.+",
TRUE ~ type))
BloodType
## # A tibble: 15 × 3
## weight_loss type infection
## <chr> <chr> <chr>
## 1 Yes A.- Yes
## 2 No AB.+ No
## 3 Yes O.- Yes
## 4 Yes O.+ No
## 5 No AB.- No
## 6 No B.+ No
## 7 No B.- Yes
## 8 No O.- No
## 9 Yes O.+ Yes
## 10 Yes A.- No
## 11 No A.+ No
## 12 No O.- Yes
## 13 <NA> B.- Yes
## 14 No O.+ Yes
## 15 Other AB.- NotSure
Check to see that type
only has these possible values:
“A.-”,“A.+”, “AB.-”, “AB.+”, “B-”,“B+”, “O.-”, “O.+”
BloodType %>% count(type)
## # A tibble: 8 × 2
## type n
## <chr> <int>
## 1 A.+ 1
## 2 A.- 2
## 3 AB.+ 1
## 4 AB.- 2
## 5 B.+ 1
## 6 B.- 2
## 7 O.+ 3
## 8 O.- 3
Make a new tibble of BloodType
called
Bloodtype_split
that splits the type
variable
into two called blood_type
and Rhfactor
. Note:
periods are special characters that generally are interpreted as wild
cards thus we need “\.” instead of simply “.” for the separating
character to tell R that we want it to be interpreted as a
period. Make sure you use quotes around “\.” and the column names like
shown below (don’t want backticks).
# General format
NEW_TIBBLE <- OLD_TIBBLE %>%
separate(OLD_COLUMN,
into = c("NEW_COLUMN1", "NEW_COLUMN2"),
sep = "SEPARATING_CHARACTER")
BloodType_split <- BloodType %>%
separate(type, into = c("blood_type", "Rhfactor"), sep = "\\.")
How many observations are there for each Rhfactor
in the
data object you just made:
count(BloodType_split, Rhfactor)
## # A tibble: 2 × 2
## Rhfactor n
## <chr> <int>
## 1 + 6
## 2 - 9
Filtering for patients with type O, how many had the infection?
BloodType_split %>%
filter(blood_type == "O") %>%
count(infection)
## # A tibble: 2 × 2
## infection n
## <chr> <int>
## 1 No 2
## 2 Yes 4