library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# install.packages("naniar")
library(naniar)
Read in the CalEnviroScreen data using read_csv
and the URL https://daseh.org/data/CalEnviroScreen_data.csv
Assign this dataset to an object called “ces”
ces <- read_csv(file = "https://daseh.org/data/CalEnviroScreen_data.csv")
## Rows: 8035 Columns: 67
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): CaliforniaCounty, ApproxLocation, CES4.0PercRange
## dbl (64): CensusTract, ZIP, Longitude, Latitude, CES4.0Score, CES4.0Percenti...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Use the is.na()
and any()
functions to check if the Lead
variable in ces
has any NA
values. Use the pipe between each step. Hint: You first need to pull
out the vector version of this variable to use the is.na()
function.
Lead
: an estimate of the risk for lead exposure in children living in low-income communities with older housing. A higher number indicates a greater risk.
# General format
TIBBLE %>%
pull(COLUMN) %>%
is.na() %>%
any()
ces %>%
pull(Lead) %>%
is.na() %>%
any()
## [1] TRUE
Clean rows of ces, so that only rows remain that do NOT have missing values for the Education
variable, using drop_na
. Assign this to the object have_ed_data.
Education
: the percentage of the population over 25 with less than a high school education.
have_ed_data <- ces %>% drop_na(Education)
Use naniar
to make a visual of the amount of data missing for each variable of ces
(use gg_miss_var()
). Check out more about this package here: https://www.njtierney.com/post/2018/06/12/naniar-on-cran/
gg_miss_var(ces)
What percentage of the LinguisticIsol
variable is complete in ces
? Hint: use another naniar
function.
LinguisticIsol
: the percentage of limited English speaking households within each census tract.
pull(ces, LinguisticIsol) %>% pct_complete() # this
## [1] 96.01742
miss_var_summary(ces) # or this
## # A tibble: 67 × 3
## variable n_miss pct_miss
## <chr> <int> <num>
## 1 Unemployment 335 4.17
## 2 UnemploymentPctl 335 4.17
## 3 LinguisticIsol 320 3.98
## 4 LinguisticIsolPctl 320 3.98
## 5 LowBirthWeight 227 2.83
## 6 LowBirthWeightPctl 227 2.83
## 7 HousingBurden 145 1.80
## 8 HousingBurdenPctl 145 1.80
## 9 CES4.0Score 103 1.28
## 10 CES4.0Percentile 103 1.28
## # ℹ 57 more rows
New Data set
Let’s imagine we work in a clinic and we are trying to understand more about blood types of patients.
Run the following code to create a dataset that we might collect.
BloodType <- tibble(
exposure =
c(
"Y", "No", "Yes", "y", "no",
"n", "No", "N", "yes", "Yes",
"No", "N", NA, "N", "Other"
),
type = c(
"A.-", "AB.+", "O.-", "O.+", "AB.-",
"B.+", "B.-", "o.-", "O.+", "A.-",
"A.+", "O.-", "B.-", "o.+", "AB.-"
),
infection = c(
"Yes", "No", "Yes", "No", "No",
"No", "Yes", "No", "Yes", "No",
"No", "Yes", "Yes", "Yes", "NotSure"
)
)
BloodType
## # A tibble: 15 × 3
## exposure type infection
## <chr> <chr> <chr>
## 1 Y A.- Yes
## 2 No AB.+ No
## 3 Yes O.- Yes
## 4 y O.+ No
## 5 no AB.- No
## 6 n B.+ No
## 7 No B.- Yes
## 8 N o.- No
## 9 yes O.+ Yes
## 10 Yes A.- No
## 11 No A.+ No
## 12 N O.- Yes
## 13 <NA> B.- Yes
## 14 N o.+ Yes
## 15 Other AB.- NotSure
There are some issues with this data that we need to figure out!
Determine how many NA
values there are for exposure
(assume you know thatN
and n
is for no).
count(BloodType, exposure) # the simple way
## # A tibble: 10 × 2
## exposure n
## <chr> <int>
## 1 N 3
## 2 No 3
## 3 Other 1
## 4 Y 1
## 5 Yes 2
## 6 n 1
## 7 no 1
## 8 y 1
## 9 yes 1
## 10 <NA> 1
sum(is.na(pull(BloodType, exposure))) # another way
## [1] 1
BloodType %>% # another way
pull(exposure) %>%
is.na() %>%
sum()
## [1] 1
Recode the exposure
variable of the BloodType
data so that it is consistent. Use case_when()
. Keep “Other” as “Other”. Don’t forget to use quotes!
# General format
NEW_TIBBLE <- OLD_TIBBLE %>%
mutate(NEW_COLUMN = case_when(
OLD_COLUMN %in% c( ... ) ~ ... ,
OLD_COLUMN %in% c( ... ) ~ ... ,
TRUE ~ OLD_COLUMN
))
BloodType <- BloodType %>%
mutate(exposure = case_when(
exposure %in% c("N", "n", "No", "no") ~ "No",
exposure %in% c("Y", "y", "Yes", "yes") ~ "Yes",
TRUE ~ exposure # the only other value is an NA so we could include this or we don't need to (it's generally good practice unless we want to create NAs)
))
count(BloodType, exposure)
## # A tibble: 4 × 2
## exposure n
## <chr> <int>
## 1 No 8
## 2 Other 1
## 3 Yes 5
## 4 <NA> 1
Check to see how many values exposure
has for each category (hint: use count
). It’s good practice to regularly check your data throughout the data wrangling process.
BloodType %>% count(exposure)
## # A tibble: 4 × 2
## exposure n
## <chr> <int>
## 1 No 8
## 2 Other 1
## 3 Yes 5
## 4 <NA> 1
Recode the type
variable of the BloodType
data to be consistent. Use case_when()
. Hint: the inconsistency has to do with lower case o
and capital O
. Don’t forget to use quotes! Remember that important extra step that we often do for case_when()
. Sometimes it matters and sometimes it doesn’t. Why is that?
BloodType <- BloodType %>%
mutate(type = case_when(
type == "o.-" ~ "O.-",
type == "o.+" ~ "O.+",
TRUE ~ type))
BloodType
## # A tibble: 15 × 3
## exposure type infection
## <chr> <chr> <chr>
## 1 Yes A.- Yes
## 2 No AB.+ No
## 3 Yes O.- Yes
## 4 Yes O.+ No
## 5 No AB.- No
## 6 No B.+ No
## 7 No B.- Yes
## 8 No O.- No
## 9 Yes O.+ Yes
## 10 Yes A.- No
## 11 No A.+ No
## 12 No O.- Yes
## 13 <NA> B.- Yes
## 14 No O.+ Yes
## 15 Other AB.- NotSure
Check to see that type
only has these possible values: “A.-”,“A.+”, “AB.-”, “AB.+”, “B-”,“B+”, “O.-”, “O.+”
BloodType %>% count(type)
## # A tibble: 8 × 2
## type n
## <chr> <int>
## 1 A.+ 1
## 2 A.- 2
## 3 AB.+ 1
## 4 AB.- 2
## 5 B.+ 1
## 6 B.- 2
## 7 O.+ 3
## 8 O.- 3
Make a new tibble of BloodType
called Bloodtype_split
that splits the type
variable into two called blood_type
and Rhfactor
. Note: periods are special characters that generally are interpreted as wild cards thus we need “\.” instead of simply “.” for the separating character to tell R that we want it to be interpreted as a period. Make sure you use quotes around “\.” and the column names like shown below (don’t want backticks).
# General format
NEW_TIBBLE <- OLD_TIBBLE %>%
separate(OLD_COLUMN,
into = c("NEW_COLUMN1", "NEW_COLUMN2"),
sep = "SEPARATING_CHARACTER")
BloodType_split <- BloodType %>%
separate(type, into = c("blood_type", "Rhfactor"), sep = "\\.")
How many observations are there for each Rhfactor
in the data object you just made:
count(BloodType_split, Rhfactor)
## # A tibble: 2 × 2
## Rhfactor n
## <chr> <int>
## 1 + 6
## 2 - 9
Filtering for patients with type O, how many had the infection?
BloodType_split %>%
filter(blood_type == "O") %>%
count(infection)
## # A tibble: 2 × 2
## infection n
## <chr> <int>
## 1 No 2
## 2 Yes 4