Load all the libraries we will use in this lab.
library(tidyverse)
Create a function that takes one argument, a vector, and returns the
sum of the vector and then squares the result. Call it “sum_squared”.
Test your function on the vector c(2,7,21,30,90)
- you
should get the answer 22500.
# General format
NEW_FUNCTION <- function(x, y) x + y
or
# General format
NEW_FUNCTION <- function(x, y){
result <- x + y
return(result)
}
nums <- c(2, 7, 21, 30, 90)
sum_squared <- function(x) sum(x)^2
sum_squared(x = nums)
## [1] 22500
sum_squared <- function(x) {
out <- sum(x)^2
return(out)
}
sum_squared(x = nums)
## [1] 22500
Create a function that takes two arguments, (1) a vector and (2) a
numeric value. This function tests whether the number (2) is contained
within the vector (1). Hint: use %in%
.
Call it has_n
. Test your function on the vector
c(2,7,21,30,90)
and number 21
- you should get
the answer TRUE.
nums <- c(2, 7, 21, 30, 90)
a_num <- 21
has_n <- function(x, n) n %in% x
has_n(x = nums, n = a_num)
## [1] TRUE
Amend the function has_n
from question 1.2 so that it
takes a default value of 21
for the numeric argument.
nums <- c(2, 7, 21, 30, 90)
a_num <- 21
has_n <- function(x, n = 21) n %in% x
has_n(x = nums)
## [1] TRUE
Create a new number b_num
that is not contained with
nums
. Use your updated has_n
function with the
default value and add b_num
as the n
argument
when calling the function. What is the outcome?
b_num <- 11
has_n(x = nums, n = b_num)
## [1] FALSE
Read in the CalEnviroScreen from https://daseh.org/data/CalEnviroScreen_data.csv. Assign the data the name “ces”.
ces <- read_csv("https://daseh.org/data/CalEnviroScreen_data.csv")
# If downloaded
# ces <- read_csv("CalEnviroScreen_data.csv")
We want to get some summary statistics on water contamination. Use
across
inside summarize
to get the sum total
variable containing the string “water” AND ending with “Pctl”.
Hint: use contains()
AND
ends_with()
to select the right columns inside
across
. Remember that NA
values can influence
calculations.
# General format
data %>%
summarize(across(
{vector or tidyselect},
{some function}
))
ces %>%
summarize(across(
contains("Water") & ends_with("Pctl"),
sum
))
## # A tibble: 1 × 3
## DrinkingWaterPctl GroundwaterThreatsPctl ImpWaterBodiesPctl
## <dbl> <dbl> <dbl>
## 1 NA 304029. 256802.
ces %>%
summarize(across(
contains("Water") & ends_with("Pctl"),
function(x) sum(x, na.rm = T)
))
## # A tibble: 1 × 3
## DrinkingWaterPctl GroundwaterThreatsPctl ImpWaterBodiesPctl
## <dbl> <dbl> <dbl>
## 1 403640. 304029. 256802.
Use across
and mutate
to convert all
columns containing the word “water” into proportions (i.e., divide that
value by 100). Hint: use contains()
to
select the right columns within across()
. Use an anonymous
function (“function on the fly”) to divide by 100
(function(x) x / 100
). It will also be easier to check your
work if you select()
columns that match “Pctl”.
# General format
data %>%
mutate(across(
{vector or tidyselect},
{some function}
))
ces %>%
mutate(across(
contains("water"),
function(x) x / 100
)) %>%
select(contains("Pctl"))
## # A tibble: 8,035 × 23
## OzonePctl PM2.5.Pctl DieselPMPctl DrinkingWaterPctl LeadPctl PesticidesPctl
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 3.12 36.3 34.8 0.0421 7.74 0
## 2 3.12 42.0 92.7 0.0421 68.2 0
## 3 3.12 43.9 89.8 0.0421 64.2 0
## 4 3.12 42.8 79.1 0.0421 67.1 0
## 5 3.12 42.8 67.6 0.0421 68.0 0
## 6 3.12 42.8 83.8 0.0421 69.7 0
## 7 3.12 43.3 81.3 0.0421 76.9 0
## 8 3.12 44.0 68.7 0.0421 73.2 0
## 9 3.12 44.0 81.1 0.0421 86.7 0
## 10 3.12 45.8 99.4 0.0421 88.5 0
## # ℹ 8,025 more rows
## # ℹ 17 more variables: ToxReleasePctl <dbl>, TrafficPctl <dbl>,
## # CleanupSitesPctl <dbl>, GroundwaterThreatsPctl <dbl>, HazWastePctl <dbl>,
## # ImpWaterBodiesPctl <dbl>, SolidWastePctl <dbl>, PollutionBurdenPctl <dbl>,
## # AsthmaPctl <dbl>, LowBirthWeightPctl <dbl>,
## # CardiovascularDiseasePctl <dbl>, PopCharPctl <dbl>, EducationPctl <dbl>,
## # LinguisticIsolPctl <dbl>, PovertyPctl <dbl>, UnemploymentPctl <dbl>, …
Use across
and mutate
to convert all
columns starting with the string “PM” into a binary variable: TRUE if
the value is greater than 10 and FALSE if less than or equal to 10.
Hint: use starts_with()
to select the
columns that start with “PM”. Use an anonymous function (“function on
the fly”) to do a logical test if the value is greater than 10.
ces %>%
mutate(across(
starts_with("PM"),
function(x) x > 10
))
## # A tibble: 8,035 × 68
## ...1 CensusTract CaliforniaCounty ZIP Longitude Latitude ApproxLocation
## <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <chr>
## 1 1 6001400100 Alameda 94704 -122. 37.9 Oakland
## 2 2 6001400200 Alameda 94618 -122. 37.8 Oakland
## 3 3 6001400300 Alameda 94618 -122. 37.8 Oakland
## 4 4 6001400400 Alameda 94609 -122. 37.8 Oakland
## 5 5 6001400500 Alameda 94609 -122. 37.8 Oakland
## 6 6 6001400600 Alameda 94609 -122. 37.8 Oakland
## 7 7 6001400700 Alameda 94608 -122. 37.8 Oakland
## 8 8 6001400800 Alameda 94608 -122. 37.8 Oakland
## 9 9 6001400900 Alameda 94608 -122. 37.8 Oakland
## 10 10 6001401000 Alameda 94608 -122. 37.8 Oakland
## # ℹ 8,025 more rows
## # ℹ 61 more variables: CES4.0Score <dbl>, CES4.0Percentile <dbl>,
## # CES4.0PercRange <chr>, Ozone <dbl>, OzonePctl <dbl>, PM2.5 <lgl>,
## # PM2.5.Pctl <lgl>, DieselPM <dbl>, DieselPMPctl <dbl>, DrinkingWater <dbl>,
## # DrinkingWaterPctl <dbl>, Lead <dbl>, LeadPctl <dbl>, Pesticides <dbl>,
## # PesticidesPctl <dbl>, ToxRelease <dbl>, ToxReleasePctl <dbl>,
## # Traffic <dbl>, TrafficPctl <dbl>, CleanupSites <dbl>, …
Take your code from question 2.4 and assign it to the variable
ces_dat
.
filter()
to drop any rows where “Oakland” appears
in ApproxLocation
. Make sure to reassign this to
ces_dat
.geom_boxplot()
) where (1) the
x-axis is PM2.5
and (2) the y-axis is
Asthma
.labs()
layer so that the x-axis is “ER
Visits for Asthma: PM2.5 greater than 10”ces_dat <-
ces %>%
mutate(across(
starts_with("PM"),
function(x) x > 10
)) %>%
filter(ApproxLocation != "Oakland")
ces_boxplot <- function(df) {
ggplot(df) +
geom_boxplot(aes(
x = `Asthma`,
y = `PM2.5`
)) +
labs(x = "ER Visits for Asthma: PM2.5 greater than 10")
}
ces_boxplot(ces_dat)
## Warning: Removed 11 rows containing non-finite outside the scale range
## (`stat_boxplot()`).