Part 1

Load all the libraries we will use in this lab.

library(tidyverse)

1.1

Create a function that takes one argument, a vector, and returns the sum of the vector and then squares the result. Call it “sum_squared”. Test your function on the vector c(2,7,21,30,90) - you should get the answer 22500.

# General format
NEW_FUNCTION <- function(x, y) x + y 

or

# General format
NEW_FUNCTION <- function(x, y){
result <- x + y 
return(result)
}
nums <- c(2, 7, 21, 30, 90)

sum_squared <- function(x) sum(x)^2
sum_squared(x = nums)
## [1] 22500
sum_squared <- function(x) {
  out <- sum(x)^2
  return(out)
}
sum_squared(x = nums)
## [1] 22500

1.2

Create a function that takes two arguments, (1) a vector and (2) a numeric value. This function tests whether the number (2) is contained within the vector (1). Hint: use %in%. Call it has_n. Test your function on the vector c(2,7,21,30,90) and number 21 - you should get the answer TRUE.

nums <- c(2, 7, 21, 30, 90)
a_num <- 21

has_n <- function(x, n) n %in% x
has_n(x = nums, n = a_num)
## [1] TRUE

1.3

Amend the function has_n from question 1.2 so that it takes a default value of 21 for the numeric argument.

nums <- c(2, 7, 21, 30, 90)
a_num <- 21

has_n <- function(x, n = 21) n %in% x
has_n(x = nums)
## [1] TRUE

P.1

Create a new number b_num that is not contained with nums. Use your updated has_n function with the default value and add b_num as the n argument when calling the function. What is the outcome?

b_num <- 11
has_n(x = nums, n = b_num)
## [1] FALSE

Part 2

2.1

Read in the CalEnviroScreen from https://daseh.org/data/CalEnviroScreen_data.csv. Assign the data the name “ces”.

ces <- read_csv("https://daseh.org/data/CalEnviroScreen_data.csv")
# If downloaded
# ces <- read_csv("CalEnviroScreen_data.csv")

2.2

We want to get some summary statistics on water contamination. Use across inside summarize to get the sum total variable containing the string “water” AND ending with “Pctl”. Hint: use contains() AND ends_with() to select the right columns inside across. Remember that NA values can influence calculations.

# General format
data %>%
  summarize(across(
    {vector or tidyselect},
    {some function}
  ))
ces %>%
  summarize(across(
    contains("Water") & ends_with("Pctl"),
    sum
  ))
## # A tibble: 1 × 3
##   DrinkingWaterPctl GroundwaterThreatsPctl ImpWaterBodiesPctl
##               <dbl>                  <dbl>              <dbl>
## 1                NA                304029.            256802.
ces %>%
  summarize(across(
    contains("Water") & ends_with("Pctl"),
    function(x) sum(x, na.rm = T)
  ))
## # A tibble: 1 × 3
##   DrinkingWaterPctl GroundwaterThreatsPctl ImpWaterBodiesPctl
##               <dbl>                  <dbl>              <dbl>
## 1           403640.                304029.            256802.

2.3

Use across and mutate to convert all columns containing the word “water” into proportions (i.e., divide that value by 100). Hint: use contains() to select the right columns within across(). Use an anonymous function (“function on the fly”) to divide by 100 (function(x) x / 100). It will also be easier to check your work if you select() columns that match “Pctl”.

# General format
data %>%
  mutate(across(
    {vector or tidyselect},
    {some function}
  ))
ces %>%
  mutate(across(
    contains("water"),
    function(x) x / 100
  )) %>%
  select(contains("Pctl"))
## # A tibble: 8,035 × 23
##    OzonePctl PM2.5.Pctl DieselPMPctl DrinkingWaterPctl LeadPctl PesticidesPctl
##        <dbl>      <dbl>        <dbl>             <dbl>    <dbl>          <dbl>
##  1      3.12       36.3         34.8            0.0421     7.74              0
##  2      3.12       42.0         92.7            0.0421    68.2               0
##  3      3.12       43.9         89.8            0.0421    64.2               0
##  4      3.12       42.8         79.1            0.0421    67.1               0
##  5      3.12       42.8         67.6            0.0421    68.0               0
##  6      3.12       42.8         83.8            0.0421    69.7               0
##  7      3.12       43.3         81.3            0.0421    76.9               0
##  8      3.12       44.0         68.7            0.0421    73.2               0
##  9      3.12       44.0         81.1            0.0421    86.7               0
## 10      3.12       45.8         99.4            0.0421    88.5               0
## # ℹ 8,025 more rows
## # ℹ 17 more variables: ToxReleasePctl <dbl>, TrafficPctl <dbl>,
## #   CleanupSitesPctl <dbl>, GroundwaterThreatsPctl <dbl>, HazWastePctl <dbl>,
## #   ImpWaterBodiesPctl <dbl>, SolidWastePctl <dbl>, PollutionBurdenPctl <dbl>,
## #   AsthmaPctl <dbl>, LowBirthWeightPctl <dbl>,
## #   CardiovascularDiseasePctl <dbl>, PopCharPctl <dbl>, EducationPctl <dbl>,
## #   LinguisticIsolPctl <dbl>, PovertyPctl <dbl>, UnemploymentPctl <dbl>, …

Practice on Your Own!

P.2

Use across and mutate to convert all columns starting with the string “PM” into a binary variable: TRUE if the value is greater than 10 and FALSE if less than or equal to 10. Hint: use starts_with() to select the columns that start with “PM”. Use an anonymous function (“function on the fly”) to do a logical test if the value is greater than 10.

ces %>%
  mutate(across(
    starts_with("PM"),
    function(x) x > 10
  ))
## # A tibble: 8,035 × 68
##     ...1 CensusTract CaliforniaCounty   ZIP Longitude Latitude ApproxLocation
##    <dbl>       <dbl> <chr>            <dbl>     <dbl>    <dbl> <chr>         
##  1     1  6001400100 Alameda          94704     -122.     37.9 Oakland       
##  2     2  6001400200 Alameda          94618     -122.     37.8 Oakland       
##  3     3  6001400300 Alameda          94618     -122.     37.8 Oakland       
##  4     4  6001400400 Alameda          94609     -122.     37.8 Oakland       
##  5     5  6001400500 Alameda          94609     -122.     37.8 Oakland       
##  6     6  6001400600 Alameda          94609     -122.     37.8 Oakland       
##  7     7  6001400700 Alameda          94608     -122.     37.8 Oakland       
##  8     8  6001400800 Alameda          94608     -122.     37.8 Oakland       
##  9     9  6001400900 Alameda          94608     -122.     37.8 Oakland       
## 10    10  6001401000 Alameda          94608     -122.     37.8 Oakland       
## # ℹ 8,025 more rows
## # ℹ 61 more variables: CES4.0Score <dbl>, CES4.0Percentile <dbl>,
## #   CES4.0PercRange <chr>, Ozone <dbl>, OzonePctl <dbl>, PM2.5 <lgl>,
## #   PM2.5.Pctl <lgl>, DieselPM <dbl>, DieselPMPctl <dbl>, DrinkingWater <dbl>,
## #   DrinkingWaterPctl <dbl>, Lead <dbl>, LeadPctl <dbl>, Pesticides <dbl>,
## #   PesticidesPctl <dbl>, ToxRelease <dbl>, ToxReleasePctl <dbl>,
## #   Traffic <dbl>, TrafficPctl <dbl>, CleanupSites <dbl>, …

P.3

Take your code from question 2.4 and assign it to the variable ces_dat.

  • use filter() to drop any rows where “Oakland” appears in ApproxLocation. Make sure to reassign this to ces_dat.
  • Create a ggplot boxplot (geom_boxplot()) where (1) the x-axis is PM2.5 and (2) the y-axis is Asthma.
  • You change the labs() layer so that the x-axis is “ER Visits for Asthma: PM2.5 greater than 10”
ces_dat <-
  ces %>%
  mutate(across(
    starts_with("PM"),
    function(x) x > 10
  )) %>%
  filter(ApproxLocation != "Oakland")

ces_boxplot <- function(df) {
  ggplot(df) +
    geom_boxplot(aes(
      x = `Asthma`,
      y = `PM2.5`
    )) +
    labs(x = "ER Visits for Asthma: PM2.5 greater than 10")
}
ces_boxplot(ces_dat)
## Warning: Removed 11 rows containing non-finite outside the scale range
## (`stat_boxplot()`).