Part 1

Load the tidyverse package.

library(tidyverse)

1.1

Create a function that:

  • Takes one argument, a vector.
  • Returns the sum of the vector and then squares the result.
  • Call it “sum_squared”.
  • Test your function on the vector c(2,7,21,30,90) - you should get the answer 22500.
  • Format is NEW_FUNCTION <- function(x, y) x + y
nums <- c(2, 7, 21, 30, 90)

sum_squared <- function(x) sum(x)^2
sum_squared(x = nums)
## [1] 22500
sum_squared <- function(x) {
  out <- sum(x)^2
  return(out)
}
sum_squared(x = nums)
## [1] 22500

1.2

Create a function that:

  • takes two arguments, (1) a vector and (2) a numeric value.
  • This function tests whether the number (2) is contained within the vector (1). Hint: use %in%.
  • Call it has_n.
  • Test your function on the vector c(2,7,21,30,90) and number 21 - you should get the answer TRUE.
nums <- c(2, 7, 21, 30, 90)
a_num <- 21

has_n <- function(x, n) n %in% x
has_n(x = nums, n = a_num)
## [1] TRUE

1.3

Amend the function has_n from question 1.2 so that it takes a default value of 21 for the numeric argument.

nums <- c(2, 7, 21, 30, 90)
a_num <- 21

has_n <- function(x, n = 21) n %in% x
has_n(x = nums)
## [1] TRUE

P.1

Create a function for the CalEnviroScreen Data.

  • Read in (https://daseh.org/data/CalEnviroScreen_data.csv)
  • The function takes an argument for a column name. (use {{col_name}})
  • The function creates a ggplot with {{col_name}} on the x-axis and Poverty on the y-axis.
  • Use geom_point()
  • Test the function using the Lead column and HousingBurden columns, or other columns of your choice.
ces <- read_csv("https://daseh.org/data/CalEnviroScreen_data.csv")
## Rows: 8035 Columns: 67
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (3): CaliforniaCounty, ApproxLocation, CES4.0PercRange
## dbl (64): CensusTract, ZIP, Longitude, Latitude, CES4.0Score, CES4.0Percenti...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
plot_ces <- function(col_name){
  ggplot(data = ces, aes(x = {{col_name}}, y = Poverty)) +
    geom_point()
}

plot_ces(Lead)
## Warning: Removed 99 rows containing missing values or values outside the scale range
## (`geom_point()`).

plot_ces(HousingBurden)
## Warning: Removed 147 rows containing missing values or values outside the scale range
## (`geom_point()`).

Part 2

2.1

Read in the CalEnviroScreen from https://daseh.org/data/CalEnviroScreen_data.csv. Assign the data the name “ces”.

ces <- read_csv("https://daseh.org/data/CalEnviroScreen_data.csv")
# If downloaded
# ces <- read_csv("CalEnviroScreen_data.csv")

2.2

We want to get some summary statistics on water contamination.

  • Use across inside summarize.
  • Choose columns about “water”. Hint: use contains("water") inside across.
  • Use mean as the function inside of across.
  • Remember that NA values can influence calculations.
# General format
data %>%
  summarize(across(
    {vector or tidyselect},
    {some function}
  ))
ces %>%
  summarize(across(
    contains("water"),
    mean
  ))
## # A tibble: 1 × 6
##   DrinkingWater DrinkingWaterPctl GroundwaterThreats GroundwaterThreatsPctl
##           <dbl>             <dbl>              <dbl>                  <dbl>
## 1            NA                NA               16.7                   37.8
## # ℹ 2 more variables: ImpWaterBodies <dbl>, ImpWaterBodiesPctl <dbl>
# Accounting for NA
ces %>%
  summarize(across(
    contains("water"),
    function(x) mean(x, na.rm = T)
  ))
## # A tibble: 1 × 6
##   DrinkingWater DrinkingWaterPctl GroundwaterThreats GroundwaterThreatsPctl
##           <dbl>             <dbl>              <dbl>                  <dbl>
## 1          477.              50.4               16.7                   37.8
## # ℹ 2 more variables: ImpWaterBodies <dbl>, ImpWaterBodiesPctl <dbl>

2.3

Convert all columns that are percentiles into proportions.

  • Use across and mutate
  • Choose columns that contain “Pctl” in the name. Hint: use contains("Pctl") inside across.
  • Use an anonymous function (“function on the fly”) to divide by 100 (function(x) x / 100).
  • Check your work - It will also be easier if you select(contains("Pctl")).
# General format
data %>%
  mutate(across(
    {vector or tidyselect},
    {some function}
  ))
ces %>%
  mutate(across(
    contains("Pctl"),
    function(x) x / 100
  )) %>%
  select(contains("Pctl"))
## # A tibble: 8,035 × 23
##    OzonePctl PM2.5.Pctl DieselPMPctl DrinkingWaterPctl LeadPctl PesticidesPctl
##        <dbl>      <dbl>        <dbl>             <dbl>    <dbl>          <dbl>
##  1    0.0312      0.363        0.348            0.0421   0.0774              0
##  2    0.0312      0.420        0.927            0.0421   0.682               0
##  3    0.0312      0.439        0.898            0.0421   0.642               0
##  4    0.0312      0.428        0.791            0.0421   0.671               0
##  5    0.0312      0.428        0.676            0.0421   0.680               0
##  6    0.0312      0.428        0.838            0.0421   0.697               0
##  7    0.0312      0.433        0.813            0.0421   0.769               0
##  8    0.0312      0.440        0.687            0.0421   0.732               0
##  9    0.0312      0.440        0.811            0.0421   0.867               0
## 10    0.0312      0.458        0.994            0.0421   0.885               0
## # ℹ 8,025 more rows
## # ℹ 17 more variables: ToxReleasePctl <dbl>, TrafficPctl <dbl>,
## #   CleanupSitesPctl <dbl>, GroundwaterThreatsPctl <dbl>, HazWastePctl <dbl>,
## #   ImpWaterBodiesPctl <dbl>, SolidWastePctl <dbl>, PollutionBurdenPctl <dbl>,
## #   AsthmaPctl <dbl>, LowBirthWeightPctl <dbl>,
## #   CardiovascularDiseasePctl <dbl>, PopCharPctl <dbl>, EducationPctl <dbl>,
## #   LinguisticIsolPctl <dbl>, PovertyPctl <dbl>, UnemploymentPctl <dbl>, …

Practice on Your Own!

P.2

Use across and mutate to convert all columns starting with the string “PM” into a binary variable: TRUE if the value is greater than 10 and FALSE if less than or equal to 10.

  • Hint: use starts_with() to select the columns that start with “PM”.
  • Use an anonymous function (“function on the fly”) to do a logical test if the value is greater than 10.
  • A logical test with mutate (x > 10) will automatically fill a column with TRUE/FALSE.
ces %>%
  mutate(across(
    starts_with("PM"),
    function(x) x > 10
  )) %>% 
  glimpse() # add glimpse to view the changes
## Rows: 8,035
## Columns: 67
## $ CensusTract               <dbl> 6001400100, 6001400200, 6001400300, 60014004…
## $ CaliforniaCounty          <chr> "Alameda", "Alameda", "Alameda", "Alameda", …
## $ ZIP                       <dbl> 94704, 94618, 94618, 94609, 94609, 94609, 94…
## $ Longitude                 <dbl> -122.2319, -122.2496, -122.2544, -122.2575, …
## $ Latitude                  <dbl> 37.86759, 37.84817, 37.84060, 37.84821, 37.8…
## $ ApproxLocation            <chr> "Oakland", "Oakland", "Oakland", "Oakland", …
## $ CES4.0Score               <dbl> 4.85, 4.88, 11.20, 12.39, 16.73, 20.02, 36.7…
## $ CES4.0Percentile          <dbl> 2.80, 2.87, 15.94, 18.97, 29.74, 37.59, 70.1…
## $ CES4.0PercRange           <chr> "1-5% (lowest scores)", "1-5% (lowest scores…
## $ Ozone                     <dbl> 0.029, 0.029, 0.029, 0.029, 0.029, 0.029, 0.…
## $ OzonePctl                 <dbl> 3.12, 3.12, 3.12, 3.12, 3.12, 3.12, 3.12, 3.…
## $ PM2.5                     <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA…
## $ PM2.5.Pctl                <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TR…
## $ DieselPM                  <dbl> 0.093, 0.591, 0.497, 0.327, 0.233, 0.383, 0.…
## $ DieselPMPctl              <dbl> 34.76, 92.71, 89.77, 79.10, 67.58, 83.76, 81…
## $ DrinkingWater             <dbl> 110.41, 110.41, 110.41, 110.41, 110.41, 110.…
## $ DrinkingWaterPctl         <dbl> 4.21, 4.21, 4.21, 4.21, 4.21, 4.21, 4.21, 4.…
## $ Lead                      <dbl> 15.25, 61.71, 58.82, 60.88, 61.56, 62.83, 68…
## $ LeadPctl                  <dbl> 7.74, 68.20, 64.18, 67.08, 67.95, 69.70, 76.…
## $ Pesticides                <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ PesticidesPctl            <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ ToxRelease                <dbl> 626.7743, 609.9600, 598.8777, 624.6293, 641.…
## $ ToxReleasePctl            <dbl> 56.03, 55.43, 55.04, 55.90, 56.48, 55.65, 55…
## $ Traffic                   <dbl> 965.9462, 716.8225, 776.4908, 721.5428, 862.…
## $ TrafficPctl               <dbl> 55.94, 37.49, 42.48, 38.00, 48.68, 67.06, 52…
## $ CleanupSites              <dbl> 9.00, 0.00, 0.90, 0.00, 3.50, 1.75, 9.15, 47…
## $ CleanupSitesPctl          <dbl> 58.17, 0.00, 11.83, 0.00, 33.87, 22.62, 58.5…
## $ GroundwaterThreats        <dbl> 12.25, 44.25, 38.55, 60.50, 37.00, 29.80, 56…
## $ GroundwaterThreatsPctl    <dbl> 52.42, 87.93, 85.29, 92.56, 84.34, 79.06, 91…
## $ HazWaste                  <dbl> 2.000, 0.165, 0.595, 0.315, 0.350, 0.370, 0.…
## $ HazWastePctl              <dbl> 92.52, 28.51, 74.07, 51.89, 56.40, 58.27, 72…
## $ ImpWaterBodies            <dbl> 2, 0, 0, 0, 0, 0, 10, 10, 10, 10, 0, 0, 11, …
## $ ImpWaterBodiesPctl        <dbl> 23.88, 0.00, 0.00, 0.00, 0.00, 0.00, 82.97, …
## $ SolidWaste                <dbl> 1.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.…
## $ SolidWastePctl            <dbl> 35.72, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0…
## $ PollutionBurden           <dbl> 34.02, 33.02, 36.64, 33.82, 35.88, 37.86, 46…
## $ PollutionBurdenScore      <dbl> 4.15, 4.03, 4.47, 4.13, 4.38, 4.62, 5.72, 5.…
## $ PollutionBurdenPctl       <dbl> 26.62, 24.18, 33.37, 26.24, 31.40, 36.94, 61…
## $ Asthma                    <dbl> 15.65, 20.47, 30.88, 49.61, 86.57, 101.53, 1…
## $ AsthmaPctl                <dbl> 4.44, 9.80, 26.57, 55.98, 88.38, 93.07, 93.8…
## $ LowBirthWeight            <dbl> 3.85, 4.05, 3.78, 4.44, 3.64, 2.56, 5.60, 7.…
## $ LowBirthWeightPctl        <dbl> 23.06, 27.92, 21.62, 37.02, 19.00, 5.03, 66.…
## $ CardiovascularDisease     <dbl> 5.24, 8.14, 8.88, 8.08, 11.13, 12.80, 10.63,…
## $ CardiovascularDiseasePctl <dbl> 1.42, 14.53, 20.11, 14.28, 38.87, 52.78, 35.…
## $ TotalPop                  <dbl> 3120, 2007, 5051, 4007, 4124, 1745, 5128, 40…
## $ ChildrenPercLess10        <dbl> 7.82, 10.46, 11.42, 9.38, 9.12, 9.97, 7.62, …
## $ PopPerc10to64             <dbl> 66.12, 66.32, 73.04, 78.79, 81.96, 81.43, 82…
## $ ElderlyMore64             <dbl> 26.06, 23.22, 15.54, 11.83, 8.92, 8.60, 9.93…
## $ HispanicPerc              <dbl> 3.78, 8.67, 6.95, 12.10, 9.46, 7.51, 19.38, …
## $ WhitePerc                 <dbl> 74.26, 73.49, 67.99, 63.74, 45.44, 49.28, 38…
## $ AfAmericanPerc            <dbl> 3.43, 2.59, 9.09, 6.64, 21.39, 20.52, 28.24,…
## $ NativeAmericanPerc        <dbl> 0.00, 0.20, 0.00, 0.87, 0.00, 0.17, 0.00, 0.…
## $ AsianAmericanPerc         <dbl> 12.53, 8.52, 12.14, 10.48, 11.34, 10.54, 6.4…
## $ OtherMultiplePerc         <dbl> 5.99, 6.53, 3.84, 6.16, 12.37, 11.98, 7.68, …
## $ PopChar                   <dbl> 11.25, 11.67, 24.14, 28.93, 36.83, 41.76, 61…
## $ PopCharScore              <dbl> 1.17, 1.21, 2.50, 3.00, 3.82, 4.33, 6.42, 6.…
## $ PopCharPctl               <dbl> 1.53, 1.65, 12.27, 18.43, 30.16, 37.70, 68.4…
## $ Education                 <dbl> 3.3, 0.4, 5.6, 4.8, 2.3, 2.8, NA, 5.5, 8.9, …
## $ EducationPctl             <dbl> 12.55, 0.42, 24.12, 20.29, 7.40, 9.73, NA, 2…
## $ LinguisticIsol            <dbl> 1.2, 0.0, 8.0, 0.9, 1.7, 1.0, 4.8, 2.5, 10.3…
## $ LinguisticIsolPctl        <dbl> 8.49, 0.00, 53.36, 5.64, 13.30, 6.27, 36.46,…
## $ Poverty                   <dbl> 10.4, 10.6, 10.3, 21.1, 21.9, 16.0, 33.2, 36…
## $ PovertyPctl               <dbl> 11.03, 11.44, 10.90, 36.42, 38.13, 24.42, 58…
## $ Unemployment              <dbl> NA, 3.0, 3.9, 2.5, 3.8, 7.5, 7.0, 5.7, 5.5, …
## $ UnemploymentPctl          <dbl> NA, 17.11, 29.41, 10.66, 28.20, 71.67, 67.48…
## $ HousingBurden             <dbl> 11.2, 4.0, 8.9, 14.8, 14.8, 18.0, 22.1, 20.7…
## $ HousingBurdenPctl         <dbl> 19.39, 0.67, 9.81, 37.48, 37.48, 54.07, 70.8…

P.3

Take your code from previous question and assign it to the variable ces_dat.

  • Create a ggplot where the x-axis is Asthma and the y-axis is PM2.5.
  • Add a boxplot (geom_boxplot())
  • Change the labs() layer so that the x-axis is “ER Visits for Asthma: PM2.5 greater than 10”
ces_dat <-
  ces %>%
  mutate(across(
    starts_with("PM"),
    function(x) x > 10
  ))

ggplot(data = ces_dat, aes(x = `Asthma`, y = `PM2.5`)) +
  geom_boxplot() +
  labs(x = "ER Visits for Asthma: PM2.5 greater than 10")
## Warning: Removed 11 rows containing non-finite outside the scale range
## (`stat_boxplot()`).

# Make everything a function if you like!
ces_boxplot <- function() {
  ces %>%
    mutate(across(
      starts_with("PM"), 
      function(x) x > 10
    )) %>% 
    ggplot(aes(x = `Asthma`, y = `PM2.5`)) +
    geom_boxplot() +
    labs(x = "ER Visits for Asthma: PM2.5 greater than 10")
}

ces_boxplot()
## Warning: Removed 11 rows containing non-finite outside the scale range
## (`stat_boxplot()`).