Load all the packages we will use in this lab.

library(tidyverse)

1.0

Load the CalEnviroScreen dataset and use select to choose the CaliforniaCounty, ImpWaterBodies, and ZIP variables. Then subset this data using filter to include only the California counties Amador, Napa, Ventura, and San Francisco. Name this data “ces”.

ImpWaterBodies: measure of the number of pollutants across all impaired water bodies within a given distance of populated areas.

ces <- 
  read_csv("https://daseh.org/data/CalEnviroScreen_data.csv") %>% 
  select(CaliforniaCounty, ImpWaterBodies, ZIP) %>%
  filter(CaliforniaCounty %in% c("Amador", "Napa", "Ventura", "San Francisco"))
## Rows: 8035 Columns: 67
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (3): CaliforniaCounty, ApproxLocation, CES4.0PercRange
## dbl (64): CensusTract, ZIP, Longitude, Latitude, CES4.0Score, CES4.0Percenti...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

1.1

Create a boxplot showing the difference in groundwater contamination threats (ImpWaterBodies) among Amador, Napa, San Francisco, and Ventura counties (CaliforniaCounty). Hint: Use aes(x = CaliforniaCounty, y = ImpWaterBodies) and geom_boxplot().

ces %>%
  ggplot(aes(x = CaliforniaCounty, y = ImpWaterBodies)) +
  geom_boxplot()

1.2

Use count to count up the number of observations of data for each CaliforniaCounty group.

ces %>%
  count(CaliforniaCounty)
## # A tibble: 4 × 2
##   CaliforniaCounty     n
##   <chr>            <int>
## 1 Amador               9
## 2 Napa                40
## 3 San Francisco      195
## 4 Ventura            173

1.3

Make CaliforniaCounty a factor using the mutate and factor functions. Use the levels argument inside factor to reorder CaliforniaCounty. Reorder this variable so the order is now San Francisco, Ventura, Napa, and Amador. Assign the output the name “ces_fct”.

ces_fct <-
  ces %>% mutate(CaliforniaCounty = factor(CaliforniaCounty,
    levels = c("San Francisco", "Ventura", "Napa", "Amador")
  ))

1.4

Repeat question 1.1 and 1.2 using the “ces_fct” data. You should see different ordering in the plot and count table.

ces_fct %>%
  ggplot(aes(x = CaliforniaCounty, y = ImpWaterBodies)) +
  geom_boxplot()

ces_fct %>%
  count(CaliforniaCounty)
## # A tibble: 4 × 2
##   CaliforniaCounty     n
##   <fct>            <int>
## 1 San Francisco      195
## 2 Ventura            173
## 3 Napa                40
## 4 Amador               9

Practice on Your Own!

P.1

Subset ces_fct so that it only includes data from Ventura county. Then convert ZIP (zip code) into a factor using the mutate and factor functions. Do not add a levels = argument.

ces_Ventura <- ces_fct %>% 
  filter(CaliforniaCounty == "Ventura") %>%
  mutate(ZIP = factor(ZIP))

P.2

We want to create a new column that contains the group-level median values for ImpWaterBodies.

  • Using the “ces_Ventura” data, group the data by ZIP using group_by
  • Then, use mutate to create a new column med_ImpWaterBodies that is the median of ImpWaterBodies.
  • Hint: Since you have already done group_by, a median ImpWaterBodies will automatically be created for each unique level in ZIP. Use the median function with na.rm = TRUE.
ces_Ventura <- ces_Ventura %>%
  group_by(ZIP) %>%
  mutate(med_ImpWaterBodies = median(ImpWaterBodies, na.rm = TRUE))

P.3

We want to make a plot of the med_ImpWaterBodies column we created above in the ces_Ventura, separated by ZIP. Using the forcats package, create a plot that:

  • Has ZIP on the x-axis
  • Uses the mapping argument and the fct_reorder function to order the x-axis by med_ImpWaterBodies
  • Has med_ImpWaterBodies on the y-axis
  • Is a boxplot (geom_boxplot)
  • Has the x axis label of “Zipcode” (Don’t worry if you get a warning about not being able to plot NA values.)

Save your plot using ggsave() with a width of 10 and height of 3.

Which zipcode has the largest median measure of water pollution?

library(forcats)

ces_Ventura_plot <- ces_Ventura %>%
  drop_na() %>%
  ggplot(aes(
    x = fct_reorder(
      ZIP, med_ImpWaterBodies
    ),
    y = med_ImpWaterBodies
  )) +
  geom_boxplot() +
  labs(x = "Zipcode")

ggsave(
  filename = "ces_Ventura.png", # will save in working directory
  plot = ces_Ventura_plot,
  width = 10, height = 3
)