Load all the packages we will use in this lab.
library(tidyverse)
Load the CalEnviroScreen dataset and use select to choose the CaliforniaCounty, ImpWaterBodies, and ZIP variables. Then subset this data using filter to include only the California counties Amador, Napa, Ventura, and San Francisco. Name this data “ces”.
ImpWaterBodies: measure of the number of pollutants across all impaired water bodies within a given distance of populated areas.
ces <-
read_csv("https://daseh.org/data/CalEnviroScreen_data.csv") %>%
select(CaliforniaCounty, ImpWaterBodies, ZIP) %>%
filter(CaliforniaCounty %in% c("Amador", "Napa", "Ventura", "San Francisco"))
## Rows: 8035 Columns: 67
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): CaliforniaCounty, ApproxLocation, CES4.0PercRange
## dbl (64): CensusTract, ZIP, Longitude, Latitude, CES4.0Score, CES4.0Percenti...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Create a boxplot showing the difference in groundwater contamination threats (ImpWaterBodies) among Amador, Napa, San Francisco, and Ventura counties (CaliforniaCounty). Hint: Use aes(x = CaliforniaCounty, y = ImpWaterBodies) and geom_boxplot().
ces %>%
ggplot(aes(x = CaliforniaCounty, y = ImpWaterBodies)) +
geom_boxplot()
Use count to count up the number of observations of data for each CaliforniaCounty group.
ces %>%
count(CaliforniaCounty)
## # A tibble: 4 × 2
## CaliforniaCounty n
## <chr> <int>
## 1 Amador 9
## 2 Napa 40
## 3 San Francisco 195
## 4 Ventura 173
Make CaliforniaCounty a factor using the mutate and factor functions. Use the levels argument inside factor to reorder CaliforniaCounty. Reorder this variable so the order is now San Francisco, Ventura, Napa, and Amador. Assign the output the name “ces_fct”.
ces_fct <-
ces %>% mutate(CaliforniaCounty = factor(CaliforniaCounty,
levels = c("San Francisco", "Ventura", "Napa", "Amador")
))
Repeat question 1.1 and 1.2 using the “ces_fct” data. You should see different ordering in the plot and count table.
ces_fct %>%
ggplot(aes(x = CaliforniaCounty, y = ImpWaterBodies)) +
geom_boxplot()
ces_fct %>%
count(CaliforniaCounty)
## # A tibble: 4 × 2
## CaliforniaCounty n
## <fct> <int>
## 1 San Francisco 195
## 2 Ventura 173
## 3 Napa 40
## 4 Amador 9
Subset ces_fct so that it only includes data from Ventura county. Then convert ZIP (zip code) into a factor using the mutate and factor functions. Do not add a levels = argument.
ces_Ventura <- ces_fct %>%
filter(CaliforniaCounty == "Ventura") %>%
mutate(ZIP = factor(ZIP))
We want to create a new column that contains the group-level median values for ImpWaterBodies.
ZIP using group_bymutate to create a new column med_ImpWaterBodies that is the median of ImpWaterBodies.group_by, a median ImpWaterBodies will automatically be created for each unique level in ZIP. Use the median function with na.rm = TRUE.ces_Ventura <- ces_Ventura %>%
group_by(ZIP) %>%
mutate(med_ImpWaterBodies = median(ImpWaterBodies, na.rm = TRUE))
We want to make a plot of the med_ImpWaterBodies column we created above in the ces_Ventura, separated by ZIP. Using the forcats package, create a plot that:
ZIP on the x-axismapping argument and the fct_reorder function to order the x-axis by med_ImpWaterBodiesmed_ImpWaterBodies on the y-axisgeom_boxplot)NA values.)Save your plot using ggsave() with a width of 10 and height of 3.
Which zipcode has the largest median measure of water pollution?
library(forcats)
ces_Ventura_plot <- ces_Ventura %>%
drop_na() %>%
ggplot(aes(
x = fct_reorder(
ZIP, med_ImpWaterBodies
),
y = med_ImpWaterBodies
)) +
geom_boxplot() +
labs(x = "Zipcode")
ggsave(
filename = "ces_Ventura.png", # will save in working directory
plot = ces_Ventura_plot,
width = 10, height = 3
)