Load all the packages we will use in this lab.
library(tidyverse)
Load the CalEnviroScreen dataset and use select
to choose the CaliforniaCounty
, ImpWaterBodies
, and ZIP
variables. Then subset this data using filter
to include only the California counties Amador, Napa, Ventura, and San Francisco. Name this data “ces”.
ImpWaterBodies
: measure of the number of pollutants across all impaired water bodies within a given distance of populated areas.
ces <-
read_csv("https://daseh.org/data/CalEnviroScreen_data.csv") %>%
select(CaliforniaCounty, ImpWaterBodies, ZIP) %>%
filter(CaliforniaCounty %in% c("Amador", "Napa", "Ventura", "San Francisco"))
## Rows: 8035 Columns: 67
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): CaliforniaCounty, ApproxLocation, CES4.0PercRange
## dbl (64): CensusTract, ZIP, Longitude, Latitude, CES4.0Score, CES4.0Percenti...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Create a boxplot showing the difference in groundwater contamination threats (ImpWaterBodies
) among Amador, Napa, San Francisco, and Ventura counties (CaliforniaCounty
). Hint: Use aes(x = CaliforniaCounty, y = ImpWaterBodies)
and geom_boxplot()
.
ces %>%
ggplot(aes(x = CaliforniaCounty, y = ImpWaterBodies)) +
geom_boxplot()
Use count
to count up the number of observations of data for each CaliforniaCounty
group.
ces %>%
count(CaliforniaCounty)
## # A tibble: 4 × 2
## CaliforniaCounty n
## <chr> <int>
## 1 Amador 9
## 2 Napa 40
## 3 San Francisco 195
## 4 Ventura 173
Make CaliforniaCounty
a factor using the mutate
and factor
functions. Use the levels
argument inside factor
to reorder CaliforniaCounty
. Reorder this variable so the order is now San Francisco, Ventura, Napa, and Amador. Assign the output the name “ces_fct”.
ces_fct <-
ces %>% mutate(CaliforniaCounty = factor(CaliforniaCounty,
levels = c("San Francisco", "Ventura", "Napa", "Amador")
))
Repeat question 1.1 and 1.2 using the “ces_fct” data. You should see different ordering in the plot and count
table.
ces_fct %>%
ggplot(aes(x = CaliforniaCounty, y = ImpWaterBodies)) +
geom_boxplot()
ces_fct %>%
count(CaliforniaCounty)
## # A tibble: 4 × 2
## CaliforniaCounty n
## <fct> <int>
## 1 San Francisco 195
## 2 Ventura 173
## 3 Napa 40
## 4 Amador 9
Subset ces_fct
so that it only includes data from Ventura county. Then convert ZIP
(zip code) into a factor using the mutate
and factor
functions. Do not add a levels =
argument.
ces_Ventura <- ces_fct %>%
filter(CaliforniaCounty == "Ventura") %>%
mutate(ZIP = factor(ZIP))
We want to create a new column that contains the group-level median values for ImpWaterBodies
.
ZIP
using group_by
mutate
to create a new column med_ImpWaterBodies
that is the median of ImpWaterBodies
.group_by
, a median ImpWaterBodies
will automatically be created for each unique level in ZIP
. Use the median
function with na.rm = TRUE
.ces_Ventura <- ces_Ventura %>%
group_by(ZIP) %>%
mutate(med_ImpWaterBodies = median(ImpWaterBodies, na.rm = TRUE))
We want to make a plot of the med_ImpWaterBodies
column we created above in the ces_Ventura
, separated by ZIP
. Using the forcats
package, create a plot that:
ZIP
on the x-axismapping
argument and the fct_reorder
function to order the x-axis by med_ImpWaterBodies
med_ImpWaterBodies
on the y-axisgeom_boxplot
)NA
values.)Save your plot using ggsave()
with a width of 10 and height of 3.
Which zipcode has the largest median measure of water pollution?
library(forcats)
ces_Ventura_plot <- ces_Ventura %>%
drop_na() %>%
ggplot(aes(
x = fct_reorder(
ZIP, med_ImpWaterBodies
),
y = med_ImpWaterBodies
)) +
geom_boxplot() +
labs(x = "Zipcode")
ggsave(
filename = "ces_Ventura.png", # will save in working directory
plot = ces_Ventura_plot,
width = 10, height = 3
)