Homework is optional, but we recommend it so you can get the most out of this course.
## you can add more, or change...these are suggestions
library(tidyverse)
library(readr)
library(dplyr)
library(ggplot2)
library(tidyr)
1. Bring the following dataset into R.
read_csv()
and assign it the name water
.2. Check out and clean the columns of the water
object.
colnames()
function to take a look at the dataset column names.water
to “GPS_North” using the rename()
function (tidyverse).water
.3. Clean up the “datetime” column.
mdy_hm()
functions from the lubridate
package. Call the new column “datetime_fixed”.mutate()
and year(datetime_fixed)
to create your new variables.)water
.4. Separate “datetime_fixed” into two separate columns using the separate()
function. Use into = c("date", "time")
and sep = " "
arguments. Replace the original water
object by reassigning the new dataset to water
.
5. What is the class for “Result” column? Use pull()
. Can you take the mean value? Why or why not?
6. Some of the values of the “Result” column contain symbols “<”, “>”, and/or “=”. Replace these:
mutate()
and str_remove()
.pattern = "<"
, pattern = ">"
, then pattern = "="
.water
each time.7. Use mutate()
and as.numeric()
to convert the “Result” column to numeric class. Then convert “Parameter” to a factor with as.factor()
. Reassign to water
both times.
8. How many different measurements (levels) are stored in the “Parameter” column? Use pull()
and levels()
.
9. Use the pct_complete()
function in the naniar
package to determine the percent missing data in “Results”. You might need to load and install naniar
!
10. Are there any parameter levels that have an incomplete record in water
across all years?
na_counts
.group_by()
on “Parameter”summarize()
on Result checking for NA values with sum(is.na(Result))
.filter()
to keep only rows with > 0 NA values.11. Subset water
so that it only contains results for the years 2010 - 2024, using &
and the filter()
function. Make sure to include both the years 2010 and 2024. Confirm your filtering worked by looking at the range()
of “Year”. Assign this subsetted tibble to water_subset
.
12. Subset water_subset
so that the “Parameter” column only contains results from the tests for “Dissolved Oxygen”, “pH”, and “Water Temperature”. Use the %in%
operator and the filter()
function. Make sure to reassign to water_subset
.
13. Load the new dataset “Baltimore_rainfall_HamiltonAve.csv” into R.
read_csv()
and assign it the name rainfall
.This dataset contains measured precipitation by month, as collected near the Hamilton Ave testing location in Baltimore.
14. Reshape rainfall
into long format using pivot_longer()
. The reshaped dataset should contain three columns: year (“Year”), month (“Month”), and amount of rainfall (“Precip”).
!COLUMN
or -COLUMN
means everything except COLUMN.names_to
is “Year” and values_to
is “Precip”.rainfall_long
.15. How many possible “Year” and “Month” combinations are in rainfall_long
? Use count()
. How does this compare to nrow()
? Can you use this information to determine the number of observations per each Year and Month combination?
16. We would like to join the rainfall measurements dataset with the Baltimore surface water quality dataset, but we need to do a bit more wrangling first. Because the rainfall measures were collected near the Hamilton Ave water testing site, let’s keep only the Hamilton Ave data from the water_subset
data, using filter()
. Assign this to water_Ham
.
17. Right-join water_Ham
and rainfall_long
by “Month” and “Year” using right_join()
. Assign the joined dataset to the name water_rain
. Did this join attempt work? Why or why not?
18. Check the class of the “Month” column in each dataset. Reformat “Month” in rainfall_long
so that it matches the format in water_Ham
. Use case_when()
and mutate()
, then reassign to rainfall_long
.
19. NOW try to right-join water_Ham
and rainfall_long
by “Month” and “Year” and assign the joined dataset to the name water_rain
. Did this join attempt work?
20. Fix the differences between the classes of “Year” by changing rainfall_long
. Then right-join the two datasets to make “water_rain”.
21. Subset water_rain
so that “Parameter” is “pH”. Plot points with “Precip” on the x axis, “Result” on the y axis, and “Year” as color. Facet the plot by Month. You can use esquisse or ggplot (whichever you prefer).
22. Create a new plot changing the measurement units. This time, subset water_rain
so that “Parameter” is “Water Temperature”. Plot points with “Precip” on the x axis, “Result” on the y axis, and “Year” as color. Facet the plot by Month. You can use esquisse or ggplot (whichever you prefer). Do the temperatures follow the expected seasonal climate across months (highs and lows)?