Introduction to R: Homework 3

Problem Set

1. Bring the following dataset into R.

The dataset is located at: https://daseh.org/data/Baltimore_surface_water_quality_data.csv.
You can use the link, download it, or use whatever method you like for getting the file.
Once you get the file, read the dataset in using read_csv() and assign it the name water.

water <- read_csv("https://daseh.org/data/Baltimore_surface_water_quality_data.csv")

## Rows: 461308 Columns: 9
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (6): Station, Parameter, Lab, Result, Unit, datetime
## dbl (3): OBJECTID, GPS Coordinate North, GPS Coordinate West
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

2. Check out and clean the columns of the water object.

Run the colnames() function to take a look at the dataset column names.
Rename the third column of water to “GPS_North” using the rename() function (tidyverse).
Then rename the fourth column to “GPS_West”.
Reassign the changed data to water.

colnames(water)

## [1] "OBJECTID"             "Station"              "GPS Coordinate North"
## [4] "GPS Coordinate West"  "Parameter"            "Lab"                 
## [7] "Result"               "Unit"                 "datetime"

water <- water |> rename(GPS_North = `GPS Coordinate North`)
water <- water |> rename(GPS_West = `GPS Coordinate West`)

# Alternatives
# water <- rename(water, GPS_North = "GPS Coordinate North", GPS_West = "GPS Coordinate West")
# Or pipe
# water <- water |> rename(GPS_North = "GPS Coordinate North", GPS_West = "GPS Coordinate West")

3. Clean up the “datetime” column.

Convert the “datetime” column to POSIXct format using the mdy_hm() functions from the lubridate package. Call the new column “datetime_fixed”.
Then create the variables “Month” and “Year” by extracting the month and year from the “datetime” column. (HINT: use mutate() and year(datetime_fixed) to create your new variables.)
Reassign to water.

#converting the date format to POSIXct
water <- water |> mutate(datetime_fixed = mdy_hm(datetime))

#creating new variables
water <- water |> mutate(Year = year(datetime_fixed))
water <- water |> mutate(Month = month(datetime_fixed))

# Alternative
# water <- water |> mutate( datetime_fixed = mdy_hm(datetime), Month = month(datetime_fixed), Year = year(datetime_fixed))

4. Separate “datetime_fixed” into two separate columns using the separate() function. Use into = c("date", "time") and sep = " " arguments. Replace the original water object by reassigning the new dataset to water.

water <- water |> separate(`datetime_fixed`, into = c("date", "time"), sep = " ")

5. What is the class for “Result” column? Use pull(). Can you take the mean value? Why or why not?

water |> pull(Result) |> class()

## [1] "character"

# "Result" is of type "character". For some tests, the result is listed as below the detectable level, using a "<". R can only interpret this symbol as a character.

6. Some of the values of the “Result” column contain symbols “<”, “>”, and/or “=”. Replace these:

Using mutate() and str_remove().
Repeat 3 times for patterns: pattern = "<" , pattern = ">", then pattern = "=".
Reassign the object to water each time.

water <- water |> mutate(Result = str_remove(Result, pattern = "<"))
water <- water |> mutate(Result = str_remove(Result, pattern = ">"))
water <- water |> mutate(Result = str_remove(Result, pattern = "="))

# Alternative syntax
# water <- water |> mutate(
#   Result = str_remove(Result, pattern = "<"),
#   Result = str_remove(Result, pattern = ">"),
#   Result = str_remove(Result, pattern = "=")
# )

7. Use mutate() and as.numeric() to convert the “Result” column to numeric class. Then convert “Parameter” to a factor with as.factor(). Reassign to water both times.

water <- water |>
  mutate(Result = as.numeric(Result)) 
water <- water |> 
  mutate(Parameter = as.factor(Parameter))

8. How many different measurements (levels) are stored in the “Parameter” column? Use pull() and levels().

water |> pull(Parameter) |> levels()

##  [1] "Ammonia-Nitrogen"             "Antimony_Dissolved"          
##  [3] "Antimony_Total"               "Arsenic_Dissolved"           
##  [5] "Arsenic_Total"                "BOD5"                        
##  [7] "Cadmium_Dissolved"            "Cadmium_Total"               
##  [9] "Chloride"                     "chlorine"                    
## [11] "Chlorine"                     "Chlorine_Spec"               
## [13] "Chromium_Dissolved"           "Chromium_Total"              
## [15] "COD"                          "Conductivity"                
## [17] "Copper_Dissolved"             "Copper_Total"                
## [19] "Copper, Total"                "Dissolved Oxygen"            
## [21] "E. Coli"                      "Enterococcus"                
## [23] "Fecal Coliform"               "Fluoride"                    
## [25] "Hach Ammonia-Nitrogen"        "Hardness"                    
## [27] "Iron, Total"                  "Lead_Dissolved"              
## [29] "Lead_Total"                   "Lead, Total"                 
## [31] "Maximum Depth"                "Nitrate+Nitrite-Nitrogen"    
## [33] "Oil & Grease"                 "Ortho-Phosphate"             
## [35] "pH"                           "Secchi Disk"                 
## [37] "SGT-HEM"                      "Sodium"                      
## [39] "Suspended Solids"             "TKN"                         
## [41] "Total Coliform"               "Total Petroleum Hydrocarbons"
## [43] "Total Phenolics"              "Total Phosphorus"            
## [45] "Turbidity"                    "Water Temperature"           
## [47] "Zinc_Dissolved"               "Zinc_Total"                  
## [49] "Zinc, Total"

#There are 49 different types of measurements in this dataset.

9. Use the pct_complete() function in the naniar package to determine the percent missing data in “Results”. You might need to load and install naniar!

library(naniar)
pct_complete(water)

## [1] 99.99922

10. Are there any parameter levels that have an incomplete record in water across all years?

Create a new dataframe called na_counts.
Use group_by() on “Parameter”
Use summarize() on Result checking for NA values with sum(is.na(Result)).
Use filter() to keep only rows with > 0 NA values.

na_counts <- water |>
  group_by(Parameter) |>
  summarize(na_count = sum(is.na(Result)))

na_counts |> filter(na_count > 0)

## # A tibble: 3 × 2
##   Parameter na_count
##   <fct>        <int>
## 1 E. Coli         27
## 2 Hardness        11
## 3 Turbidity        3

11. Subset water so that it only contains results for the years 2010 - 2024, using & and the filter() function. Make sure to include both the years 2010 and 2024. Confirm your filtering worked by looking at the range() of “Year”. Assign this subsetted tibble to water_subset.

water_subset <- water |> filter(Year >= 2010 & Year <= 2024)
water_subset |>
  pull(Year) |>
  range() # confirm it worked

## [1] 2010 2024

12. Subset water_subset so that the “Parameter” column only contains results from the tests for “Dissolved Oxygen”, “pH”, and “Water Temperature”. Use the %in% operator and the filter() function. Make sure to reassign to water_subset.

water_subset <- water_subset |>
  filter(Parameter %in% c("Dissolved Oxygen", "pH", "Water Temperature"))

13. Load the new dataset “Baltimore_rainfall_HamiltonAve.csv” into R.

The dataset is located at: https://daseh.org/data/Baltimore_rainfall_HamiltonAve.csv.
You can use the link, download it, or use whatever method you like for getting the file.
Once you get the file, read the dataset in using read_csv() and assign it the name rainfall.

This dataset contains measured precipitation by month, as collected near the Hamilton Ave testing location in Baltimore.

rainfall <- read_csv("https://daseh.org/data/Baltimore_rainfall_HamiltonAve.csv")

## Rows: 13 Columns: 17
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (1): Month
## dbl (16): 2009, 2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

14. Reshape rainfall into long format using pivot_longer(). The reshaped dataset should contain three columns: year (“Year”), month (“Month”), and amount of rainfall (“Precip”).

Pivot all columns except “Month”. Hint: listing !COLUMN or -COLUMN means everything except COLUMN.
Make sure names_to is “Year” and values_to is “Precip”.
Assign the reshaped dataset to rainfall_long.

rainfall_long <- rainfall |> pivot_longer(!Month, names_to = "Year", values_to = "Precip")

15. How many possible “Year” and “Month” combinations are in rainfall_long? Use count(). How does this compare to nrow()? Can you use this information to determine the number of observations per each Year and Month combination?

rainfall_long |> count(Year, Month)

## # A tibble: 208 × 3
##    Year  Month      n
##    <chr> <chr>  <int>
##  1 2009  Annual     1
##  2 2009  Apr        1
##  3 2009  Aug        1
##  4 2009  Dec        1
##  5 2009  Feb        1
##  6 2009  Jan        1
##  7 2009  Jul        1
##  8 2009  Jun        1
##  9 2009  Mar        1
## 10 2009  May        1
## # ℹ 198 more rows

nrow(rainfall_long)

## [1] 208

# They are the same; the dataset contains one observation per Year and Month combination.

16. We would like to join the rainfall measurements dataset with the Baltimore surface water quality dataset, but we need to do a bit more wrangling first. Because the rainfall measures were collected near the Hamilton Ave water testing site, let’s keep only the Hamilton Ave data from the water_subset data, using filter(). Assign this to water_Ham.

The “Station” column should be “HAMILTON AVE.”

water_Ham <- water_subset |> filter(Station == "HAMILTON AVE.")

17. Right-join water_Ham and rainfall_long by “Month” and “Year” using right_join(). Assign the joined dataset to the name water_rain. Did this join attempt work? Why or why not?

water_rain <- water_Ham |> right_join(rainfall_long, by = c("Month", "Year"))

## Error in `right_join()`:
## ! Can't join `x$Month` with `y$Month` due to incompatible types.
## ℹ `x$Month` is a <double>.
## ℹ `y$Month` is a <character>.

#This attempted join did not work, because the class of Month is not the same between the two datasets!

18. Check the class of the “Month” column in each dataset. Reformat “Month” in rainfall_long so that it matches the format in water_Ham. Use case_when() and mutate(), then reassign to rainfall_long.

rainfall_long <- rainfall_long |> 
  mutate(Month = case_when(
    Month == "Jan" ~ 1,
    Month == "Feb" ~ 2,
    Month == "Mar" ~ 3,
    Month == "Apr" ~ 4,
    Month == "May" ~ 5,
    Month == "Jun" ~ 6,
    Month == "Jul" ~ 7,
    Month == "Aug" ~ 8,
    Month == "Sep" ~ 9,
    Month == "Oct" ~ 10,
    Month == "Nov" ~ 11,
    Month == "Dec" ~ 12
    ))

19. NOW try to right-join water_Ham and rainfall_long by “Month” and “Year” and assign the joined dataset to the name water_rain. Did this join attempt work?

water_rain <- water_Ham |> right_join(rainfall_long, by = c("Month", "Year"))

## Error in `right_join()`:
## ! Can't join `x$Year` with `y$Year` due to incompatible types.
## ℹ `x$Year` is a <double>.
## ℹ `y$Year` is a <character>.

#This attempted join did not work either, because "Year" is a different format between both datasets too!

20. Fix the differences between the classes of “Year” by changing rainfall_long. Then right-join the two datasets to make “water_rain”.

rainfall_long <- rainfall_long |> 
  mutate(Year = as.double(Year))

water_rain <- water_Ham |> right_join(rainfall_long, by = c("Month", "Year"))

21. Subset water_rain so that “Parameter” is “pH”. Plot points with “Precip” on the x axis, “Result” on the y axis, and “Year” as color. Facet the plot by Month. You can use esquisse or ggplot (whichever you prefer).

plot_subset <- water_rain |> filter(Parameter == "pH")

#If using esquisse:
#library(esquisse)
#esquisser(plot_subset)

#If using ggplot:
ggplot(data = plot_subset, aes(x = Precip, y = Result, color = Year)) +
  geom_point() +
  facet_wrap(~Month)

## Warning: Removed 111 rows containing missing values or values outside the scale range
## (`geom_point()`).

# Alternative syntax
# water_rain |>
#   filter(Parameter == "pH") |>
#   ggplot(aes(x = Precip, y = Result, color = Year)) +
#   geom_point() +
#   facet_wrap(~ Month)

22. Create a new plot changing the measurement units. This time, subset water_rain so that “Parameter” is “Water Temperature”. Plot points with “Precip” on the x axis, “Result” on the y axis, and “Year” as color. Facet the plot by Month. You can use esquisse or ggplot (whichever you prefer). Do the temperatures follow the expected seasonal climate across months (highs and lows)?

plot_subset <- water_rain |> filter(Parameter == "Water Temperature")

#If using esquisse:
#library(esquisse)
#esquisser(plot_subset)

#If using ggplot:
ggplot(data = plot_subset, aes(x = Precip, y = Result, color = Year)) +
  geom_point() +
  facet_wrap(~Month)

## Warning: Removed 117 rows containing missing values or values outside the scale range
## (`geom_point()`).

# Yes, temperatures are higher in summer and lower in winter months.

Introduction to R: Homework 3 - Key

Instructions

Problem Set