Load the package we will use in this lab.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Create some data to work with by running the following code chunk.
set.seed(1234)
int_vect <- rep(seq(from = 1, to = 10), times = 3)
rand_vect <- sample(x = 1:30, size = 30, replace = TRUE)
TF_vect <- rep(c(TRUE, TRUE, FALSE), times = 10)
TF_vect2 <- rep(c("TRUE", "TRUE", "FALSE"), times = 10)
Determine the class of each of these new objects.
class(int_vect) # [1] "integer"
## [1] "integer"
class(rand_vect) # [1] "integer"
## [1] "integer"
class(TF_vect) # [1] "logical"
## [1] "logical"
class(TF_vect2) # [1] "character"
## [1] "character"
Are TF_vect
and TF_vect2
different classes? Why or why not?
# Yes!
# Logical vectors do not have quotes around `TRUE` and `FALSE` values.
Create a tibble combining these vectors together called vect_data
using the following code.
vect_data <- tibble(int_vect, rand_vect, TF_vect, TF_vect2)
Coerce rand_vect
to character class using as.character()
. Save this vector as rand_char_vect
. How is the output for rand_vect
and rand_char_vect
different?
rand_char_vect <- as.character(rand_vect)
rand_char_vect # Numbers now have quotation marks
## [1] "28" "16" "26" "22" "5" "12" "15" "9" "5" "6" "16" "4" "2" "7" "22"
## [16] "26" "6" "15" "14" "20" "14" "30" "24" "30" "4" "4" "21" "8" "20" "24"
Read in the National Wastewater Surveillance System (NWSS) SARS-CoV-2 Wastewater data using the url link and the code provided.
The NWSS uses water from different sewage treatment plants to test for the SARS-CoV-2 virus, as a way to estimate how many COVID infections a community is experiencing.
sars_ww <-
read_csv(file = "https://daseh.org/data/SARS-CoV-2_Wastewater_Data.csv")
## Rows: 2813 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): reporting_jurisdiction, sample_location, key_plot_id, town_name, co...
## dbl (4): population_served, rna_pct_change_15d, pos_PCR_prop_15d, percentile
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Use the mutate()
function to create a new column named date_formatted
, based on the date_end
column. Hint: use mdy()
function. Reassign to sars_ww
.
date_end
: This is the last date of the sampling window. A sampling window is used to measure change in viral concentration.
# General format
NEWDATA <- OLD_DATA %>% mutate(NEW_COLUMN = OLD_COLUMN)
sars_ww <- sars_ww %>% mutate(date_formatted = mdy(date_end))
Move the date_formatted
variable to be before date_end
using the relocate
function. Take a look at the data using glimpse()
. Note the difference between date_end
and date_formatted
columns.
# General format
NEWDATA <- OLD_DATA %>% relocate(COLUMN1, .before = COLUMN2)
sars_ww <- sars_ww %>% relocate(date_formatted, .before = date_end)
# alternative
# sars_ww <- sars_ww %>% select(date_end, date_formatted, everything()) %>% head()
glimpse(sars_ww)
## Rows: 2,813
## Columns: 14
## $ reporting_jurisdiction <chr> "Missouri", "Missouri", "Missouri", "Missouri",…
## $ sample_location <chr> "Treatment plant", "Treatment plant", "Treatmen…
## $ key_plot_id <chr> "NWSS_mo_259_Treatment plant_raw wastewater", "…
## $ town_name <chr> "Barry", "Barry", "Barry", "Barry", "Barry", "B…
## $ county_names <chr> "Lawrence", "Lawrence", "Lawrence", "Lawrence",…
## $ population_served <dbl> 9100, 9100, 9100, 9100, 9100, 9100, 9100, 9100,…
## $ date_start <chr> "6/21/2020", "6/22/2020", "6/23/2020", "6/24/20…
## $ date_formatted <date> 2020-07-05, 2020-07-06, 2020-07-07, 2020-07-08…
## $ date_end <chr> "7/5/2020", "7/6/2020", "7/7/2020", "7/8/2020",…
## $ rna_pct_change_15d <dbl> 0, 0, 0, 0, 0, 0, 0, 3683, 3683, 3683, 3683, 36…
## $ pos_PCR_prop_15d <dbl> 100, 100, 100, 100, 100, 100, 100, 100, 100, 10…
## $ percentile <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ sampling_prior <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes"…
## $ first_sample_date <chr> "7/5/2020", "7/5/2020", "7/5/2020", "7/5/2020",…
Use range()
function on date_formatted
variable to display the range of dates in the data set. How does this compare to that of date_end
? Why? (Hint: use the pull function first to pull the values.)
pull(sars_ww, date_formatted) %>% range()
## [1] "2020-07-05" "2024-05-11"
pull(sars_ww, date_end) %>% range()
## [1] "1/1/2021" "9/9/2023"
# The max of `pull(sars_ww, date_end) %>% range()` is numerical not based on date.