Part 1

1.1

Load the package we will use in this lab.

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Create some data to work with by running the following code chunk.

set.seed(1234)

int_vect <- rep(seq(from = 1, to = 10), times = 3)
rand_vect <- sample(x = 1:30, size = 30, replace = TRUE)
TF_vect <- rep(c(TRUE, TRUE, FALSE), times = 10)
TF_vect2 <- rep(c("TRUE", "TRUE", "FALSE"), times = 10)

1.2

Determine the class of each of these new objects.

class(int_vect) # [1] "integer"
## [1] "integer"
class(rand_vect) # [1] "integer"
## [1] "integer"
class(TF_vect) # [1] "logical"
## [1] "logical"
class(TF_vect2) # [1] "character"
## [1] "character"

1.3

Are TF_vect and TF_vect2 different classes? Why or why not?

# Yes!
# Logical vectors do not have quotes around `TRUE` and `FALSE` values.

1.4

Create a tibble combining these vectors together called vect_data using the following code.

vect_data <- tibble(int_vect, rand_vect, TF_vect, TF_vect2)

1.5

Coerce rand_vect to character class using as.character(). Save this vector as rand_char_vect. How is the output for rand_vect and rand_char_vect different?

rand_char_vect <- as.character(rand_vect)
rand_char_vect # Numbers now have quotation marks
##  [1] "28" "16" "26" "22" "5"  "12" "15" "9"  "5"  "6"  "16" "4"  "2"  "7"  "22"
## [16] "26" "6"  "15" "14" "20" "14" "30" "24" "30" "4"  "4"  "21" "8"  "20" "24"

1.6

Read in the National Wastewater Surveillance System (NWSS) SARS-CoV-2 Wastewater data using the url link and the code provided.

The NWSS uses water from different sewage treatment plants to test for the SARS-CoV-2 virus, as a way to estimate how many COVID infections a community is experiencing.

sars_ww <- 
  read_csv(file = "https://daseh.org/data/SARS-CoV-2_Wastewater_Data.csv")
## Rows: 2813 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): reporting_jurisdiction, sample_location, key_plot_id, town_name, co...
## dbl (4): population_served, rna_pct_change_15d, pos_PCR_prop_15d, percentile
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

1.7

Use the mutate() function to create a new column named date_formatted, based on the date_end column. Hint: use mdy() function. Reassign to sars_ww.

date_end: This is the last date of the sampling window. A sampling window is used to measure change in viral concentration.

# General format
NEWDATA <- OLD_DATA %>% mutate(NEW_COLUMN = OLD_COLUMN)
sars_ww <- sars_ww %>% mutate(date_formatted = mdy(date_end))

Practice on Your Own!

P.1

Move the date_formatted variable to be before date_end using the relocate function. Take a look at the data using glimpse(). Note the difference between date_end and date_formatted columns.

# General format
NEWDATA <- OLD_DATA %>% relocate(COLUMN1, .before = COLUMN2)
sars_ww <- sars_ww %>% relocate(date_formatted, .before = date_end)

# alternative
# sars_ww <- sars_ww %>% select(date_end, date_formatted, everything()) %>% head() 

glimpse(sars_ww)
## Rows: 2,813
## Columns: 14
## $ reporting_jurisdiction <chr> "Missouri", "Missouri", "Missouri", "Missouri",…
## $ sample_location        <chr> "Treatment plant", "Treatment plant", "Treatmen…
## $ key_plot_id            <chr> "NWSS_mo_259_Treatment plant_raw wastewater", "…
## $ town_name              <chr> "Barry", "Barry", "Barry", "Barry", "Barry", "B…
## $ county_names           <chr> "Lawrence", "Lawrence", "Lawrence", "Lawrence",…
## $ population_served      <dbl> 9100, 9100, 9100, 9100, 9100, 9100, 9100, 9100,…
## $ date_start             <chr> "6/21/2020", "6/22/2020", "6/23/2020", "6/24/20…
## $ date_formatted         <date> 2020-07-05, 2020-07-06, 2020-07-07, 2020-07-08…
## $ date_end               <chr> "7/5/2020", "7/6/2020", "7/7/2020", "7/8/2020",…
## $ rna_pct_change_15d     <dbl> 0, 0, 0, 0, 0, 0, 0, 3683, 3683, 3683, 3683, 36…
## $ pos_PCR_prop_15d       <dbl> 100, 100, 100, 100, 100, 100, 100, 100, 100, 10…
## $ percentile             <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ sampling_prior         <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes"…
## $ first_sample_date      <chr> "7/5/2020", "7/5/2020", "7/5/2020", "7/5/2020",…

P.2

Use range() function on date_formatted variable to display the range of dates in the data set. How does this compare to that of date_end? Why? (Hint: use the pull function first to pull the values.)

pull(sars_ww, date_formatted) %>% range()
## [1] "2020-07-05" "2024-05-11"
pull(sars_ww, date_end) %>% range()
## [1] "1/1/2021" "9/9/2023"
# The max of `pull(sars_ww, date_end) %>% range()` is numerical not based on date.