Part 1

1.1

Load all the packages we will use in this lab.

library(tidyverse)
library(dasehr)

Create some data to work with by running the following code chunk.

set.seed(1234)

int_vect <- rep(seq(from = 1, to = 10), times = 3)
rand_vect <- sample(x = 1:30, size = 30, replace = TRUE)
TF_vect <- rep(c(TRUE, TRUE, FALSE), times = 10)
TF_vect2 <- rep(c("TRUE", "TRUE", "FALSE"), times = 10)

1.2

Determine the class of each of these new objects.

class(int_vect) # [1] "integer"
## [1] "integer"
class(rand_vect) # [1] "integer"
## [1] "integer"
class(TF_vect) # [1] "logical"
## [1] "logical"
class(TF_vect2) # [1] "character"
## [1] "character"

1.3

Are TF_vect and TF_vect2 different classes? Why or why not?

# Yes!
# Logical vectors do not have quotes around `TRUE` and `FALSE` values.

1.4

Create a tibble combining these vectors together called vect_data using the following code.

vect_data <- tibble(int_vect, rand_vect, TF_vect, TF_vect2)

1.5

Coerce rand_vect to character class using as.character(). Save this vector as rand_char_vect. How is the output for rand_vect and rand_char_vect different?

rand_char_vect <- as.character(rand_vect)
rand_char_vect # Numbers now have quotation marks
##  [1] "28" "16" "26" "22" "5"  "12" "15" "9"  "5"  "6"  "16" "4"  "2"  "7"  "22"
## [16] "26" "6"  "15" "14" "20" "14" "30" "24" "30" "4"  "4"  "21" "8"  "20" "24"

1.6

Read in the National Wastewater Surveillance System (NWSS) SARS-CoV-2 Wastewater data from dasehr package using the code supplied in the chunk. Alternatively using the url link.

The NWSS uses water from different sewage treatment plants to test for covid, as a way to estimate how many covid infections a community is experiencing.

covidww <- covid_wastewater
# covidww <- read_csv(file = "https://daseh.org/data/SARS-CoV-2_Wastewater_Data.csv")

1.7

Use the mutate() function to create a new column named date_formatted that is of first_sample_date class. The new variable is created from date column. Hint: use mdy() function. Reassign to covidww.

# General format
NEWDATA <- OLD_DATA %>% mutate(NEW_COLUMN = OLD_COLUMN)
covidww <- covidww %>% mutate(date_formatted = mdy(first_sample_date))

Practice on Your Own!

P.1

Move the date_formatted variable to be before first_sample_date using the relocate function. Take a look at the data using glimpse(). Note the difference between first_sample_date and date_formatted columns.

# General format
NEWDATA <- OLD_DATA %>% relocate(COLUMN1, .before = COLUMN2)
covidww <- covidww %>% relocate(date_formatted, .before = first_sample_date)

# alternative
# covidww <- covidww %>% select(first_sample_date, date_formatted, everything()) %>% head() 

glimpse(covidww)
## Rows: 776,059
## Columns: 13
## $ reporting_jurisdiction <chr> "Missouri", "Missouri", "Missouri", "Missouri",…
## $ sample_location        <chr> "Treatment plant", "Treatment plant", "Treatmen…
## $ key_plot_id            <chr> "NWSS_mo_259_Treatment plant_raw wastewater", "…
## $ county_names           <chr> "Barry,Lawrence", "Barry,Lawrence", "Barry,Lawr…
## $ population_served      <dbl> 9100, 9100, 9100, 9100, 9100, 9100, 9100, 9100,…
## $ date_start             <chr> "6/21/2020", "6/22/2020", "6/23/2020", "6/24/20…
## $ date_end               <chr> "7/5/2020", "7/6/2020", "7/7/2020", "7/8/2020",…
## $ rna_pct_change_15d     <dbl> NA, NA, NA, NA, NA, NA, NA, 3683, 3683, 3683, 3…
## $ pos_PCR_prop_15d       <dbl> 100, 100, 100, 100, 100, 100, 100, 100, 100, 10…
## $ percentile             <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ sampling_prior         <chr> "yes", "yes", "yes", "yes", "yes", "yes", "yes"…
## $ date_formatted         <date> 2020-07-05, 2020-07-05, 2020-07-05, 2020-07-05…
## $ first_sample_date      <chr> "7/5/2020", "7/5/2020", "7/5/2020", "7/5/2020",…

P.2

Use range() function on date_formatted variable to display the range of dates in the data set. How does this compare to that of first_sample_date? Why? (Hint: use the pull function first to pull the values.)

pull(covidww, date_formatted) %>% range()
## [1] "2020-07-05" "2024-05-06"
pull(covidww, first_sample_date) %>% range()
## [1] "1/1/2023" "9/9/2022"
# The max of `pull(covidww, first_sample_date) %>% range()` is numerical not based on date.