In this lab you can use the interactive console to explore but please record your commands here. Remember anything you type here can be “sent” to the console with Cmd-Enter (OS-X) or Ctrl-Enter (Windows/Linux) (But only inside the code chunks designated with the {r} areas).

Part 1

First let’s get load our packages.

# don't forget to load the packages that you will need!
library(dplyr)
library(tidyverse)
library(dasehr)

Now let’s load the ER dataset by running one of these chunks. We can either load the data from the website or from the dasehr package.

library(dasehr)
ER <- CO_heat_ER_bygender

Or

ER <- 
  read_csv("https://daseh.org/data/Colorado_ER_heat_visits_by_county_gender.csv")
## Rows: 240 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): county, gender
## dbl (5): rate, lower95cl, upper95cl, visits, year
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Check that it worked by seeing if you have the ER data.

ER
## # A tibble: 240 × 7
##    county  rate lower95cl upper95cl visits  year gender
##    <chr>  <dbl>     <dbl>     <dbl>  <dbl> <dbl> <chr> 
##  1 Adams   7.60      4.38     11.7      17  2011 Female
##  2 Adams  NA        NA        NA        NA  2012 Female
##  3 Adams   6.22      3.37      9.93     14  2013 Female
##  4 Adams  NA        NA        NA        NA  2014 Female
##  5 Adams  NA        NA        NA        NA  2015 Female
##  6 Adams   6.16      3.35      9.82     14  2016 Female
##  7 Adams   4.69      2.31      7.88     11  2017 Female
##  8 Adams   6.39      3.62      9.93     16  2018 Female
##  9 Adams   6.64      3.85     10.2      17  2019 Female
## 10 Adams  NA        NA        NA        NA  2020 Female
## # ℹ 230 more rows

1.1

What class is ER?

class(ER)
## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"

1.2

How many observations (rows) and variables (columns) are in the dataset - try the dim() function?

dim(ER)
## [1] 240   7
nrow(ER)
## [1] 240
ncol(ER)
## [1] 7

1.3

Next, rename the column lower95cl to be lower_limit (hint - use rename() and watch out for the order of the new and old names!).

ER<- rename(ER, lower_limit = lower95cl)
head(ER)
## # A tibble: 6 × 7
##   county  rate lower_limit upper95cl visits  year gender
##   <chr>  <dbl>       <dbl>     <dbl>  <dbl> <dbl> <chr> 
## 1 Adams   7.60        4.38     11.7      17  2011 Female
## 2 Adams  NA          NA        NA        NA  2012 Female
## 3 Adams   6.22        3.37      9.93     14  2013 Female
## 4 Adams  NA          NA        NA        NA  2014 Female
## 5 Adams  NA          NA        NA        NA  2015 Female
## 6 Adams   6.16        3.35      9.82     14  2016 Female

1.4

Convert the column names of ER to be all upper case. Use rename_with(), and the toupper command. Save this as a new dataset called ER_upper.

ER_upper <- rename_with(ER, toupper)
head(ER_upper)
## # A tibble: 6 × 7
##   COUNTY  RATE LOWER_LIMIT UPPER95CL VISITS  YEAR GENDER
##   <chr>  <dbl>       <dbl>     <dbl>  <dbl> <dbl> <chr> 
## 1 Adams   7.60        4.38     11.7      17  2011 Female
## 2 Adams  NA          NA        NA        NA  2012 Female
## 3 Adams   6.22        3.37      9.93     14  2013 Female
## 4 Adams  NA          NA        NA        NA  2014 Female
## 5 Adams  NA          NA        NA        NA  2015 Female
## 6 Adams   6.16        3.35      9.82     14  2016 Female
dim(ER_upper)
## [1] 240   7

Practice on Your Own!

P.1

How can you print the first 3 rows and the last 3 rows of ER (in two lines of code)?

head(ER, 3)
## # A tibble: 3 × 7
##   county  rate lower_limit upper95cl visits  year gender
##   <chr>  <dbl>       <dbl>     <dbl>  <dbl> <dbl> <chr> 
## 1 Adams   7.60        4.38     11.7      17  2011 Female
## 2 Adams  NA          NA        NA        NA  2012 Female
## 3 Adams   6.22        3.37      9.93     14  2013 Female
tail(ER, 3)
## # A tibble: 3 × 7
##   county  rate lower_limit upper95cl visits  year gender
##   <chr>  <dbl>       <dbl>     <dbl>  <dbl> <dbl> <chr> 
## 1 Weld    10.2        5.82      15.8     16  2020 Male  
## 2 Weld    11.7        6.53      16.9     20  2021 Male  
## 3 Weld    14.5        8.76      20.3     25  2022 Male

Part 2

2.1

Create a subset of the ER that only contains the columns: county, year, and rate and assign this object to ER_sub - what are the dimensions of this dataset?

ER_sub <- select(ER, county, year, rate)
dim(ER_sub)
## [1] 240   3

2.2

Start with ER again instead of the dataset you just made. Subset the data to only include the rate column and the columns that end with “cl”. Hint: use select() and ends_with(). Assign this subset of the data to be ER2. Again take a look at the data and check the dimensions.

ER2 <- select(ER, rate, ends_with("cl"))
ER2
## # A tibble: 240 × 2
##     rate upper95cl
##    <dbl>     <dbl>
##  1  7.60     11.7 
##  2 NA        NA   
##  3  6.22      9.93
##  4 NA        NA   
##  5 NA        NA   
##  6  6.16      9.82
##  7  4.69      7.88
##  8  6.39      9.93
##  9  6.64     10.2 
## 10 NA        NA   
## # ℹ 230 more rows
dim(ER2)
## [1] 240   2

2.3

Pull the variable rate from ER2. How does this differ form selecting it? Use head() to take a look at both options.

head(pull(ER2, rate))
## [1] 7.596958       NA 6.218501       NA       NA 6.161907
head(select(ER2, rate))
## # A tibble: 6 × 1
##    rate
##   <dbl>
## 1  7.60
## 2 NA   
## 3  6.22
## 4 NA   
## 5 NA   
## 6  6.16

2.4

Subset the rows of ER2 that have more than 10 for rate - how many rows are there? Use filter().

ER_sub <- filter(ER2, rate > 10)
nrow(ER_sub)
## [1] 40

2.5

Subset the rows of ER that have a year value less than 2012 and more than 10 rate - how many are there?

filter(ER, year < 2012 & rate > 10) # all of these options work
## # A tibble: 3 × 7
##   county  rate lower_limit upper95cl visits  year gender
##   <chr>  <dbl>       <dbl>     <dbl>  <dbl> <dbl> <chr> 
## 1 Pueblo  16.7        8.74      27.2     13  2011 Male  
## 2 Weld    10.3        5.43      16.6     13  2011 Female
## 3 Weld    10.1        5.37      16.3     13  2011 Male
nrow(filter(ER, year < 2012 & rate > 10))
## [1] 3
nrow(filter(ER, year < 2012, rate > 10))
## [1] 3

2.6

Subset the rows of ER that have a year value of less than or equal to 2012 and more than 10 rate - how many are there?

filter(ER, year <= 2012 & rate > 10) # all of these options work
## # A tibble: 6 × 7
##   county  rate lower_limit upper95cl visits  year gender
##   <chr>  <dbl>       <dbl>     <dbl>  <dbl> <dbl> <chr> 
## 1 Pueblo  16.7        8.74      27.2     13  2011 Male  
## 2 Pueblo  20.9       11.8       32.5     16  2012 Male  
## 3 Weld    10.3        5.43      16.6     13  2011 Female
## 4 Weld    13.2        7.64      20.3     17  2012 Female
## 5 Weld    10.1        5.37      16.3     13  2011 Male  
## 6 Weld    12.8        6.96      20.3     15  2012 Male
nrow(filter(ER, year <= 2012 & rate > 10))
## [1] 6
nrow(filter(ER, year <= 2012, rate > 10))
## [1] 6

2.7

Why do the answers for 2.5 and 2.6 differ?

# They differ because there are rows with values that are exactly 2010.

Practice on Your Own!

P.2

Subset the rows of ER for rows that have county of Denver, or less than 4 `rate``. How many rows have both?

filter(ER, county == "Denver" | rate < 4)
## # A tibble: 56 × 7
##    county    rate lower_limit upper95cl visits  year gender
##    <chr>    <dbl>       <dbl>     <dbl>  <dbl> <dbl> <chr> 
##  1 Arapahoe  3.93        1.95      6.58     11  2013 Female
##  2 Arapahoe  3.66        1.93      5.95     13  2018 Female
##  3 Arapahoe  3.38        1.71      5.61     12  2021 Female
##  4 Arapahoe  3.27        1.60      5.54     11  2020 Male  
##  5 Cheyenne  0           0         0         0  2011 Female
##  6 Cheyenne  0           0         0         0  2012 Female
##  7 Cheyenne  0           0         0         0  2013 Female
##  8 Cheyenne  0           0         0         0  2014 Female
##  9 Cheyenne  0           0         0         0  2015 Female
## 10 Cheyenne  0           0         0         0  2016 Female
## # ℹ 46 more rows
nrow(filter(ER, county == "Denver" & rate < 4))
## [1] 5

P.3

Select the variables that contain the letter “a” from ER.

select(ER, contains("a"))
## # A tibble: 240 × 2
##     rate  year
##    <dbl> <dbl>
##  1  7.60  2011
##  2 NA     2012
##  3  6.22  2013
##  4 NA     2014
##  5 NA     2015
##  6  6.16  2016
##  7  4.69  2017
##  8  6.39  2018
##  9  6.64  2019
## 10 NA     2020
## # ℹ 230 more rows

Part 3

3.1

Create a subset called ER_2012 from ER that only contains the rows for the year 2012 and only the columns: county and rate. year should not be included in ER_sub. What are the dimensions of this dataset? Don’t use pipes (%>%) and instead do this in two steps creating the ER_2012 object with filter and updating it with select.

ER_2012 <- filter(ER, year == 2012)
ER_2012  <- select(ER_2012, county, rate)
ER_2012 # can get dimensions just from viewing the tibble
## # A tibble: 20 × 2
##    county     rate
##    <chr>     <dbl>
##  1 Adams     NA   
##  2 Adams      5.48
##  3 Arapahoe   4.20
##  4 Arapahoe   5.17
##  5 Cheyenne   0   
##  6 Cheyenne   0   
##  7 Denver     5.99
##  8 Denver     7.89
##  9 El Paso   NA   
## 10 El Paso    7.06
## 11 Jefferson  5.39
## 12 Jefferson  7.26
## 13 Larimer    6.48
## 14 Larimer    9.13
## 15 Pueblo    NA   
## 16 Pueblo    20.9 
## 17 Statewide  5.64
## 18 Statewide  7.56
## 19 Weld      13.2 
## 20 Weld      12.8
dim(ER_2012) # alternatively can use dim() function
## [1] 20  2

3.2

Try the same thing again with pipes %>%.

ER_2012 <- ER %>%
  filter(year == 2012) %>%
  select(county, rate)
ER_2012
## # A tibble: 20 × 2
##    county     rate
##    <chr>     <dbl>
##  1 Adams     NA   
##  2 Adams      5.48
##  3 Arapahoe   4.20
##  4 Arapahoe   5.17
##  5 Cheyenne   0   
##  6 Cheyenne   0   
##  7 Denver     5.99
##  8 Denver     7.89
##  9 El Paso   NA   
## 10 El Paso    7.06
## 11 Jefferson  5.39
## 12 Jefferson  7.26
## 13 Larimer    6.48
## 14 Larimer    9.13
## 15 Pueblo    NA   
## 16 Pueblo    20.9 
## 17 Statewide  5.64
## 18 Statewide  7.56
## 19 Weld      13.2 
## 20 Weld      12.8
dim(ER_2012)
## [1] 20  2

What happens if you do the steps in a different order? Why does this not work?

#ER_2012 <- ER %>%
#  select(county, rate) %>%
#  filter(year == 2012)
# you get an error because there is now year variable to filter from

1.3

Re-order the rows of ER_2012 by population in increasing order. Use arrange(). What is county with the smallest rate?

ER_2012 <- arrange(ER_2012, rate)
ER_2012
## # A tibble: 20 × 2
##    county     rate
##    <chr>     <dbl>
##  1 Cheyenne   0   
##  2 Cheyenne   0   
##  3 Arapahoe   4.20
##  4 Arapahoe   5.17
##  5 Jefferson  5.39
##  6 Adams      5.48
##  7 Statewide  5.64
##  8 Denver     5.99
##  9 Larimer    6.48
## 10 El Paso    7.06
## 11 Jefferson  7.26
## 12 Statewide  7.56
## 13 Denver     7.89
## 14 Larimer    9.13
## 15 Weld      12.8 
## 16 Weld      13.2 
## 17 Pueblo    20.9 
## 18 Adams     NA   
## 19 El Paso   NA   
## 20 Pueblo    NA
arrange(ER_2012, rate) %>%
  head(n = 1) %>%
  pull(county)
## [1] "Cheyenne"

1.4

Create a new variable in ER_2012 called rate1000, which is equal to rate divided by 1000, using mutate()(don’t forget to reassign ER_2012). Use pipes %>%.

# General format
NEWDATA <- OLD_DATA %>% mutate(NEW_COLUMN = OLD_COLUMN)
ER_2012 <- ER_2012 %>% mutate(rate1000 = rate/1000)

Practice on Your Own!

P.4

Move the rate1000 column to be before county in the ER_2012 dataset. Use relocate().

ER_2012 <- ER_2012 %>% relocate(rate1000, .before = county)
ER_2012
## # A tibble: 20 × 3
##    rate1000 county     rate
##       <dbl> <chr>     <dbl>
##  1  0       Cheyenne   0   
##  2  0       Cheyenne   0   
##  3  0.00420 Arapahoe   4.20
##  4  0.00517 Arapahoe   5.17
##  5  0.00539 Jefferson  5.39
##  6  0.00548 Adams      5.48
##  7  0.00564 Statewide  5.64
##  8  0.00599 Denver     5.99
##  9  0.00648 Larimer    6.48
## 10  0.00706 El Paso    7.06
## 11  0.00726 Jefferson  7.26
## 12  0.00756 Statewide  7.56
## 13  0.00789 Denver     7.89
## 14  0.00913 Larimer    9.13
## 15  0.0128  Weld      12.8 
## 16  0.0132  Weld      13.2 
## 17  0.0209  Pueblo    20.9 
## 18 NA       Adams     NA   
## 19 NA       El Paso   NA   
## 20 NA       Pueblo    NA

P.5

How can you find the value of rate in 2020 for Statewide for Females - using the initial ER data - without just looking at the data manually and instead use functions we learned today?

Note that gender was recorded as binary, which we know isn’t really accurate. This is something you might encounter. Please see this article about ways to measure gender in a more inclusive way: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6526522/.

ER %>%
  filter(county == "Statewide" & year == "2020" & gender == "Female") %>%
  select(rate)
## # A tibble: 1 × 1
##    rate
##   <dbl>
## 1  3.51