In this lab you can use the interactive console to explore but please
record your commands here. Remember anything you type here can be “sent”
to the console with Cmd-Enter (OS-X) or Ctrl-Enter (Windows/Linux) (But
only inside the code chunks designated with the {r}
areas).
First let’s get load our packages.
# don't forget to load the packages that you will need!
library(dplyr)
library(tidyverse)
library(dasehr)
Now let’s load the ER dataset by running one of these chunks. We can
either load the data from the website or from the dasehr
package.
library(dasehr)
ER <- CO_heat_ER_bygender
Or
ER <-
read_csv("https://daseh.org/data/Colorado_ER_heat_visits_by_county_gender.csv")
## Rows: 240 Columns: 7
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): county, gender
## dbl (5): rate, lower95cl, upper95cl, visits, year
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Check that it worked by seeing if you have the ER
data.
ER
## # A tibble: 240 × 7
## county rate lower95cl upper95cl visits year gender
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 Adams 7.60 4.38 11.7 17 2011 Female
## 2 Adams NA NA NA NA 2012 Female
## 3 Adams 6.22 3.37 9.93 14 2013 Female
## 4 Adams NA NA NA NA 2014 Female
## 5 Adams NA NA NA NA 2015 Female
## 6 Adams 6.16 3.35 9.82 14 2016 Female
## 7 Adams 4.69 2.31 7.88 11 2017 Female
## 8 Adams 6.39 3.62 9.93 16 2018 Female
## 9 Adams 6.64 3.85 10.2 17 2019 Female
## 10 Adams NA NA NA NA 2020 Female
## # ℹ 230 more rows
What class is ER
?
class(ER)
## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
How many observations (rows) and variables (columns) are in the
dataset - try the dim()
function?
dim(ER)
## [1] 240 7
nrow(ER)
## [1] 240
ncol(ER)
## [1] 7
Next, rename the column lower95cl
to be
lower_limit
(hint - use rename()
and watch out
for the order of the new and old names!).
ER<- rename(ER, lower_limit = lower95cl)
head(ER)
## # A tibble: 6 × 7
## county rate lower_limit upper95cl visits year gender
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 Adams 7.60 4.38 11.7 17 2011 Female
## 2 Adams NA NA NA NA 2012 Female
## 3 Adams 6.22 3.37 9.93 14 2013 Female
## 4 Adams NA NA NA NA 2014 Female
## 5 Adams NA NA NA NA 2015 Female
## 6 Adams 6.16 3.35 9.82 14 2016 Female
Convert the column names of ER
to be all upper case. Use
rename_with()
, and the toupper
command. Save
this as a new dataset called ER_upper
.
ER_upper <- rename_with(ER, toupper)
head(ER_upper)
## # A tibble: 6 × 7
## COUNTY RATE LOWER_LIMIT UPPER95CL VISITS YEAR GENDER
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 Adams 7.60 4.38 11.7 17 2011 Female
## 2 Adams NA NA NA NA 2012 Female
## 3 Adams 6.22 3.37 9.93 14 2013 Female
## 4 Adams NA NA NA NA 2014 Female
## 5 Adams NA NA NA NA 2015 Female
## 6 Adams 6.16 3.35 9.82 14 2016 Female
dim(ER_upper)
## [1] 240 7
How can you print the first 3 rows and the last 3 rows of
ER
(in two lines of code)?
head(ER, 3)
## # A tibble: 3 × 7
## county rate lower_limit upper95cl visits year gender
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 Adams 7.60 4.38 11.7 17 2011 Female
## 2 Adams NA NA NA NA 2012 Female
## 3 Adams 6.22 3.37 9.93 14 2013 Female
tail(ER, 3)
## # A tibble: 3 × 7
## county rate lower_limit upper95cl visits year gender
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 Weld 10.2 5.82 15.8 16 2020 Male
## 2 Weld 11.7 6.53 16.9 20 2021 Male
## 3 Weld 14.5 8.76 20.3 25 2022 Male
Create a subset of the ER
that only contains the
columns: county
, year
, and rate
and assign this object to ER_sub
- what are the dimensions
of this dataset?
ER_sub <- select(ER, county, year, rate)
dim(ER_sub)
## [1] 240 3
Start with ER
again instead of the dataset you just
made. Subset the data to only include the rate
column and
the columns that end with “cl”. Hint: use select()
and
ends_with()
. Assign this subset of the data to be
ER2
. Again take a look at the data and check the
dimensions.
ER2 <- select(ER, rate, ends_with("cl"))
ER2
## # A tibble: 240 × 2
## rate upper95cl
## <dbl> <dbl>
## 1 7.60 11.7
## 2 NA NA
## 3 6.22 9.93
## 4 NA NA
## 5 NA NA
## 6 6.16 9.82
## 7 4.69 7.88
## 8 6.39 9.93
## 9 6.64 10.2
## 10 NA NA
## # ℹ 230 more rows
dim(ER2)
## [1] 240 2
Pull the variable rate
from ER2
. How does
this differ form selecting it? Use head() to take a look at both
options.
head(pull(ER2, rate))
## [1] 7.596958 NA 6.218501 NA NA 6.161907
head(select(ER2, rate))
## # A tibble: 6 × 1
## rate
## <dbl>
## 1 7.60
## 2 NA
## 3 6.22
## 4 NA
## 5 NA
## 6 6.16
Subset the rows of ER2
that have more
than 10 for rate - how many rows are there? Use
filter()
.
ER_sub <- filter(ER2, rate > 10)
nrow(ER_sub)
## [1] 40
Subset the rows of ER
that have a year value
less than 2012 and more than 10 rate -
how many are there?
filter(ER, year < 2012 & rate > 10) # all of these options work
## # A tibble: 3 × 7
## county rate lower_limit upper95cl visits year gender
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 Pueblo 16.7 8.74 27.2 13 2011 Male
## 2 Weld 10.3 5.43 16.6 13 2011 Female
## 3 Weld 10.1 5.37 16.3 13 2011 Male
nrow(filter(ER, year < 2012 & rate > 10))
## [1] 3
nrow(filter(ER, year < 2012, rate > 10))
## [1] 3
Subset the rows of ER
that have a year value of
less than or equal to 2012 and more
than 10 rate - how many are there?
filter(ER, year <= 2012 & rate > 10) # all of these options work
## # A tibble: 6 × 7
## county rate lower_limit upper95cl visits year gender
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 Pueblo 16.7 8.74 27.2 13 2011 Male
## 2 Pueblo 20.9 11.8 32.5 16 2012 Male
## 3 Weld 10.3 5.43 16.6 13 2011 Female
## 4 Weld 13.2 7.64 20.3 17 2012 Female
## 5 Weld 10.1 5.37 16.3 13 2011 Male
## 6 Weld 12.8 6.96 20.3 15 2012 Male
nrow(filter(ER, year <= 2012 & rate > 10))
## [1] 6
nrow(filter(ER, year <= 2012, rate > 10))
## [1] 6
Why do the answers for 2.5 and 2.6 differ?
# They differ because there are rows with values that are exactly 2010.
Subset the rows of ER
for rows that have
county
of Denver
, or
less than 4 `rate``. How many rows have both?
filter(ER, county == "Denver" | rate < 4)
## # A tibble: 56 × 7
## county rate lower_limit upper95cl visits year gender
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 Arapahoe 3.93 1.95 6.58 11 2013 Female
## 2 Arapahoe 3.66 1.93 5.95 13 2018 Female
## 3 Arapahoe 3.38 1.71 5.61 12 2021 Female
## 4 Arapahoe 3.27 1.60 5.54 11 2020 Male
## 5 Cheyenne 0 0 0 0 2011 Female
## 6 Cheyenne 0 0 0 0 2012 Female
## 7 Cheyenne 0 0 0 0 2013 Female
## 8 Cheyenne 0 0 0 0 2014 Female
## 9 Cheyenne 0 0 0 0 2015 Female
## 10 Cheyenne 0 0 0 0 2016 Female
## # ℹ 46 more rows
nrow(filter(ER, county == "Denver" & rate < 4))
## [1] 5
Select the variables that contain the letter “a” from
ER
.
select(ER, contains("a"))
## # A tibble: 240 × 2
## rate year
## <dbl> <dbl>
## 1 7.60 2011
## 2 NA 2012
## 3 6.22 2013
## 4 NA 2014
## 5 NA 2015
## 6 6.16 2016
## 7 4.69 2017
## 8 6.39 2018
## 9 6.64 2019
## 10 NA 2020
## # ℹ 230 more rows
Create a subset called ER_2012
from ER
that
only contains the rows for the year 2012 and only the columns:
county
and rate
. year
should not
be included in ER_sub
. What are the dimensions of this
dataset? Don’t use pipes (%>%
) and instead do this in
two steps creating the ER_2012
object with
filter
and updating it with select
.
ER_2012 <- filter(ER, year == 2012)
ER_2012 <- select(ER_2012, county, rate)
ER_2012 # can get dimensions just from viewing the tibble
## # A tibble: 20 × 2
## county rate
## <chr> <dbl>
## 1 Adams NA
## 2 Adams 5.48
## 3 Arapahoe 4.20
## 4 Arapahoe 5.17
## 5 Cheyenne 0
## 6 Cheyenne 0
## 7 Denver 5.99
## 8 Denver 7.89
## 9 El Paso NA
## 10 El Paso 7.06
## 11 Jefferson 5.39
## 12 Jefferson 7.26
## 13 Larimer 6.48
## 14 Larimer 9.13
## 15 Pueblo NA
## 16 Pueblo 20.9
## 17 Statewide 5.64
## 18 Statewide 7.56
## 19 Weld 13.2
## 20 Weld 12.8
dim(ER_2012) # alternatively can use dim() function
## [1] 20 2
Try the same thing again with pipes %>%
.
ER_2012 <- ER %>%
filter(year == 2012) %>%
select(county, rate)
ER_2012
## # A tibble: 20 × 2
## county rate
## <chr> <dbl>
## 1 Adams NA
## 2 Adams 5.48
## 3 Arapahoe 4.20
## 4 Arapahoe 5.17
## 5 Cheyenne 0
## 6 Cheyenne 0
## 7 Denver 5.99
## 8 Denver 7.89
## 9 El Paso NA
## 10 El Paso 7.06
## 11 Jefferson 5.39
## 12 Jefferson 7.26
## 13 Larimer 6.48
## 14 Larimer 9.13
## 15 Pueblo NA
## 16 Pueblo 20.9
## 17 Statewide 5.64
## 18 Statewide 7.56
## 19 Weld 13.2
## 20 Weld 12.8
dim(ER_2012)
## [1] 20 2
What happens if you do the steps in a different order? Why does this not work?
#ER_2012 <- ER %>%
# select(county, rate) %>%
# filter(year == 2012)
# you get an error because there is now year variable to filter from
Re-order the rows of ER_2012
by population in increasing
order. Use arrange()
. What is county with the smallest
rate?
ER_2012 <- arrange(ER_2012, rate)
ER_2012
## # A tibble: 20 × 2
## county rate
## <chr> <dbl>
## 1 Cheyenne 0
## 2 Cheyenne 0
## 3 Arapahoe 4.20
## 4 Arapahoe 5.17
## 5 Jefferson 5.39
## 6 Adams 5.48
## 7 Statewide 5.64
## 8 Denver 5.99
## 9 Larimer 6.48
## 10 El Paso 7.06
## 11 Jefferson 7.26
## 12 Statewide 7.56
## 13 Denver 7.89
## 14 Larimer 9.13
## 15 Weld 12.8
## 16 Weld 13.2
## 17 Pueblo 20.9
## 18 Adams NA
## 19 El Paso NA
## 20 Pueblo NA
arrange(ER_2012, rate) %>%
head(n = 1) %>%
pull(county)
## [1] "Cheyenne"
Create a new variable in ER_2012
called
rate1000
, which is equal to rate
divided by
1000, using mutate()
(don’t forget to reassign
ER_2012
). Use pipes %>%
.
# General format
NEWDATA <- OLD_DATA %>% mutate(NEW_COLUMN = OLD_COLUMN)
ER_2012 <- ER_2012 %>% mutate(rate1000 = rate/1000)
Move the rate1000
column to be before
county
in the ER_2012
dataset. Use
relocate()
.
ER_2012 <- ER_2012 %>% relocate(rate1000, .before = county)
ER_2012
## # A tibble: 20 × 3
## rate1000 county rate
## <dbl> <chr> <dbl>
## 1 0 Cheyenne 0
## 2 0 Cheyenne 0
## 3 0.00420 Arapahoe 4.20
## 4 0.00517 Arapahoe 5.17
## 5 0.00539 Jefferson 5.39
## 6 0.00548 Adams 5.48
## 7 0.00564 Statewide 5.64
## 8 0.00599 Denver 5.99
## 9 0.00648 Larimer 6.48
## 10 0.00706 El Paso 7.06
## 11 0.00726 Jefferson 7.26
## 12 0.00756 Statewide 7.56
## 13 0.00789 Denver 7.89
## 14 0.00913 Larimer 9.13
## 15 0.0128 Weld 12.8
## 16 0.0132 Weld 13.2
## 17 0.0209 Pueblo 20.9
## 18 NA Adams NA
## 19 NA El Paso NA
## 20 NA Pueblo NA
How can you find the value of rate
in 2020 for Statewide
for Females - using the initial ER data - without just looking at the
data manually and instead use functions we learned today?
Note that gender was recorded as binary, which we know isn’t really accurate. This is something you might encounter. Please see this article about ways to measure gender in a more inclusive way: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6526522/.
ER %>%
filter(county == "Statewide" & year == "2020" & gender == "Female") %>%
select(rate)
## # A tibble: 1 × 1
## rate
## <dbl>
## 1 3.51