Refresh the website and get the latest version of the labs and slides! We are constantly making improvements.
Refresh the website and get the latest version of the labs and slides! We are constantly making improvements.
<-
to save (assign) values to objectsc()
to combine vectorslength()
, class()
, and str()
tell you information about an objectseq()
function helps you create numeric vectors (from
,to
, by
, and length.out
arguments)rep()
function helps you create vectors with the each
and times
argumentsreadr
has helpful functions like read_csv()
that can help you import data into RIn this module, we will show you how to:
We will largely focus on the dplyr
package which is part of the tidyverse
.
Some resources on how to use dplyr
:
The dplyr
package is one of the most helpful packages for altering your data to get it into a form that is useful for creating visualizations, summarizing, or more deeply analyzing.
So you can imagine using pliers on your data.
See this website for a list of the packages included in the tidyverse
: https://www.tidyverse.org/packages/
library(tidyverse) # loads dplyr and other packages!
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ── ✔ forcats 1.0.0 ✔ stringr 1.5.1 ✔ lubridate 1.9.3 ✔ tibble 3.2.1 ✔ purrr 1.0.2 ✔ tidyr 1.3.1 ✔ readr 2.1.5 ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ── ✖ dplyr::filter() masks stats::filter() ✖ dplyr::lag() masks stats::lag() ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
We will work with data called ER
about heat-related ER visits between 2011 and 2022, as reported by the state of Colorado, specifically made available by the Colorado Environmental Public Health Tracking program website. Full dataset available at https://coepht.colorado.gov/heat-related-illness.
library(dasehr) ER <- CO_heat_ER_bygender #or we could do this: ER <- read_csv("https://daseh.org/data/Colorado_ER_heat_visits_by_county_gender.csv")
Rows: 240 Columns: 7 ── Column specification ──────────────────────────────────────────────────────── Delimiter: "," chr (2): county, gender dbl (5): rate, lower95cl, upper95cl, visits, year ℹ Use `spec()` to retrieve the full column specification for this data. ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
dim()
The dim()
, nrow()
, and ncol()
functions are good options to check the dimensions of your data before moving forward.
dim(ER) # rows, columns
[1] 240 7
nrow(ER) # number of rows
[1] 240
ncol(ER) # number of columns
[1] 7
glimpse()
In addition to head()
and tail()
, the glimpse()
function of the dplyr
package is another great function to look at your data.
glimpse(ER)
Rows: 240 Columns: 7 $ county <chr> "Adams", "Adams", "Adams", "Adams", "Adams", "Adams", "Adams… $ rate <dbl> 7.596958, NA, 6.218501, NA, NA, 6.161907, 4.685828, 6.386427… $ lower95cl <dbl> 4.383039, NA, 3.367822, NA, NA, 3.350234, 2.314647, 3.619401… $ upper95cl <dbl> 11.680261, NA, 9.928823, NA, NA, 9.816068, 7.879132, 9.92620… $ visits <dbl> 17, NA, 14, NA, NA, 14, 11, 16, 17, NA, 19, 18, 12, 13, 17, … $ year <dbl> 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020, … $ gender <chr> "Female", "Female", "Female", "Female", "Female", "Female", …
slice_sample()
What if you want to see the middle of your data? You can use the slice_sample()
function of the dplyr
package to see a random set of rows. You can specify the number of rows with the n
argument.
slice_sample(ER, n = 2)
# A tibble: 2 × 7 county rate lower95cl upper95cl visits year gender <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> 1 Arapahoe 7.18 4.04 10.3 21 2016 Male 2 Weld NA NA NA NA 2014 Female
slice_sample(ER, n = 2)
# A tibble: 2 × 7 county rate lower95cl upper95cl visits year gender <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> 1 Cheyenne 0 0 0 0 2012 Male 2 Jefferson 4.15 2.01 7.04 11 2011 Female
An older version of data in tables is called a data frame. The mtcars dataset is an example of this.
class(mtcars)
[1] "data.frame"
head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Tibbles are a fancier version of data frames:
ER
# A tibble: 240 × 7 county rate lower95cl upper95cl visits year gender <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> 1 Adams 7.60 4.38 11.7 17 2011 Female 2 Adams NA NA NA NA 2012 Female 3 Adams 6.22 3.37 9.93 14 2013 Female 4 Adams NA NA NA NA 2014 Female 5 Adams NA NA NA NA 2015 Female 6 Adams 6.16 3.35 9.82 14 2016 Female 7 Adams 4.69 2.31 7.88 11 2017 Female 8 Adams 6.39 3.62 9.93 16 2018 Female 9 Adams 6.64 3.85 10.2 17 2019 Female 10 Adams NA NA NA NA 2020 Female # ℹ 230 more rows
tibble
If we wanted to create a tibble
(“fancy” data frame), we can using the tibble()
function on a data frame.
tbl_mtcars <- tibble(mtcars) tbl_mtcars
# A tibble: 32 × 11 mpg cyl disp hp drat wt qsec vs am gear carb <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 # ℹ 22 more rows
Note don’t necessarily need to use head()
with tibbles, as they conveniently print a portion of the data.
We generally recommend using tibbles, but you are likely to run into lots of data frames with your work.
Most functions work for both so you don’t need to worry about it much!
It can be helpful to convert data frames to tibbles though just to see more about the data more easily. The tibble()
function helps us do that.
Note that this conversion can remove row names - which some data frames have. For example, mtcars
(part of R) has row names. In this case we would want to make the rownames a new column first before making into a tibble.
head(mtcars, n = 2)
mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21 6 160 110 3.9 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4
head(tibble(mtcars), n = 2)
# A tibble: 2 × 11 mpg cyl disp hp drat wt qsec vs am gear carb <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 2 21 6 160 110 3.9 2.88 17.0 0 1 4 4
There is a function that specifically helps you do that.
head(rownames_to_column(mtcars), n = 2)
rowname mpg cyl disp hp drat wt qsec vs am gear carb 1 Mazda RX4 21 6 160 110 3.9 2.620 16.46 0 1 4 4 2 Mazda RX4 Wag 21 6 160 110 3.9 2.875 17.02 0 1 4 4
head(tibble(rownames_to_column(mtcars)), n = 2)
# A tibble: 2 × 12 rowname mpg cyl disp hp drat wt qsec vs am gear carb <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 Mazda RX4 21 6 160 110 3.9 2.62 16.5 0 1 4 4 2 Mazda RX4 W… 21 6 160 110 3.9 2.88 17.0 0 1 4 4
Let’s stick with the tibble ER data for our next lesson
head(ER)
# A tibble: 6 × 7 county rate lower95cl upper95cl visits year gender <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> 1 Adams 7.60 4.38 11.7 17 2011 Female 2 Adams NA NA NA NA 2012 Female 3 Adams 6.22 3.37 9.93 14 2013 Female 4 Adams NA NA NA NA 2014 Female 5 Adams NA NA NA NA 2015 Female 6 Adams 6.16 3.35 9.82 14 2016 Female
rename
function“Artwork by @allison_horst”. https://allisonhorst.com/
colnames()
function (or names()
)colnames(ER)
[1] "county" "rate" "lower95cl" "upper95cl" "visits" "year" [7] "gender"
To rename columns in dplyr
, you can use the rename
function.
For example, let’s rename lower95CI to lower_95_CI_limit. Notice the new name is listed first, similar to how a new object is assigned on the left!
# general format! not code! {data you are creating or changing} <- rename({data you are using}, {New Name} = {Old name})
renamed_ER<- rename(ER, lower_95_CI_limit = lower95cl) head(renamed_ER)
# A tibble: 6 × 7 county rate lower_95_CI_limit upper95cl visits year gender <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> 1 Adams 7.60 4.38 11.7 17 2011 Female 2 Adams NA NA NA NA 2012 Female 3 Adams 6.22 3.37 9.93 14 2013 Female 4 Adams NA NA NA NA 2014 Female 5 Adams NA NA NA NA 2015 Female 6 Adams 6.16 3.35 9.82 14 2016 Female
When you can, avoid spaces, special punctuation, or numbers in column names, as these require special treatment to refer to them.
See https://daseh.org/resources/quotes_vs_backticks.html for more guidance.
# this will cause an error renamed_ER <- rename(ER, lower_95%_CI_limit = lower95cl)
# this will work renamed_ER <- rename(ER, `lower_95%_CI_limit` = lower95cl) head(renamed_ER, 2)
# A tibble: 2 × 7 county rate `lower_95%_CI_limit` upper95cl visits year gender <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> 1 Adams 7.60 4.38 11.7 17 2011 Female 2 Adams NA NA NA NA 2012 Female
It’s best to avoid unusual column names where possible, as things get tricky later.
We just showed the use of ` backticks` . You may see people use quotes as well.
Other atypical column names are those with:
Rename tricky column names so that you don’t have to deal with them later!
Curly quotes will not work!
# this will cause an error! renamed_ER <- rename(ER, ‘lower_95%_CI_limit’ = lower95cl)
# this will work! renamed_ER <- rename(ER, 'lower_95%_CI_limit' = lower95cl)
Also true for double quotes
# this will cause an error! renamed_ER <- rename(ER, “lower_95%_CI_limit” = lower95cl)
# this will work! renamed_ER <- rename(ER, "lower_95%_CI_limit" = lower95cl)
A comma can separate different column names to change.
renamed_ER <- rename(ER, lower_95perc_CI_limit = lower95cl, upper_95perc_CI_limit = upper95cl) head(renamed_ER, 3)
# A tibble: 3 × 7 county rate lower_95perc_CI_limit upper_95perc_CI_limit visits year gender <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> 1 Adams 7.60 4.38 11.7 17 2011 Female 2 Adams NA NA NA NA 2012 Female 3 Adams 6.22 3.37 9.93 14 2013 Female
To rename all columns you use the rename_with()
. In this case we will use toupper()
to make all letters upper case. Could also use tolower()
function.
ER_upper <- rename_with(ER, toupper) head(ER_upper, 3)
# A tibble: 3 × 7 COUNTY RATE LOWER95CL UPPER95CL VISITS YEAR GENDER <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> 1 Adams 7.60 4.38 11.7 17 2011 Female 2 Adams NA NA NA NA 2012 Female 3 Adams 6.22 3.37 9.93 14 2013 Female
rename_with(ER_upper, tolower)
# A tibble: 240 × 7 county rate lower95cl upper95cl visits year gender <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> 1 Adams 7.60 4.38 11.7 17 2011 Female 2 Adams NA NA NA NA 2012 Female 3 Adams 6.22 3.37 9.93 14 2013 Female 4 Adams NA NA NA NA 2014 Female 5 Adams NA NA NA NA 2015 Female 6 Adams 6.16 3.35 9.82 14 2016 Female 7 Adams 4.69 2.31 7.88 11 2017 Female 8 Adams 6.39 3.62 9.93 16 2018 Female 9 Adams 6.64 3.85 10.2 17 2019 Female 10 Adams NA NA NA NA 2020 Female # ℹ 230 more rows
If you need to do lots of naming fixes - look into the janitor package!
#install.packages("janitor") library(janitor)
clean_names
The clean_names
function can intuit what fixes you might need. Here it make sure year names aren’t just a number, so that the colnames don’t need ticks or quotes to be used.
#library(dasehr) CO2 <- dasehr::yearly_co2_emissions # or this: CO2 <- read_csv("https://daseh.org/data/Yearly_CO2_Emissions_1000_tonnes.csv")
Rows: 192 Columns: 265 ── Column specification ──────────────────────────────────────────────────────── Delimiter: "," chr (1): country dbl (264): 1751, 1752, 1753, 1754, 1755, 1756, 1757, 1758, 1759, 1760, 1761,... ℹ Use `spec()` to retrieve the full column specification for this data. ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(CO2, n = 2)
# A tibble: 2 × 265 country `1751` `1752` `1753` `1754` `1755` `1756` `1757` `1758` `1759` `1760` <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 Afghani… NA NA NA NA NA NA NA NA NA NA 2 Albania NA NA NA NA NA NA NA NA NA NA # ℹ 254 more variables: `1761` <dbl>, `1762` <dbl>, `1763` <dbl>, `1764` <dbl>, # `1765` <dbl>, `1766` <dbl>, `1767` <dbl>, `1768` <dbl>, `1769` <dbl>, # `1770` <dbl>, `1771` <dbl>, `1772` <dbl>, `1773` <dbl>, `1774` <dbl>, # `1775` <dbl>, `1776` <dbl>, `1777` <dbl>, `1778` <dbl>, `1779` <dbl>, # `1780` <dbl>, `1781` <dbl>, `1782` <dbl>, `1783` <dbl>, `1784` <dbl>, # `1785` <dbl>, `1786` <dbl>, `1787` <dbl>, `1788` <dbl>, `1789` <dbl>, # `1790` <dbl>, `1791` <dbl>, `1792` <dbl>, `1793` <dbl>, `1794` <dbl>, …
clean_names(CO2)
# A tibble: 192 × 265 country x1751 x1752 x1753 x1754 x1755 x1756 x1757 x1758 x1759 x1760 x1761 <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 Afghanistan NA NA NA NA NA NA NA NA NA NA NA 2 Albania NA NA NA NA NA NA NA NA NA NA NA 3 Algeria NA NA NA NA NA NA NA NA NA NA NA 4 Andorra NA NA NA NA NA NA NA NA NA NA NA 5 Angola NA NA NA NA NA NA NA NA NA NA NA 6 Antigua an… NA NA NA NA NA NA NA NA NA NA NA 7 Argentina NA NA NA NA NA NA NA NA NA NA NA 8 Armenia NA NA NA NA NA NA NA NA NA NA NA 9 Australia NA NA NA NA NA NA NA NA NA NA NA 10 Austria NA NA NA NA NA NA NA NA NA NA NA # ℹ 182 more rows # ℹ 253 more variables: x1762 <dbl>, x1763 <dbl>, x1764 <dbl>, x1765 <dbl>, # x1766 <dbl>, x1767 <dbl>, x1768 <dbl>, x1769 <dbl>, x1770 <dbl>, # x1771 <dbl>, x1772 <dbl>, x1773 <dbl>, x1774 <dbl>, x1775 <dbl>, # x1776 <dbl>, x1777 <dbl>, x1778 <dbl>, x1779 <dbl>, x1780 <dbl>, # x1781 <dbl>, x1782 <dbl>, x1783 <dbl>, x1784 <dbl>, x1785 <dbl>, # x1786 <dbl>, x1787 <dbl>, x1788 <dbl>, x1789 <dbl>, x1790 <dbl>, …
clean_names
can also get rid of spaces and replace them with _
.
test <- tibble(`col 1` = c(1,2,3), `col 2` = c(2,3,4)) test
# A tibble: 3 × 2 `col 1` `col 2` <dbl> <dbl> 1 1 2 2 2 3 3 3 4
clean_names(test)
# A tibble: 3 × 2 col_1 col_2 <dbl> <dbl> 1 1 2 2 2 3 3 3 4
tidyverse
versiontibble()
rownames_to_column
before converting to tibblerename()
function of dplyr
can help you rename columnsjanitor
if you need to make lots of column name changes🏠 Class Website
💻 Lab
This time lets also make it a smaller subset so it is easier for us to see the full dataset as we work through examples.
#library(dasehr) #ER <- CO_heat_ER_bygender #or this: #read_csv("https://daseh.org/data/Colorado_ER_heat_visits_by_county_gender.csv") set.seed(1234) ER_30 <-slice_sample(ER, n = 30)
tidyverse
way:To grab (or “pull” out) the year
column the tidyverse
way we can use the pull
function:
pull(ER_30, year)
[1] 2014 2018 2016 2015 2018 2013 2015 2011 2020 2022 2022 2012 2017 2020 2016 [16] 2020 2017 2012 2018 2012 2016 2014 2012 2014 2012 2015 2014 2018 2013 2013
The select
command from dplyr
allows you to subset (still a tibble
!)
select(ER_30, year)
# A tibble: 30 × 1 year <dbl> 1 2014 2 2018 3 2016 4 2015 5 2018 6 2013 7 2015 8 2011 9 2020 10 2022 # ℹ 20 more rows
We can use select
to select for multiple columns.
select(ER_30, year, rate, county)
# A tibble: 30 × 3 year rate county <dbl> <dbl> <chr> 1 2014 NA Arapahoe 2 2018 3.76 Denver 3 2016 8.13 Larimer 4 2015 3.94 El Paso 5 2018 11.1 Weld 6 2013 5.99 El Paso 7 2015 4.12 Jefferson 8 2011 5.76 Jefferson 9 2020 NA Larimer 10 2022 5.01 Jefferson # ℹ 20 more rows
The select
command from dplyr
allows you to subset columns matching patterns:
head(ER_30, 2)
# A tibble: 2 × 7 county rate lower95cl upper95cl visits year gender <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> 1 Arapahoe NA NA NA NA 2014 Female 2 Denver 3.76 1.97 6.12 13 2018 Female
select(ER_30, ends_with("cl"))
# A tibble: 30 × 2 lower95cl upper95cl <dbl> <dbl> 1 NA NA 2 1.97 6.12 3 4.32 13.1 4 2.01 6.52 5 6.59 16.9 6 3.32 8.66 7 2.02 6.94 8 3.12 9.20 9 NA NA 10 2.71 7.99 # ℹ 20 more rows
Here are a few:
last_col() starts_with() ends_with() contains()
Type tidyselect::
in the console and see what RStudio suggests:
head(ER_30, 2)
# A tibble: 2 × 7 county rate lower95cl upper95cl visits year gender <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> 1 Arapahoe NA NA NA NA 2014 Female 2 Denver 3.76 1.97 6.12 13 2018 Female
select(ER_30, ends_with("cl"), year)
# A tibble: 30 × 3 lower95cl upper95cl year <dbl> <dbl> <dbl> 1 NA NA 2014 2 1.97 6.12 2018 3 4.32 13.1 2016 4 2.01 6.52 2015 5 6.59 16.9 2018 6 3.32 8.66 2013 7 2.02 6.94 2015 8 3.12 9.20 2011 9 NA NA 2020 10 2.71 7.99 2022 # ℹ 20 more rows
Follows OR logic.
select(ER_30, ends_with("cl"), starts_with("r"))
# A tibble: 30 × 3 lower95cl upper95cl rate <dbl> <dbl> <dbl> 1 NA NA NA 2 1.97 6.12 3.76 3 4.32 13.1 8.13 4 2.01 6.52 3.94 5 6.59 16.9 11.1 6 3.32 8.66 5.99 7 2.02 6.94 4.12 8 3.12 9.20 5.76 9 NA NA NA 10 2.71 7.99 5.01 # ℹ 20 more rows
Need to combine the patterns with the c()
function.
select(ER_30, starts_with(c("r", "l")))
# A tibble: 30 × 2 rate lower95cl <dbl> <dbl> 1 NA NA 2 3.76 1.97 3 8.13 4.32 4 3.94 2.01 5 11.1 6.59 6 5.99 3.32 7 4.12 2.02 8 5.76 3.12 9 NA NA 10 5.01 2.71 # ℹ 20 more rows
where()
function can help select columns of a specific classis.character()
and is.numeric()
are often the most helpful
head(ER_30, 2)
# A tibble: 2 × 7 county rate lower95cl upper95cl visits year gender <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> 1 Arapahoe NA NA NA NA 2014 Female 2 Denver 3.76 1.97 6.12 13 2018 Female
select(ER_30, where(is.numeric))
# A tibble: 30 × 5 rate lower95cl upper95cl visits year <dbl> <dbl> <dbl> <dbl> <dbl> 1 NA NA NA NA 2014 2 3.76 1.97 6.12 13 2018 3 8.13 4.32 13.1 14 2016 4 3.94 2.01 6.52 12 2015 5 11.1 6.59 16.9 18 2018 6 5.99 3.32 8.66 20 2013 7 4.12 2.02 6.94 11 2015 8 5.76 3.12 9.20 14 2011 9 NA NA NA NA 2020 10 5.01 2.71 7.99 14 2022 # ℹ 20 more rows
filter
function“Artwork by @allison_horst”. https://allisonhorst.com/
The command in dplyr
for subsetting rows is filter
.
filter(ER_30, year > 2020)
# A tibble: 2 × 7 county rate lower95cl upper95cl visits year gender <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> 1 Jefferson 5.01 2.71 7.99 14 2022 Male 2 Jefferson 4.56 2.43 7.35 14 2022 Female
You can have multiple logical conditions using the following:
==
: equals to!=
: not equal to (!
: not/negation)>
/ <
: greater than / less than>=
or <=
: greater than or equal to / less than or equal to&
: AND|
: ORIf you try to filter for a column that does not exist it will not work:
You can filter by two conditions using &
or commas (must meet both conditions):
filter(ER_30, rate > 9, year == 2012)
filter(ER_30, rate > 9 & year == 2012) # same result
# A tibble: 1 × 7 county rate lower95cl upper95cl visits year gender <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> 1 Weld 12.8 6.96 20.3 15 2012 Male
If you want OR statements (meaning the data can meet either condition does not need to meet both), you need to use |
between conditions:
filter(ER_30, rate > 9 | year == 2012)
# A tibble: 9 × 7 county rate lower95cl upper95cl visits year gender <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> 1 Weld 11.1 6.59 16.9 18 2018 Male 2 El Paso NA NA NA NA 2012 Female 3 Statewide 7.56 6.48 8.65 193 2012 Male 4 El Paso 9.07 5.74 12.4 31 2018 Male 5 Adams 5.48 2.90 8.88 13 2012 Male 6 Pueblo 18.5 9.91 29.7 14 2014 Male 7 Cheyenne 0 0 0 0 2012 Male 8 Weld 12.8 6.96 20.3 15 2012 Male 9 Statewide 10.1 8.95 11.3 293 2018 Male
The %in%
operator can be used find values from a pre-made list (using c()
) for a single column at a time.
filter(ER_30, county %in% c("Denver","Weld","Arapahoe"))
# A tibble: 8 × 7 county rate lower95cl upper95cl visits year gender <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> 1 Arapahoe NA NA NA NA 2014 Female 2 Denver 3.76 1.97 6.12 13 2018 Female 3 Weld 11.1 6.59 16.9 18 2018 Male 4 Denver 8.51 5.30 11.7 30 2016 Male 5 Denver 3.56 1.82 5.88 13 2017 Female 6 Weld 12.8 6.96 20.3 15 2012 Male 7 Arapahoe NA NA NA NA 2014 Male 8 Weld NA NA NA NA 2013 Female
filter(ER_30, county == "Denver"| county == "Weld"| county == "Arapahoe") #equivalent
# A tibble: 8 × 7 county rate lower95cl upper95cl visits year gender <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> 1 Arapahoe NA NA NA NA 2014 Female 2 Denver 3.76 1.97 6.12 13 2018 Female 3 Weld 11.1 6.59 16.9 18 2018 Male 4 Denver 8.51 5.30 11.7 30 2016 Male 5 Denver 3.56 1.82 5.88 13 2017 Female 6 Weld 12.8 6.96 20.3 15 2012 Male 7 Arapahoe NA NA NA NA 2014 Male 8 Weld NA NA NA NA 2013 Female
The %in%
operator can be used find values from a pre-made list (using c()
) for a single column at a time with different columns.
filter(ER_30, year %in% c(2013,2020), county %in% c("Denver","Weld"))
# A tibble: 1 × 7 county rate lower95cl upper95cl visits year gender <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> 1 Weld NA NA NA NA 2013 Female
filter
This will not work the way you might expect! Best to stick with nothing but the column name if it is a typical name.
filter(ER_30, "year" > 2014)
# A tibble: 30 × 7 county rate lower95cl upper95cl visits year gender <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> 1 Arapahoe NA NA NA NA 2014 Female 2 Denver 3.76 1.97 6.12 13 2018 Female 3 Larimer 8.13 4.32 13.1 14 2016 Female 4 El Paso 3.94 2.01 6.52 12 2015 Female 5 Weld 11.1 6.59 16.9 18 2018 Male 6 El Paso 5.99 3.32 8.66 20 2013 Male 7 Jefferson 4.12 2.02 6.94 11 2015 Male 8 Jefferson 5.76 3.12 9.20 14 2011 Male 9 Larimer NA NA NA NA 2020 Male 10 Jefferson 5.01 2.71 7.99 14 2022 Male # ℹ 20 more rows
Atypical names are those with punctuation, spaces, start with a number, or are just a number.
ER_30_rename <- rename(ER_30, `year!` = year) filter(ER_30_rename, "year!" > 2013) # will not work correctly
# A tibble: 30 × 7 county rate lower95cl upper95cl visits `year!` gender <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> 1 Arapahoe NA NA NA NA 2014 Female 2 Denver 3.76 1.97 6.12 13 2018 Female 3 Larimer 8.13 4.32 13.1 14 2016 Female 4 El Paso 3.94 2.01 6.52 12 2015 Female 5 Weld 11.1 6.59 16.9 18 2018 Male 6 El Paso 5.99 3.32 8.66 20 2013 Male 7 Jefferson 4.12 2.02 6.94 11 2015 Male 8 Jefferson 5.76 3.12 9.20 14 2011 Male 9 Larimer NA NA NA NA 2020 Male 10 Jefferson 5.01 2.71 7.99 14 2022 Male # ℹ 20 more rows
filter
Using backticks works!
filter(ER_30_rename, `year!` > 2013)
# A tibble: 21 × 7 county rate lower95cl upper95cl visits `year!` gender <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> 1 Arapahoe NA NA NA NA 2014 Female 2 Denver 3.76 1.97 6.12 13 2018 Female 3 Larimer 8.13 4.32 13.1 14 2016 Female 4 El Paso 3.94 2.01 6.52 12 2015 Female 5 Weld 11.1 6.59 16.9 18 2018 Male 6 Jefferson 4.12 2.02 6.94 11 2015 Male 7 Larimer NA NA NA NA 2020 Male 8 Jefferson 5.01 2.71 7.99 14 2022 Male 9 Jefferson 4.56 2.43 7.35 14 2022 Female 10 El Paso NA NA NA NA 2017 Female # ℹ 11 more rows
filter
filter(ER_30, "county" == "Denver") # this will not work
# A tibble: 0 × 7 # ℹ 7 variables: county <chr>, rate <dbl>, lower95cl <dbl>, upper95cl <dbl>, # visits <dbl>, year <dbl>, gender <chr>
filter
filter(ER_30, county == "Denver")# this works!
# A tibble: 3 × 7 county rate lower95cl upper95cl visits year gender <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> 1 Denver 3.76 1.97 6.12 13 2018 Female 2 Denver 8.51 5.30 11.7 30 2016 Male 3 Denver 3.56 1.82 5.88 13 2017 Female
filter()
is trickyTry not use anything special for the column names in filter()
. This is why it is good to not use atypical column names. Then you can just use the column name!
Did the filter work the way you expected? Did the dimensions change?
pull()
to get values out of a data frame/tibbleselect()
is the tidyverse
way to get a tibble with only certain columnsselect()
based on patterns in the column namesselect()
based on column class with the where()
functionselect(starts_with("C"), ends_with("state"))
c()
function like select(starts_with(c("A", "C")))
filter()
can be used to filter out rows based on logical conditionsfilter()
==
is the same as equivalent to&
means both conditions must be met to remain after filter()
|
means either conditions needs to be met to remain after filter()
🏠 Class Website
💻 Lab
#library(dasehr) #ER <- CO_heat_ER_bygender # or this: #read_csv("https://daseh.org/data/Colorado_ER_heat_visits_by_county_gender.csv") set.seed(1234) ER_30 <-slice_sample(ER, n = 30)
filter
and select
You can combine filter
and select
to subset the rows and columns, respectively, of a data frame:
select(filter(ER_30, year > 2012), county)
# A tibble: 24 × 1 county <chr> 1 Arapahoe 2 Denver 3 Larimer 4 El Paso 5 Weld 6 El Paso 7 Jefferson 8 Larimer 9 Jefferson 10 Jefferson # ℹ 14 more rows
In R
, the common way to perform multiple operations is to wrap functions around each other in a “nested” form.
head(select(ER_30, year, county), 2)
# A tibble: 2 × 2 year county <dbl> <chr> 1 2014 Arapahoe 2 2018 Denver
select(filter(ER_30, year > 2000 & county == "Denver"), year, rate)
# A tibble: 3 × 2 year rate <dbl> <dbl> 1 2018 3.76 2 2016 8.51 3 2017 3.56
One can also create temporary objects and reassign them:
ER_30_CO <- filter(ER_30, year > 2000 & county == "Denver") ER_30_CO <- select(ER_30_CO, year, rate) head(ER_30_CO)
# A tibble: 3 × 2 year rate <dbl> <dbl> 1 2018 3.76 2 2016 8.51 3 2017 3.56
pipe
(comes with dplyr
):The pipe %>%
makes this much more readable. It reads left side “pipes” into right side. RStudio CMD/Ctrl + Shift + M
shortcut. Pipe tb
into filter
, then pipe that into select
:
ER_30 %>% filter(year > 2000 & county == "Denver") %>% select(year, rate)
# A tibble: 3 × 2 year rate <dbl> <dbl> 1 2018 3.76 2 2016 8.51 3 2017 3.56
tidyverse
way)The mutate
function in dplyr
allows you to add or modify columns of a data frame.
# General format - Not the code! {data object to update} <- mutate({data to use}, {new variable name} = {new variable source})
ER_30 <- mutate(ER_30, newcol = rate * 2) head(ER_30, 4)
# A tibble: 4 × 8 county rate lower95cl upper95cl visits year gender newcol <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> 1 Arapahoe NA NA NA NA 2014 Female NA 2 Denver 3.76 1.97 6.12 13 2018 Female 7.53 3 Larimer 8.13 4.32 13.1 14 2016 Female 16.3 4 El Paso 3.94 2.01 6.52 12 2015 Female 7.89
The mutate
function in dplyr
allows you to add or modify columns of a data frame.
# General format - Not the code! {data object to update} <- mutate({data to use}, {variable name to change} = {variable modification})
ER_30 <- mutate(ER_30, newcol = newcol / 2) head(ER_30, 4)
# A tibble: 4 × 8 county rate lower95cl upper95cl visits year gender newcol <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> 1 Arapahoe NA NA NA NA 2014 Female NA 2 Denver 3.76 1.97 6.12 13 2018 Female 3.76 3 Larimer 8.13 4.32 13.1 14 2016 Female 8.13 4 El Paso 3.94 2.01 6.52 12 2015 Female 3.94
ER_30 <- ER_30 %>% mutate(newcol = newcol / 2) head(ER_30,4)
# A tibble: 4 × 8 county rate lower95cl upper95cl visits year gender newcol <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> 1 Arapahoe NA NA NA NA 2014 Female NA 2 Denver 3.76 1.97 6.12 13 2018 Female 1.88 3 Larimer 8.13 4.32 13.1 14 2016 Female 4.07 4 El Paso 3.94 2.01 6.52 12 2015 Female 1.97
mutate
function“Artwork by @allison_horst”. https://allisonhorst.com/
The NULL
method is still very common.
The select
function can remove a column with exclamation mark (!
) our using the minus sign (-
):
select(ER_30, !newcol)
# A tibble: 6 × 7 county rate lower95cl upper95cl visits year gender <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> 1 Arapahoe NA NA NA NA 2014 Female 2 Denver 3.76 1.97 6.12 13 2018 Female 3 Larimer 8.13 4.32 13.1 14 2016 Female 4 El Paso 3.94 2.01 6.52 12 2015 Female 5 Weld 11.1 6.59 16.9 18 2018 Male 6 El Paso 5.99 3.32 8.66 20 2013 Male
Or, you can simply select the columns you want to keep, ignoring the ones you want to remove.
You can use c()
to list the columns to remove.
Remove newcol
and year
:
select(ER_30, !c(newcol, year))
# A tibble: 30 × 6 county rate lower95cl upper95cl visits gender <chr> <dbl> <dbl> <dbl> <dbl> <chr> 1 Arapahoe NA NA NA NA Female 2 Denver 3.76 1.97 6.12 13 Female 3 Larimer 8.13 4.32 13.1 14 Female 4 El Paso 3.94 2.01 6.52 12 Female 5 Weld 11.1 6.59 16.9 18 Male 6 El Paso 5.99 3.32 8.66 20 Male 7 Jefferson 4.12 2.02 6.94 11 Male 8 Jefferson 5.76 3.12 9.20 14 Male 9 Larimer NA NA NA NA Male 10 Jefferson 5.01 2.71 7.99 14 Male # ℹ 20 more rows
The select
function can reorder columns.
head(ER_30, 2)
# A tibble: 2 × 8 county rate lower95cl upper95cl visits year gender newcol <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> 1 Arapahoe NA NA NA NA 2014 Female NA 2 Denver 3.76 1.97 6.12 13 2018 Female 1.88
ER_30 %>% select(year, rate, county) %>% head(2)
# A tibble: 2 × 3 year rate county <dbl> <dbl> <chr> 1 2014 NA Arapahoe 2 2018 3.76 Denver
The select
function can reorder columns. Put newcol
first, then select the rest of columns:
select(ER_30, newcol, everything())
# A tibble: 3 × 8 newcol county rate lower95cl upper95cl visits year gender <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> 1 NA Arapahoe NA NA NA NA 2014 Female 2 1.88 Denver 3.76 1.97 6.12 13 2018 Female 3 4.07 Larimer 8.13 4.32 13.1 14 2016 Female
Put year
at the end (“remove, everything, then add back in”):
select(ER_30, !year, everything(), year)
# A tibble: 3 × 8 county rate lower95cl upper95cl visits gender newcol year <chr> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> <dbl> 1 Arapahoe NA NA NA NA Female NA 2014 2 Denver 3.76 1.97 6.12 13 Female 1.88 2018 3 Larimer 8.13 4.32 13.1 14 Female 4.07 2016
Using the base R order()
function.
order(colnames(ER_30))
[1] 1 7 3 8 2 4 5 6
ER_30 %>% select(order(colnames(ER_30)))
# A tibble: 30 × 8 county gender lower95cl newcol rate upper95cl visits year <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> 1 Arapahoe Female NA NA NA NA NA 2014 2 Denver Female 1.97 1.88 3.76 6.12 13 2018 3 Larimer Female 4.32 4.07 8.13 13.1 14 2016 4 El Paso Female 2.01 1.97 3.94 6.52 12 2015 5 Weld Male 6.59 5.57 11.1 16.9 18 2018 6 El Paso Male 3.32 3.00 5.99 8.66 20 2013 7 Jefferson Male 2.02 2.06 4.12 6.94 11 2015 8 Jefferson Male 3.12 2.88 5.76 9.20 14 2011 9 Larimer Male NA NA NA NA NA 2020 10 Jefferson Male 2.71 2.50 5.01 7.99 14 2022 # ℹ 20 more rows
In addition to select
we can also use the relocate()
function of dplyr to rearrange the columns for more complicated moves.
For example, let say we just wanted year
to be before BUYER_STATE
.
head(ER_30, 1)
# A tibble: 1 × 8 county rate lower95cl upper95cl visits year gender newcol <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> 1 Arapahoe NA NA NA NA 2014 Female NA
tb_carb <- relocate(ER_30, year, .before = rate) head(tb_carb, 1)
# A tibble: 1 × 8 county year rate lower95cl upper95cl visits gender newcol <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> 1 Arapahoe 2014 NA NA NA NA Female NA
The arrange
function can reorder rows By default, arrange
orders in increasing order:
arrange(ER_30, year)
# A tibble: 30 × 8 county rate lower95cl upper95cl visits year gender newcol <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> 1 Jefferson 5.76 3.12 9.20 14 2011 Male 2.88 2 El Paso NA NA NA NA 2012 Female NA 3 Statewide 7.56 6.48 8.65 193 2012 Male 3.78 4 Adams 5.48 2.90 8.88 13 2012 Male 2.74 5 Cheyenne 0 0 0 0 2012 Male 0 6 Weld 12.8 6.96 20.3 15 2012 Male 6.38 7 El Paso 5.99 3.32 8.66 20 2013 Male 3.00 8 Weld NA NA NA NA 2013 Female NA 9 Statewide 4.94 4.06 5.82 124 2013 Female 2.47 10 Arapahoe NA NA NA NA 2014 Female NA # ℹ 20 more rows
Use the desc
to arrange the rows in descending order:
arrange(ER_30, desc(year))
# A tibble: 30 × 8 county rate lower95cl upper95cl visits year gender newcol <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> 1 Jefferson 5.01 2.71 7.99 14 2022 Male 2.50 2 Jefferson 4.56 2.43 7.35 14 2022 Female 2.28 3 Larimer NA NA NA NA 2020 Male NA 4 Statewide 6.60 5.66 7.53 197 2020 Male 3.30 5 Cheyenne 0 0 0 0 2020 Male 0 6 Denver 3.76 1.97 6.12 13 2018 Female 1.88 7 Weld 11.1 6.59 16.9 18 2018 Male 5.57 8 El Paso 9.07 5.74 12.4 31 2018 Male 4.54 9 Statewide 10.1 8.95 11.3 293 2018 Male 5.06 10 El Paso NA NA NA NA 2017 Female NA # ℹ 20 more rows
You can combine increasing and decreasing orderings:
arrange(ER_30, rate, desc(year)) %>% head(n = 2)
# A tibble: 2 × 8 county rate lower95cl upper95cl visits year gender newcol <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> 1 Cheyenne 0 0 0 0 2020 Male 0 2 Cheyenne 0 0 0 0 2012 Male 0
arrange(ER_30, desc(year), rate) %>% head(n = 2)
# A tibble: 2 × 8 county rate lower95cl upper95cl visits year gender newcol <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> 1 Jefferson 4.56 2.43 7.35 14 2022 Female 2.28 2 Jefferson 5.01 2.71 7.99 14 2022 Male 2.50
select()
and filter()
can be combined together()
%>%
(like “then” statements)select()
and relocate()
can be used to reorder columnsarrange()
can be used to reorder rowsfilter()
select()
with exclamation mark in front of column name(s)mutate()
can be used to create new variables or modify them# General format - Not the code! {data object to update} <- mutate({data to use}, {new variable name} = {new variable source})
ER_30 <- mutate(ER_30, newcol = count/2.2)
The $
operator is similar to pull()
. This is the base R way to do this:
ER_30$year
[1] 2014 2018 2016 2015 2018 2013 2015 2011 2020 2022 2022 2012 2017 2020 2016 [16] 2020 2017 2012 2018 2012 2016 2014 2012 2014 2012 2015 2014 2018 2013 2013
Although it is easier (for this one task), mixing and matching the $
operator with tidyverse functions usually doesn’t work. Therefore, we want to let you know about it in case you see it, but we suggest that you try working with the tidyverse way.
You can add a new column (or modify an existing one) using the $
operator instead of mutate
.
Just want you to be aware of this as it is very common.
ER_30$newcol <- ER_30$rate/2.2 head(ER_30,3)
# A tibble: 3 × 8 county rate lower95cl upper95cl visits year gender newcol <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> 1 Arapahoe NA NA NA NA 2014 Female NA 2 Denver 3.76 1.97 6.12 13 2018 Female 1.71 3 Larimer 8.13 4.32 13.1 14 2016 Female 3.70
Even though $
is easier for creating new columns, mutate
is really powerful, so it’s worth getting used to.
💻 Lab
Image by Gerd Altmann from Pixabay
which()
functionInstead of removing rows like filter, which()
simply shows where the values occur if they pass a specific condition. We will see that this can be helpful later when we want to select and filter in more complicated ways.
which(select(ER_30, year) == 2014)
[1] 1 22 24 27
select(ER_30, year) == 2014 %>% head(10)
year [1,] TRUE [2,] FALSE [3,] FALSE [4,] FALSE [5,] FALSE [6,] FALSE [7,] FALSE [8,] FALSE [9,] FALSE [10,] FALSE [11,] FALSE [12,] FALSE [13,] FALSE [14,] FALSE [15,] FALSE [16,] FALSE [17,] FALSE [18,] FALSE [19,] FALSE [20,] FALSE [21,] FALSE [22,] TRUE [23,] FALSE [24,] TRUE [25,] FALSE [26,] FALSE [27,] TRUE [28,] FALSE [29,] FALSE [30,] FALSE
ER_30$year <- NULL
We can use the colnames
function to extract and/or directly reassign column names of df
:
colnames(ER_30) # just prints
[1] "county" "rate" "lower95cl" "upper95cl" "visits" "year" [7] "gender" "newcol"
colnames(ER_30)[1:2] <- c("County", "Rate") # reassigns head(ER_30)
# A tibble: 6 × 8 County Rate lower95cl upper95cl visits year gender newcol <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> 1 Arapahoe NA NA NA NA 2014 Female NA 2 Denver 3.76 1.97 6.12 13 2018 Female 1.71 3 Larimer 8.13 4.32 13.1 14 2016 Female 3.70 4 El Paso 3.94 2.01 6.52 12 2015 Female 1.79 5 Weld 11.1 6.59 16.9 18 2018 Male 5.07 6 El Paso 5.99 3.32 8.66 20 2013 Male 2.72
Let’s select rows 1 and 3 from df
using brackets:
ER_30[ c(1, 3), ]
# A tibble: 2 × 8 County Rate lower95cl upper95cl visits year gender newcol <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <dbl> 1 Arapahoe NA NA NA NA 2014 Female NA 2 Larimer 8.13 4.32 13.1 14 2016 Female 3.70
We can also subset a data frame using the bracket [, ]
subsetting.
For data frames and matrices (2-dimensional objects), the brackets are [rows, columns]
subsetting. We can grab the x
column using the index of the column or the column name (“year
”)
ER_30[, 6]
# A tibble: 30 × 1 year <dbl> 1 2014 2 2018 3 2016 4 2015 5 2018 6 2013 7 2015 8 2011 9 2020 10 2022 # ℹ 20 more rows
ER_30[, "year"]
# A tibble: 30 × 1 year <dbl> 1 2014 2 2018 3 2016 4 2015 5 2018 6 2013 7 2015 8 2011 9 2020 10 2022 # ℹ 20 more rows
We can select multiple columns using multiple column names:
ER_30[, c("County", "Rate")]
# A tibble: 30 × 2 County Rate <chr> <dbl> 1 Arapahoe NA 2 Denver 3.76 3 Larimer 8.13 4 El Paso 3.94 5 Weld 11.1 6 El Paso 5.99 7 Jefferson 4.12 8 Jefferson 5.76 9 Larimer NA 10 Jefferson 5.01 # ℹ 20 more rows