```{r, echo = FALSE, message=FALSE, error = FALSE}
library(knitr)
opts_chunk$set(comment = "", message = FALSE)
suppressWarnings({library(dplyr)})
library(readr)
library(tidyverse)
```
## Recap
- `select()`: subset and/or reorder columns
- `filter()`: remove rows
- `arrange()`: reorder rows
- `mutate()`: create new columns or modify them
- `select()` and `filter()` can be combined together
- remove a column: `select()` with `!` mark (`!col_name`)
- you can do sequential steps: especially using pipes `%>%`
📃[Cheatsheet](https://daseh.org/modules/cheatsheets/Day-3.pdf)
## Another Cheatsheet
https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-transformation.pdf
```{r, fig.alt="A preview of the Data transformation cheatsheet produced by RStudio.", out.width = "80%", echo = FALSE, align = "center"}
knitr::include_graphics("images/Manip_cheatsheet.png")
```
## Data Summarization
* Basic statistical summarization
* `mean(x)`: takes the mean of x
* `sd(x)`: takes the standard deviation of x
* `median(x)`: takes the median of x
* `quantile(x)`: displays sample quantiles of x. Default is min, IQR, max
* `range(x)`: displays the range. Same as `c(min(x), max(x))`
* `sum(x)`: sum of x
* `max(x)`: maximum value in x
* `min(x)`: minimum value in x
## Some examples
We can use the `CO_heat_ER` object from the `dasehr` package to explore different ways of summarizing data. (This dataset contains information about the number and rate of visits for heat-related illness to ERs in Colorado from 2011-2022, adjusted for age.) The `head` command displays the first rows of an object:
```{r}
library(dasehr)
head(CO_heat_ER)
```
## Behavior of `pull()` function
`pull()` converts a single data column into a vector. This allows you to run summary functions.
```{r, eval=FALSE}
CO_heat_ER %>% pull(visits)
```
## Statistical summarization the "tidy" way
**Add the ** `na.rm =` **argument for missing data**
```{r}
CO_heat_ER %>% pull(visits) %>% mean()
CO_heat_ER %>% pull(visits) %>% mean(na.rm=T)
```
# Summarization on tibbles (data frames)
## Summarize the data: `dplyr` `summarize()` function
`summarize` creates a summary table.
Multiple summary statistics can be calculated at once (unlike `pull()` which can only do a single calculation on one column).
```{r, eval = FALSE}
# General format - Not the code!
{data to use} %>%
summarize({summary column name} = {function(source column)},
{summary column name} = {function(source column)})
```
## Summarize the data: `dplyr` `summarize()` function
```{r}
CO_heat_ER %>%
summarize(mean_visits = mean(visits))
CO_heat_ER %>%
summarize(mean_visits = mean(visits, na.rm = TRUE))
```
## Summarize the data: `dplyr` `summarize()` function
`summarize()` can do multiple operations at once. Just separate by a comma.
```{r}
CO_heat_ER %>%
summarize(mean_visits = mean(visits, na.rm = TRUE),
median_visits = median(visits, na.rm = TRUE),
mean_rate = mean(rate, na.rm = TRUE))
```
## Summarize the data: `dplyr` `summarize()` function
Note that `summarize()` creates a separate tibble from the original data.
If you want to save a summary statistic in the original data, use `mutate()` instead to create a new column for the summary statistic.
## `summary()` Function
Using `summary()` can give you rough snapshots of each numeric column (character columns are skipped):
```{r}
summary(CO_heat_ER)
```
## Summary & Lab Part 1
- summary stats (`mean()`) work with `pull()`
- don't forget the `na.rm = TRUE` argument!
- `summary(x)`: quantile information
- `summarize`: creates a summary table of columns of interest
🏠 [Class Website](https://daseh.org/)
💻 [Lab](https://daseh.org/modules/Data_Summarization/lab/Data_Summarization_Lab.Rmd)
## `distinct()` values
`distinct(x)` will return the unique elements of column `x`.
```{r, message = FALSE}
CO_heat_ER %>%
distinct(gender)
```
## How many `distinct()` values?
`n_distinct()` tells you the number of unique elements. _Must pull the column first!_
```{r}
CO_heat_ER %>%
pull(gender) %>%
n_distinct()
```
```{r echo=FALSE}
options(max.print = 1000)
```
## `dplyr`: `count`
Use `count` to return row count by category.
```{r, message = FALSE}
CO_heat_ER %>% count(gender)
```
## `dplyr`: `count`
Multiple columns listed further subdivides the count.
```{r, message = FALSE}
CO_heat_ER %>% count(county, gender)
```
# Grouping
## Perform Operations By Groups: dplyr
`group_by` allows you group the data set by variables/columns you specify:
```{r}
CO_heat_ER_grouped <- CO_heat_ER %>% group_by(gender)
CO_heat_ER_grouped
```
## Summarize the grouped data
It's grouped! Grouping doesn't change the data in any way, but how **functions operate on it**. Now we can summarize `visits` by group:
```{r}
CO_heat_ER_grouped %>%
summarize(avg_visits = mean(visits, na.rm = TRUE))
```
## Use the `pipe` to string these together!
Pipe `CO_heat_ER` into `group_by`, then pipe that into `summarize`:
```{r}
CO_heat_ER %>%
group_by(gender) %>%
summarize(avg_visits = mean(visits, na.rm = TRUE))
```
## Group by as many variables as you want
`group_by` gender and year:
```{r, warnings = F}
CO_heat_ER %>%
group_by(year, gender) %>%
summarize(avg_visits = mean(visits, na.rm = TRUE))
```
## Counting
There are other functions, such as `n()` count the number of observations (NAs included).
```{r}
CO_heat_ER %>%
group_by(gender) %>%
summarize(n = n(),
mean = mean(visits, na.rm = TRUE))
```
## Counting{.codesmall}
`count()` and `n()` can give very similar information.
```{r}
CO_heat_ER %>% count(gender)
CO_heat_ER %>% group_by(gender) %>% summarize(n()) # n() typically used with summarize
```
# A few miscellaneous topics ..
## Base R functions you might see: `length` and `unique`
These functions require a column as a vector using `pull()`.
```{r, message = FALSE}
CO_heat_ER_gen <- CO_heat_ER %>% pull(gender) # pull() to make a vector
CO_heat_ER_gen %>% unique() # similar to distinct()
```
## Base R functions you might see: `length` and `unique`
These functions require a column as a vector using `pull()`.
```{r, message = FALSE}
CO_heat_ER_gen %>% unique() %>% length() # similar to n_distinct()
```
## * New! * Many dplyr functions now have a `.by=` argument
Pipe `CO_heat_ER` into `group_by`, then pipe that into `summarize`:
```{r eval = FALSE}
CO_heat_ER %>%
group_by(gender) %>%
summarize(avg_visits = mean(visits, na.rm = TRUE),
max_visits = max(visits, na.rm = TRUE))
```
is the same as..
```{r eval = FALSE}
CO_heat_ER %>%
summarize(avg_visits = mean(visits, na.rm = TRUE),
max_visits = max(visits, na.rm = TRUE),
.by = county)
```
## `summary()` vs. `summarize()`
* `summary()` (base R) gives statistics table on a dataset.
* `summarize()` (dplyr) creates a more customized summary tibble/dataframe.
## Summary & Lab Part 2
- `count(x)`: what unique values do you have?
- `distinct()`: what are the distinct values?
- `n_distinct()` with `pull()`: how many distinct values?
- `group_by()`: changes all subsequent functions
- combine with `summarize()` to get statistics per group
- combine with `mutate()` to add column
- `summarize()` with `n()` gives the count (NAs included)
🏠 [Class Website](https://daseh.org/)
💻 [Lab](https://daseh.org/modules/Data_Summarization/lab/Data_Summarization_Lab.Rmd)
```{r, fig.alt="The End", out.width = "50%", echo = FALSE, fig.align='center'}
knitr::include_graphics(here::here("images/the-end-g23b994289_1280.jpg"))
```
Image by Gerd Altmann from Pixabay
# Extra Slides: More advanced summarization
## Data Summarization on data frames
* Statistical summarization across the data frame
* `rowMeans(x)`: takes the means of each row of x
* `colMeans(x)`: takes the means of each column of x
* `rowSums(x)`: takes the sum of each row of x
* `colSums(x)`: takes the sum of each column of x
```{r}
yearly_co2 <- yearly_co2_emissions
```
## `rowMeans()` example
Get means for each row.
Let's see what the mean CO2 emissions is across years for each row (country):
```{r}
yearly_co2 %>%
select(starts_with("201")) %>%
rowMeans(na.rm = TRUE) %>%
head(n = 5)
yearly_co2 %>%
group_by(country) %>%
summarize(mean = rowMeans(across(starts_with("201")), na.rm = TRUE)) %>%
head(n = 5)
```
## `colMeans()` example
Get means for each column.
Let's see what the mean is across each column (year):
```{r}
yearly_co2 %>%
select(starts_with("201")) %>%
colMeans(na.rm = TRUE) %>%
head(n = 5)
yearly_co2 %>%
summarize(across(starts_with("201"), ~mean(.x, na.rm = TRUE)))
```