```{r, echo = FALSE, message=FALSE, error = FALSE} library(knitr) opts_chunk$set(comment = "", message = FALSE) suppressWarnings({library(dplyr)}) library(readr) library(tidyverse) ``` ## Recap - `select()`: subset and/or reorder columns - `filter()`: remove rows - `arrange()`: reorder rows - `mutate()`: create new columns or modify them - `select()` and `filter()` can be combined together - remove a column: `select()` with `!` mark (`!col_name`) - you can do sequential steps: especially using pipes `%>%` 📃[Cheatsheet](https://daseh.org/modules/cheatsheets/Day-3.pdf) ## Another Cheatsheet https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-transformation.pdf ```{r, fig.alt="A preview of the Data transformation cheatsheet produced by RStudio.", out.width = "80%", echo = FALSE, align = "center"} knitr::include_graphics("images/Manip_cheatsheet.png") ``` ## Data Summarization * Basic statistical summarization * `mean(x)`: takes the mean of x * `sd(x)`: takes the standard deviation of x * `median(x)`: takes the median of x * `quantile(x)`: displays sample quantiles of x. Default is min, IQR, max * `range(x)`: displays the range. Same as `c(min(x), max(x))` * `sum(x)`: sum of x * `max(x)`: maximum value in x * `min(x)`: minimum value in x ## Some examples We can use the `CO_heat_ER` object from the `dasehr` package to explore different ways of summarizing data. (This dataset contains information about the number and rate of visits for heat-related illness to ERs in Colorado from 2011-2022, adjusted for age.) The `head` command displays the first rows of an object: ```{r} library(dasehr) head(CO_heat_ER) ``` ## Behavior of `pull()` function `pull()` converts a single data column into a vector. This allows you to run summary functions. ```{r, eval=FALSE} CO_heat_ER %>% pull(visits) ``` ## Statistical summarization the "tidy" way **Add the ** `na.rm =` **argument for missing data** ```{r} CO_heat_ER %>% pull(visits) %>% mean() CO_heat_ER %>% pull(visits) %>% mean(na.rm=T) ``` # Summarization on tibbles (data frames) ## Summarize the data: `dplyr` `summarize()` function `summarize` creates a summary table. Multiple summary statistics can be calculated at once (unlike `pull()` which can only do a single calculation on one column).
```{r, eval = FALSE} # General format - Not the code! {data to use} %>% summarize({summary column name} = {function(source column)}, {summary column name} = {function(source column)}) ```
## Summarize the data: `dplyr` `summarize()` function ```{r} CO_heat_ER %>% summarize(mean_visits = mean(visits)) CO_heat_ER %>% summarize(mean_visits = mean(visits, na.rm = TRUE)) ``` ## Summarize the data: `dplyr` `summarize()` function `summarize()` can do multiple operations at once. Just separate by a comma. ```{r} CO_heat_ER %>% summarize(mean_visits = mean(visits, na.rm = TRUE), median_visits = median(visits, na.rm = TRUE), mean_rate = mean(rate, na.rm = TRUE)) ``` ## Summarize the data: `dplyr` `summarize()` function Note that `summarize()` creates a separate tibble from the original data. If you want to save a summary statistic in the original data, use `mutate()` instead to create a new column for the summary statistic. ## `summary()` Function Using `summary()` can give you rough snapshots of each numeric column (character columns are skipped): ```{r} summary(CO_heat_ER) ``` ## Summary & Lab Part 1 - summary stats (`mean()`) work with `pull()` - don't forget the `na.rm = TRUE` argument! - `summary(x)`: quantile information - `summarize`: creates a summary table of columns of interest 🏠 [Class Website](https://daseh.org/) 💻 [Lab](https://daseh.org/modules/Data_Summarization/lab/Data_Summarization_Lab.Rmd) ## `distinct()` values `distinct(x)` will return the unique elements of column `x`. ```{r, message = FALSE} CO_heat_ER %>% distinct(gender) ``` ## How many `distinct()` values? `n_distinct()` tells you the number of unique elements. _Must pull the column first!_ ```{r} CO_heat_ER %>% pull(gender) %>% n_distinct() ``` ```{r echo=FALSE} options(max.print = 1000) ``` ## `dplyr`: `count` Use `count` to return row count by category. ```{r, message = FALSE} CO_heat_ER %>% count(gender) ``` ## `dplyr`: `count` Multiple columns listed further subdivides the count. ```{r, message = FALSE} CO_heat_ER %>% count(county, gender) ``` # Grouping ## Perform Operations By Groups: dplyr `group_by` allows you group the data set by variables/columns you specify: ```{r} CO_heat_ER_grouped <- CO_heat_ER %>% group_by(gender) CO_heat_ER_grouped ``` ## Summarize the grouped data It's grouped! Grouping doesn't change the data in any way, but how **functions operate on it**. Now we can summarize `visits` by group: ```{r} CO_heat_ER_grouped %>% summarize(avg_visits = mean(visits, na.rm = TRUE)) ``` ## Use the `pipe` to string these together! Pipe `CO_heat_ER` into `group_by`, then pipe that into `summarize`: ```{r} CO_heat_ER %>% group_by(gender) %>% summarize(avg_visits = mean(visits, na.rm = TRUE)) ``` ## Group by as many variables as you want `group_by` gender and year: ```{r, warnings = F} CO_heat_ER %>% group_by(year, gender) %>% summarize(avg_visits = mean(visits, na.rm = TRUE)) ``` ## Counting There are other functions, such as `n()` count the number of observations (NAs included). ```{r} CO_heat_ER %>% group_by(gender) %>% summarize(n = n(), mean = mean(visits, na.rm = TRUE)) ``` ## Counting{.codesmall} `count()` and `n()` can give very similar information. ```{r} CO_heat_ER %>% count(gender) CO_heat_ER %>% group_by(gender) %>% summarize(n()) # n() typically used with summarize ``` # A few miscellaneous topics .. ## Base R functions you might see: `length` and `unique` These functions require a column as a vector using `pull()`. ```{r, message = FALSE} CO_heat_ER_gen <- CO_heat_ER %>% pull(gender) # pull() to make a vector CO_heat_ER_gen %>% unique() # similar to distinct() ``` ## Base R functions you might see: `length` and `unique` These functions require a column as a vector using `pull()`. ```{r, message = FALSE} CO_heat_ER_gen %>% unique() %>% length() # similar to n_distinct() ``` ## * New! * Many dplyr functions now have a `.by=` argument Pipe `CO_heat_ER` into `group_by`, then pipe that into `summarize`: ```{r eval = FALSE} CO_heat_ER %>% group_by(gender) %>% summarize(avg_visits = mean(visits, na.rm = TRUE), max_visits = max(visits, na.rm = TRUE)) ``` is the same as.. ```{r eval = FALSE} CO_heat_ER %>% summarize(avg_visits = mean(visits, na.rm = TRUE), max_visits = max(visits, na.rm = TRUE), .by = county) ``` ## `summary()` vs. `summarize()` * `summary()` (base R) gives statistics table on a dataset. * `summarize()` (dplyr) creates a more customized summary tibble/dataframe. ## Summary & Lab Part 2 - `count(x)`: what unique values do you have? - `distinct()`: what are the distinct values? - `n_distinct()` with `pull()`: how many distinct values? - `group_by()`: changes all subsequent functions - combine with `summarize()` to get statistics per group - combine with `mutate()` to add column - `summarize()` with `n()` gives the count (NAs included) 🏠 [Class Website](https://daseh.org/) 💻 [Lab](https://daseh.org/modules/Data_Summarization/lab/Data_Summarization_Lab.Rmd) ```{r, fig.alt="The End", out.width = "50%", echo = FALSE, fig.align='center'} knitr::include_graphics(here::here("images/the-end-g23b994289_1280.jpg")) ``` Image by Gerd Altmann from Pixabay # Extra Slides: More advanced summarization ## Data Summarization on data frames * Statistical summarization across the data frame * `rowMeans(x)`: takes the means of each row of x * `colMeans(x)`: takes the means of each column of x * `rowSums(x)`: takes the sum of each row of x * `colSums(x)`: takes the sum of each column of x ```{r} yearly_co2 <- yearly_co2_emissions ``` ## `rowMeans()` example Get means for each row. Let's see what the mean CO2 emissions is across years for each row (country): ```{r} yearly_co2 %>% select(starts_with("201")) %>% rowMeans(na.rm = TRUE) %>% head(n = 5) yearly_co2 %>% group_by(country) %>% summarize(mean = rowMeans(across(starts_with("201")), na.rm = TRUE)) %>% head(n = 5) ``` ## `colMeans()` example Get means for each column. Let's see what the mean is across each column (year): ```{r} yearly_co2 %>% select(starts_with("201")) %>% colMeans(na.rm = TRUE) %>% head(n = 5) yearly_co2 %>% summarize(across(starts_with("201"), ~mean(.x, na.rm = TRUE))) ```