Part 1

We’ll again use the CalEnviroScreen dataset for the lab. Load the tidyverse package and the dataset, which can be found at https://daseh.org/data/CalEnviroScreen_data.csv. Name the dataset ces.

library(tidyverse)

ces <- read_csv(file = "https://daseh.org/data/CalEnviroScreen_data.csv")

1.1

How many observations/rows are in the ces data set? You can use dim() or nrow() or examine the Environment.

nrow(ces)

## [1] 8035

dim(ces)

## [1] 8035   67

ces %>% nrow()

## [1] 8035

ces %>% count()

## # A tibble: 1 × 1
##       n
##   <int>
## 1  8035

1.2

The TotalPop column includes information about the population for each census tract as of the 2010 census.

NOTE: A census tract a small, relatively permanent area within a county used to present data from the census. Each row in the ces dataset corresponds to a single census tract. See https://www2.census.gov/geo/pdfs/education/CensusTracts.pdf

What was the total population in the dataset based on the 2010 census? (use sum() and the TotalPop column)

ces %>% pull(TotalPop) %>% sum()

## [1] 39283497

1.3

What was the largest population, according to the 2010 census? Use summarize and max.

# General format 
DATA_TIBBLE %>% 
    summarize(SUMMARY_COLUMN_NAME = FUNCTION(SOURCE_COLUMN))

ces %>% 
  summarize(max = max(TotalPop))

## # A tibble: 1 × 1
##     max
##   <dbl>
## 1 38754

1.4

Modify your code from 1.3 to add the smallest population among census tracts. Use min in your summarize function.

HINT: Make sure you make your SUMMARY_COLUMN_NAME different for the max and the min summaries! What do you think would happen if you tried to make SUMMARY_COLUMN_NAME the same name for both statistics?

# General format 
DATA_TIBBLE %>% 
    summarize(SUMMARY_COLUMN_NAME = FUNCTION(SOURCE_COLUMN),
              SUMMARY_COLUMN_NAME = FUNCTION(SOURCE_COLUMN)
    )

ces %>% 
  summarize(max = max(TotalPop),
            min = min(TotalPop))

## # A tibble: 1 × 2
##     max   min
##   <dbl> <dbl>
## 1 38754     0

#If your code uses the same name for both summary columns, the output only has one column. The min() summary replaces the max() summary
test <- ces %>% 
  summarize(summary = max(TotalPop),
            summary = min(TotalPop))
  
test

## # A tibble: 1 × 1
##   summary
##     <dbl>
## 1       0

Practice on Your Own!

P.1

Summarize the ces data to get the mean of both the TotalPop and Pesticides columns. Make sure to remove NAs.

Pesticides: Total pounds of selected active pesticide ingredients used in production-agriculture per square mile. The higher the number, the greater the amount of pesticides have been used on agricultural sites

# General format 
DATA_TIBBLE %>% 
    summarize(SUMMARY_COLUMN_NAME = OPERATOR(SOURCE_COLUMN, na.rm = TRUE),
              SUMMARY_COLUMN_NAME = OPERATOR(SOURCE_COLUMN, na.rm = TRUE)
    )

ces %>% summarize(
  mean_pop = mean(TotalPop, na.rm = TRUE),
  mean_pesticide = mean(Pesticides, na.rm = TRUE)
)

## # A tibble: 1 × 2
##   mean_pop mean_pesticide
##      <dbl>          <dbl>
## 1    4889.           268.

ces %>%
  select(TotalPop, Pesticides) %>%
  colMeans()

##   TotalPop Pesticides 
##  4889.0475   268.4894

P.2

Given that parts of California are heavily agricultural, and the max value for the Pesticides variable is 80811, why might the average value be so low??

# There are probably some zeros or other incorrect low values in the data from the census tracts that are in urban areas, or areas with limited agriculture.

P.3

Filter any zeros from the Pesticides column out of ces. Use filter(). Assign this “cleaned” dataset object the name ces_pest.

# General format 
DATA_TIBBLE %>% filter(LOGICAL_COMPARISON)

ces_pest <- ces %>% filter(Pesticides != 0)

How many census tracts have pesticide values greater than 0?

nrow(ces_pest)

## [1] 2602

dim(ces_pest)

## [1] 2602   67

ces_pest %>% nrow()

## [1] 2602

ces_pest %>% count()

## # A tibble: 1 × 1
##       n
##   <int>
## 1  2602

Part 2

2.1

How many census tracts are present in each California county? Use count() on the column named CaliforniaCounty. Use ces as your input data.

ces %>% count(CaliforniaCounty)

## # A tibble: 58 × 2
##    CaliforniaCounty     n
##    <chr>            <int>
##  1 Alameda            360
##  2 Alpine               1
##  3 Amador               9
##  4 Butte               51
##  5 Calaveras           10
##  6 Colusa               5
##  7 Contra Costa       207
##  8 Del Norte            7
##  9 El Dorado           42
## 10 Fresno             199
## # ℹ 48 more rows

2.2

Let’s break down the count further. Modify your code from question 2.1 to count census tracts by County AND ZIP code. Use count() on the columns named CaliforniaCounty and ZIP.

ces %>% count(CaliforniaCounty, ZIP)

## # A tibble: 1,377 × 3
##    CaliforniaCounty   ZIP     n
##    <chr>            <dbl> <int>
##  1 Alameda          94501    14
##  2 Alameda          94502     2
##  3 Alameda          94536    12
##  4 Alameda          94538    13
##  5 Alameda          94539    11
##  6 Alameda          94541    12
##  7 Alameda          94542     2
##  8 Alameda          94544    16
##  9 Alameda          94545     6
## 10 Alameda          94546    10
## # ℹ 1,367 more rows

This isn’t the only way we can create this table in R. Let’s look at another way to build it.

2.3

How many census tracts are there in each county? Use group_by(), summarize(), and n() on the column named CaliforniaCounty.

# General format 
DATA_TIBBLE %>% 
    group_by(GROUPING_COLUMN_NAME) %>% 
    summarize(SUMMARY_COLUMN_NAME = n())

ces %>%
  group_by(CaliforniaCounty) %>%
  summarize(count = n())

## # A tibble: 58 × 2
##    CaliforniaCounty count
##    <chr>            <int>
##  1 Alameda            360
##  2 Alpine               1
##  3 Amador               9
##  4 Butte               51
##  5 Calaveras           10
##  6 Colusa               5
##  7 Contra Costa       207
##  8 Del Norte            7
##  9 El Dorado           42
## 10 Fresno             199
## # ℹ 48 more rows

2.4

Modify your code from 2.3 to also group by ZIP.

ces %>%
  group_by(CaliforniaCounty, ZIP) %>%
  summarize(count = n())

## `summarise()` has grouped output by 'CaliforniaCounty'. You can override using
## the `.groups` argument.

## # A tibble: 1,377 × 3
## # Groups:   CaliforniaCounty [58]
##    CaliforniaCounty   ZIP count
##    <chr>            <dbl> <int>
##  1 Alameda          94501    14
##  2 Alameda          94502     2
##  3 Alameda          94536    12
##  4 Alameda          94538    13
##  5 Alameda          94539    11
##  6 Alameda          94541    12
##  7 Alameda          94542     2
##  8 Alameda          94544    16
##  9 Alameda          94545     6
## 10 Alameda          94546    10
## # ℹ 1,367 more rows

Practice on Your Own!

P.4

Modify code from 2.3 (the code that only groups by county) to also summarize by total population (TotalPop) per group. In your summarized output, make sure you call the new summarized average total population variable (column name) “mean”. In other words, the head of your output should look like:

# A tibble: 58 × 3
   CaliforniaCounty count  mean
   <chr>            <int> <dbl>
 1 "Alameda "         360 4602.
...

(In the above table, remember that the “count” column is counting the number of census tracts.)

ces %>%
  group_by(CaliforniaCounty) %>%
  summarize(
    count = n(),
    mean = mean(TotalPop)
  )

## # A tibble: 58 × 3
##    CaliforniaCounty count  mean
##    <chr>            <int> <dbl>
##  1 Alameda            360 4602.
##  2 Alpine               1 1039 
##  3 Amador               9 4270.
##  4 Butte               51 4428.
##  5 Calaveras           10 4551.
##  6 Colusa               5 4291.
##  7 Contra Costa       207 5518.
##  8 Del Norte            7 3928.
##  9 El Dorado           42 4490.
## 10 Fresno             199 4947.
## # ℹ 48 more rows

P.5

Take your code from the above question P.4 and do the following:

Add another pipe (%>%)
Add the arrange() to sort the output by the summarized column “mean”.

ces %>%
  group_by(CaliforniaCounty) %>%
  summarize(
    count = n(),
    mean = mean(TotalPop)
  ) %>%
  arrange(mean)

## # A tibble: 58 × 3
##    CaliforniaCounty count  mean
##    <chr>            <int> <dbl>
##  1 Alpine               1 1039 
##  2 Modoc                4 2227.
##  3 Trinity              5 2540 
##  4 Plumas               7 2666.
##  5 Mariposa             6 2903.
##  6 Inyo                 6 2996.
##  7 Sierra               1 3040 
##  8 Siskiyou            14 3105.
##  9 Lassen               9 3424.
## 10 Napa                40 3491.
## # ℹ 48 more rows

Data Summarization Lab - Key

Part 1

1.1

1.2

1.3

1.4

Practice on Your Own!

P.1

P.2

P.3

Part 2

2.1

2.2

2.3

2.4

Practice on Your Own!

P.4

P.5