Part 1

Data used

CalEnviroScreen Dataset: CalEnviroScreen is a project that ranks census tracts in California based on potential exposures to pollutants, adverse environmental conditions, socioeconomic factors and the prevalence of certain health conditions. Data used in the CalEnviroScreen model come from national and state sources.

The data is from https://calenviroscreen-oehha.hub.arcgis.com/#Data

You can Download as a CSV in your current working directory. Note its also available at: https://daseh.org/data/CalEnviroScreen_data.csv

library(tidyverse)
library(dasehr)
ces <- calenviroscreen
# Or use
# ces <- read_csv(file = "https://daseh.org/data/CalEnviroScreen_data.csv")

1.1

How observations/rows are in the ces data set? You can use dim() or nrow() or examine the Environment.

nrow(ces)
## [1] 8035
dim(ces)
## [1] 8035   67
ces %>% nrow()
## [1] 8035
ces %>% count()
## # A tibble: 1 × 1
##       n
##   <int>
## 1  8035

1.2

What was the population of California in the 2010 census, based on the TotalPop column? (use sum())

ces %>% pull(TotalPop) %>% sum()
## [1] 39283497

1.3

What is the largest (max) total population (TotalPop) among all census tracts (rows)? Use summarize.

# General format 
DATA_TIBBLE %>% 
    summarize(SUMMARY_COLUMN_NAME = FUNCTION(SOURCE_COLUMN))
ces %>% 
  summarize(max = max(TotalPop))
## # A tibble: 1 × 1
##     max
##   <int>
## 1 38754

1.4

Modify your code from 1.3 to add the min of TotalPop using the summarize function.

# General format 
DATA_TIBBLE %>% 
    summarize(SUMMARY_COLUMN_NAME = FUNCTION(SOURCE_COLUMN),
              SUMMARY_COLUMN_NAME = FUNCTION(SOURCE_COLUMN)
    )
ces %>% 
  summarize(max = max(TotalPop),
            min = min(TotalPop))
## # A tibble: 1 × 2
##     max   min
##   <int> <int>
## 1 38754     0

Practice on Your Own!

P.1

Summarize the ces data to get the mean of TotalPop and Pesticides. Make sure to remove NAs.

# General format 
DATA_TIBBLE %>% 
    summarize(SUMMARY_COLUMN_NAME = OPERATOR(SOURCE_COLUMN, na.rm = TRUE),
              SUMMARY_COLUMN_NAME = OPERATOR(SOURCE_COLUMN, na.rm = TRUE)
    )
ces %>% summarize(
  mean_pop = mean(TotalPop, na.rm = TRUE),
  mean_pesticide = mean(Pesticides, na.rm = TRUE)
)
## # A tibble: 1 × 2
##   mean_pop mean_pesticide
##      <dbl>          <dbl>
## 1    4889.           268.
ces %>%
  select(TotalPop, Pesticides) %>%
  colMeans()
##   TotalPop Pesticides 
##  4889.0475   268.4894

P.2

Given that parts of California are heavily agricultural, and the max value for the Pesticides variable is 80811, why might the average value be so low??

# There are probably some zeros or other incorrect low values in the data from the census tracts that are in urban areas.

P.3

Filter any zeros out of ces Pesticides. Use filter(). Assign this “cleaned” dataset object the name `exurban_ces``.

(We are making the admittedly shaky assumption that places with no reported pesticide use are within cities.)

# General format 
DATA_TIBBLE %>% filter(LOGICAL_COMPARISON)
exurban_ces <- ces %>% filter(Pesticides != 0)

How many census tracts have pesticide values greater than 0?

nrow(exurban_ces)
## [1] 2602
dim(exurban_ces)
## [1] 2602   67
exurban_ces %>% nrow()
## [1] 2602
exurban_ces %>% count()
## # A tibble: 1 × 1
##       n
##   <int>
## 1  2602

Part 2

2.1

The variable CES4.0PercRange categorizes the calculated CES4.0 value (a measure of the pollution burden in a particular region) into percentile ranges, grouped by 5% increments.

How many census tracts are there in each percentile range? Use count() on the column named CES4.0PercRange. Use ces as your input data.

ces %>% count(CES4.0PercRange)
## # A tibble: 21 × 2
##    CES4.0PercRange          n
##    <chr>                <int>
##  1 1-5% (lowest scores)   396
##  2 10-15%                 396
##  3 15-20%                 397
##  4 20-25%                 396
##  5 25-30%                 397
##  6 30-35%                 397
##  7 35-40%                 396
##  8 40-45%                 397
##  9 45-50%                 396
## 10 5-10%                  397
## # ℹ 11 more rows

2.2

Modify your code from question 2.1 to break down each percentile range by California county. Use count() on the columns named CES4.0PercRange and CaliforniaCounty.

ces %>% count(CES4.0PercRange, CaliforniaCounty)
## # A tibble: 730 × 3
##    CES4.0PercRange      CaliforniaCounty     n
##    <chr>                <chr>            <int>
##  1 1-5% (lowest scores) "Alameda "          25
##  2 1-5% (lowest scores) "Contra Costa"      43
##  3 1-5% (lowest scores) "El Dorado"         16
##  4 1-5% (lowest scores) "Fresno "            2
##  5 1-5% (lowest scores) "Los Angeles"       28
##  6 1-5% (lowest scores) "Marin "            22
##  7 1-5% (lowest scores) "Monterey "         13
##  8 1-5% (lowest scores) "Napa "              1
##  9 1-5% (lowest scores) "Nevada "            5
## 10 1-5% (lowest scores) "Orange "           29
## # ℹ 720 more rows

Hmm. This isn’t the easiest table to read. Let’s try a different approach.

2.3

How many census tracts are there in each county? Use group_by(), summarize(), and n() on the column named county.

# General format 
DATA_TIBBLE %>% 
    group_by(GROUPING_COLUMN_NAME) %>% 
    summarize(SUMMARY_COLUMN_NAME = n())
ces %>%
  group_by(CaliforniaCounty) %>%
  summarize(count = n())
## # A tibble: 58 × 2
##    CaliforniaCounty count
##    <chr>            <int>
##  1 "Alameda "         360
##  2 "Alpine "            1
##  3 "Amador "            9
##  4 "Butte "            51
##  5 "Calaveras "        10
##  6 "Colusa "            5
##  7 "Contra Costa"     207
##  8 "Del Norte"          7
##  9 "El Dorado"         42
## 10 "Fresno "          199
## # ℹ 48 more rows

2.4

Modify your code from 2.3 to also group by CES4.0PercRange.

ces %>%
  group_by(CaliforniaCounty, CES4.0PercRange) %>%
  summarize(count = n())
## `summarise()` has grouped output by 'CaliforniaCounty'. You can override using
## the `.groups` argument.
## # A tibble: 730 × 3
## # Groups:   CaliforniaCounty [58]
##    CaliforniaCounty CES4.0PercRange      count
##    <chr>            <chr>                <int>
##  1 "Alameda "       1-5% (lowest scores)    25
##  2 "Alameda "       10-15%                  14
##  3 "Alameda "       15-20%                  21
##  4 "Alameda "       20-25%                  30
##  5 "Alameda "       25-30%                  27
##  6 "Alameda "       30-35%                  21
##  7 "Alameda "       35-40%                  22
##  8 "Alameda "       40-45%                  20
##  9 "Alameda "       45-50%                  16
## 10 "Alameda "       5-10%                   21
## # ℹ 720 more rows

Practice on Your Own!

P.4

Modify code from 2.3 to also summarize by total population per group. In your summarized output, make sure you call the new summarized average total population variable (column name) “mean”. In other words, the head of your output should look like:

# A tibble: 58 × 3
   CaliforniaCounty count  mean
   <chr>            <int> <dbl>
 1 "Alameda "         360 4602.
...
ces %>%
  group_by(CaliforniaCounty) %>%
  summarize(
    count = n(),
    mean = mean(TotalPop)
  )
## # A tibble: 58 × 3
##    CaliforniaCounty count  mean
##    <chr>            <int> <dbl>
##  1 "Alameda "         360 4602.
##  2 "Alpine "            1 1039 
##  3 "Amador "            9 4270.
##  4 "Butte "            51 4428.
##  5 "Calaveras "        10 4551.
##  6 "Colusa "            5 4291.
##  7 "Contra Costa"     207 5518.
##  8 "Del Norte"          7 3928.
##  9 "El Dorado"         42 4490.
## 10 "Fresno "          199 4947.
## # ℹ 48 more rows

P.5

Take your code from the above question P.4 and do the following:

  • Add another pipe (%>%)
  • Add the arrange() to sort the output by the summarized column “mean”.
ces %>%
  group_by(CaliforniaCounty) %>%
  summarize(
    count = n(),
    mean = mean(TotalPop)
  ) %>%
  arrange(mean)
## # A tibble: 58 × 3
##    CaliforniaCounty count  mean
##    <chr>            <int> <dbl>
##  1 "Alpine "            1 1039 
##  2 "Modoc "             4 2227.
##  3 "Trinity "           5 2540 
##  4 "Plumas "            7 2666.
##  5 "Mariposa "          6 2903.
##  6 "Inyo "              6 2996.
##  7 "Sierra "            1 3040 
##  8 "Siskiyou "         14 3105.
##  9 "Lassen "            9 3424.
## 10 "Napa "             40 3491.
## # ℹ 48 more rows