Data used
CalEnviroScreen Dataset: CalEnviroScreen is a project that ranks census tracts in California based on potential exposures to pollutants, adverse environmental conditions, socioeconomic factors and the prevalence of certain health conditions. Data used in the CalEnviroScreen model come from national and state sources.
The data is from https://calenviroscreen-oehha.hub.arcgis.com/#Data
You can Download as a CSV in your current working directory. Note its also available at: https://daseh.org/data/CalEnviroScreen_data.csv
library(tidyverse)
library(dasehr)
ces <- calenviroscreen
# Or use
# ces <- read_csv(file = "https://daseh.org/data/CalEnviroScreen_data.csv")
How observations/rows are in the ces
data set? You can
use dim()
or nrow()
or examine the
Environment.
nrow(ces)
## [1] 8035
dim(ces)
## [1] 8035 67
ces %>% nrow()
## [1] 8035
ces %>% count()
## # A tibble: 1 × 1
## n
## <int>
## 1 8035
What was the population of California in the 2010 census, based on
the TotalPop
column? (use sum()
)
ces %>% pull(TotalPop) %>% sum()
## [1] 39283497
What is the largest (max
) total population
(TotalPop
) among all census tracts (rows)? Use
summarize
.
# General format
DATA_TIBBLE %>%
summarize(SUMMARY_COLUMN_NAME = FUNCTION(SOURCE_COLUMN))
ces %>%
summarize(max = max(TotalPop))
## # A tibble: 1 × 1
## max
## <int>
## 1 38754
Modify your code from 1.3 to add the min
of
TotalPop
using the summarize
function.
# General format
DATA_TIBBLE %>%
summarize(SUMMARY_COLUMN_NAME = FUNCTION(SOURCE_COLUMN),
SUMMARY_COLUMN_NAME = FUNCTION(SOURCE_COLUMN)
)
ces %>%
summarize(max = max(TotalPop),
min = min(TotalPop))
## # A tibble: 1 × 2
## max min
## <int> <int>
## 1 38754 0
Summarize the ces
data to get the mean of
TotalPop
and Pesticides
. Make sure to remove
NA
s.
# General format
DATA_TIBBLE %>%
summarize(SUMMARY_COLUMN_NAME = OPERATOR(SOURCE_COLUMN, na.rm = TRUE),
SUMMARY_COLUMN_NAME = OPERATOR(SOURCE_COLUMN, na.rm = TRUE)
)
ces %>% summarize(
mean_pop = mean(TotalPop, na.rm = TRUE),
mean_pesticide = mean(Pesticides, na.rm = TRUE)
)
## # A tibble: 1 × 2
## mean_pop mean_pesticide
## <dbl> <dbl>
## 1 4889. 268.
ces %>%
select(TotalPop, Pesticides) %>%
colMeans()
## TotalPop Pesticides
## 4889.0475 268.4894
Given that parts of California are heavily agricultural, and the max
value for the Pesticides
variable is 80811, why might the
average value be so low??
# There are probably some zeros or other incorrect low values in the data from the census tracts that are in urban areas.
Filter any zeros out of ces
Pesticides
. Use
filter()
. Assign this “cleaned” dataset object the name
`exurban_ces``.
(We are making the admittedly shaky assumption that places with no reported pesticide use are within cities.)
# General format
DATA_TIBBLE %>% filter(LOGICAL_COMPARISON)
exurban_ces <- ces %>% filter(Pesticides != 0)
How many census tracts have pesticide values greater than 0?
nrow(exurban_ces)
## [1] 2602
dim(exurban_ces)
## [1] 2602 67
exurban_ces %>% nrow()
## [1] 2602
exurban_ces %>% count()
## # A tibble: 1 × 1
## n
## <int>
## 1 2602
The variable CES4.0PercRange
categorizes the calculated
CES4.0 value (a measure of the pollution burden in a particular region)
into percentile ranges, grouped by 5% increments.
How many census tracts are there in each percentile range? Use
count()
on the column named CES4.0PercRange
.
Use ces
as your input data.
ces %>% count(CES4.0PercRange)
## # A tibble: 21 × 2
## CES4.0PercRange n
## <chr> <int>
## 1 1-5% (lowest scores) 396
## 2 10-15% 396
## 3 15-20% 397
## 4 20-25% 396
## 5 25-30% 397
## 6 30-35% 397
## 7 35-40% 396
## 8 40-45% 397
## 9 45-50% 396
## 10 5-10% 397
## # ℹ 11 more rows
Modify your code from question 2.1 to break down each percentile
range by California county. Use count()
on the columns
named CES4.0PercRange
and
CaliforniaCounty
.
ces %>% count(CES4.0PercRange, CaliforniaCounty)
## # A tibble: 730 × 3
## CES4.0PercRange CaliforniaCounty n
## <chr> <chr> <int>
## 1 1-5% (lowest scores) "Alameda " 25
## 2 1-5% (lowest scores) "Contra Costa" 43
## 3 1-5% (lowest scores) "El Dorado" 16
## 4 1-5% (lowest scores) "Fresno " 2
## 5 1-5% (lowest scores) "Los Angeles" 28
## 6 1-5% (lowest scores) "Marin " 22
## 7 1-5% (lowest scores) "Monterey " 13
## 8 1-5% (lowest scores) "Napa " 1
## 9 1-5% (lowest scores) "Nevada " 5
## 10 1-5% (lowest scores) "Orange " 29
## # ℹ 720 more rows
Hmm. This isn’t the easiest table to read. Let’s try a different approach.
How many census tracts are there in each county? Use
group_by()
, summarize()
, and n()
on the column named county
.
# General format
DATA_TIBBLE %>%
group_by(GROUPING_COLUMN_NAME) %>%
summarize(SUMMARY_COLUMN_NAME = n())
ces %>%
group_by(CaliforniaCounty) %>%
summarize(count = n())
## # A tibble: 58 × 2
## CaliforniaCounty count
## <chr> <int>
## 1 "Alameda " 360
## 2 "Alpine " 1
## 3 "Amador " 9
## 4 "Butte " 51
## 5 "Calaveras " 10
## 6 "Colusa " 5
## 7 "Contra Costa" 207
## 8 "Del Norte" 7
## 9 "El Dorado" 42
## 10 "Fresno " 199
## # ℹ 48 more rows
Modify your code from 2.3 to also group by
CES4.0PercRange
.
ces %>%
group_by(CaliforniaCounty, CES4.0PercRange) %>%
summarize(count = n())
## `summarise()` has grouped output by 'CaliforniaCounty'. You can override using
## the `.groups` argument.
## # A tibble: 730 × 3
## # Groups: CaliforniaCounty [58]
## CaliforniaCounty CES4.0PercRange count
## <chr> <chr> <int>
## 1 "Alameda " 1-5% (lowest scores) 25
## 2 "Alameda " 10-15% 14
## 3 "Alameda " 15-20% 21
## 4 "Alameda " 20-25% 30
## 5 "Alameda " 25-30% 27
## 6 "Alameda " 30-35% 21
## 7 "Alameda " 35-40% 22
## 8 "Alameda " 40-45% 20
## 9 "Alameda " 45-50% 16
## 10 "Alameda " 5-10% 21
## # ℹ 720 more rows
Modify code from 2.3 to also summarize by total population per group. In your summarized output, make sure you call the new summarized average total population variable (column name) “mean”. In other words, the head of your output should look like:
# A tibble: 58 × 3
CaliforniaCounty count mean
<chr> <int> <dbl>
1 "Alameda " 360 4602.
...
ces %>%
group_by(CaliforniaCounty) %>%
summarize(
count = n(),
mean = mean(TotalPop)
)
## # A tibble: 58 × 3
## CaliforniaCounty count mean
## <chr> <int> <dbl>
## 1 "Alameda " 360 4602.
## 2 "Alpine " 1 1039
## 3 "Amador " 9 4270.
## 4 "Butte " 51 4428.
## 5 "Calaveras " 10 4551.
## 6 "Colusa " 5 4291.
## 7 "Contra Costa" 207 5518.
## 8 "Del Norte" 7 3928.
## 9 "El Dorado" 42 4490.
## 10 "Fresno " 199 4947.
## # ℹ 48 more rows
Take your code from the above question P.4 and do the following:
%>%
)arrange()
to sort the output by the summarized
column “mean”.ces %>%
group_by(CaliforniaCounty) %>%
summarize(
count = n(),
mean = mean(TotalPop)
) %>%
arrange(mean)
## # A tibble: 58 × 3
## CaliforniaCounty count mean
## <chr> <int> <dbl>
## 1 "Alpine " 1 1039
## 2 "Modoc " 4 2227.
## 3 "Trinity " 5 2540
## 4 "Plumas " 7 2666.
## 5 "Mariposa " 6 2903.
## 6 "Inyo " 6 2996.
## 7 "Sierra " 1 3040
## 8 "Siskiyou " 14 3105.
## 9 "Lassen " 9 3424.
## 10 "Napa " 40 3491.
## # ℹ 48 more rows