We’ll again use the CalEnviroScreen dataset for the lab. Load the tidyverse
package and the dataset, which can be found at https://daseh.org/data/CalEnviroScreen_data.csv. Name the dataset ces
.
library(tidyverse)
ces <- read_csv(file = "https://daseh.org/data/CalEnviroScreen_data.csv")
How many observations/rows are in the ces
data set? You can use dim()
or nrow()
or examine the Environment.
nrow(ces)
## [1] 8035
dim(ces)
## [1] 8035 67
ces %>% nrow()
## [1] 8035
ces %>% count()
## # A tibble: 1 × 1
## n
## <int>
## 1 8035
The TotalPop
column includes information about the population for each census tract as of the 2010 census.
NOTE: A census tract a small, relatively permanent area within a county used to present data from the census. Each row in the ces
dataset corresponds to a single census tract. See https://www2.census.gov/geo/pdfs/education/CensusTracts.pdf
What was the total population in the dataset based on the 2010 census? (use sum()
and the TotalPop
column)
ces %>% pull(TotalPop) %>% sum()
## [1] 39283497
What was the largest population, according to the 2010 census, for a single census tract (row)? Use summarize
and max
.
# General format
DATA_TIBBLE %>%
summarize(SUMMARY_COLUMN_NAME = FUNCTION(SOURCE_COLUMN))
ces %>%
summarize(max = max(TotalPop))
## # A tibble: 1 × 1
## max
## <dbl>
## 1 38754
Modify your code from 1.3 to add the smallest population among census tracts. Use min
in your summarize
function.
# General format
DATA_TIBBLE %>%
summarize(SUMMARY_COLUMN_NAME = FUNCTION(SOURCE_COLUMN),
SUMMARY_COLUMN_NAME = FUNCTION(SOURCE_COLUMN)
)
ces %>%
summarize(max = max(TotalPop),
min = min(TotalPop))
## # A tibble: 1 × 2
## max min
## <dbl> <dbl>
## 1 38754 0
Summarize the ces
data to get the mean of both the TotalPop
and Pesticides
columns. Make sure to remove NA
s.
Pesticides
: Total pounds of selected active pesticide ingredients used in production-agriculture per square mile. The higher the number, the greater the amount of pesticides have been used on agricultural sites
# General format
DATA_TIBBLE %>%
summarize(SUMMARY_COLUMN_NAME = OPERATOR(SOURCE_COLUMN, na.rm = TRUE),
SUMMARY_COLUMN_NAME = OPERATOR(SOURCE_COLUMN, na.rm = TRUE)
)
ces %>% summarize(
mean_pop = mean(TotalPop, na.rm = TRUE),
mean_pesticide = mean(Pesticides, na.rm = TRUE)
)
## # A tibble: 1 × 2
## mean_pop mean_pesticide
## <dbl> <dbl>
## 1 4889. 268.
ces %>%
select(TotalPop, Pesticides) %>%
colMeans()
## TotalPop Pesticides
## 4889.0475 268.4894
Given that parts of California are heavily agricultural, and the max value for the Pesticides
variable is 80811, why might the average value be so low??
# There are probably some zeros or other incorrect low values in the data from the census tracts that are in urban areas, or areas with limited agriculture.
Filter any zeros from the Pesticides
column out of ces
. Use filter()
. Assign this “cleaned” dataset object the name ces_pest
.
# General format
DATA_TIBBLE %>% filter(LOGICAL_COMPARISON)
ces_pest <- ces %>% filter(Pesticides != 0)
How many census tracts have pesticide values greater than 0?
nrow(ces_pest)
## [1] 2602
dim(ces_pest)
## [1] 2602 67
ces_pest %>% nrow()
## [1] 2602
ces_pest %>% count()
## # A tibble: 1 × 1
## n
## <int>
## 1 2602
How many census tracts are present in each California county? Use count()
on the column named CaliforniaCounty
. Use ces
as your input data.
ces %>% count(CaliforniaCounty)
## # A tibble: 58 × 2
## CaliforniaCounty n
## <chr> <int>
## 1 Alameda 360
## 2 Alpine 1
## 3 Amador 9
## 4 Butte 51
## 5 Calaveras 10
## 6 Colusa 5
## 7 Contra Costa 207
## 8 Del Norte 7
## 9 El Dorado 42
## 10 Fresno 199
## # ℹ 48 more rows
Let’s break down the count further. Modify your code from question 2.1 to count census tracts by County AND ZIP code. Use count()
on the columns named CaliforniaCounty
and ZIP
.
ces %>% count(CaliforniaCounty, ZIP)
## # A tibble: 1,377 × 3
## CaliforniaCounty ZIP n
## <chr> <dbl> <int>
## 1 Alameda 94501 14
## 2 Alameda 94502 2
## 3 Alameda 94536 12
## 4 Alameda 94538 13
## 5 Alameda 94539 11
## 6 Alameda 94541 12
## 7 Alameda 94542 2
## 8 Alameda 94544 16
## 9 Alameda 94545 6
## 10 Alameda 94546 10
## # ℹ 1,367 more rows
This isn’t the only way we can create this table in R. Let’s look at another way to build it.
How many census tracts are there in each county? Use group_by()
, summarize()
, and n()
on the column named county
.
# General format
DATA_TIBBLE %>%
group_by(GROUPING_COLUMN_NAME) %>%
summarize(SUMMARY_COLUMN_NAME = n())
ces %>%
group_by(CaliforniaCounty) %>%
summarize(count = n())
## # A tibble: 58 × 2
## CaliforniaCounty count
## <chr> <int>
## 1 Alameda 360
## 2 Alpine 1
## 3 Amador 9
## 4 Butte 51
## 5 Calaveras 10
## 6 Colusa 5
## 7 Contra Costa 207
## 8 Del Norte 7
## 9 El Dorado 42
## 10 Fresno 199
## # ℹ 48 more rows
Modify your code from 2.3 to also group by ZIP
.
ces %>%
group_by(CaliforniaCounty, ZIP) %>%
summarize(count = n())
## `summarise()` has grouped output by 'CaliforniaCounty'. You can override using
## the `.groups` argument.
## # A tibble: 1,377 × 3
## # Groups: CaliforniaCounty [58]
## CaliforniaCounty ZIP count
## <chr> <dbl> <int>
## 1 Alameda 94501 14
## 2 Alameda 94502 2
## 3 Alameda 94536 12
## 4 Alameda 94538 13
## 5 Alameda 94539 11
## 6 Alameda 94541 12
## 7 Alameda 94542 2
## 8 Alameda 94544 16
## 9 Alameda 94545 6
## 10 Alameda 94546 10
## # ℹ 1,367 more rows
Modify code from 2.3 (the code that only groups by county) to also summarize by total population (TotalPop
) per group. In your summarized output, make sure you call the new summarized average total population variable (column name) “mean”. In other words, the head of your output should look like:
# A tibble: 58 × 3
CaliforniaCounty count mean
<chr> <int> <dbl>
1 "Alameda " 360 4602.
...
(In the above table, remember that the “count” column is counting the number of census tracts.)
ces %>%
group_by(CaliforniaCounty) %>%
summarize(
count = n(),
mean = mean(TotalPop)
)
## # A tibble: 58 × 3
## CaliforniaCounty count mean
## <chr> <int> <dbl>
## 1 Alameda 360 4602.
## 2 Alpine 1 1039
## 3 Amador 9 4270.
## 4 Butte 51 4428.
## 5 Calaveras 10 4551.
## 6 Colusa 5 4291.
## 7 Contra Costa 207 5518.
## 8 Del Norte 7 3928.
## 9 El Dorado 42 4490.
## 10 Fresno 199 4947.
## # ℹ 48 more rows
Take your code from the above question P.4 and do the following:
%>%
)arrange()
to sort the output by the summarized column “mean”.ces %>%
group_by(CaliforniaCounty) %>%
summarize(
count = n(),
mean = mean(TotalPop)
) %>%
arrange(mean)
## # A tibble: 58 × 3
## CaliforniaCounty count mean
## <chr> <int> <dbl>
## 1 Alpine 1 1039
## 2 Modoc 4 2227.
## 3 Trinity 5 2540
## 4 Plumas 7 2666.
## 5 Mariposa 6 2903.
## 6 Inyo 6 2996.
## 7 Sierra 1 3040
## 8 Siskiyou 14 3105.
## 9 Lassen 9 3424.
## 10 Napa 40 3491.
## # ℹ 48 more rows