In this lab you can use the interactive console to explore but please record your commands here. Remember anything you type here can be “sent” to the console with Cmd-Enter (OS-X) or Ctrl-Enter (Windows/Linux) (But only inside the code chunks designated with the {r}
areas).
First let’s load our packages. We’re going to be using the dplyr
package, which you can load as part of the tidyverse
package.
# don't forget to load the packages that you will need!
library(tidyverse)
We’ll again work with the CalEnviroScreen dataset, which contains information about environmental factors associated with human health in California.
First, load the data from the website, either manually or by using the Data Import menu (find it by clicking on File).
ces <- read_csv(file = "https://daseh.org/data/CalEnviroScreen_data.csv")
## Rows: 8035 Columns: 67
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): CaliforniaCounty, ApproxLocation, CES4.0PercRange
## dbl (64): CensusTract, ZIP, Longitude, Latitude, CES4.0Score, CES4.0Percenti...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Check that it worked by seeing if you have the ces
data.
ces
## # A tibble: 8,035 × 67
## CensusTract CaliforniaCounty ZIP Longitude Latitude ApproxLocation
## <dbl> <chr> <dbl> <dbl> <dbl> <chr>
## 1 6001400100 Alameda 94704 -122. 37.9 Oakland
## 2 6001400200 Alameda 94618 -122. 37.8 Oakland
## 3 6001400300 Alameda 94618 -122. 37.8 Oakland
## 4 6001400400 Alameda 94609 -122. 37.8 Oakland
## 5 6001400500 Alameda 94609 -122. 37.8 Oakland
## 6 6001400600 Alameda 94609 -122. 37.8 Oakland
## 7 6001400700 Alameda 94608 -122. 37.8 Oakland
## 8 6001400800 Alameda 94608 -122. 37.8 Oakland
## 9 6001400900 Alameda 94608 -122. 37.8 Oakland
## 10 6001401000 Alameda 94608 -122. 37.8 Oakland
## # ℹ 8,025 more rows
## # ℹ 61 more variables: CES4.0Score <dbl>, CES4.0Percentile <dbl>,
## # CES4.0PercRange <chr>, Ozone <dbl>, OzonePctl <dbl>, PM2.5 <dbl>,
## # PM2.5.Pctl <dbl>, DieselPM <dbl>, DieselPMPctl <dbl>, DrinkingWater <dbl>,
## # DrinkingWaterPctl <dbl>, Lead <dbl>, LeadPctl <dbl>, Pesticides <dbl>,
## # PesticidesPctl <dbl>, ToxRelease <dbl>, ToxReleasePctl <dbl>,
## # Traffic <dbl>, TrafficPctl <dbl>, CleanupSites <dbl>, …
What class is ces
?
class(ces)
## [1] "spec_tbl_df" "tbl_df" "tbl" "data.frame"
How many observations (rows) and variables (columns) are in the dataset - try the dim()
function?
dim(ces)
## [1] 8035 67
nrow(ces)
## [1] 8035
ncol(ces)
## [1] 67
Next, rename the column CaliforniaCounty
to CA_county
(hint - use rename()
and watch out for the order of the new and old names!).
ces<- rename(ces, CA_county = CaliforniaCounty)
head(ces)
## # A tibble: 6 × 67
## CensusTract CA_county ZIP Longitude Latitude ApproxLocation CES4.0Score
## <dbl> <chr> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 6001400100 Alameda 94704 -122. 37.9 Oakland 4.85
## 2 6001400200 Alameda 94618 -122. 37.8 Oakland 4.88
## 3 6001400300 Alameda 94618 -122. 37.8 Oakland 11.2
## 4 6001400400 Alameda 94609 -122. 37.8 Oakland 12.4
## 5 6001400500 Alameda 94609 -122. 37.8 Oakland 16.7
## 6 6001400600 Alameda 94609 -122. 37.8 Oakland 20.0
## # ℹ 60 more variables: CES4.0Percentile <dbl>, CES4.0PercRange <chr>,
## # Ozone <dbl>, OzonePctl <dbl>, PM2.5 <dbl>, PM2.5.Pctl <dbl>,
## # DieselPM <dbl>, DieselPMPctl <dbl>, DrinkingWater <dbl>,
## # DrinkingWaterPctl <dbl>, Lead <dbl>, LeadPctl <dbl>, Pesticides <dbl>,
## # PesticidesPctl <dbl>, ToxRelease <dbl>, ToxReleasePctl <dbl>,
## # Traffic <dbl>, TrafficPctl <dbl>, CleanupSites <dbl>,
## # CleanupSitesPctl <dbl>, GroundwaterThreats <dbl>, …
Convert the column names of ces
to be all upper case. Use rename_with()
, and the toupper
command. Save this as a new dataset called ces_upper
.
ces_upper <- rename_with(ces, toupper)
head(ces_upper)
## # A tibble: 6 × 67
## CENSUSTRACT CA_COUNTY ZIP LONGITUDE LATITUDE APPROXLOCATION CES4.0SCORE
## <dbl> <chr> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 6001400100 Alameda 94704 -122. 37.9 Oakland 4.85
## 2 6001400200 Alameda 94618 -122. 37.8 Oakland 4.88
## 3 6001400300 Alameda 94618 -122. 37.8 Oakland 11.2
## 4 6001400400 Alameda 94609 -122. 37.8 Oakland 12.4
## 5 6001400500 Alameda 94609 -122. 37.8 Oakland 16.7
## 6 6001400600 Alameda 94609 -122. 37.8 Oakland 20.0
## # ℹ 60 more variables: CES4.0PERCENTILE <dbl>, CES4.0PERCRANGE <chr>,
## # OZONE <dbl>, OZONEPCTL <dbl>, PM2.5 <dbl>, PM2.5.PCTL <dbl>,
## # DIESELPM <dbl>, DIESELPMPCTL <dbl>, DRINKINGWATER <dbl>,
## # DRINKINGWATERPCTL <dbl>, LEAD <dbl>, LEADPCTL <dbl>, PESTICIDES <dbl>,
## # PESTICIDESPCTL <dbl>, TOXRELEASE <dbl>, TOXRELEASEPCTL <dbl>,
## # TRAFFIC <dbl>, TRAFFICPCTL <dbl>, CLEANUPSITES <dbl>,
## # CLEANUPSITESPCTL <dbl>, GROUNDWATERTHREATS <dbl>, …
dim(ces_upper)
## [1] 8035 67
How can you show the first 3 rows and the last 3 rows of ces
(in two lines of code)?
head(ces, 3)
## # A tibble: 3 × 67
## CensusTract CA_county ZIP Longitude Latitude ApproxLocation CES4.0Score
## <dbl> <chr> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 6001400100 Alameda 94704 -122. 37.9 Oakland 4.85
## 2 6001400200 Alameda 94618 -122. 37.8 Oakland 4.88
## 3 6001400300 Alameda 94618 -122. 37.8 Oakland 11.2
## # ℹ 60 more variables: CES4.0Percentile <dbl>, CES4.0PercRange <chr>,
## # Ozone <dbl>, OzonePctl <dbl>, PM2.5 <dbl>, PM2.5.Pctl <dbl>,
## # DieselPM <dbl>, DieselPMPctl <dbl>, DrinkingWater <dbl>,
## # DrinkingWaterPctl <dbl>, Lead <dbl>, LeadPctl <dbl>, Pesticides <dbl>,
## # PesticidesPctl <dbl>, ToxRelease <dbl>, ToxReleasePctl <dbl>,
## # Traffic <dbl>, TrafficPctl <dbl>, CleanupSites <dbl>,
## # CleanupSitesPctl <dbl>, GroundwaterThreats <dbl>, …
tail(ces, 3)
## # A tibble: 3 × 67
## CensusTract CA_county ZIP Longitude Latitude ApproxLocation CES4.0Score
## <dbl> <chr> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 6115040902 Yuba 95901 -121. 39.1 Unincorporated Yub… 18.7
## 2 6115041000 Yuba 95901 -121. 39.3 Unincorporated Yub… 22.0
## 3 6115041100 Yuba 95925 -121. 39.5 Unincorporated Yub… 11.7
## # ℹ 60 more variables: CES4.0Percentile <dbl>, CES4.0PercRange <chr>,
## # Ozone <dbl>, OzonePctl <dbl>, PM2.5 <dbl>, PM2.5.Pctl <dbl>,
## # DieselPM <dbl>, DieselPMPctl <dbl>, DrinkingWater <dbl>,
## # DrinkingWaterPctl <dbl>, Lead <dbl>, LeadPctl <dbl>, Pesticides <dbl>,
## # PesticidesPctl <dbl>, ToxRelease <dbl>, ToxReleasePctl <dbl>,
## # Traffic <dbl>, TrafficPctl <dbl>, CleanupSites <dbl>,
## # CleanupSitesPctl <dbl>, GroundwaterThreats <dbl>, …
Create a subset of the ces
that only contains the columns: CensusTract
, Traffic
, and Asthma
and assign this object to ces_sub
- what are the dimensions of this dataset?
CensusTract
: this is a small, relatively permanent area within a county used to present data from the census and other statistical programs
Traffic
: A measure of traffic density in vehicle-kilometers per hour per road length, within 150 meters of the census tract boundary. A higher Traffic
value indicates the presence of more traffic
Asthma
: Age-adjusted rate of emergency department visits for asthma
ces_sub <- select(ces, CensusTract, Traffic, Asthma)
dim(ces_sub)
## [1] 8035 3
Start with ces
again instead of the dataset you just made. Subset the data to only include the CensusTract
column and the columns that end with “Pctl”. Hint: use select()
and ends_with()
. Assign this subset of the data to be ces2
. Again take a look at the data and check the dimensions.
“Pctl” stands for “percentile”.
ces2 <- select(ces, CensusTract, ends_with("Pctl"))
ces2
## # A tibble: 8,035 × 24
## CensusTract OzonePctl PM2.5.Pctl DieselPMPctl DrinkingWaterPctl LeadPctl
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 6001400100 3.12 36.3 34.8 4.21 7.74
## 2 6001400200 3.12 42.0 92.7 4.21 68.2
## 3 6001400300 3.12 43.9 89.8 4.21 64.2
## 4 6001400400 3.12 42.8 79.1 4.21 67.1
## 5 6001400500 3.12 42.8 67.6 4.21 68.0
## 6 6001400600 3.12 42.8 83.8 4.21 69.7
## 7 6001400700 3.12 43.3 81.3 4.21 76.9
## 8 6001400800 3.12 44.0 68.7 4.21 73.2
## 9 6001400900 3.12 44.0 81.1 4.21 86.7
## 10 6001401000 3.12 45.8 99.4 4.21 88.5
## # ℹ 8,025 more rows
## # ℹ 18 more variables: PesticidesPctl <dbl>, ToxReleasePctl <dbl>,
## # TrafficPctl <dbl>, CleanupSitesPctl <dbl>, GroundwaterThreatsPctl <dbl>,
## # HazWastePctl <dbl>, ImpWaterBodiesPctl <dbl>, SolidWastePctl <dbl>,
## # PollutionBurdenPctl <dbl>, AsthmaPctl <dbl>, LowBirthWeightPctl <dbl>,
## # CardiovascularDiseasePctl <dbl>, PopCharPctl <dbl>, EducationPctl <dbl>,
## # LinguisticIsolPctl <dbl>, PovertyPctl <dbl>, UnemploymentPctl <dbl>, …
dim(ces2)
## [1] 8035 24
Pull the variable Asthma
from ces_sub
. How does this differ from selecting it? Use head() to take a look at both options.
head(pull(ces_sub, Asthma))
## [1] 15.65 20.47 30.88 49.61 86.57 101.53
head(select(ces_sub, Asthma))
## # A tibble: 6 × 1
## Asthma
## <dbl>
## 1 15.6
## 2 20.5
## 3 30.9
## 4 49.6
## 5 86.6
## 6 102.
Subset the rows of ces_sub
that have more than 100 for Asthma
- how many rows are there? Use filter()
.
nrow(filter(ces_sub, Asthma > 100))
## [1] 592
Subset the rows of ces_sub
that have an Asthma
value more than 100 and a Traffic
value less than 500 and — how many are there?
filter(ces_sub, Asthma > 100 & Traffic < 500) # all of these options work
## # A tibble: 130 × 3
## CensusTract Traffic Asthma
## <dbl> <dbl> <dbl>
## 1 6001401800 73.7 152.
## 2 6001402400 246. 152.
## 3 6001402900 473. 123.
## 4 6001406201 324. 123.
## 5 6001406300 402. 129.
## 6 6001407101 349. 129.
## 7 6001407102 325. 116.
## 8 6001407400 366. 129.
## 9 6001407600 311. 115.
## 10 6001408200 437. 187.
## # ℹ 120 more rows
nrow(filter(ces_sub, Asthma > 100 & Traffic < 500))
## [1] 130
nrow(filter(ces_sub, Asthma > 100, Traffic < 500))
## [1] 130
Subset the rows of ces_sub
that have an Asthma
value more than 100 and a Traffic
value less than or equal to (<=
) 500 — how many are there?
filter(ces_sub, Traffic <= 500 & Asthma > 100) # all of these options work
## # A tibble: 130 × 3
## CensusTract Traffic Asthma
## <dbl> <dbl> <dbl>
## 1 6001401800 73.7 152.
## 2 6001402400 246. 152.
## 3 6001402900 473. 123.
## 4 6001406201 324. 123.
## 5 6001406300 402. 129.
## 6 6001407101 349. 129.
## 7 6001407102 325. 116.
## 8 6001407400 366. 129.
## 9 6001407600 311. 115.
## 10 6001408200 437. 187.
## # ℹ 120 more rows
nrow(filter(ces_sub, Asthma > 100 & Traffic <= 500))
## [1] 130
nrow(filter(ces_sub, Asthma > 100, Traffic <= 500))
## [1] 130
We used two different criteria for subsetting in 2.5 and 2.6. Why are the number of rows the same for 2.5 and 2.6?
# There are no rows with a Traffic value exactly equal to 500
Subset the rows of ces
for rows that have CA_county
of “Los Angeles”, or a Traffic
value less than 300.
How many rows have both?
filter(ces, CA_county == "Los Angeles" | Traffic < 300)
## # A tibble: 2,943 × 67
## CensusTract CA_county ZIP Longitude Latitude ApproxLocation CES4.0Score
## <dbl> <chr> <dbl> <dbl> <dbl> <chr> <dbl>
## 1 6001401800 Alameda 94607 -122. 37.8 Oakland 45.1
## 2 6001402400 Alameda 94607 -122. 37.8 Oakland 37.3
## 3 6001404800 Alameda 94602 -122. 37.8 Oakland 14.3
## 4 6001405500 Alameda 94606 -122. 37.8 Oakland 28.4
## 5 6001405800 Alameda 94606 -122. 37.8 Oakland 28.8
## 6 6001407700 Alameda 94619 -122. 37.8 Oakland 15.7
## 7 6001409600 Alameda 94621 -122. 37.8 Oakland 32.3
## 8 6001409700 Alameda 94605 -122. 37.8 Oakland 21.8
## 9 6001420100 Alameda 94706 -122. 37.9 Albany 6.52
## 10 6001421100 Alameda 94708 -122. 37.9 Berkeley 3.28
## # ℹ 2,933 more rows
## # ℹ 60 more variables: CES4.0Percentile <dbl>, CES4.0PercRange <chr>,
## # Ozone <dbl>, OzonePctl <dbl>, PM2.5 <dbl>, PM2.5.Pctl <dbl>,
## # DieselPM <dbl>, DieselPMPctl <dbl>, DrinkingWater <dbl>,
## # DrinkingWaterPctl <dbl>, Lead <dbl>, LeadPctl <dbl>, Pesticides <dbl>,
## # PesticidesPctl <dbl>, ToxRelease <dbl>, ToxReleasePctl <dbl>,
## # Traffic <dbl>, TrafficPctl <dbl>, CleanupSites <dbl>, …
nrow(filter(ces, CA_county == "Los Angeles" & Traffic < 300))
## [1] 61
Select the variables that contain the letter “a” from ces
. Remember, the variables are the column names.
select(ces, contains("a"))
## # A tibble: 8,035 × 37
## CensusTract CA_county Latitude ApproxLocation CES4.0PercRange DrinkingWater
## <dbl> <chr> <dbl> <chr> <chr> <dbl>
## 1 6001400100 Alameda 37.9 Oakland 1-5% (lowest sco… 110.
## 2 6001400200 Alameda 37.8 Oakland 1-5% (lowest sco… 110.
## 3 6001400300 Alameda 37.8 Oakland 15-20% 110.
## 4 6001400400 Alameda 37.8 Oakland 15-20% 110.
## 5 6001400500 Alameda 37.8 Oakland 25-30% 110.
## 6 6001400600 Alameda 37.8 Oakland 35-40% 110.
## 7 6001400700 Alameda 37.8 Oakland 70-75% 110.
## 8 6001400800 Alameda 37.8 Oakland 70-75% 110.
## 9 6001400900 Alameda 37.8 Oakland 75-80% 110.
## 10 6001401000 Alameda 37.8 Oakland 80-85% 110.
## # ℹ 8,025 more rows
## # ℹ 31 more variables: DrinkingWaterPctl <dbl>, Lead <dbl>, LeadPctl <dbl>,
## # ToxRelease <dbl>, ToxReleasePctl <dbl>, Traffic <dbl>, TrafficPctl <dbl>,
## # CleanupSites <dbl>, CleanupSitesPctl <dbl>, GroundwaterThreats <dbl>,
## # GroundwaterThreatsPctl <dbl>, HazWaste <dbl>, HazWastePctl <dbl>,
## # ImpWaterBodies <dbl>, ImpWaterBodiesPctl <dbl>, SolidWaste <dbl>,
## # SolidWastePctl <dbl>, Asthma <dbl>, AsthmaPctl <dbl>, …
Create a subset called ces_Alameda
from ces
that only contains the rows for Alameda and only the columns: Traffic
and Asthma
. CA_county
should not be included in ces_Alameda
.
What are the dimensions of this dataset? Don’t use pipes (%>%
) and instead do this in two steps creating the ces_Alameda
object with filter
and updating it with select
.
ces_Alameda <- filter(ces, CA_county == "Alameda")
ces_Alameda <- select(ces_Alameda, Traffic, Asthma)
ces_Alameda # can get dimensions just from viewing the tibble
## # A tibble: 360 × 2
## Traffic Asthma
## <dbl> <dbl>
## 1 966. 15.6
## 2 717. 20.5
## 3 776. 30.9
## 4 722. 49.6
## 5 862. 86.6
## 6 1164. 102.
## 7 912. 104.
## 8 676. 117.
## 9 590. 119.
## 10 1455. 92.5
## # ℹ 350 more rows
dim(ces_Alameda) # alternatively can use dim() function
## [1] 360 2
Try the same thing again with pipes %>%
.
ces_Alameda <- ces %>%
filter(CA_county == "Alameda") %>%
select(Traffic, Asthma)
ces_Alameda
## # A tibble: 360 × 2
## Traffic Asthma
## <dbl> <dbl>
## 1 966. 15.6
## 2 717. 20.5
## 3 776. 30.9
## 4 722. 49.6
## 5 862. 86.6
## 6 1164. 102.
## 7 912. 104.
## 8 676. 117.
## 9 590. 119.
## 10 1455. 92.5
## # ℹ 350 more rows
dim(ces_Alameda)
## [1] 360 2
What happens if you do the steps in a different order? Why does this not work?
ces_Alameda <- ces %>%
select(Traffic, Asthma) %>%
filter(CA_county == "Alameda")
## Error in `filter()`:
## ℹ In argument: `CA_county == "Alameda"`.
## Caused by error:
## ! object 'CA_county' not found
# you get an error because there is no CA_county variable to filter from
Re-order the rows of ces_Alameda
by Traffic
value in increasing order. Use arrange()
. What’s the smallest value?
ces_Alameda <- arrange(ces_Alameda, Traffic)
ces_Alameda
## # A tibble: 360 × 2
## Traffic Asthma
## <dbl> <dbl>
## 1 20.9 12.0
## 2 73.7 152.
## 3 107. 89.8
## 4 123. 14.6
## 5 127. 17.9
## 6 132. 16.7
## 7 144. 11.8
## 8 158. 14.8
## 9 169. 18.9
## 10 189. 22.9
## # ℹ 350 more rows
arrange(ces_Alameda, Traffic) %>%
head(n = 1) %>%
pull(Traffic)
## [1] 20.92008
Create a new variable in ces_Alameda
called Asthma100
, which is equal to Asthma
divided by 100, using mutate()
(don’t forget to reassign ces_Alameda
). Use pipes %>%
. Take a look at the data now!
# General format
NEWDATA <- OLD_DATA %>% mutate(NEW_COLUMN = OLD_COLUMN)
ces_Alameda <- ces_Alameda %>% mutate(Asthma100 = Asthma/100)
ces_Alameda
## # A tibble: 360 × 3
## Traffic Asthma Asthma100
## <dbl> <dbl> <dbl>
## 1 20.9 12.0 0.120
## 2 73.7 152. 1.52
## 3 107. 89.8 0.898
## 4 123. 14.6 0.146
## 5 127. 17.9 0.179
## 6 132. 16.7 0.167
## 7 144. 11.8 0.118
## 8 158. 14.8 0.148
## 9 169. 18.9 0.189
## 10 189. 22.9 0.229
## # ℹ 350 more rows
Move the Asthma100
column to be before Traffic
in the ces_Alameda
dataset. Use relocate()
.
ces_Alameda <- ces_Alameda %>% relocate(Asthma100, .before = Traffic)
ces_Alameda
## # A tibble: 360 × 3
## Asthma100 Traffic Asthma
## <dbl> <dbl> <dbl>
## 1 0.120 20.9 12.0
## 2 1.52 73.7 152.
## 3 0.898 107. 89.8
## 4 0.146 123. 14.6
## 5 0.179 127. 17.9
## 6 0.167 132. 16.7
## 7 0.118 144. 11.8
## 8 0.148 158. 14.8
## 9 0.189 169. 18.9
## 10 0.229 189. 22.9
## # ℹ 350 more rows
Using the original ces
data, how can you find the values of ApproxLocation
for areas within zip code 90745 (in Los Angeles county) that also have a CES4.0 score in the range 90-95% - without just looking at the data manually and instead use functions we learned today? (Hint: It can be helpful to look at your data first)
CES4.0PercRange
: Percentile of the CalEnviroScreen score, grouped by 5% increments. The CalEnviroScreen score is a measure of the negative environmental effects seen in a given region. Those zip codes that have a percentile range of 90-95% are those regions that experience the highest effects of pollution in California.
ces %>%
filter(ZIP == 90745 & CES4.0PercRange == "90-95%") %>%
select(ApproxLocation)
## # A tibble: 4 × 1
## ApproxLocation
## <chr>
## 1 Carson
## 2 Carson
## 3 Carson
## 4 Carson