In this lab you can use the interactive console to explore but please record your commands here. Remember anything you type here can be “sent” to the console with Cmd-Enter (OS-X) or Ctrl-Enter (Windows/Linux) (But only inside the code chunks designated with the {r} areas).

Part 1

First let’s load our packages. We’re going to be using the dplyr package, which you can load as part of the tidyverse package.

# don't forget to load the packages that you will need!
library(tidyverse)

We’ll again work with the CalEnviroScreen dataset, which contains information about environmental factors associated with human health in California.

First, load the data from the website, either manually or by using the Data Import menu (find it by clicking on File).

ces <- read_csv(file = "https://daseh.org/data/CalEnviroScreen_data.csv")

## Rows: 8035 Columns: 67
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (3): CaliforniaCounty, ApproxLocation, CES4.0PercRange
## dbl (64): CensusTract, ZIP, Longitude, Latitude, CES4.0Score, CES4.0Percenti...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Check that it worked by seeing if you have the ces data.

ces

## # A tibble: 8,035 × 67
##    CensusTract CaliforniaCounty   ZIP Longitude Latitude ApproxLocation
##          <dbl> <chr>            <dbl>     <dbl>    <dbl> <chr>         
##  1  6001400100 Alameda          94704     -122.     37.9 Oakland       
##  2  6001400200 Alameda          94618     -122.     37.8 Oakland       
##  3  6001400300 Alameda          94618     -122.     37.8 Oakland       
##  4  6001400400 Alameda          94609     -122.     37.8 Oakland       
##  5  6001400500 Alameda          94609     -122.     37.8 Oakland       
##  6  6001400600 Alameda          94609     -122.     37.8 Oakland       
##  7  6001400700 Alameda          94608     -122.     37.8 Oakland       
##  8  6001400800 Alameda          94608     -122.     37.8 Oakland       
##  9  6001400900 Alameda          94608     -122.     37.8 Oakland       
## 10  6001401000 Alameda          94608     -122.     37.8 Oakland       
## # ℹ 8,025 more rows
## # ℹ 61 more variables: CES4.0Score <dbl>, CES4.0Percentile <dbl>,
## #   CES4.0PercRange <chr>, Ozone <dbl>, OzonePctl <dbl>, PM2.5 <dbl>,
## #   PM2.5.Pctl <dbl>, DieselPM <dbl>, DieselPMPctl <dbl>, DrinkingWater <dbl>,
## #   DrinkingWaterPctl <dbl>, Lead <dbl>, LeadPctl <dbl>, Pesticides <dbl>,
## #   PesticidesPctl <dbl>, ToxRelease <dbl>, ToxReleasePctl <dbl>,
## #   Traffic <dbl>, TrafficPctl <dbl>, CleanupSites <dbl>, …

1.1

What class is ces?

class(ces)

## [1] "spec_tbl_df" "tbl_df"      "tbl"         "data.frame"

1.2

How many observations (rows) and variables (columns) are in the dataset - try the dim() function?

dim(ces)

## [1] 8035   67

nrow(ces)

## [1] 8035

ncol(ces)

## [1] 67

1.3

Next, rename the column CaliforniaCounty to CA_county (hint - use rename() and watch out for the order of the new and old names!).

ces<- rename(ces, CA_county = CaliforniaCounty)
head(ces)

## # A tibble: 6 × 67
##   CensusTract CA_county   ZIP Longitude Latitude ApproxLocation CES4.0Score
##         <dbl> <chr>     <dbl>     <dbl>    <dbl> <chr>                <dbl>
## 1  6001400100 Alameda   94704     -122.     37.9 Oakland               4.85
## 2  6001400200 Alameda   94618     -122.     37.8 Oakland               4.88
## 3  6001400300 Alameda   94618     -122.     37.8 Oakland              11.2 
## 4  6001400400 Alameda   94609     -122.     37.8 Oakland              12.4 
## 5  6001400500 Alameda   94609     -122.     37.8 Oakland              16.7 
## 6  6001400600 Alameda   94609     -122.     37.8 Oakland              20.0 
## # ℹ 60 more variables: CES4.0Percentile <dbl>, CES4.0PercRange <chr>,
## #   Ozone <dbl>, OzonePctl <dbl>, PM2.5 <dbl>, PM2.5.Pctl <dbl>,
## #   DieselPM <dbl>, DieselPMPctl <dbl>, DrinkingWater <dbl>,
## #   DrinkingWaterPctl <dbl>, Lead <dbl>, LeadPctl <dbl>, Pesticides <dbl>,
## #   PesticidesPctl <dbl>, ToxRelease <dbl>, ToxReleasePctl <dbl>,
## #   Traffic <dbl>, TrafficPctl <dbl>, CleanupSites <dbl>,
## #   CleanupSitesPctl <dbl>, GroundwaterThreats <dbl>, …

1.4

Convert the column names of ces to be all upper case. Use rename_with(), and the toupper command. Save this as a new dataset called ces_upper.

ces_upper <- rename_with(ces, toupper)
head(ces_upper)

## # A tibble: 6 × 67
##   CENSUSTRACT CA_COUNTY   ZIP LONGITUDE LATITUDE APPROXLOCATION CES4.0SCORE
##         <dbl> <chr>     <dbl>     <dbl>    <dbl> <chr>                <dbl>
## 1  6001400100 Alameda   94704     -122.     37.9 Oakland               4.85
## 2  6001400200 Alameda   94618     -122.     37.8 Oakland               4.88
## 3  6001400300 Alameda   94618     -122.     37.8 Oakland              11.2 
## 4  6001400400 Alameda   94609     -122.     37.8 Oakland              12.4 
## 5  6001400500 Alameda   94609     -122.     37.8 Oakland              16.7 
## 6  6001400600 Alameda   94609     -122.     37.8 Oakland              20.0 
## # ℹ 60 more variables: CES4.0PERCENTILE <dbl>, CES4.0PERCRANGE <chr>,
## #   OZONE <dbl>, OZONEPCTL <dbl>, PM2.5 <dbl>, PM2.5.PCTL <dbl>,
## #   DIESELPM <dbl>, DIESELPMPCTL <dbl>, DRINKINGWATER <dbl>,
## #   DRINKINGWATERPCTL <dbl>, LEAD <dbl>, LEADPCTL <dbl>, PESTICIDES <dbl>,
## #   PESTICIDESPCTL <dbl>, TOXRELEASE <dbl>, TOXRELEASEPCTL <dbl>,
## #   TRAFFIC <dbl>, TRAFFICPCTL <dbl>, CLEANUPSITES <dbl>,
## #   CLEANUPSITESPCTL <dbl>, GROUNDWATERTHREATS <dbl>, …

dim(ces_upper)

## [1] 8035   67

Practice on Your Own!

P.1

How can you show the first 3 rows and the last 3 rows of ces (in two lines of code)?

head(ces, 3)

## # A tibble: 3 × 67
##   CensusTract CA_county   ZIP Longitude Latitude ApproxLocation CES4.0Score
##         <dbl> <chr>     <dbl>     <dbl>    <dbl> <chr>                <dbl>
## 1  6001400100 Alameda   94704     -122.     37.9 Oakland               4.85
## 2  6001400200 Alameda   94618     -122.     37.8 Oakland               4.88
## 3  6001400300 Alameda   94618     -122.     37.8 Oakland              11.2 
## # ℹ 60 more variables: CES4.0Percentile <dbl>, CES4.0PercRange <chr>,
## #   Ozone <dbl>, OzonePctl <dbl>, PM2.5 <dbl>, PM2.5.Pctl <dbl>,
## #   DieselPM <dbl>, DieselPMPctl <dbl>, DrinkingWater <dbl>,
## #   DrinkingWaterPctl <dbl>, Lead <dbl>, LeadPctl <dbl>, Pesticides <dbl>,
## #   PesticidesPctl <dbl>, ToxRelease <dbl>, ToxReleasePctl <dbl>,
## #   Traffic <dbl>, TrafficPctl <dbl>, CleanupSites <dbl>,
## #   CleanupSitesPctl <dbl>, GroundwaterThreats <dbl>, …

tail(ces, 3)

## # A tibble: 3 × 67
##   CensusTract CA_county   ZIP Longitude Latitude ApproxLocation      CES4.0Score
##         <dbl> <chr>     <dbl>     <dbl>    <dbl> <chr>                     <dbl>
## 1  6115040902 Yuba      95901     -121.     39.1 Unincorporated Yub…        18.7
## 2  6115041000 Yuba      95901     -121.     39.3 Unincorporated Yub…        22.0
## 3  6115041100 Yuba      95925     -121.     39.5 Unincorporated Yub…        11.7
## # ℹ 60 more variables: CES4.0Percentile <dbl>, CES4.0PercRange <chr>,
## #   Ozone <dbl>, OzonePctl <dbl>, PM2.5 <dbl>, PM2.5.Pctl <dbl>,
## #   DieselPM <dbl>, DieselPMPctl <dbl>, DrinkingWater <dbl>,
## #   DrinkingWaterPctl <dbl>, Lead <dbl>, LeadPctl <dbl>, Pesticides <dbl>,
## #   PesticidesPctl <dbl>, ToxRelease <dbl>, ToxReleasePctl <dbl>,
## #   Traffic <dbl>, TrafficPctl <dbl>, CleanupSites <dbl>,
## #   CleanupSitesPctl <dbl>, GroundwaterThreats <dbl>, …

Part 2

2.1

Create a subset of the ces that only contains the columns: CensusTract, Traffic, and Asthma and assign this object to ces_sub - what are the dimensions of this dataset?

CensusTract: this is a small, relatively permanent area within a county used to present data from the census and other statistical programs

Traffic: A measure of traffic density in vehicle-kilometers per hour per road length, within 150 meters of the census tract boundary. A higher Traffic value indicates the presence of more traffic

Asthma: Age-adjusted rate of emergency department visits for asthma

ces_sub <- select(ces, CensusTract, Traffic, Asthma)
dim(ces_sub)

## [1] 8035    3

2.2

Start with ces again instead of the dataset you just made. Subset the data to only include the CensusTract column and the columns that end with “Pctl”. Hint: use select() and ends_with(). Assign this subset of the data to be ces2. Again take a look at the data and check the dimensions.

“Pctl” stands for “percentile”.

ces2 <- select(ces, CensusTract, ends_with("Pctl"))
ces2

## # A tibble: 8,035 × 24
##    CensusTract OzonePctl PM2.5.Pctl DieselPMPctl DrinkingWaterPctl LeadPctl
##          <dbl>     <dbl>      <dbl>        <dbl>             <dbl>    <dbl>
##  1  6001400100      3.12       36.3         34.8              4.21     7.74
##  2  6001400200      3.12       42.0         92.7              4.21    68.2 
##  3  6001400300      3.12       43.9         89.8              4.21    64.2 
##  4  6001400400      3.12       42.8         79.1              4.21    67.1 
##  5  6001400500      3.12       42.8         67.6              4.21    68.0 
##  6  6001400600      3.12       42.8         83.8              4.21    69.7 
##  7  6001400700      3.12       43.3         81.3              4.21    76.9 
##  8  6001400800      3.12       44.0         68.7              4.21    73.2 
##  9  6001400900      3.12       44.0         81.1              4.21    86.7 
## 10  6001401000      3.12       45.8         99.4              4.21    88.5 
## # ℹ 8,025 more rows
## # ℹ 18 more variables: PesticidesPctl <dbl>, ToxReleasePctl <dbl>,
## #   TrafficPctl <dbl>, CleanupSitesPctl <dbl>, GroundwaterThreatsPctl <dbl>,
## #   HazWastePctl <dbl>, ImpWaterBodiesPctl <dbl>, SolidWastePctl <dbl>,
## #   PollutionBurdenPctl <dbl>, AsthmaPctl <dbl>, LowBirthWeightPctl <dbl>,
## #   CardiovascularDiseasePctl <dbl>, PopCharPctl <dbl>, EducationPctl <dbl>,
## #   LinguisticIsolPctl <dbl>, PovertyPctl <dbl>, UnemploymentPctl <dbl>, …

dim(ces2)

## [1] 8035   24

2.3

Pull the variable Asthma from ces_sub. How does this differ from selecting it? Use head() to take a look at both options.

head(pull(ces_sub, Asthma))

## [1]  15.65  20.47  30.88  49.61  86.57 101.53

head(select(ces_sub, Asthma))

## # A tibble: 6 × 1
##   Asthma
##    <dbl>
## 1   15.6
## 2   20.5
## 3   30.9
## 4   49.6
## 5   86.6
## 6  102.

2.4

Subset the rows of ces_sub that have more than 100 for Asthma - how many rows are there? Use filter().

nrow(filter(ces_sub, Asthma > 100))

## [1] 592

2.5

Subset the rows of ces_sub that have an Asthma value more than 100 and a Traffic value less than 500 and — how many are there?

filter(ces_sub, Asthma > 100 & Traffic < 500) # all of these options work

## # A tibble: 130 × 3
##    CensusTract Traffic Asthma
##          <dbl>   <dbl>  <dbl>
##  1  6001401800    73.7   152.
##  2  6001402400   246.    152.
##  3  6001402900   473.    123.
##  4  6001406201   324.    123.
##  5  6001406300   402.    129.
##  6  6001407101   349.    129.
##  7  6001407102   325.    116.
##  8  6001407400   366.    129.
##  9  6001407600   311.    115.
## 10  6001408200   437.    187.
## # ℹ 120 more rows

nrow(filter(ces_sub, Asthma > 100 & Traffic < 500))

## [1] 130

nrow(filter(ces_sub, Asthma > 100, Traffic < 500))

## [1] 130

2.6

Subset the rows of ces_sub that have an Asthma value more than 100 and a Traffic value less than or equal to (<=) 500 — how many are there?

filter(ces_sub, Traffic <= 500 & Asthma > 100) # all of these options work

## # A tibble: 130 × 3
##    CensusTract Traffic Asthma
##          <dbl>   <dbl>  <dbl>
##  1  6001401800    73.7   152.
##  2  6001402400   246.    152.
##  3  6001402900   473.    123.
##  4  6001406201   324.    123.
##  5  6001406300   402.    129.
##  6  6001407101   349.    129.
##  7  6001407102   325.    116.
##  8  6001407400   366.    129.
##  9  6001407600   311.    115.
## 10  6001408200   437.    187.
## # ℹ 120 more rows

nrow(filter(ces_sub, Asthma > 100 & Traffic <= 500))

## [1] 130

nrow(filter(ces_sub, Asthma > 100, Traffic <= 500))

## [1] 130

2.7

We used two different criteria for subsetting in 2.5 and 2.6. Why are the number of rows the same for 2.5 and 2.6?

# There are no rows with a Traffic value exactly equal to 500

Practice on Your Own!

P.2

Subset the rows of ces for rows that have CA_county of “Los Angeles”, or a Traffic value less than 300.

How many rows have both?

filter(ces, CA_county == "Los Angeles" | Traffic < 300)

## # A tibble: 2,943 × 67
##    CensusTract CA_county   ZIP Longitude Latitude ApproxLocation CES4.0Score
##          <dbl> <chr>     <dbl>     <dbl>    <dbl> <chr>                <dbl>
##  1  6001401800 Alameda   94607     -122.     37.8 Oakland              45.1 
##  2  6001402400 Alameda   94607     -122.     37.8 Oakland              37.3 
##  3  6001404800 Alameda   94602     -122.     37.8 Oakland              14.3 
##  4  6001405500 Alameda   94606     -122.     37.8 Oakland              28.4 
##  5  6001405800 Alameda   94606     -122.     37.8 Oakland              28.8 
##  6  6001407700 Alameda   94619     -122.     37.8 Oakland              15.7 
##  7  6001409600 Alameda   94621     -122.     37.8 Oakland              32.3 
##  8  6001409700 Alameda   94605     -122.     37.8 Oakland              21.8 
##  9  6001420100 Alameda   94706     -122.     37.9 Albany                6.52
## 10  6001421100 Alameda   94708     -122.     37.9 Berkeley              3.28
## # ℹ 2,933 more rows
## # ℹ 60 more variables: CES4.0Percentile <dbl>, CES4.0PercRange <chr>,
## #   Ozone <dbl>, OzonePctl <dbl>, PM2.5 <dbl>, PM2.5.Pctl <dbl>,
## #   DieselPM <dbl>, DieselPMPctl <dbl>, DrinkingWater <dbl>,
## #   DrinkingWaterPctl <dbl>, Lead <dbl>, LeadPctl <dbl>, Pesticides <dbl>,
## #   PesticidesPctl <dbl>, ToxRelease <dbl>, ToxReleasePctl <dbl>,
## #   Traffic <dbl>, TrafficPctl <dbl>, CleanupSites <dbl>, …

nrow(filter(ces, CA_county == "Los Angeles" & Traffic < 300))

## [1] 61

P.3

Select the variables that contain the letter “a” from ces. Remember, the variables are the column names.

select(ces, contains("a"))

## # A tibble: 8,035 × 37
##    CensusTract CA_county Latitude ApproxLocation CES4.0PercRange   DrinkingWater
##          <dbl> <chr>        <dbl> <chr>          <chr>                     <dbl>
##  1  6001400100 Alameda       37.9 Oakland        1-5% (lowest sco…          110.
##  2  6001400200 Alameda       37.8 Oakland        1-5% (lowest sco…          110.
##  3  6001400300 Alameda       37.8 Oakland        15-20%                     110.
##  4  6001400400 Alameda       37.8 Oakland        15-20%                     110.
##  5  6001400500 Alameda       37.8 Oakland        25-30%                     110.
##  6  6001400600 Alameda       37.8 Oakland        35-40%                     110.
##  7  6001400700 Alameda       37.8 Oakland        70-75%                     110.
##  8  6001400800 Alameda       37.8 Oakland        70-75%                     110.
##  9  6001400900 Alameda       37.8 Oakland        75-80%                     110.
## 10  6001401000 Alameda       37.8 Oakland        80-85%                     110.
## # ℹ 8,025 more rows
## # ℹ 31 more variables: DrinkingWaterPctl <dbl>, Lead <dbl>, LeadPctl <dbl>,
## #   ToxRelease <dbl>, ToxReleasePctl <dbl>, Traffic <dbl>, TrafficPctl <dbl>,
## #   CleanupSites <dbl>, CleanupSitesPctl <dbl>, GroundwaterThreats <dbl>,
## #   GroundwaterThreatsPctl <dbl>, HazWaste <dbl>, HazWastePctl <dbl>,
## #   ImpWaterBodies <dbl>, ImpWaterBodiesPctl <dbl>, SolidWaste <dbl>,
## #   SolidWastePctl <dbl>, Asthma <dbl>, AsthmaPctl <dbl>, …

Part 3

3.1

Create a subset called ces_Alameda from ces that only contains the rows for Alameda and only the columns: Traffic and Asthma. CA_county should not be included in ces_Alameda.

What are the dimensions of this dataset? Don’t use pipes (%>%) and instead do this in two steps creating the ces_Alameda object with filter and updating it with select.

ces_Alameda <- filter(ces, CA_county == "Alameda")
ces_Alameda  <- select(ces_Alameda, Traffic, Asthma)
ces_Alameda # can get dimensions just from viewing the tibble

## # A tibble: 360 × 2
##    Traffic Asthma
##      <dbl>  <dbl>
##  1    966.   15.6
##  2    717.   20.5
##  3    776.   30.9
##  4    722.   49.6
##  5    862.   86.6
##  6   1164.  102. 
##  7    912.  104. 
##  8    676.  117. 
##  9    590.  119. 
## 10   1455.   92.5
## # ℹ 350 more rows

dim(ces_Alameda) # alternatively can use dim() function

## [1] 360   2

3.2

Try the same thing again with pipes %>%.

ces_Alameda <- ces %>%
  filter(CA_county == "Alameda") %>%
  select(Traffic, Asthma)
ces_Alameda

## # A tibble: 360 × 2
##    Traffic Asthma
##      <dbl>  <dbl>
##  1    966.   15.6
##  2    717.   20.5
##  3    776.   30.9
##  4    722.   49.6
##  5    862.   86.6
##  6   1164.  102. 
##  7    912.  104. 
##  8    676.  117. 
##  9    590.  119. 
## 10   1455.   92.5
## # ℹ 350 more rows

dim(ces_Alameda)

## [1] 360   2

What happens if you do the steps in a different order? Why does this not work?

ces_Alameda <- ces %>%
                 select(Traffic, Asthma) %>%
                 filter(CA_county == "Alameda")

## Error in `filter()`:
## ℹ In argument: `CA_county == "Alameda"`.
## Caused by error:
## ! object 'CA_county' not found

# you get an error because there is no CA_county variable to filter from

3.3

Re-order the rows of ces_Alameda by Traffic value in increasing order. Use arrange(). What’s the smallest value?

ces_Alameda <- arrange(ces_Alameda, Traffic)
ces_Alameda

## # A tibble: 360 × 2
##    Traffic Asthma
##      <dbl>  <dbl>
##  1    20.9   12.0
##  2    73.7  152. 
##  3   107.    89.8
##  4   123.    14.6
##  5   127.    17.9
##  6   132.    16.7
##  7   144.    11.8
##  8   158.    14.8
##  9   169.    18.9
## 10   189.    22.9
## # ℹ 350 more rows

arrange(ces_Alameda, Traffic) %>%
  head(n = 1) %>%
  pull(Traffic)

## [1] 20.92008

3.4

Create a new variable in ces_Alameda called Asthma100, which is equal to Asthma divided by 100, using mutate()(don’t forget to reassign ces_Alameda). Use pipes %>%. Take a look at the data now!

# General format
NEWDATA <- OLD_DATA %>% mutate(NEW_COLUMN = OLD_COLUMN)

ces_Alameda <- ces_Alameda %>% mutate(Asthma100 = Asthma/100)
ces_Alameda

## # A tibble: 360 × 3
##    Traffic Asthma Asthma100
##      <dbl>  <dbl>     <dbl>
##  1    20.9   12.0     0.120
##  2    73.7  152.      1.52 
##  3   107.    89.8     0.898
##  4   123.    14.6     0.146
##  5   127.    17.9     0.179
##  6   132.    16.7     0.167
##  7   144.    11.8     0.118
##  8   158.    14.8     0.148
##  9   169.    18.9     0.189
## 10   189.    22.9     0.229
## # ℹ 350 more rows

Practice on Your Own!

P.4

Move the Asthma100 column to be before Traffic in the ces_Alameda dataset. Use relocate().

ces_Alameda <- ces_Alameda %>% relocate(Asthma100, .before = Traffic)
ces_Alameda

## # A tibble: 360 × 3
##    Asthma100 Traffic Asthma
##        <dbl>   <dbl>  <dbl>
##  1     0.120    20.9   12.0
##  2     1.52     73.7  152. 
##  3     0.898   107.    89.8
##  4     0.146   123.    14.6
##  5     0.179   127.    17.9
##  6     0.167   132.    16.7
##  7     0.118   144.    11.8
##  8     0.148   158.    14.8
##  9     0.189   169.    18.9
## 10     0.229   189.    22.9
## # ℹ 350 more rows

P.5

Using the original ces data, how can you find the values of ApproxLocation for areas within zip code 90745 (in Los Angeles county) that also have a CES4.0 score in the range 90-95% - without just looking at the data manually and instead use functions we learned today? (Hint: It can be helpful to look at your data first)

CES4.0PercRange: Percentile of the CalEnviroScreen score, grouped by 5% increments. The CalEnviroScreen score is a measure of the negative environmental effects seen in a given region. Those zip codes that have a percentile range of 90-95% are those regions that experience the highest effects of pollution in California.

ces %>%
  filter(ZIP == 90745 & CES4.0PercRange == "90-95%") %>%
  select(ApproxLocation)

## # A tibble: 4 × 1
##   ApproxLocation
##   <chr>         
## 1 Carson        
## 2 Carson        
## 3 Carson        
## 4 Carson

Subsetting Data in R - Key

Part 1

1.1

1.2

1.3

1.4

Practice on Your Own!

P.1

Part 2

2.1

2.2

2.3

2.4

2.5

2.6

2.7

Practice on Your Own!

P.2

P.3

Part 3

3.1

3.2

3.3

3.4

Practice on Your Own!

P.4

P.5