Part 1

1.1

Load the packages needed in this lab. Then, read in the CalEnviroScreen data and assign it to the “ces” variable. You can find the data at https://daseh.org/data/CalEnviroScreen_data.csv

library(tidyverse)
library(broom)
ces <- read_csv("https://daseh.org/data/CalEnviroScreen_data.csv")
## Rows: 8035 Columns: 67
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (3): CaliforniaCounty, ApproxLocation, CES4.0PercRange
## dbl (64): CensusTract, ZIP, Longitude, Latitude, CES4.0Score, CES4.0Percenti...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

1.2

Compute the correlation (with cor) between the HousingBurden and LowBirthWeight variables.

  • Use select() to choose these columns and save as an object.
  • Use the cor function on this object.
  • Add the use = “complete.obs” argument.

HousingBurden: Percentage of housing-burdened low-income households. Households are considered cost-burdened when they spend more than 30% of their income on rent, mortgage payments, and other housing costs

LowBirthWeight: Percentage of births that are low birth weight.

ces_sub <-
  ces %>%
  select(HousingBurden, LowBirthWeight)

cor(ces_sub, use = "complete.obs")
##                HousingBurden LowBirthWeight
## HousingBurden      1.0000000      0.2841148
## LowBirthWeight     0.2841148      1.0000000

1.3

Change your code from 1.2 to add a few more variables:

  • DieselPM: Diesel PM emissions from on-road and non-road sources.
  • Traffic: Measure of traffic density within 150 meters of the census tract boundary.
  • Asthma: Age-adjusted rate of emergency department visits for asthma.
ces_sub <-
  ces %>%
  select(HousingBurden, LowBirthWeight, DieselPM, Traffic, Asthma)

cor(ces_sub, use = "complete.obs")
##                HousingBurden LowBirthWeight  DieselPM     Traffic      Asthma
## HousingBurden      1.0000000      0.2842325 0.2311997  0.11809070  0.33628186
## LowBirthWeight     0.2842325      1.0000000 0.1262594  0.05098940  0.37540822
## DieselPM           0.2311997      0.1262594 1.0000000  0.34102206  0.14109556
## Traffic            0.1180907      0.0509894 0.3410221  1.00000000 -0.04329935
## Asthma             0.3362819      0.3754082 0.1410956 -0.04329935  1.00000000

1.4

Perform a t-test to determine if there is evidence of a difference between low birth weight percentage (LowBirthWeight) in Los Angeles census tracts compared to San Diego:

  • Create a subset of the data for CaliforniaCounty == "Los Angeles"
  • Create a subset of the data for CaliforniaCounty == "San Diego"
  • pull the LowBirthWeight column for both subsets
  • Use t.test to compare the two pulled vectors
  • Print the results using the tidy function from the broom package
losangeles <- ces %>% filter(CaliforniaCounty == "Los Angeles")
sandiego <- ces %>% filter(CaliforniaCounty == "San Diego")

x <- pull(losangeles, LowBirthWeight)
y <- pull(sandiego, LowBirthWeight)

t.test(x, y)
## 
##  Welch Two Sample t-test
## 
## data:  x and y
## t = 9.1988, df = 1113.8, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.4857923 0.7492197
## sample estimates:
## mean of x mean of y 
##  5.314695  4.697189
tidy(t.test(x, y))
## # A tibble: 1 × 10
##   estimate estimate1 estimate2 statistic  p.value parameter conf.low conf.high
##      <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>    <dbl>     <dbl>
## 1    0.618      5.31      4.70      9.20 1.73e-19     1114.    0.486     0.749
## # ℹ 2 more variables: method <chr>, alternative <chr>
# Another way to look at it..
ces %>% 
  group_by(CaliforniaCounty) %>% 
  summarize(mean = mean(LowBirthWeight, na.rm = TRUE)) %>% 
  filter(CaliforniaCounty %in% c("Los Angeles", "San Diego"))
## # A tibble: 2 × 2
##   CaliforniaCounty  mean
##   <chr>            <dbl>
## 1 Los Angeles       5.31
## 2 San Diego         4.70

Practice on Your Own!

P.1

Perform a t-test similar to 1.4, but this time examine differences in HousingBurden.

x <- pull(losangeles, HousingBurden)
y <- pull(sandiego, HousingBurden)

t.test(x, y)
## 
##  Welch Two Sample t-test
## 
## data:  x and y
## t = 12.378, df = 1104.5, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  3.758442 5.174496
## sample estimates:
## mean of x mean of y 
##  22.49781  18.03134
tidy(t.test(x, y))
## # A tibble: 1 × 10
##   estimate estimate1 estimate2 statistic  p.value parameter conf.low conf.high
##      <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>    <dbl>     <dbl>
## 1     4.47      22.5      18.0      12.4 4.79e-33     1104.     3.76      5.17
## # ℹ 2 more variables: method <chr>, alternative <chr>

P.2

Filter ces so it only contains Los Angeles and San Diego. Launch Esquisse or create a ggplot which plots:

  • Lead exposure on the x axis
  • HousingBurden on the y axis
  • Colors points (geom_point()) by CaliforniaCounty
ces_sub <- 
  ces %>% 
  filter(CaliforniaCounty %in% c("Los Angeles", "San Diego"))

ggplot(ces_sub, aes(x = Lead, y = HousingBurden, color = CaliforniaCounty)) +
  geom_point()
## Warning: Removed 68 rows containing missing values or values outside the scale range
## (`geom_point()`).

Part 2

2.1

Fit a linear regression model with LowBirthWeight as the outcome and HousingBurden as the predictor. Save the model fit in an object called “lmfit_bw” and display the summary table with summary().

HousingBurden: Percentage of housing-burdened low-income households. Households are considered cost-burdened when they spend more than 30% of their income on rent, mortgage payments, and other housing costs

LowBirthWeight: Percentage of births that are low birth weight.

# General format
glm(y ~ x, data = DATASET_NAME)
lmfit_bw <- glm(LowBirthWeight ~ HousingBurden, data = ces)
summary(lmfit_bw)
## 
## Call:
## glm(formula = LowBirthWeight ~ HousingBurden, data = ces)
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.98050    0.04277   93.06   <2e-16 ***
## HousingBurden  0.05542    0.00212   26.14   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 2.311994)
## 
##     Null deviance: 19567  on 7781  degrees of freedom
## Residual deviance: 17987  on 7780  degrees of freedom
##   (253 observations deleted due to missingness)
## AIC: 28611
## 
## Number of Fisher Scoring iterations: 2

2.2

Take your code from 2.1 and add Traffic as another predictor. Save the model fit in an object called “lmfit_bw2” and display the summary table.

Traffic: Measure of traffic density within 150 meters of the census tract boundary

lmfit_bw2 <- glm(LowBirthWeight ~ HousingBurden + Traffic, data = ces)
summary(lmfit_bw2)
## 
## Call:
## glm(formula = LowBirthWeight ~ HousingBurden + Traffic, data = ces)
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.957e+00  4.541e-02  87.149   <2e-16 ***
## HousingBurden 5.503e-02  2.138e-03  25.742   <2e-16 ***
## Traffic       2.828e-05  1.754e-05   1.612    0.107    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 2.310696)
## 
##     Null deviance: 19506  on 7759  degrees of freedom
## Residual deviance: 17924  on 7757  degrees of freedom
##   (275 observations deleted due to missingness)
## AIC: 28526
## 
## Number of Fisher Scoring iterations: 2

Practice on Your Own!

P.3

Take your code from 2.2 and add an interaction term. Save the model fit in an object called “lmfit_bw3” and display the summary table.

  • To include the interaction, use HousingBurden * Traffic in the formula.
  • Save the model fit in an object called “lmfit_bw3” and display the summary table with summary().
lmfit_bw3 <- glm(LowBirthWeight ~ HousingBurden * Traffic, data = ces)
summary(lmfit_bw3)
## 
## Call:
## glm(formula = LowBirthWeight ~ HousingBurden * Traffic, data = ces)
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            3.882e+00  6.474e-02  59.953   <2e-16 ***
## HousingBurden          5.940e-02  3.429e-03  17.324   <2e-16 ***
## Traffic                9.475e-05  4.432e-05   2.138   0.0325 *  
## HousingBurden:Traffic -3.695e-06  2.262e-06  -1.633   0.1025    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 2.310199)
## 
##     Null deviance: 19506  on 7759  degrees of freedom
## Residual deviance: 17918  on 7756  degrees of freedom
##   (275 observations deleted due to missingness)
## AIC: 28526
## 
## Number of Fisher Scoring iterations: 2

P.4

Let’s make LowBirthWeight into a binary variable, where over 5% low birth weight is “TRUE”.

The following code creates a column weight_cat with TRUE/FALSE values.

ces <- read_csv("https://daseh.org/data/CalEnviroScreen_data.csv")
## Rows: 8035 Columns: 67
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (3): CaliforniaCounty, ApproxLocation, CES4.0PercRange
## dbl (64): CensusTract, ZIP, Longitude, Latitude, CES4.0Score, CES4.0Percenti...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
ces_bw <- 
  ces %>% 
  mutate(weight_cat = LowBirthWeight > 5)

Fit a logistic regression model where:

  • The outcome is weight_cat.
  • HousingBurden, Poverty, and Education are predictors.
  • Use glm and family = binomial(link = “logit”).
  • Save the model fit in an object called “logfit_bw” and display the summary table with summary().

Poverty: Percent of population living below two times the federal poverty level

Education: Percent of population over 25 with less than a high school education

logfit_bw <-
  glm(
    weight_cat ~ HousingBurden + Poverty + Education,
    data = ces_bw,
    family = binomial(link = "logit")
  )
summary(logfit_bw)
## 
## Call:
## glm(formula = weight_cat ~ HousingBurden + Poverty + Education, 
##     family = binomial(link = "logit"), data = ces_bw)
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -1.258807   0.061505 -20.467  < 2e-16 ***
## HousingBurden  0.020510   0.004060   5.051 4.39e-07 ***
## Poverty        0.015220   0.002486   6.122 9.27e-10 ***
## Education      0.017751   0.002689   6.601 4.07e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10726  on 7749  degrees of freedom
## Residual deviance: 10041  on 7746  degrees of freedom
##   (285 observations deleted due to missingness)
## AIC: 10049
## 
## Number of Fisher Scoring iterations: 4