Part 1

1.1

Load the packages needed in this lab. Then, read in the CalEnviroScreen data and assign it to the “ces” variable. You can find the data at https://daseh.org/data/CalEnviroScreen_data.csv

library(tidyverse)
library(broom)

ces <- read_csv("https://daseh.org/data/CalEnviroScreen_data.csv")

## Rows: 8035 Columns: 67
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (3): CaliforniaCounty, ApproxLocation, CES4.0PercRange
## dbl (64): CensusTract, ZIP, Longitude, Latitude, CES4.0Score, CES4.0Percenti...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

1.2

Compute the correlation (with cor) between the HousingBurden and LowBirthWeight variables.

Use select() to choose these columns and save as an object.
Use the cor function on this object.
Add the use = “complete.obs” argument.

HousingBurden: Percentage of housing-burdened low-income households. Households are considered cost-burdened when they spend more than 30% of their income on rent, mortgage payments, and other housing costs

LowBirthWeight: Percentage of births that are low birth weight.

ces_sub <-
  ces %>%
  select(HousingBurden, LowBirthWeight)

cor(ces_sub, use = "complete.obs")

##                HousingBurden LowBirthWeight
## HousingBurden      1.0000000      0.2841148
## LowBirthWeight     0.2841148      1.0000000

1.3

Change your code from 1.2 to add a few more variables:

DieselPM: Diesel PM emissions from on-road and non-road sources.
Traffic: Measure of traffic density within 150 meters of the census tract boundary.
Asthma: Age-adjusted rate of emergency department visits for asthma.

ces_sub <-
  ces %>%
  select(HousingBurden, LowBirthWeight, DieselPM, Traffic, Asthma)

cor(ces_sub, use = "complete.obs")

##                HousingBurden LowBirthWeight  DieselPM     Traffic      Asthma
## HousingBurden      1.0000000      0.2842325 0.2311997  0.11809070  0.33628186
## LowBirthWeight     0.2842325      1.0000000 0.1262594  0.05098940  0.37540822
## DieselPM           0.2311997      0.1262594 1.0000000  0.34102206  0.14109556
## Traffic            0.1180907      0.0509894 0.3410221  1.00000000 -0.04329935
## Asthma             0.3362819      0.3754082 0.1410956 -0.04329935  1.00000000

1.4

Perform a t-test to determine if there is evidence of a difference between low birth weight percentage (LowBirthWeight) in Los Angeles census tracts compared to San Diego:

Create a subset of the data for CaliforniaCounty == "Los Angeles"
Create a subset of the data for CaliforniaCounty == "San Diego"
pull the LowBirthWeight column for both subsets
Use t.test to compare the two pulled vectors
Print the results using the tidy function from the broom package

losangeles <- ces %>% filter(CaliforniaCounty == "Los Angeles")
sandiego <- ces %>% filter(CaliforniaCounty == "San Diego")

x <- pull(losangeles, LowBirthWeight)
y <- pull(sandiego, LowBirthWeight)

t.test(x, y)

## 
##  Welch Two Sample t-test
## 
## data:  x and y
## t = 9.1988, df = 1113.8, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.4857923 0.7492197
## sample estimates:
## mean of x mean of y 
##  5.314695  4.697189

tidy(t.test(x, y))

## # A tibble: 1 × 10
##   estimate estimate1 estimate2 statistic  p.value parameter conf.low conf.high
##      <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>    <dbl>     <dbl>
## 1    0.618      5.31      4.70      9.20 1.73e-19     1114.    0.486     0.749
## # ℹ 2 more variables: method <chr>, alternative <chr>

# Another way to look at it..
ces %>% 
  group_by(CaliforniaCounty) %>% 
  summarize(mean = mean(LowBirthWeight, na.rm = TRUE)) %>% 
  filter(CaliforniaCounty %in% c("Los Angeles", "San Diego"))

## # A tibble: 2 × 2
##   CaliforniaCounty  mean
##   <chr>            <dbl>
## 1 Los Angeles       5.31
## 2 San Diego         4.70

Practice on Your Own!

P.1

Perform a t-test similar to 1.4, but this time examine differences in HousingBurden.

x <- pull(losangeles, HousingBurden)
y <- pull(sandiego, HousingBurden)

t.test(x, y)

## 
##  Welch Two Sample t-test
## 
## data:  x and y
## t = 12.378, df = 1104.5, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  3.758442 5.174496
## sample estimates:
## mean of x mean of y 
##  22.49781  18.03134

tidy(t.test(x, y))

## # A tibble: 1 × 10
##   estimate estimate1 estimate2 statistic  p.value parameter conf.low conf.high
##      <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>    <dbl>     <dbl>
## 1     4.47      22.5      18.0      12.4 4.79e-33     1104.     3.76      5.17
## # ℹ 2 more variables: method <chr>, alternative <chr>

P.2

Filter ces so it only contains Los Angeles and San Diego. Launch Esquisse or create a ggplot which plots:

Lead exposure on the x axis
HousingBurden on the y axis
Colors points (geom_point()) by CaliforniaCounty

ces_sub <- 
  ces %>% 
  filter(CaliforniaCounty %in% c("Los Angeles", "San Diego"))

ggplot(ces_sub, aes(x = Lead, y = HousingBurden, color = CaliforniaCounty)) +
  geom_point()

## Warning: Removed 68 rows containing missing values or values outside the scale range
## (`geom_point()`).

Part 2

2.1

Fit a linear regression model with LowBirthWeight as the outcome and HousingBurden as the predictor. Save the model fit in an object called “lmfit_bw” and display the summary table with summary().

LowBirthWeight: Percentage of births that are low birth weight.

# General format
glm(y ~ x, data = DATASET_NAME)

lmfit_bw <- glm(LowBirthWeight ~ HousingBurden, data = ces)
summary(lmfit_bw)

## 
## Call:
## glm(formula = LowBirthWeight ~ HousingBurden, data = ces)
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    3.98050    0.04277   93.06   <2e-16 ***
## HousingBurden  0.05542    0.00212   26.14   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 2.311994)
## 
##     Null deviance: 19567  on 7781  degrees of freedom
## Residual deviance: 17987  on 7780  degrees of freedom
##   (253 observations deleted due to missingness)
## AIC: 28611
## 
## Number of Fisher Scoring iterations: 2

2.2

Take your code from 2.1 and add Traffic as another predictor. Save the model fit in an object called “lmfit_bw2” and display the summary table.

Traffic: Measure of traffic density within 150 meters of the census tract boundary

lmfit_bw2 <- glm(LowBirthWeight ~ HousingBurden + Traffic, data = ces)
summary(lmfit_bw2)

## 
## Call:
## glm(formula = LowBirthWeight ~ HousingBurden + Traffic, data = ces)
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.957e+00  4.541e-02  87.149   <2e-16 ***
## HousingBurden 5.503e-02  2.138e-03  25.742   <2e-16 ***
## Traffic       2.828e-05  1.754e-05   1.612    0.107    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 2.310696)
## 
##     Null deviance: 19506  on 7759  degrees of freedom
## Residual deviance: 17924  on 7757  degrees of freedom
##   (275 observations deleted due to missingness)
## AIC: 28526
## 
## Number of Fisher Scoring iterations: 2

Practice on Your Own!

P.3

Take your code from 2.2 and add an interaction term. Save the model fit in an object called “lmfit_bw3” and display the summary table.

To include the interaction, use HousingBurden * Traffic in the formula.
Save the model fit in an object called “lmfit_bw3” and display the summary table with summary().

lmfit_bw3 <- glm(LowBirthWeight ~ HousingBurden * Traffic, data = ces)
summary(lmfit_bw3)

## 
## Call:
## glm(formula = LowBirthWeight ~ HousingBurden * Traffic, data = ces)
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            3.882e+00  6.474e-02  59.953   <2e-16 ***
## HousingBurden          5.940e-02  3.429e-03  17.324   <2e-16 ***
## Traffic                9.475e-05  4.432e-05   2.138   0.0325 *  
## HousingBurden:Traffic -3.695e-06  2.262e-06  -1.633   0.1025    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 2.310199)
## 
##     Null deviance: 19506  on 7759  degrees of freedom
## Residual deviance: 17918  on 7756  degrees of freedom
##   (275 observations deleted due to missingness)
## AIC: 28526
## 
## Number of Fisher Scoring iterations: 2

P.4

Let’s make LowBirthWeight into a binary variable, where over 5% low birth weight is “TRUE”.

The following code creates a column weight_cat with TRUE/FALSE values.

ces <- read_csv("https://daseh.org/data/CalEnviroScreen_data.csv")

## Rows: 8035 Columns: 67
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (3): CaliforniaCounty, ApproxLocation, CES4.0PercRange
## dbl (64): CensusTract, ZIP, Longitude, Latitude, CES4.0Score, CES4.0Percenti...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

ces_bw <- 
  ces %>% 
  mutate(weight_cat = LowBirthWeight > 5)

Fit a logistic regression model where:

The outcome is weight_cat.
HousingBurden, Poverty, and Education are predictors.
Use glm and family = binomial(link = “logit”).
Save the model fit in an object called “logfit_bw” and display the summary table with summary().

Poverty: Percent of population living below two times the federal poverty level

Education: Percent of population over 25 with less than a high school education

logfit_bw <-
  glm(
    weight_cat ~ HousingBurden + Poverty + Education,
    data = ces_bw,
    family = binomial(link = "logit")
  )
summary(logfit_bw)

## 
## Call:
## glm(formula = weight_cat ~ HousingBurden + Poverty + Education, 
##     family = binomial(link = "logit"), data = ces_bw)
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -1.258807   0.061505 -20.467  < 2e-16 ***
## HousingBurden  0.020510   0.004060   5.051 4.39e-07 ***
## Poverty        0.015220   0.002486   6.122 9.27e-10 ***
## Education      0.017751   0.002689   6.601 4.07e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10726  on 7749  degrees of freedom
## Residual deviance: 10041  on 7746  degrees of freedom
##   (285 observations deleted due to missingness)
## AIC: 10049
## 
## Number of Fisher Scoring iterations: 4

Statistics Lab - Key

Part 1

1.1

1.2

1.3

1.4

Practice on Your Own!

P.1

P.2

Part 2

2.1

2.2

Practice on Your Own!

P.3

P.4