Part 1

1.1

Load the libraries needed in this lab. Then, read in the CalEnviroScreen data from dasehr package. Assign it to the “ces” variable. You can also find the data at https://daseh.org/data/CalEnviroScreen_data.csv

library(dplyr)
library(dasehr)
library(broom)
ces <- calenviroscreen

1.2

Compute the correlation (with cor) between the PM2.5 and Ozone variables. (No need to save this in an object. Just display the result to the screen.) Use the pull() function to first extract these columns. To use a column name in pull() that starts with a number, surround it with backticks. Then, use the cor function.

x <- pull(ces, PM2.5)
y <- pull(ces, Ozone)

cor(x, y)
## [1] 0.4242936

1.3

Compute the correlation (with cor) between the DieselPM, Lead, Traffic, and Asthma variables. (No need to save this in an object. Just display the result to the screen.) Use select() function to first subset the data frame to keep the four columns only. To use a column name in select() that starts with a number, surround it with backticks. Does this change when we use the use = "complete.obs" argument?

ces_sub <-
  ces %>%
  select(DieselPM, Lead, Traffic, Asthma)

cor(ces_sub)
##          DieselPM Lead Traffic Asthma
## DieselPM        1   NA      NA     NA
## Lead           NA    1      NA     NA
## Traffic        NA   NA       1     NA
## Asthma         NA   NA      NA      1
cor(ces_sub, use = "complete.obs")
##           DieselPM       Lead     Traffic      Asthma
## DieselPM 1.0000000 0.25051463  0.34561061  0.14494038
## Lead     0.2505146 1.00000000  0.08835065  0.41469905
## Traffic  0.3456106 0.08835065  1.00000000 -0.03962609
## Asthma   0.1449404 0.41469905 -0.03962609  1.00000000

1.4

Perform a t-test to determine if there is evidence of a difference between lead exposure burden (Lead) and poverty burden (Poverty). Use the pull() function to extract these columns. Print the results using the tidy function from the broom package.

x <- pull(ces, Lead)
y <- pull(ces, Poverty)

t.test(x, y)
## 
##  Welch Two Sample t-test
## 
## data:  x and y
## t = 53.389, df = 15084, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  16.98224 18.27674
## sample estimates:
## mean of x mean of y 
##  48.97170  31.34221
tidy(t.test(x, y))
## # A tibble: 1 × 10
##   estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high
##      <dbl>     <dbl>     <dbl>     <dbl>   <dbl>     <dbl>    <dbl>     <dbl>
## 1     17.6      49.0      31.3      53.4       0    15084.     17.0      18.3
## # ℹ 2 more variables: method <chr>, alternative <chr>

Practice on Your Own!

P.1

Perform a t-test to determine if there is evidence of a difference between lead exposure burden (Lead) and housing insecurity (HousingBurden). Use the pull() function to extract these columns. Print the results using the tidy function. How do these results compare to those in question 1.4?

x <- pull(ces, Lead)
y <- pull(ces, HousingBurden)

t.test(x, y)
## 
##  Welch Two Sample t-test
## 
## data:  x and y
## t = 110.75, df = 9948.3, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  29.94165 31.02062
## sample estimates:
## mean of x mean of y 
##  48.97170  18.49057
tidy(t.test(x, y))
## # A tibble: 1 × 10
##   estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high
##      <dbl>     <dbl>     <dbl>     <dbl>   <dbl>     <dbl>    <dbl>     <dbl>
## 1     30.5      49.0      18.5      111.       0     9948.     29.9      31.0
## # ℹ 2 more variables: method <chr>, alternative <chr>

Part 2

2.1

Fit a linear regression model with low birthweight percentile (“LowBirthWeightPctl”) as the outcome and CES4.0 score (“CES4.0Score”) as the predictor. Save the model fit in an object called “lmfit_bw” and display the summary table with summary().

# General format
glm(y ~ x, data = DATASET_NAME)
lmfit_bw <- glm(LowBirthWeightPctl ~ CES4.0Score, data = ces)
summary(lmfit_bw)
## 
## Call:
## glm(formula = LowBirthWeightPctl ~ CES4.0Score, data = ces)
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  26.0480     0.5733   45.44   <2e-16 ***
## CES4.0Score   0.8406     0.0174   48.30   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 643.8752)
## 
##     Null deviance: 6527805  on 7806  degrees of freedom
## Residual deviance: 5025446  on 7805  degrees of freedom
##   (228 observations deleted due to missingness)
## AIC: 72651
## 
## Number of Fisher Scoring iterations: 2

2.2

Fit a linear regression model with low birthweight percentile (“LowBirthWeightPctl”) as the outcome and CES4.0 score (“CES4.0Score”) and poverty score (“Poverty”) as the predictors. Save the model fit in an object called “lmfit_bw2” and display the summary table.

lmfit_bw2 <- glm(LowBirthWeightPctl ~ CES4.0Score + Poverty, data = ces)
summary(lmfit_bw2)
## 
## Call:
## glm(formula = LowBirthWeightPctl ~ CES4.0Score + Poverty, data = ces)
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 26.71119    0.60185  44.381  < 2e-16 ***
## CES4.0Score  0.90926    0.02649  34.330  < 2e-16 ***
## Poverty     -0.08357    0.02411  -3.466 0.000532 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 642.8077)
## 
##     Null deviance: 6522992  on 7804  degrees of freedom
## Residual deviance: 5015186  on 7802  degrees of freedom
##   (230 observations deleted due to missingness)
## AIC: 72621
## 
## Number of Fisher Scoring iterations: 2

Practice on Your Own!

P.2

Fit a linear regression model with low birthweight percentile (“LowBirthWeightPctl”) as the outcome, CES4.0 score (“CES4.0Score”) and poverty score (“Poverty”) as the predictors, and interaction between “CES4.0Score” and “Poverty”.

  • To include the interaction, use CES4.0Score * Poverty in the formula.
  • Save the model fit in an object called “lmfit_bw3” and display the summary table with summary().
lmfit_bw3 <- glm(LowBirthWeightPctl ~ CES4.0Score + Poverty + CES4.0Score * Poverty, data = ces)
summary(lmfit_bw3)
## 
## Call:
## glm(formula = LowBirthWeightPctl ~ CES4.0Score + Poverty + CES4.0Score * 
##     Poverty, data = ces)
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         22.039092   1.030772  21.381  < 2e-16 ***
## CES4.0Score          1.102149   0.043529  25.320  < 2e-16 ***
## Poverty              0.082456   0.038280   2.154   0.0313 *  
## CES4.0Score:Poverty -0.005388   0.000966  -5.578 2.52e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 640.3365)
## 
##     Null deviance: 6522992  on 7804  degrees of freedom
## Residual deviance: 4995265  on 7801  degrees of freedom
##   (230 observations deleted due to missingness)
## AIC: 72591
## 
## Number of Fisher Scoring iterations: 2

P.3

Fit a logistic regression model where the outcome is whether the CES percentile is above the 50th percentile and predictors are the poverty measure (“Poverty”) and percentage of white residents (“WhitePerc”), as well as the interaction between “Poverty” and “WhitePerc”. Information about the CES4.0 percentile rank is stored in the “CES4.0Percentile” variable. You will need to first create a variable “CESPctlMoreThan50” using the mutate and case_when commands.

  • Save the model fit in an object called “logfit_ces” and display the summary table with summary().
# General format
df <- df %>% mutate(binary_y = case_when(y > 50  ~ 1, y < 50 ~ 0)) 
glm(y ~ x, data = DATASET_NAME, family = binomial(link = "logit"))
ces <- ces %>% 
  mutate(CESPctlMoreThan50 = 
           case_when(CES4.0Percentile > 50  ~ 1, CES4.0Percentile < 50 ~ 0))

logfit_ces <- glm(CESPctlMoreThan50 ~ Poverty + WhitePerc + Poverty * WhitePerc, data = ces, family = binomial(link = "logit"))
summary(logfit_ces)
## 
## Call:
## glm(formula = CESPctlMoreThan50 ~ Poverty + WhitePerc + Poverty * 
##     WhitePerc, family = binomial(link = "logit"), data = ces)
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -0.854724   0.172138  -4.965 6.86e-07 ***
## Poverty            0.095863   0.005458  17.564  < 2e-16 ***
## WhitePerc         -0.045093   0.003783 -11.918  < 2e-16 ***
## Poverty:WhitePerc -0.000207   0.000119  -1.739   0.0821 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 10991.9  on 7928  degrees of freedom
## Residual deviance:  5809.5  on 7925  degrees of freedom
##   (106 observations deleted due to missingness)
## AIC: 5817.5
## 
## Number of Fisher Scoring iterations: 6