Load the libraries needed in this lab. Then, read in the
CalEnviroScreen data from dasehr
package. Assign it to the
“ces” variable. You can also find the data at https://daseh.org/data/CalEnviroScreen_data.csv
library(dplyr)
library(dasehr)
library(broom)
ces <- calenviroscreen
Compute the correlation (with cor
) between the
PM2.5
and Ozone
variables. (No need to save
this in an object. Just display the result to the screen.) Use the
pull()
function to first extract these columns. To use a
column name in pull()
that starts with a number, surround
it with backticks. Then, use the cor
function.
x <- pull(ces, PM2.5)
y <- pull(ces, Ozone)
cor(x, y)
## [1] 0.4242936
Compute the correlation (with cor
) between the
DieselPM
, Lead
, Traffic
, and
Asthma
variables. (No need to save this in an object. Just
display the result to the screen.) Use select()
function to
first subset the data frame to keep the four columns only. To use a
column name in select()
that starts with a number, surround
it with backticks. Does this change when we use the
use = "complete.obs"
argument?
ces_sub <-
ces %>%
select(DieselPM, Lead, Traffic, Asthma)
cor(ces_sub)
## DieselPM Lead Traffic Asthma
## DieselPM 1 NA NA NA
## Lead NA 1 NA NA
## Traffic NA NA 1 NA
## Asthma NA NA NA 1
cor(ces_sub, use = "complete.obs")
## DieselPM Lead Traffic Asthma
## DieselPM 1.0000000 0.25051463 0.34561061 0.14494038
## Lead 0.2505146 1.00000000 0.08835065 0.41469905
## Traffic 0.3456106 0.08835065 1.00000000 -0.03962609
## Asthma 0.1449404 0.41469905 -0.03962609 1.00000000
Perform a t-test to determine if there is evidence of a difference
between lead exposure burden (Lead
) and poverty burden
(Poverty
). Use the pull()
function to extract
these columns. Print the results using the tidy
function
from the broom
package.
x <- pull(ces, Lead)
y <- pull(ces, Poverty)
t.test(x, y)
##
## Welch Two Sample t-test
##
## data: x and y
## t = 53.389, df = 15084, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 16.98224 18.27674
## sample estimates:
## mean of x mean of y
## 48.97170 31.34221
tidy(t.test(x, y))
## # A tibble: 1 × 10
## estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 17.6 49.0 31.3 53.4 0 15084. 17.0 18.3
## # ℹ 2 more variables: method <chr>, alternative <chr>
Perform a t-test to determine if there is evidence of a difference
between lead exposure burden (Lead
) and housing insecurity
(HousingBurden
). Use the pull()
function to
extract these columns. Print the results using the tidy
function. How do these results compare to those in question 1.4?
x <- pull(ces, Lead)
y <- pull(ces, HousingBurden)
t.test(x, y)
##
## Welch Two Sample t-test
##
## data: x and y
## t = 110.75, df = 9948.3, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 29.94165 31.02062
## sample estimates:
## mean of x mean of y
## 48.97170 18.49057
tidy(t.test(x, y))
## # A tibble: 1 × 10
## estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 30.5 49.0 18.5 111. 0 9948. 29.9 31.0
## # ℹ 2 more variables: method <chr>, alternative <chr>
Fit a linear regression model with low birthweight percentile
(“LowBirthWeightPctl”) as the outcome and CES4.0 score (“CES4.0Score”)
as the predictor. Save the model fit in an object called “lmfit_bw” and
display the summary table with summary()
.
# General format
glm(y ~ x, data = DATASET_NAME)
lmfit_bw <- glm(LowBirthWeightPctl ~ CES4.0Score, data = ces)
summary(lmfit_bw)
##
## Call:
## glm(formula = LowBirthWeightPctl ~ CES4.0Score, data = ces)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26.0480 0.5733 45.44 <2e-16 ***
## CES4.0Score 0.8406 0.0174 48.30 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 643.8752)
##
## Null deviance: 6527805 on 7806 degrees of freedom
## Residual deviance: 5025446 on 7805 degrees of freedom
## (228 observations deleted due to missingness)
## AIC: 72651
##
## Number of Fisher Scoring iterations: 2
Fit a linear regression model with low birthweight percentile (“LowBirthWeightPctl”) as the outcome and CES4.0 score (“CES4.0Score”) and poverty score (“Poverty”) as the predictors. Save the model fit in an object called “lmfit_bw2” and display the summary table.
lmfit_bw2 <- glm(LowBirthWeightPctl ~ CES4.0Score + Poverty, data = ces)
summary(lmfit_bw2)
##
## Call:
## glm(formula = LowBirthWeightPctl ~ CES4.0Score + Poverty, data = ces)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 26.71119 0.60185 44.381 < 2e-16 ***
## CES4.0Score 0.90926 0.02649 34.330 < 2e-16 ***
## Poverty -0.08357 0.02411 -3.466 0.000532 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 642.8077)
##
## Null deviance: 6522992 on 7804 degrees of freedom
## Residual deviance: 5015186 on 7802 degrees of freedom
## (230 observations deleted due to missingness)
## AIC: 72621
##
## Number of Fisher Scoring iterations: 2
Fit a linear regression model with low birthweight percentile (“LowBirthWeightPctl”) as the outcome, CES4.0 score (“CES4.0Score”) and poverty score (“Poverty”) as the predictors, and interaction between “CES4.0Score” and “Poverty”.
CES4.0Score * Poverty
in the formula.summary()
.lmfit_bw3 <- glm(LowBirthWeightPctl ~ CES4.0Score + Poverty + CES4.0Score * Poverty, data = ces)
summary(lmfit_bw3)
##
## Call:
## glm(formula = LowBirthWeightPctl ~ CES4.0Score + Poverty + CES4.0Score *
## Poverty, data = ces)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 22.039092 1.030772 21.381 < 2e-16 ***
## CES4.0Score 1.102149 0.043529 25.320 < 2e-16 ***
## Poverty 0.082456 0.038280 2.154 0.0313 *
## CES4.0Score:Poverty -0.005388 0.000966 -5.578 2.52e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 640.3365)
##
## Null deviance: 6522992 on 7804 degrees of freedom
## Residual deviance: 4995265 on 7801 degrees of freedom
## (230 observations deleted due to missingness)
## AIC: 72591
##
## Number of Fisher Scoring iterations: 2
Fit a logistic regression model where the outcome is whether the CES
percentile is above the 50th percentile and predictors are the poverty
measure (“Poverty”) and percentage of white residents (“WhitePerc”), as
well as the interaction between “Poverty” and “WhitePerc”. Information
about the CES4.0 percentile rank is stored in the “CES4.0Percentile”
variable. You will need to first create a variable “CESPctlMoreThan50”
using the mutate
and case_when
commands.
summary()
.# General format
df <- df %>% mutate(binary_y = case_when(y > 50 ~ 1, y < 50 ~ 0))
glm(y ~ x, data = DATASET_NAME, family = binomial(link = "logit"))
ces <- ces %>%
mutate(CESPctlMoreThan50 =
case_when(CES4.0Percentile > 50 ~ 1, CES4.0Percentile < 50 ~ 0))
logfit_ces <- glm(CESPctlMoreThan50 ~ Poverty + WhitePerc + Poverty * WhitePerc, data = ces, family = binomial(link = "logit"))
summary(logfit_ces)
##
## Call:
## glm(formula = CESPctlMoreThan50 ~ Poverty + WhitePerc + Poverty *
## WhitePerc, family = binomial(link = "logit"), data = ces)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.854724 0.172138 -4.965 6.86e-07 ***
## Poverty 0.095863 0.005458 17.564 < 2e-16 ***
## WhitePerc -0.045093 0.003783 -11.918 < 2e-16 ***
## Poverty:WhitePerc -0.000207 0.000119 -1.739 0.0821 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 10991.9 on 7928 degrees of freedom
## Residual deviance: 5809.5 on 7925 degrees of freedom
## (106 observations deleted due to missingness)
## AIC: 5817.5
##
## Number of Fisher Scoring iterations: 6