Load the packages needed in this lab. Then, read in the CalEnviroScreen data and assign it to the “ces” variable. You can find the data at https://daseh.org/data/CalEnviroScreen_data.csv
library(tidyverse)
library(broom)
ces <- read_csv("https://daseh.org/data/CalEnviroScreen_data.csv")
## Rows: 8035 Columns: 67
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): CaliforniaCounty, ApproxLocation, CES4.0PercRange
## dbl (64): CensusTract, ZIP, Longitude, Latitude, CES4.0Score, CES4.0Percenti...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Compute the correlation (with cor
) between the HousingBurden
and LowBirthWeight
variables.
select()
to choose these columns and save as an object.cor
function on this object.HousingBurden
: Percentage of housing-burdened low-income households. Households are considered cost-burdened when they spend more than 30% of their income on rent, mortgage payments, and other housing costs
LowBirthWeight
: Percentage of births that are low birth weight.
ces_sub <-
ces %>%
select(HousingBurden, LowBirthWeight)
cor(ces_sub, use = "complete.obs")
## HousingBurden LowBirthWeight
## HousingBurden 1.0000000 0.2841148
## LowBirthWeight 0.2841148 1.0000000
Change your code from 1.2 to add a few more variables:
DieselPM
: Diesel PM emissions from on-road and non-road sources.Traffic
: Measure of traffic density within 150 meters of the census tract boundary.Asthma
: Age-adjusted rate of emergency department visits for asthma.ces_sub <-
ces %>%
select(HousingBurden, LowBirthWeight, DieselPM, Traffic, Asthma)
cor(ces_sub, use = "complete.obs")
## HousingBurden LowBirthWeight DieselPM Traffic Asthma
## HousingBurden 1.0000000 0.2842325 0.2311997 0.11809070 0.33628186
## LowBirthWeight 0.2842325 1.0000000 0.1262594 0.05098940 0.37540822
## DieselPM 0.2311997 0.1262594 1.0000000 0.34102206 0.14109556
## Traffic 0.1180907 0.0509894 0.3410221 1.00000000 -0.04329935
## Asthma 0.3362819 0.3754082 0.1410956 -0.04329935 1.00000000
Perform a t-test to determine if there is evidence of a difference between low birth weight percentage (LowBirthWeight
) in Los Angeles census tracts compared to San Diego:
CaliforniaCounty == "Los Angeles"
CaliforniaCounty == "San Diego"
pull
the LowBirthWeight
column for both subsetst.test
to compare the two pulled vectorstidy
function from the broom
packagelosangeles <- ces %>% filter(CaliforniaCounty == "Los Angeles")
sandiego <- ces %>% filter(CaliforniaCounty == "San Diego")
x <- pull(losangeles, LowBirthWeight)
y <- pull(sandiego, LowBirthWeight)
t.test(x, y)
##
## Welch Two Sample t-test
##
## data: x and y
## t = 9.1988, df = 1113.8, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.4857923 0.7492197
## sample estimates:
## mean of x mean of y
## 5.314695 4.697189
tidy(t.test(x, y))
## # A tibble: 1 × 10
## estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.618 5.31 4.70 9.20 1.73e-19 1114. 0.486 0.749
## # ℹ 2 more variables: method <chr>, alternative <chr>
# Another way to look at it..
ces %>%
group_by(CaliforniaCounty) %>%
summarize(mean = mean(LowBirthWeight, na.rm = TRUE)) %>%
filter(CaliforniaCounty %in% c("Los Angeles", "San Diego"))
## # A tibble: 2 × 2
## CaliforniaCounty mean
## <chr> <dbl>
## 1 Los Angeles 5.31
## 2 San Diego 4.70
Perform a t-test similar to 1.4, but this time examine differences in HousingBurden
.
x <- pull(losangeles, HousingBurden)
y <- pull(sandiego, HousingBurden)
t.test(x, y)
##
## Welch Two Sample t-test
##
## data: x and y
## t = 12.378, df = 1104.5, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 3.758442 5.174496
## sample estimates:
## mean of x mean of y
## 22.49781 18.03134
tidy(t.test(x, y))
## # A tibble: 1 × 10
## estimate estimate1 estimate2 statistic p.value parameter conf.low conf.high
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 4.47 22.5 18.0 12.4 4.79e-33 1104. 3.76 5.17
## # ℹ 2 more variables: method <chr>, alternative <chr>
Filter ces
so it only contains Los Angeles and San Diego. Launch Esquisse or create a ggplot which plots:
Lead
exposure on the x axisHousingBurden
on the y axisgeom_point()
) by CaliforniaCounty
ces_sub <-
ces %>%
filter(CaliforniaCounty %in% c("Los Angeles", "San Diego"))
ggplot(ces_sub, aes(x = Lead, y = HousingBurden, color = CaliforniaCounty)) +
geom_point()
## Warning: Removed 68 rows containing missing values or values outside the scale range
## (`geom_point()`).
Fit a linear regression model with LowBirthWeight
as the outcome and HousingBurden
as the predictor. Save the model fit in an object called “lmfit_bw” and display the summary table with summary()
.
HousingBurden
: Percentage of housing-burdened low-income households. Households are considered cost-burdened when they spend more than 30% of their income on rent, mortgage payments, and other housing costs
LowBirthWeight
: Percentage of births that are low birth weight.
# General format
glm(y ~ x, data = DATASET_NAME)
lmfit_bw <- glm(LowBirthWeight ~ HousingBurden, data = ces)
summary(lmfit_bw)
##
## Call:
## glm(formula = LowBirthWeight ~ HousingBurden, data = ces)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.98050 0.04277 93.06 <2e-16 ***
## HousingBurden 0.05542 0.00212 26.14 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 2.311994)
##
## Null deviance: 19567 on 7781 degrees of freedom
## Residual deviance: 17987 on 7780 degrees of freedom
## (253 observations deleted due to missingness)
## AIC: 28611
##
## Number of Fisher Scoring iterations: 2
Take your code from 2.1 and add Traffic
as another predictor. Save the model fit in an object called “lmfit_bw2” and display the summary table.
Traffic
: Measure of traffic density within 150 meters of the census tract boundary
lmfit_bw2 <- glm(LowBirthWeight ~ HousingBurden + Traffic, data = ces)
summary(lmfit_bw2)
##
## Call:
## glm(formula = LowBirthWeight ~ HousingBurden + Traffic, data = ces)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.957e+00 4.541e-02 87.149 <2e-16 ***
## HousingBurden 5.503e-02 2.138e-03 25.742 <2e-16 ***
## Traffic 2.828e-05 1.754e-05 1.612 0.107
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 2.310696)
##
## Null deviance: 19506 on 7759 degrees of freedom
## Residual deviance: 17924 on 7757 degrees of freedom
## (275 observations deleted due to missingness)
## AIC: 28526
##
## Number of Fisher Scoring iterations: 2
Take your code from 2.2 and add an interaction term. Save the model fit in an object called “lmfit_bw3” and display the summary table.
HousingBurden * Traffic
in the formula.summary()
.lmfit_bw3 <- glm(LowBirthWeight ~ HousingBurden * Traffic, data = ces)
summary(lmfit_bw3)
##
## Call:
## glm(formula = LowBirthWeight ~ HousingBurden * Traffic, data = ces)
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.882e+00 6.474e-02 59.953 <2e-16 ***
## HousingBurden 5.940e-02 3.429e-03 17.324 <2e-16 ***
## Traffic 9.475e-05 4.432e-05 2.138 0.0325 *
## HousingBurden:Traffic -3.695e-06 2.262e-06 -1.633 0.1025
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for gaussian family taken to be 2.310199)
##
## Null deviance: 19506 on 7759 degrees of freedom
## Residual deviance: 17918 on 7756 degrees of freedom
## (275 observations deleted due to missingness)
## AIC: 28526
##
## Number of Fisher Scoring iterations: 2
Let’s make LowBirthWeight
into a binary variable, where over 5% low birth weight is “TRUE”.
The following code creates a column weight_cat
with TRUE/FALSE values.
ces <- read_csv("https://daseh.org/data/CalEnviroScreen_data.csv")
## Rows: 8035 Columns: 67
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): CaliforniaCounty, ApproxLocation, CES4.0PercRange
## dbl (64): CensusTract, ZIP, Longitude, Latitude, CES4.0Score, CES4.0Percenti...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
ces_bw <-
ces %>%
mutate(weight_cat = LowBirthWeight > 5)
Fit a logistic regression model where:
weight_cat
.HousingBurden
, Poverty
, and Education
are predictors.glm
and family = binomial(link = “logit”).summary()
.Poverty
: Percent of population living below two times the federal poverty level
Education
: Percent of population over 25 with less than a high school education
logfit_bw <-
glm(
weight_cat ~ HousingBurden + Poverty + Education,
data = ces_bw,
family = binomial(link = "logit")
)
summary(logfit_bw)
##
## Call:
## glm(formula = weight_cat ~ HousingBurden + Poverty + Education,
## family = binomial(link = "logit"), data = ces_bw)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.258807 0.061505 -20.467 < 2e-16 ***
## HousingBurden 0.020510 0.004060 5.051 4.39e-07 ***
## Poverty 0.015220 0.002486 6.122 9.27e-10 ***
## Education 0.017751 0.002689 6.601 4.07e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 10726 on 7749 degrees of freedom
## Residual deviance: 10041 on 7746 degrees of freedom
## (285 observations deleted due to missingness)
## AIC: 10049
##
## Number of Fisher Scoring iterations: 4