Load the tidyverse
package.
library(tidyverse)
Create a function that:
c(2,7,21,30,90)
- you should get the answer 22500.NEW_FUNCTION <- function(x, y) x + y
nums <- c(2, 7, 21, 30, 90)
sum_squared <- function(x) sum(x)^2
sum_squared(x = nums)
## [1] 22500
sum_squared <- function(x) {
out <- sum(x)^2
return(out)
}
sum_squared(x = nums)
## [1] 22500
Create a function that:
%in%
.has_n
.c(2,7,21,30,90)
and number 21
- you should get the answer TRUE.nums <- c(2, 7, 21, 30, 90)
a_num <- 21
has_n <- function(x, n) n %in% x
has_n(x = nums, n = a_num)
## [1] TRUE
Amend the function has_n
from question 1.2 so that it takes a default value of 21
for the numeric argument.
nums <- c(2, 7, 21, 30, 90)
a_num <- 21
has_n <- function(x, n = 21) n %in% x
has_n(x = nums)
## [1] TRUE
Create a function for the CalEnviroScreen Data.
{{col_name}}
){{col_name}}
on the x-axis and Poverty
on the y-axis.geom_point()
Lead
column and HousingBurden
columns, or other columns of your choice.ces <- read_csv("https://daseh.org/data/CalEnviroScreen_data.csv")
## Rows: 8035 Columns: 67
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): CaliforniaCounty, ApproxLocation, CES4.0PercRange
## dbl (64): CensusTract, ZIP, Longitude, Latitude, CES4.0Score, CES4.0Percenti...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
plot_ces <- function(col_name){
ggplot(data = ces, aes(x = {{col_name}}, y = Poverty)) +
geom_point()
}
plot_ces(Lead)
## Warning: Removed 99 rows containing missing values or values outside the scale range
## (`geom_point()`).
plot_ces(HousingBurden)
## Warning: Removed 147 rows containing missing values or values outside the scale range
## (`geom_point()`).
Read in the CalEnviroScreen from https://daseh.org/data/CalEnviroScreen_data.csv. Assign the data the name “ces”.
ces <- read_csv("https://daseh.org/data/CalEnviroScreen_data.csv")
# If downloaded
# ces <- read_csv("CalEnviroScreen_data.csv")
We want to get some summary statistics on water contamination.
across
inside summarize
.contains("water")
inside across
.mean
as the function inside of across
.NA
values can influence calculations.# General format
data %>%
summarize(across(
{vector or tidyselect},
{some function}
))
ces %>%
summarize(across(
contains("water"),
mean
))
## # A tibble: 1 × 6
## DrinkingWater DrinkingWaterPctl GroundwaterThreats GroundwaterThreatsPctl
## <dbl> <dbl> <dbl> <dbl>
## 1 NA NA 16.7 37.8
## # ℹ 2 more variables: ImpWaterBodies <dbl>, ImpWaterBodiesPctl <dbl>
# Accounting for NA
ces %>%
summarize(across(
contains("water"),
function(x) mean(x, na.rm = T)
))
## # A tibble: 1 × 6
## DrinkingWater DrinkingWaterPctl GroundwaterThreats GroundwaterThreatsPctl
## <dbl> <dbl> <dbl> <dbl>
## 1 477. 50.4 16.7 37.8
## # ℹ 2 more variables: ImpWaterBodies <dbl>, ImpWaterBodiesPctl <dbl>
Convert all columns that are percentiles into proportions.
across
and mutate
contains("Pctl")
inside across
.function(x) x / 100
).select(contains("Pctl"))
.# General format
data %>%
mutate(across(
{vector or tidyselect},
{some function}
))
ces %>%
mutate(across(
contains("Pctl"),
function(x) x / 100
)) %>%
select(contains("Pctl"))
## # A tibble: 8,035 × 23
## OzonePctl PM2.5.Pctl DieselPMPctl DrinkingWaterPctl LeadPctl PesticidesPctl
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.0312 0.363 0.348 0.0421 0.0774 0
## 2 0.0312 0.420 0.927 0.0421 0.682 0
## 3 0.0312 0.439 0.898 0.0421 0.642 0
## 4 0.0312 0.428 0.791 0.0421 0.671 0
## 5 0.0312 0.428 0.676 0.0421 0.680 0
## 6 0.0312 0.428 0.838 0.0421 0.697 0
## 7 0.0312 0.433 0.813 0.0421 0.769 0
## 8 0.0312 0.440 0.687 0.0421 0.732 0
## 9 0.0312 0.440 0.811 0.0421 0.867 0
## 10 0.0312 0.458 0.994 0.0421 0.885 0
## # ℹ 8,025 more rows
## # ℹ 17 more variables: ToxReleasePctl <dbl>, TrafficPctl <dbl>,
## # CleanupSitesPctl <dbl>, GroundwaterThreatsPctl <dbl>, HazWastePctl <dbl>,
## # ImpWaterBodiesPctl <dbl>, SolidWastePctl <dbl>, PollutionBurdenPctl <dbl>,
## # AsthmaPctl <dbl>, LowBirthWeightPctl <dbl>,
## # CardiovascularDiseasePctl <dbl>, PopCharPctl <dbl>, EducationPctl <dbl>,
## # LinguisticIsolPctl <dbl>, PovertyPctl <dbl>, UnemploymentPctl <dbl>, …
Use across
and mutate
to convert all columns starting with the string “PM” into a binary variable: TRUE if the value is greater than 10 and FALSE if less than or equal to 10.
starts_with()
to select the columns that start with “PM”.mutate
(x > 10) will automatically fill a column with TRUE/FALSE.ces %>%
mutate(across(
starts_with("PM"),
function(x) x > 10
)) %>%
glimpse() # add glimpse to view the changes
## Rows: 8,035
## Columns: 67
## $ CensusTract <dbl> 6001400100, 6001400200, 6001400300, 60014004…
## $ CaliforniaCounty <chr> "Alameda", "Alameda", "Alameda", "Alameda", …
## $ ZIP <dbl> 94704, 94618, 94618, 94609, 94609, 94609, 94…
## $ Longitude <dbl> -122.2319, -122.2496, -122.2544, -122.2575, …
## $ Latitude <dbl> 37.86759, 37.84817, 37.84060, 37.84821, 37.8…
## $ ApproxLocation <chr> "Oakland", "Oakland", "Oakland", "Oakland", …
## $ CES4.0Score <dbl> 4.85, 4.88, 11.20, 12.39, 16.73, 20.02, 36.7…
## $ CES4.0Percentile <dbl> 2.80, 2.87, 15.94, 18.97, 29.74, 37.59, 70.1…
## $ CES4.0PercRange <chr> "1-5% (lowest scores)", "1-5% (lowest scores…
## $ Ozone <dbl> 0.029, 0.029, 0.029, 0.029, 0.029, 0.029, 0.…
## $ OzonePctl <dbl> 3.12, 3.12, 3.12, 3.12, 3.12, 3.12, 3.12, 3.…
## $ PM2.5 <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FA…
## $ PM2.5.Pctl <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TR…
## $ DieselPM <dbl> 0.093, 0.591, 0.497, 0.327, 0.233, 0.383, 0.…
## $ DieselPMPctl <dbl> 34.76, 92.71, 89.77, 79.10, 67.58, 83.76, 81…
## $ DrinkingWater <dbl> 110.41, 110.41, 110.41, 110.41, 110.41, 110.…
## $ DrinkingWaterPctl <dbl> 4.21, 4.21, 4.21, 4.21, 4.21, 4.21, 4.21, 4.…
## $ Lead <dbl> 15.25, 61.71, 58.82, 60.88, 61.56, 62.83, 68…
## $ LeadPctl <dbl> 7.74, 68.20, 64.18, 67.08, 67.95, 69.70, 76.…
## $ Pesticides <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ PesticidesPctl <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ ToxRelease <dbl> 626.7743, 609.9600, 598.8777, 624.6293, 641.…
## $ ToxReleasePctl <dbl> 56.03, 55.43, 55.04, 55.90, 56.48, 55.65, 55…
## $ Traffic <dbl> 965.9462, 716.8225, 776.4908, 721.5428, 862.…
## $ TrafficPctl <dbl> 55.94, 37.49, 42.48, 38.00, 48.68, 67.06, 52…
## $ CleanupSites <dbl> 9.00, 0.00, 0.90, 0.00, 3.50, 1.75, 9.15, 47…
## $ CleanupSitesPctl <dbl> 58.17, 0.00, 11.83, 0.00, 33.87, 22.62, 58.5…
## $ GroundwaterThreats <dbl> 12.25, 44.25, 38.55, 60.50, 37.00, 29.80, 56…
## $ GroundwaterThreatsPctl <dbl> 52.42, 87.93, 85.29, 92.56, 84.34, 79.06, 91…
## $ HazWaste <dbl> 2.000, 0.165, 0.595, 0.315, 0.350, 0.370, 0.…
## $ HazWastePctl <dbl> 92.52, 28.51, 74.07, 51.89, 56.40, 58.27, 72…
## $ ImpWaterBodies <dbl> 2, 0, 0, 0, 0, 0, 10, 10, 10, 10, 0, 0, 11, …
## $ ImpWaterBodiesPctl <dbl> 23.88, 0.00, 0.00, 0.00, 0.00, 0.00, 82.97, …
## $ SolidWaste <dbl> 1.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.…
## $ SolidWastePctl <dbl> 35.72, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0…
## $ PollutionBurden <dbl> 34.02, 33.02, 36.64, 33.82, 35.88, 37.86, 46…
## $ PollutionBurdenScore <dbl> 4.15, 4.03, 4.47, 4.13, 4.38, 4.62, 5.72, 5.…
## $ PollutionBurdenPctl <dbl> 26.62, 24.18, 33.37, 26.24, 31.40, 36.94, 61…
## $ Asthma <dbl> 15.65, 20.47, 30.88, 49.61, 86.57, 101.53, 1…
## $ AsthmaPctl <dbl> 4.44, 9.80, 26.57, 55.98, 88.38, 93.07, 93.8…
## $ LowBirthWeight <dbl> 3.85, 4.05, 3.78, 4.44, 3.64, 2.56, 5.60, 7.…
## $ LowBirthWeightPctl <dbl> 23.06, 27.92, 21.62, 37.02, 19.00, 5.03, 66.…
## $ CardiovascularDisease <dbl> 5.24, 8.14, 8.88, 8.08, 11.13, 12.80, 10.63,…
## $ CardiovascularDiseasePctl <dbl> 1.42, 14.53, 20.11, 14.28, 38.87, 52.78, 35.…
## $ TotalPop <dbl> 3120, 2007, 5051, 4007, 4124, 1745, 5128, 40…
## $ ChildrenPercLess10 <dbl> 7.82, 10.46, 11.42, 9.38, 9.12, 9.97, 7.62, …
## $ PopPerc10to64 <dbl> 66.12, 66.32, 73.04, 78.79, 81.96, 81.43, 82…
## $ ElderlyMore64 <dbl> 26.06, 23.22, 15.54, 11.83, 8.92, 8.60, 9.93…
## $ HispanicPerc <dbl> 3.78, 8.67, 6.95, 12.10, 9.46, 7.51, 19.38, …
## $ WhitePerc <dbl> 74.26, 73.49, 67.99, 63.74, 45.44, 49.28, 38…
## $ AfAmericanPerc <dbl> 3.43, 2.59, 9.09, 6.64, 21.39, 20.52, 28.24,…
## $ NativeAmericanPerc <dbl> 0.00, 0.20, 0.00, 0.87, 0.00, 0.17, 0.00, 0.…
## $ AsianAmericanPerc <dbl> 12.53, 8.52, 12.14, 10.48, 11.34, 10.54, 6.4…
## $ OtherMultiplePerc <dbl> 5.99, 6.53, 3.84, 6.16, 12.37, 11.98, 7.68, …
## $ PopChar <dbl> 11.25, 11.67, 24.14, 28.93, 36.83, 41.76, 61…
## $ PopCharScore <dbl> 1.17, 1.21, 2.50, 3.00, 3.82, 4.33, 6.42, 6.…
## $ PopCharPctl <dbl> 1.53, 1.65, 12.27, 18.43, 30.16, 37.70, 68.4…
## $ Education <dbl> 3.3, 0.4, 5.6, 4.8, 2.3, 2.8, NA, 5.5, 8.9, …
## $ EducationPctl <dbl> 12.55, 0.42, 24.12, 20.29, 7.40, 9.73, NA, 2…
## $ LinguisticIsol <dbl> 1.2, 0.0, 8.0, 0.9, 1.7, 1.0, 4.8, 2.5, 10.3…
## $ LinguisticIsolPctl <dbl> 8.49, 0.00, 53.36, 5.64, 13.30, 6.27, 36.46,…
## $ Poverty <dbl> 10.4, 10.6, 10.3, 21.1, 21.9, 16.0, 33.2, 36…
## $ PovertyPctl <dbl> 11.03, 11.44, 10.90, 36.42, 38.13, 24.42, 58…
## $ Unemployment <dbl> NA, 3.0, 3.9, 2.5, 3.8, 7.5, 7.0, 5.7, 5.5, …
## $ UnemploymentPctl <dbl> NA, 17.11, 29.41, 10.66, 28.20, 71.67, 67.48…
## $ HousingBurden <dbl> 11.2, 4.0, 8.9, 14.8, 14.8, 18.0, 22.1, 20.7…
## $ HousingBurdenPctl <dbl> 19.39, 0.67, 9.81, 37.48, 37.48, 54.07, 70.8…
Take your code from previous question and assign it to the variable ces_dat
.
Asthma
and the y-axis is PM2.5
.geom_boxplot()
)labs()
layer so that the x-axis is “ER Visits for Asthma: PM2.5 greater than 10”ces_dat <-
ces %>%
mutate(across(
starts_with("PM"),
function(x) x > 10
))
ggplot(data = ces_dat, aes(x = `Asthma`, y = `PM2.5`)) +
geom_boxplot() +
labs(x = "ER Visits for Asthma: PM2.5 greater than 10")
## Warning: Removed 11 rows containing non-finite outside the scale range
## (`stat_boxplot()`).
# Make everything a function if you like!
ces_boxplot <- function() {
ces %>%
mutate(across(
starts_with("PM"),
function(x) x > 10
)) %>%
ggplot(aes(x = `Asthma`, y = `PM2.5`)) +
geom_boxplot() +
labs(x = "ER Visits for Asthma: PM2.5 greater than 10")
}
ces_boxplot()
## Warning: Removed 11 rows containing non-finite outside the scale range
## (`stat_boxplot()`).