Homework is optional, but we recommend it so you can get the most out of this course.
## you can add more, or change...these are suggestions
library(tidyverse)
library(readr)
library(dplyr)
library(ggplot2)
library(tidyr)
1. Bring the following dataset into R.
read_csv()
and assign it the name “mort”.mort <- read_csv("https://daseh.org/data/mortality.csv")
## New names:
## Rows: 197 Columns: 255
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (1): ...1 dbl (254): 1760, 1761, 1762, 1763, 1764, 1765, 1766, 1767, 1768,
## 1769, 1770,...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
2. Run the colnames()
function to take a look at the
dataset column names. You should see that there was originally no name
for the first column and that R replaced it with “…1”. Rename the first
column of “mort” to “country” using the rename()
function
in dplyr
.
colnames(mort)
## [1] "...1" "1760" "1761" "1762" "1763" "1764" "1765" "1766" "1767" "1768"
## [11] "1769" "1770" "1771" "1772" "1773" "1774" "1775" "1776" "1777" "1778"
## [21] "1779" "1780" "1781" "1782" "1783" "1784" "1785" "1786" "1787" "1788"
## [31] "1789" "1790" "1791" "1792" "1793" "1794" "1795" "1796" "1797" "1798"
## [41] "1799" "1800" "1801" "1802" "1803" "1804" "1805" "1806" "1807" "1808"
## [51] "1809" "1810" "1811" "1812" "1813" "1814" "1815" "1816" "1817" "1818"
## [61] "1819" "1820" "1821" "1822" "1823" "1824" "1825" "1826" "1827" "1828"
## [71] "1829" "1830" "1831" "1832" "1833" "1834" "1835" "1836" "1837" "1838"
## [81] "1839" "1840" "1841" "1842" "1843" "1844" "1845" "1846" "1847" "1848"
## [91] "1849" "1850" "1851" "1852" "1853" "1854" "1855" "1856" "1857" "1858"
## [101] "1859" "1860" "1861" "1862" "1863" "1864" "1865" "1866" "1867" "1868"
## [111] "1869" "1870" "1871" "1872" "1873" "1874" "1875" "1876" "1877" "1878"
## [121] "1879" "1880" "1881" "1882" "1883" "1884" "1885" "1886" "1887" "1888"
## [131] "1889" "1890" "1891" "1892" "1893" "1894" "1895" "1896" "1897" "1898"
## [141] "1899" "1900" "1901" "1902" "1903" "1904" "1905" "1906" "1907" "1908"
## [151] "1909" "1910" "1911" "1912" "1913" "1914" "1915" "1916" "1917" "1918"
## [161] "1919" "1920" "1921" "1922" "1923" "1924" "1925" "1926" "1927" "1928"
## [171] "1929" "1930" "1931" "1932" "1933" "1934" "1935" "1936" "1937" "1938"
## [181] "1939" "1940" "1941" "1942" "1943" "1944" "1945" "1946" "1947" "1948"
## [191] "1949" "1950" "1951" "1952" "1953" "1954" "1955" "1956" "1957" "1958"
## [201] "1959" "1960" "1961" "1962" "1963" "1964" "1965" "1966" "1967" "1968"
## [211] "1969" "1970" "1971" "1972" "1973" "1974" "1975" "1976" "1977" "1978"
## [221] "1979" "1980" "1981" "1982" "1983" "1984" "1985" "1986" "1987" "1988"
## [231] "1989" "1990" "1991" "1992" "1993" "1994" "1995" "1996" "1997" "1998"
## [241] "1999" "2000" "2001" "2002" "2003" "2004" "2005" "2006" "2007" "2008"
## [251] "2009" "2010" "2030" "2050" "2099"
mort <- mort %>% rename(country = `...1`)
3. Select only the numeric type columns (select()
).
Then, create the variable “year” from column names by using the
colnames()
function to extract them.
year <- mort %>%
select(-country) %>%
colnames()
# OR
year <- mort %>%
select(starts_with(c("1", "2"))) %>%
colnames()
# OR
year <- mort %>%
select(where(is.numeric)) %>%
colnames()
4. What is the typeof()
for “year”? If it’s not an
integer, turn it into integer form with as.integer()
.
typeof(year)
## [1] "character"
year <- as.integer(year)
# "year" is of type "character".
5. Use the pct_complete()
function in the
naniar
package to determine the percent missing data in
“mort”. You might need to load and install naniar
!
library(naniar)
pct_complete(mort)
## [1] 66.95332
# "mort" is 66.95332 percent complete.
6. Are there any countries that have a complete record in “mort”
across all years? Just look at the output here, don’t reassign it.
Hint: look for complete records by dropping all NAs
from the dataset using drop_na()
.
drop_na(mort)
## # A tibble: 2 × 255
## country `1760` `1761` `1762` `1763` `1764` `1765` `1766` `1767` `1768` `1769`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Sweden 2.21 2.30 2.79 2.94 2.44 2.35 2.23 2.34 2.44 2.40
## 2 United … 2.20 2.35 2.32 2.32 2.37 2.39 2.27 2.29 2.28 2.32
## # ℹ 244 more variables: `1770` <dbl>, `1771` <dbl>, `1772` <dbl>, `1773` <dbl>,
## # `1774` <dbl>, `1775` <dbl>, `1776` <dbl>, `1777` <dbl>, `1778` <dbl>,
## # `1779` <dbl>, `1780` <dbl>, `1781` <dbl>, `1782` <dbl>, `1783` <dbl>,
## # `1784` <dbl>, `1785` <dbl>, `1786` <dbl>, `1787` <dbl>, `1788` <dbl>,
## # `1789` <dbl>, `1790` <dbl>, `1791` <dbl>, `1792` <dbl>, `1793` <dbl>,
## # `1794` <dbl>, `1795` <dbl>, `1796` <dbl>, `1797` <dbl>, `1798` <dbl>,
## # `1799` <dbl>, `1800` <dbl>, `1801` <dbl>, `1802` <dbl>, `1803` <dbl>, …
# Yes - but just two!
7. Reshape the “complete” data to long form.
pivot_longer()
.!COLUMN
or
-COLUMN
means everything except COLUMN.long <-
pivot_longer(mort, -country, names_to = "year", values_to = "mortality")
# OR
long <-
pivot_longer(mort, !country, names_to = "year", values_to = "mortality")
# OR
long <-
pivot_longer(mort, starts_with(c("1", "2")), names_to = "year", values_to = "mortality")
8. Bring an additional dataset into R.
read_tsv()
and assign it the name “pop”.pop <- read_tsv("https://daseh.org/data/country_pop.txt")
## Rows: 242 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (4): Country (or dependent territory), Date, % of world population, Source
## dbl (1): Rank
## num (1): Population
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
9. Rename the second column in “pop” to “country” and the column “%
of world population”, to “percent”. Use the rename()
function. Don’t forget to reassign the renamed data to “pop”.
pop <- pop %>%
rename(
country = `Country (or dependent territory)`,
percent = `% of world population`
)
10. Sort the data in “pop” by “Population” from largest to smallest
using arrange()
and desc()
. After sorting,
select()
“country” to create an one-column tibble of
countries ordered by population. Assign this data the name
“country_ordered”.
country_ordered <- pop %>%
arrange(desc(Population)) %>%
select(country)
11. Subset “long” based on years 2000-2010, including 2000 and 2010
and call this “long_sub” using &
or the
between()
function. Confirm your filtering worked by
looking at the range of “year”. If you’re getting a strange error, make
sure you created the “year” column in problem #7.
long_sub <- long %>% filter(year >= 2000 & year <= 2010)
long_sub %>%
pull(year) %>%
range() # confirm it worked
## [1] "2000" "2010"
12. Further subset long_sub
. You will filter for
specific countries using filter()
and the %in%
operator. Only include countries in this list:
c("Venezuela", "Bahrain", "Estonia", "Iran", "Thailand", "Canada")
.
Make sure to reassign to “long_sub”.
long_sub <- long_sub %>%
filter(country %in% c("Venezuela", "Bahrain", "Estonia", "Iran", "Thailand", "Canada"))
13. Use pivot_wider()
to turn the “year” column of
“long_sub” into multiple columns, each representing a different year.
Fill values (values_from=
) with “mortality”. Assign this
pivoted dataset the name “mort_sub”.
mort_sub <- long_sub %>%
pivot_wider(id_cols = country, names_from = year, values_from = mortality)
14. Using “country_ordered” and “mort_sub”, right_join()
the two datasets by “country”. Use the pipe %>%
to join
this dataset to “pop”, keeping only the data on the lefthand side of the
join. Call this “joined”.
joined <- country_ordered %>%
right_join(mort_sub, by = "country") %>%
left_join(pop, by = "country")
15. The values in the table are percentages of the total population (not proportion).
select()
only “country”, “Population”, and
“mort_count” and view the data.Justification is just for fun. The main point is that decisions in your analysis should depend on your reasoning not how many lines of code it takes :)
# Use 2010: There appears to be a downward trend in mortality rates, so using 2010 could be the most accurate for future years.
joined %>%
mutate(mort_count = Population * `2010` / 100) %>%
select(country, Population, mort_count)
## # A tibble: 6 × 3
## country Population mort_count
## <chr> <dbl> <dbl>
## 1 Iran 77056000 79995.
## 2 Thailand 65926261 58939.
## 3 Canada 35002447 17119.
## 4 Venezuela 28946101 40455.
## 5 Estonia 1294455 1122.
## 6 Bahrain 1234571 1609.
# OR
# Use an average 2000-2010: Using the average is more robust to fluctuations and uncertainty in the coming years.
avg_mort <- rowMeans(joined %>% select(starts_with("2")))
joined <- joined %>% mutate(avg_pct_2000 = avg_mort)
joined %>%
mutate(mort_count = Population * avg_pct_2000 / 100) %>%
select(country, Population, mort_count)
## # A tibble: 6 × 3
## country Population mort_count
## <chr> <dbl> <dbl>
## 1 Iran 77056000 107019.
## 2 Thailand 65926261 65944.
## 3 Canada 35002447 16886.
## 4 Venezuela 28946101 46433.
## 5 Estonia 1294455 1265.
## 6 Bahrain 1234571 1772.
16. Bring the following dataset into R.
read_excel()
from the readxl
package and
assign it the name “asthma”.library(readxl)
asthma <- read_excel("asthma.xlsx", sheet = "Age Group (Years)")
17. Rename the column
Weighted Number With Current Asthma
to “asthma_count” using
rename()
. Replace the original “asthma” object by calling
the new dataset “asthma”.
asthma <- asthma %>% rename("asthma_count" = `Weighted Number With Current Asthma`)
18. Separate Percent (SE)
into two separate columns:
“percent” and “SE” using the separate()
function. Replace
the original “asthma” object by calling the new dataset “asthma”.
asthma <- asthma %>% separate(`Percent (SE)`, into = c("percent", "SE"), sep = " ")
19. Remove the parentheses around the numbers in the new SE column.
You should use a combination of str_replace()
,
pull()
(because stringr package functions work on vectors
not dataframes!) and mutate()
. Replace the original
“asthma” object by calling the new dataset “asthma”.
pattern =
to find the starting parenthesis is
“[(]”pattern =
to find for the ending parenthesis is
“[)]”replacement =
for both can be empty quotation
marks: “”asthma <- asthma %>% mutate(SE = str_replace(pull(asthma, SE), pattern = "[(]", replacement = ""))
asthma <- asthma %>% mutate(SE = str_replace(pull(asthma, SE), pattern = "[)]", replacement = ""))
20. Determine the class of “percent” and “SE”. Can you take the mean values? Why or why not?
class(pull(asthma, percent))
## [1] "character"
class(pull(asthma, SE))
## [1] "character"
# No. Both are character type and math cannot be performed on characters.
21. Use as.numeric()
to convert “percent” and “SE” to
numeric class. Calculate the mean for both.
asthma %>%
mutate(percent = as.numeric(percent)) %>%
pull(percent) %>%
mean()
## [1] 7.771429
asthma %>%
mutate(SE = as.numeric(SE)) %>%
pull(SE) %>%
mean()
## [1] 0.655