Writing your own functions

So far we’ve seen many functions, like c(), class(), filter(), dim()

Why create your own functions?

  • Cut down on repetitive code (easier to fix things!)
  • Organize code into manageable chunks
  • Avoid running code unintentionally
  • Use names that make sense to you

Writing your own functions

The general syntax for a function is:

function_name <- function(arg1, arg2, ...) {
 <function body>
}

Writing your own functions

Here we will write a function that multiplies some number x by 2:

div_100 <- function(x) x / 100

When you run the line of code above, you make it ready to use (no output yet!). Let’s test it!

div_100(x = 600)
[1] 6

Writing your own functions: { }

Adding the curly brackets - {} - allows you to use functions spanning multiple lines:

div_100 <- function(x) {
  x / 100
}
div_100(x = 10)
[1] 0.1

Writing your own functions: return

If we want something specific for the function’s output, we use return():

div_100_plus_4 <- function(x) {
  output_int <- x / 100
  output <- output_int + 4
  return(output)
}
div_100_plus_4(x = 10)
[1] 4.1

Writing your own functions: multiple inputs

Functions can take multiple inputs:

div_100_plus_y <- function(x, y) x / 100 + y
div_100_plus_y(x = 10, y = 3)
[1] 3.1

Writing your own functions: multiple outputs

Functions can return a vector (or other object) with multiple outputs.

x_and_y_plus_2 <- function(x, y) {
  output1 <- x + 2
  output2 <- y + 2

  return(c(output1, output2))
}
result <- x_and_y_plus_2(x = 10, y = 3)
result
[1] 12  5

Writing your own functions: defaults

Functions can have “default” arguments. This lets us use the function without using an argument later:

div_100_plus_y <- function(x = 10, y = 3) x / 100 + y
div_100_plus_y()
[1] 3.1
div_100_plus_y(x = 11, y = 4)
[1] 4.11

Writing another simple function

Let’s write a function, sqdif, that:

  1. takes two numbers x and y with default values of 2 and 3.
  2. takes the difference
  3. squares this difference
  4. then returns the final value

Writing another simple function

sqdif <- function(x = 2, y = 3) (x - y)^2

sqdif()
[1] 1
sqdif(x = 10, y = 5)
[1] 25
sqdif(10, 5)
[1] 25
sqdif(11, 4)
[1] 49

Writing your own functions: characters

Functions can have any kind of input. Here is a function with characters:

loud <- function(word) {
  output <- rep(toupper(word), 5)
  return(output)
}
loud(word = "hooray!")
[1] "HOORAY!" "HOORAY!" "HOORAY!" "HOORAY!" "HOORAY!"

Functions for tibbles - curly braces

# get means and missing for a specific column
get_summary <- function(dataset, col_name) {
    dataset %>%  
    summarise(mean = mean({{col_name}}, na.rm = TRUE),
              na_count = sum(is.na({{col_name}})))
}

Functions for tibbles - example

er <- read_csv(file = "https://daseh.org/data/CO_ER_heat_visits.csv")
get_summary(er, visits)
# A tibble: 1 × 2
   mean na_count
  <dbl>    <int>
1  7.19      303
yearly_co2 <- 
  read_csv(file = "https://daseh.org/data/Yearly_CO2_Emissions_1000_tonnes.csv")
get_summary(yearly_co2, `2014`)
# A tibble: 1 × 2
     mean na_count
    <dbl>    <int>
1 175993.        0

Summary

  • Simple functions take the form:
    • NEW_FUNCTION <- function(x, y){x + y}
    • Can specify defaults like function(x = 1, y = 2){x + y}
    • return will provide a value as output
  • Specify a column (from a tibble) inside a function using {{double curly braces}}

Lab Part 1

Functions on multiple columns

Using your custom functions: sapply()- a base R function

Now that you’ve made a function… You can “apply” functions easily with sapply()!

These functions take the form:

sapply(<a vector, list, data frame>, some_function)

Using your custom functions: sapply()

Let’s apply a function to look at the CO heat-related ER visits dataset.

🚨There are no parentheses on the functions!🚨

You can also pipe into your function.

sapply(er, class) 
     county        rate   lower95cl   upper95cl      visits        year 
"character"   "numeric"   "numeric"   "numeric"   "numeric"   "numeric" 
# also: er %>% sapply(class)

Using your custom functions: sapply()

Use the div_100 function we created earlier to convert 0-100 percentiles to proportions.

er %>%
  select(ends_with("cl")) %>%
  sapply(div_100) %>%
  head()
      lower95cl  upper95cl
[1,]         NA 0.09236776
[2,] 0.02848937         NA
[3,] 0.04359735 0.09313561
[4,] 0.01711087 0.04846996
[5,] 0.01892912 0.05232461
[6,] 0.06124961 0.11572046

Using your custom functions “on the fly” to iterate

Also called “anonymous function”.

er %>%
  select(ends_with("cl")) %>%
  sapply(function(x) x / 100) %>%
  head()
      lower95cl  upper95cl
[1,]         NA 0.09236776
[2,] 0.02848937         NA
[3,] 0.04359735 0.09313561
[4,] 0.01711087 0.04846996
[5,] 0.01892912 0.05232461
[6,] 0.06124961 0.11572046

Anonymous functions: alternative syntax

er %>%
  select(ends_with("cl")) %>%
  sapply(\(x) x / 100) %>%
  head()
      lower95cl  upper95cl
[1,]         NA 0.09236776
[2,] 0.02848937         NA
[3,] 0.04359735 0.09313561
[4,] 0.01711087 0.04846996
[5,] 0.01892912 0.05232461
[6,] 0.06124961 0.11572046

across

Using functions in mutate() and summarize()

Already know how to use functions to modify columns using mutate() or calculate summary statistics using summarize().

er %>%
  summarize(max_visits = max(visits, na.rm = T),
            max_rate = max(rate, na.rm = T))
# A tibble: 1 × 2
  max_visits max_rate
       <dbl>    <dbl>
1         48     89.3

Applying functions with across from dplyr

across() makes it easy to apply the same transformation to multiple columns. Usually used with summarize() or mutate().

summarize(across(<columns>,function)) 

or

mutate(across(<columns>,function))
  • List columns first : .cols =
  • List function next: .fns =
  • If there are arguments to a function (e.g., na.rm = TRUE), use an anonymous function.

Applying functions with across from dplyr

Combining with summarize()

er %>%
  summarize(across(
    c(visits, rate),
    mean # no parentheses
  ))
# A tibble: 1 × 2
  visits  rate
   <dbl> <dbl>
1     NA    NA

Applying functions with across from dplyr

Add anonymous function to include additional arguments (e.g., na.rm = T).

er %>%
  summarize(across(
    c(visits, rate),
    function(x) mean(x, na.rm = T)
  ))
# A tibble: 1 × 2
  visits  rate
   <dbl> <dbl>
1   7.19  2.43

Applying functions with across from dplyr

Can use with other tidyverse functions like group_by!

er %>%
  group_by(year) %>% 
  summarize(across(
    c(visits, rate),
    function(x) mean(x, na.rm = T)
  ))
# A tibble: 12 × 3
    year visits  rate
   <dbl>  <dbl> <dbl>
 1  2011   5.20  1.49
 2  2012   5.89  1.75
 3  2013   5.63  1.83
 4  2014   4.12  1.41
 5  2015   6.4   1.96
 6  2016  10.1   5.28
 7  2017   7.24  2.13
 8  2018  11.7   3.28
 9  2019   9.12  4.09
10  2020   6.26  1.73
11  2021   8.06  2.08
12  2022   9.29  3.21

Applying functions with across from dplyr

Using different tidyselect() options (e.g., starts_with(), ends_with(), contains())

er %>% 
  group_by(year) %>%
  summarize(across(contains("cl"), mean, na.rm=T))
# A tibble: 12 × 3
    year lower95cl upper95cl
   <dbl>     <dbl>     <dbl>
 1  2011     0.836      2.12
 2  2012     1.06       2.41
 3  2013     1.07       2.62
 4  2014     0.810      2.11
 5  2015     1.21       2.77
 6  2016     3.05       7.99
 7  2017     1.28       3.08
 8  2018     2.17       4.41
 9  2019     2.32       6.21
10  2020     1.02       2.52
11  2021     1.30       2.92
12  2022     1.93       4.71

Applying functions with across from dplyr

Combining with mutate() - the replace_na function

Let’s look at the yearly CO2 emissions dataset we loaded earlier.

replace_na({data frame}, {list of values}) or replace_na({vector}, {single value})

yearly_co2 %>%
  select(country, starts_with("194")) %>%
  mutate(across(
    c(`1943`, `1944`, `1945`),
    function(x) replace_na(x, replace = 0)
  ))
# A tibble: 192 × 11
   country        `1940` `1941` `1942` `1943` `1944` `1945` `1946` `1947` `1948`
   <chr>           <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
 1 Afghanistan        NA     NA     NA      0      0      0     NA     NA     NA
 2 Albania           693    627    744    462    154    121    484    928    704
 3 Algeria           238    312    499    469    499    616    763    744    803
 4 Andorra            NA     NA     NA      0      0      0     NA     NA     NA
 5 Angola             NA     NA     NA      0      0      0     NA     NA     NA
 6 Antigua and B…     NA     NA     NA      0      0      0     NA     NA     NA
 7 Argentina       15900  14000  13500  14100  14000  13700  13700  14500  17400
 8 Armenia           848    745    513    655    613    649    730    878    935
 9 Australia       29100  34600  36500  35000  34200  32700  35500  38000  38500
10 Austria          7350   7980   8560   9620   9400   4570  12800  17600  24500
# ℹ 182 more rows
# ℹ 1 more variable: `1949` <dbl>

GUT CHECK!

Why use across()?

A. Efficiency - faster and less repetitive

B. Calculate the cross product

C. Connect across datasets

purrr package

Similar to across, purrr is a package that allows you to apply a function to multiple columns in a data frame or multiple data objects in a list.

While we won’t get into purrr too much in this class, its a handy package for you to know about should you get into a situation where you have an irregular list you need to handle!

Multiple Data Frames

Multiple data frames

Lists help us work with multiple tibbles / data frames

df_list <- list(AQ = airquality, er = er, yearly_co2 = yearly_co2)


select() from each tibble the numeric columns:

df_list <- 
  df_list %>% 
  sapply(function(x) select(x, where(is.numeric)))

Multiple data frames: sapply

df_list %>% sapply(nrow)
        AQ         er yearly_co2 
       153        768        192 
df_list %>% sapply(colMeans, na.rm = TRUE)
$AQ
     Ozone    Solar.R       Wind       Temp      Month        Day 
 42.129310 185.931507   9.957516  77.882353   6.993464  15.803922 

$er
       rate   lower95cl   upper95cl      visits        year 
   2.431466    1.449322    3.526338    7.189247 2016.500000 

$yearly_co2
      1751       1752       1753       1754       1755       1756       1757 
  9360.000   9360.000   9360.000   9370.000   9370.000  10000.000  10000.000 
      1758       1759       1760       1761       1762       1763       1764 
 10000.000  10000.000  10000.000  11000.000  11000.000  11000.000  11000.000 
      1765       1766       1767       1768       1769       1770       1771 
 11000.000  12300.000  12300.000  12300.000  12300.000  12300.000  13600.000 
      1772       1773       1774       1775       1776       1777       1778 
 13600.000  13600.000  13600.000  13600.000  15000.000  15100.000  15100.000 
      1779       1780       1781       1782       1783       1784       1785 
 15100.000  15100.000  16900.000  16900.000  16900.000  16900.000   8451.835 
      1786       1787       1788       1789       1790       1791       1792 
  9601.835   9601.835   9601.835   9601.835   9601.835  10701.835   7290.890 
      1793       1794       1795       1796       1797       1798       1799 
  7294.557   7315.890   7316.890   7646.223   8051.223   8359.890   8810.223 
      1800       1801       1802       1803       1804       1805       1806 
  5631.934   5590.134   5262.667   6299.534   5730.945   6691.334   7019.534 
      1807       1808       1809       1810       1811       1812       1813 
  6153.112   7019.134   7022.134   6231.445   6603.112   6845.945   6874.445 
      1814       1815       1816       1817       1818       1819       1820 
  7023.445   7260.778   7955.945   8251.112   8286.445   7145.524   7251.096 
      1821       1822       1823       1824       1825       1826       1827 
  7351.524   7649.239   8083.381   8364.810   8682.381   8783.096   9427.239 
      1828       1829       1830       1831       1832       1833       1834 
  9523.096   8308.959   3715.917   3524.280   3701.435   3774.812   8051.549 
      1835       1836       1837       1838       1839       1840       1841 
  8227.940   9527.570   9525.797   9834.906   9306.222   9915.889  10224.056 
      1842       1843       1844       1845       1846       1847       1848 
 10794.581  10233.128  10886.590  11944.821  11277.621  12358.836  13349.715 
      1849       1850       1851       1852       1853       1854       1855 
 14114.208   7309.460  14239.479  14812.450  15555.143  18210.786   9659.643 
      1856       1857       1858       1859       1860       1861       1862 
 19788.000  20015.214   9473.593  10048.960   9454.173   9941.708  10140.359 
      1863       1864       1865       1866       1867       1868       1869 
 10795.606  11641.327  12028.593  12403.702  13295.668  13261.986  14101.569 
      1870       1871       1872       1873       1874       1875       1876 
 14035.927  14901.214  16480.252  17536.229  16386.950  17822.450  18549.268 
      1877       1878       1879       1880       1881       1882       1883 
 18891.376  18087.349  18907.849  21371.838  21537.814  23334.287  24191.037 
      1884       1885       1886       1887       1888       1889       1890 
 23891.868  24059.083  24452.959  25096.761  27101.018  27115.464  29521.920 
      1891       1892       1893       1894       1895       1896       1897 
 30240.900  29174.809  28813.677  29819.606  30957.483  31953.229  33496.131 
      1898       1899       1900       1901       1902       1903       1904 
 35332.748  37808.537  40689.042  41192.041  41385.846  44259.451  43865.974 
      1905       1906       1907       1908       1909       1910       1911 
 47644.255  48996.128  55450.910  52345.037  54351.636  56043.765  56078.067 
      1912       1913       1914       1915       1916       1917       1918 
 59721.222  61191.497  55649.835  53950.195  56245.245  57865.373  57115.897 
      1919       1920       1921       1922       1923       1924       1925 
 50238.612  56517.498  48705.654  52677.405  59904.867  59103.740  59397.356 
      1926       1927       1928       1929       1930       1931       1932 
 58343.344  61780.641  56374.938  60483.538  55734.985  47769.001  41984.148 
      1933       1934       1935       1936       1937       1938       1939 
 41418.730  45669.194  47815.292  51868.778  55471.321  52758.914  54467.672 
      1940       1941       1942       1943       1944       1945       1946 
 60953.207  60166.075  63079.252  62584.645  62709.201  49118.750  51108.588 
      1947       1948       1949       1950       1951       1952       1953 
 57390.813  59757.682  56160.854  42926.053  45396.836  46074.155  47000.409 
      1954       1955       1956       1957       1958       1959       1960 
 47613.014  50991.232  54305.109  55256.960  54709.149  54666.378  57561.514 
      1961       1962       1963       1964       1965       1966       1967 
 57157.351  58458.493  61413.249  62221.298  65112.952  68104.840  70162.033 
      1968       1969       1970       1971       1972       1973       1974 
 74441.221  78894.014  84574.641  87462.937  90846.298  95681.885  95405.067 
      1975       1976       1977       1978       1979       1980       1981 
 95535.064 100697.677 103564.557 107347.507 110215.570 109769.029 106307.828 
      1982       1983       1984       1985       1986       1987       1988 
105814.647 106862.546 109961.996 113988.938 115511.701 119589.652 124047.922 
      1989       1990       1991       1992       1993       1994       1995 
126046.091 119785.938 119809.944 113425.511 114526.532 114229.884 116463.998 
      1996       1997       1998       1999       2000       2001       2002 
119634.469 120061.933 119380.988 121199.089 124827.001 125860.776 127903.029 
      2003       2004       2005       2006       2007       2008       2009 
134503.757 140854.821 145648.629 150692.756 154231.607 158692.397 157166.082 
      2010       2011       2012       2013       2014 
165334.092 171764.925 174033.392 174856.175 175992.542 

Summary

  • Apply your functions with sapply(<a vector or list>, some_function)
  • Use across() to apply functions across multiple columns of data
  • Need to use across within summarize() or mutate()
  • Can use sapply (or purrr package) to work with multiple data frames within lists simultaneously

Lab Part 2

Research Survey