Instructions

Homework is optional, but we recommend it so you can get the most out of this course.

## you can add more, or change...these are suggestions
library(tidyverse)
library(readr)
library(dplyr)
library(ggplot2)
library(tidyr)

Problem Set

1. Create the following two objects.

  1. Make an object “bday”. Assign it your birthday in day-month format (1-Jan).
  2. Make another object “name”. Assign it your name. Make sure to use quotation marks for anything with text!
bday <- "19-Feb"
name <- "Bruce Wayne"

2. Make an object “me” that is “bday” and “name” combined.

me <- c(bday, name)

3. Determine the data class for “me”.

class(me)
## [1] "character"
# The class for "me" is "character"

4. If I want to do me / 2 I get the following error: Error in me/2 : non-numeric argument to binary operator. Why? Write your answer as a comment inside the R chunk below.

# R cannot perform math functions on character data classes (types).

The following questions involve an outside dataset.

We will be working with a dataset from the “Kaggle” website, which hosts competitions for prediction and machine learning. This particular dataset contains information about temperature measures from the Rover Environmental Monitoring Station (REMS) on Mars. These data are collected by Spain and Finland. More details on this dataset are here: https://www.kaggle.com/datasets/deepcontractor/mars-rover-environmental-monitoring-station/data.

5. Bring the dataset into R. The dataset is located at: https://daseh.org/data/kaggleMars_Dataset.csv. You can use the link, download it, or use whatever method you like for getting the file. Once you get the file, read the dataset in using read_csv() and assign it the name mars.

mars <- read_csv(file = "https://daseh.org/data/kaggleMars_Dataset.csv")
## Rows: 3197 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (4): earth_date, mars_date, UV_Radiation, weather
## dbl  (7): earth_year, solar_day, max_ground_temp, min_ground_temp, max_air_t...
## time (2): sunrise, sunset
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# OR
mars <- read_csv("https://daseh.org/data/kaggleMars_Dataset.csv")
## Rows: 3197 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (4): earth_date, mars_date, UV_Radiation, weather
## dbl  (7): earth_year, solar_day, max_ground_temp, min_ground_temp, max_air_t...
## time (2): sunrise, sunset
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# OR
url <- "https://daseh.org/data/kaggleMars_Dataset.csv"
mars <- read_csv(file = url)
## Rows: 3197 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (4): earth_date, mars_date, UV_Radiation, weather
## dbl  (7): earth_year, solar_day, max_ground_temp, min_ground_temp, max_air_t...
## time (2): sunrise, sunset
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# OR
download.file(
  url = "https://daseh.org/data/kaggleMars_Dataset.csv",
  destfile = "mars_data.csv",
  overwrite = TRUE,
  mode = "wb"
)
mars <- read_csv(file = "mars_data.csv")
## Rows: 3197 Columns: 13
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (4): earth_date, mars_date, UV_Radiation, weather
## dbl  (7): earth_year, solar_day, max_ground_temp, min_ground_temp, max_air_t...
## time (2): sunrise, sunset
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

6. Import the data “dictionary” from https://daseh.org/data/kaggleMars_dictionary.txt. Use the read_tsv() function and assign it the name “key”.

key <- read_tsv(file = "https://daseh.org/data/kaggleMars_dictionary.txt")
## Rows: 12 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (2): earth_year, Year on Earth
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# OR
download.file(
  url = "https://daseh.org/data/kaggleMars_dictionary.txt",
  destfile = "dict.txt",
  overwrite = TRUE,
  mode = "wb"
)
key <- read_tsv("dict.txt")
## Rows: 12 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (2): earth_year, Year on Earth
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

7. You should now be ready to work with the “mars” dataset.

  1. Preview the data so that you can see the names of the columns. There are several possible functions to do this.
  2. Determine the class of the columns using str(). Write your answer as a comment inside the R chunk below.
str(mars)
## spc_tbl_ [3,197 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ earth_year     : num [1:3197] 2022 2022 2022 2022 2022 ...
##  $ earth_date     : chr [1:3197] "01-26 UTC" "01-25 UTC" "01-24 UTC" "01-23 UTC" ...
##  $ mars_date      : chr [1:3197] "Mars, Month 6 - LS 163deg" "Mars, Month 6 - LS 163deg" "Mars, Month 6 - LS 162deg" "Mars, Month 6 - LS 162deg" ...
##  $ solar_day      : num [1:3197] 3368 3367 3366 3365 3364 ...
##  $ max_ground_temp: num [1:3197] -3 -3 -4 -6 -7 -8 -4 -6 -6 -9 ...
##  $ min_ground_temp: num [1:3197] -71 -72 -70 -70 -71 -71 -72 -70 -71 -71 ...
##  $ max_air_temp   : num [1:3197] 10 10 8 9 8 8 5 5 3 5 ...
##  $ min_air_temp   : num [1:3197] -84 -87 -81 -91 -92 -80 -84 -73 -89 -80 ...
##  $ mean_pressure  : num [1:3197] 707 707 708 707 708 707 706 705 707 708 ...
##  $ sunrise        : 'hms' num [1:3197] 05:25:00 05:25:00 05:25:00 05:26:00 ...
##   ..- attr(*, "units")= chr "secs"
##  $ sunset         : 'hms' num [1:3197] 17:20:00 17:20:00 17:21:00 17:21:00 ...
##   ..- attr(*, "units")= chr "secs"
##  $ UV_Radiation   : chr [1:3197] "moderate" "moderate" "moderate" "moderate" ...
##  $ weather        : chr [1:3197] "Sunny" "Sunny" "Sunny" "Sunny" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   earth_year = col_double(),
##   ..   earth_date = col_character(),
##   ..   mars_date = col_character(),
##   ..   solar_day = col_double(),
##   ..   max_ground_temp = col_double(),
##   ..   min_ground_temp = col_double(),
##   ..   max_air_temp = col_double(),
##   ..   min_air_temp = col_double(),
##   ..   mean_pressure = col_double(),
##   ..   sunrise = col_time(format = ""),
##   ..   sunset = col_time(format = ""),
##   ..   UV_Radiation = col_character(),
##   ..   weather = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>
#spc_tbl_ [3,197 × 13] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
# $ earth_year           : num [1:3197] 2022 2022 2022 2022 2022 ...
# $ earth_date           : chr [1:3197] "01-26 UTC" "01-25 UTC" "01-24 UTC" "01-23 UTC" ...
# $ mars_date            : chr [1:3197] "Mars, Month 6 - LS 163deg" "Mars, Month 6 - LS 163deg" "Mars, Month 6 - LS 162deg" "Mars, Month 6 - LS 162deg" ...
# $ solar_day            : num [1:3197] 3368 3367 3366 3365 3364 ...
# $ max_ground_temp(degC): chr [1:3197] "-3" "-3" "-4" "-6" ...
# $ min_ground_temp(degC): chr [1:3197] "-71" "-72" "-70" "-70" ...
# $ max_air_temp(degC)   : chr [1:3197] "10" "10" "8" "9" ...
# $ min_air_temp(degC)   : chr [1:3197] "-84" "-87" "-81" "-91" ...
# $ mean_pressure(Pa)    : chr [1:3197] "707" "707" "708" "707" ...
# $ sunrise              : 'hms' num [1:3197] 05:25:00 05:25:00 05:25:00 05:26:00 ...
#  ..- attr(*, "units")= chr "secs"
# $ sunset               : 'hms' num [1:3197] 17:20:00 17:20:00 17:21:00 17:21:00 ...
#  ..- attr(*, "units")= chr "secs"
# $ UV_Radiation         : chr [1:3197] "moderate" "moderate" "moderate" "moderate" ...
# $ weather              : chr [1:3197] "Sunny" "Sunny" "Sunny" "Sunny" ...
# - attr(*, "spec")=
#  .. cols(
#  ..   earth_year = col_double(),
#  ..   earth_date = col_character(),
#  ..   mars_date = col_character(),
#  ..   solar_day = col_double(),
#  ..   `max_ground_temp(degC)` = col_character(),
#  ..   `min_ground_temp(degC)` = col_character(),
#  ..   `max_air_temp(degC)` = col_character(),
#  ..   `min_air_temp(degC)` = col_character(),
#  ..   `mean_pressure(Pa)` = col_character(),
#  ..   sunrise = col_time(format = ""),
#  ..   sunset = col_time(format = ""),
#  ..   UV_Radiation = col_character(),
#  ..   weather = col_character()
#  .. )
# - attr(*, "problems")=<externalptr> 

8. How many data points (rows) are in the dataset? How many variables (columns) are recorded for each data point?

dim(mars)
## [1] 3197   13
nrow(mars)
## [1] 3197
# There are 3197 data points in the dataset and 13 variables.

9. Filter out (i.e., remove) measurements from earlier than 2015 (according to the Earth year), as well as any rows with missing data (NA). Replace the original “mars” object by reassigning the new filtered dataset to “mars”. How many data points are left after filtering?

Hint: use drop_na() to remove rows with missing values.

mars <- drop_na(mars)
mars <- filter(mars, earth_year > 2014)
nrow(mars)
## [1] 2393
# OR
mars <- mars %>% drop_na() %>% filter(earth_year > 2014)
nrow(mars)
## [1] 2393
# There are 2393 measurements left after filtering by year.

10. From this point on, work with the filtered “mars” dataset from the above question. A Martian year is equivalent to 668.6 sols (or solar days). Create a new variable (column) called “years_since_landing” that shows how many Martian years the Curiosity rover had been on Mars for each measurement (divide “solar_day” by 668.6). Check to make sure the new column is there.

Hint: use the mutate() function.

mars <- mars %>% mutate(years_since_landing = solar_day / 668.6)
# OR
mars <- mutate(mars, years_since_landing = solar_day / 668.6)
colnames(mars)
##  [1] "earth_year"          "earth_date"          "mars_date"          
##  [4] "solar_day"           "max_ground_temp"     "min_ground_temp"    
##  [7] "max_air_temp"        "min_air_temp"        "mean_pressure"      
## [10] "sunrise"             "sunset"              "UV_Radiation"       
## [13] "weather"             "years_since_landing"

11. What is the range of the maximum ground temperature (“max_ground_temp”) of the dataset?

range(mars %>% pull(max_ground_temp))
## [1] -67  11
# OR
gtemp_max_range <- pull(mars, max_ground_temp)
range(gtemp_max_range)
## [1] -67  11
# OR
range(mars$max_ground_temp)
## [1] -67  11
table(mars$max_ground_temp)
## 
## -67 -54 -53 -37 -35 -34 -33 -32 -31 -30 -29 -28 -27 -26 -25 -24 -23 -22 -21 -20 
##   1   1   1   1   2   2  11  25  33  41  71  78  69  88  71  79  85  84  71  72 
## -19 -18 -17 -16 -15 -14 -13 -12 -11 -10  -9  -8  -7  -6  -5  -4  -3  -2  -1   0 
##  56  47  46  59  56  77  70  77  65  71  51  68  89  97  72  81  75  64  68  59 
##   1   2   3   4   5   6   7   8  10  11 
##  36  31  30  24  18   7   7   4   1   1
# The range is -67 degrees Celsius to 11 degrees Celsius. 

12. Create a random sample with of atmospheric pressure readings from mars. To determine the column that corresponds to atmospheric pressure, check the “key” corresponding to the data dictionary that you imported above in question 6. Use sample() and pull(). Remember that by default random samples differ each time you run the code.

sample(pull(mars, mean_pressure), size = 20)
##  [1] 814 858 861 846 813 863 732 726 842 874 730 848 902 864 856 860 871 771 889
## [20] 858

13. How many data points are from days where the maximum ground temperature got above 0 degrees Celsius? What percent/proportion do these represent? Use:

# How many data points are from days where the maximum ground temperature got above or equal to 0 degrees Celsius?
nrow(mars %>% filter(max_ground_temp >= 0))
## [1] 218
# OR
mars %>%
  group_by(max_ground_temp >= 0) %>%
  summarize(total = n())
## # A tibble: 2 × 2
##   `max_ground_temp >= 0` total
##   <lgl>                  <int>
## 1 FALSE                   2175
## 2 TRUE                     218
# OR
sum(mars$max_ground_temp >= 0)
## [1] 218
# OR
table(mars$max_ground_temp >= 0)
## 
## FALSE  TRUE 
##  2175   218
# what percent/proportion do these represent?
nrow(mars %>% filter(max_ground_temp >= 0)) / nrow(mars)
## [1] 0.09109904
# OR
mean(mars$max_ground_temp >= 0, na.rm=T)
## [1] 0.09109904
# There are 218 data points from days where the ground temperature got above freezing. The percent of data points is 9.1%. 

14. How many different UV radiation levels (“UV_Radiation”) are there?

Hint: use length() with unique() or table(). Remember to pull() the right column.

mars %>%
  pull(UV_Radiation) %>%
  unique() %>%
  length()
## [1] 4
# OR
length(unique(mars %>% pull(UV_Radiation)))
## [1] 4
# OR
length(unique(mars$UV_Radiation))
## [1] 4
# OR
table(unique(mars$UV_Radiation))
## 
##      high       low  moderate very_high 
##         1         1         1         1
# 4 unique levels.

15. How many different weather conditions (“weather”) are reported?

mars %>%
  pull(weather) %>%
  unique() %>%
  length()
## [1] 1
# 1 weather condition.

16. Which UV radiation level had the highest maximum air temperature, and what was it?

Hint: Use group_by() with summarize().

mars %>%
  group_by(UV_Radiation) %>%
  summarize(mean = mean(max_air_temp))
## # A tibble: 4 × 2
##   UV_Radiation   mean
##   <chr>         <dbl>
## 1 high           5.66
## 2 low          -11.3 
## 3 moderate      -1.38
## 4 very_high     12.5

17. Extend on the code you wrote for question 16. Use the arrange() function to sort the output by maximum air temperature.

mars %>%
  group_by(UV_Radiation) %>%
  summarize(mean = mean(max_air_temp)) %>%
  arrange(desc(mean))
## # A tibble: 4 × 2
##   UV_Radiation   mean
##   <chr>         <dbl>
## 1 very_high     12.5 
## 2 high           5.66
## 3 moderate      -1.38
## 4 low          -11.3

18. How many measurements were taken on days when the UV radiation was “low” and the maximum air temperature was above freezing? Use:

mars %>%
  filter(UV_Radiation == "low" & max_air_temp > 0) %>%
  tally()
## # A tibble: 1 × 1
##       n
##   <int>
## 1     3
# OR
mars %>%
  filter(UV_Radiation == "low" & max_air_temp > 0) %>%
  count()
## # A tibble: 1 × 1
##       n
##   <int>
## 1     3
# OR
sum(mars$UV_Radiation == "low" & mars$max_air_temp > 0)
## [1] 3
# A total of 3 days.

19. How many days was the UV radiation was “high” or “very high”? use:

mars %>%
  filter(UV_Radiation == "high" | UV_Radiation == "very_high") %>%
  count()
## # A tibble: 1 × 1
##       n
##   <int>
## 1  1125
# OR
mars %>%
  filter(UV_Radiation %in% c("high", "very_high")) %>%
  count()
## # A tibble: 1 × 1
##       n
##   <int>
## 1  1125
# OR
sum(mars$UV_Radiation == "high" | mars$UV_Radiation == "very_high")
## [1] 1125
# OR
sum(mars$UV_Radiation %in% c("high", "very_high"))
## [1] 1125
# A total of 1125 days.

20. Select all columns in “mars” where the column names starts with “min” (using select() and starts_with(). Then, use colMeans() to summarize across these columns.

mars %>%
  select(starts_with("min")) %>%
  colMeans()
## min_ground_temp    min_air_temp 
##       -74.94317       -80.56080

21. Using “mars”, create a new binary (TRUEs and FALSEs) column to indicate if the day’s maximum air temperature was above freezing. Call the new column “above_freezing”.

mars <- mars %>% mutate(above_freezing = (max_air_temp > 0))

22. What is the average atmospheric pressure for days that have an air temperature above freezing and UV radiation level of “moderate”? How does this compare with days that do NOT fit these criteria?

mean_mod_warm <- mars %>%
  filter(above_freezing == TRUE & UV_Radiation == "moderate") %>%
  summarize(mean = mean(mean_pressure)) %>%
  pull()

mean_not_mod_warm <- mars %>%
  filter(above_freezing != TRUE | UV_Radiation != "moderate") %>%
  summarize(mean = mean(mean_pressure)) %>%
  pull()

# Days that are above freezing with UV level of "moderate" have an average atmospheric pressure of 820.7 Pa while days not fitting this criteria have an average atmospheric pressure of 826.6 Pa.

23. Among days with a “moderate” UV level that are above freezing, what is the distribution of the earth year in which these days occurred?

mod_warm <- mars %>% filter(UV_Radiation == "moderate" & above_freezing == TRUE)
mod_warm %>%
  group_by(earth_year) %>%
  select(earth_year) %>%
  table()
## earth_year
## 2015 2016 2017 2018 2019 2020 2021 2022 
##   41   31    6   74   72  152  126   17
# OR
mod_warm <- mars %>% filter(UV_Radiation == "moderate" & above_freezing == TRUE)
mod_warm %>%
  group_by(earth_year) %>%
  count()
## # A tibble: 8 × 2
## # Groups:   earth_year [8]
##   earth_year     n
##        <dbl> <int>
## 1       2015    41
## 2       2016    31
## 3       2017     6
## 4       2018    74
## 5       2019    72
## 6       2020   152
## 7       2021   126
## 8       2022    17
# OR
mod_warm <- mars %>% filter(UV_Radiation == "moderate" & above_freezing == TRUE)
mod_warm %>%
  group_by(earth_year) %>%
  tally()
## # A tibble: 8 × 2
##   earth_year     n
##        <dbl> <int>
## 1       2015    41
## 2       2016    31
## 3       2017     6
## 4       2018    74
## 5       2019    72
## 6       2020   152
## 7       2021   126
## 8       2022    17

24. How many days (using filter() or sum() ) have a maximum ground or air temperature above zero and have a UV level of “high” or “very_high”?

sum((mars$max_ground_temp > 0| mars$max_air_temp > 0) & (mars$UV_Radiation == "high" | mars$UV_Radiation == "very_high"))
## [1] 886
# OR
sum((mars$max_ground_temp > 0| mars$max_air_temp > 0) & mars$UV_Radiation %in% c("high", "very_high"))
## [1] 886
# OR
mars %>%
  filter((mars$max_ground_temp > 0| mars$max_air_temp > 0) & mars$UV_Radiation %in% c("high", "very_high")) %>%
  count()
## # A tibble: 1 × 1
##       n
##   <int>
## 1   886
# A total of 886 days. 

25. Make a boxplot (boxplot()) that looks at earth year (“earth_year”) on the x-axis and minimum air temperature (“min_air_temp”) on the y-axis.

boxplot(mars %>% pull(min_air_temp) ~ mars %>% pull(earth_year))

26. Knit your document into a report.

You use the knit button to do this. Make sure all your code is working first!