Due: 1 hour after class ends
You will work in your pre-assigned teams. Each team should submit ONLY ONE report for this exercise. You must write the names of all team members at the top of the report containing your responses. You all must do the work using one student’s computer and R/RStudio.
Have one team member open R/RStudio on their computer and share their screen with the other team members within the breakout room. At the top of the team report, write “host” in parenthesis besides this student’s name. Have another team member be responsible for documenting the responses. At the top of the team report, write “writer” in parenthesis besides this student’s name.
NOTE: Generally, you will not be penalized for not taking on these roles many times during the semester. This is to simply ensure that you do switch the roles around a “decent number” of times within each team throughout the semester. That said, I will penalize any student who obviously dominates these roles over everyone else, so be sure to give other students an opportunity to do them.
You all should have R and RStudio installed on your computers by now. If you do not, first install the latest version of R here: https://cran.rstudio.com (remember to select the right installer for your operating system). Next, install the latest version of RStudio here: https://www.rstudio.com/products/rstudio/download/. Scroll down to the “Installers for Supported Platforms” section and find the right installer for your operating system.
Gradescope will let you select your team mates when submitting, so make sure to do so. Only one person needs to submit the sheet on Gradescope. You can submit your document in the most common formats, but pdf files are preferred. Submit on Gradescope here: https://www.gradescope.com/courses/157499/assignments. Be sure to submit under the right assignment entry.
The purpose of this exercise is to give you additional practice working with multiple linear regression. We will continue with the Kaggle beer consumption dataset (https://www.kaggle.com/dongeorge/beer-consumption-sao-paulo/). You will demonstrate the impacts of some variables on beer consumption in a given region and the consumption forecast for certain scenarios. The data (sample) were collected in Sao Paulo, Brazil, in a university area, where there are some parties with groups of students from 18 to 28 years of age (average).
Kaggle is a great online community of data scientists. To learn more about Kaggle, follow this link: https://www.kaggle.com/getting-started/44916.
This is the same data from the last in-class exercise. so you should already have it saved locally. Just in case you do not, follow the instructions below.
Download the data (named consumo_cerveja.csv
) from Sakai and save it locally to the same directory as your R markdown file. To find the data file on Sakai, go to Resources \(\rightarrow\) Datasets \(\rightarrow\) In-Class Analyses. Once you have downloaded the data file into the SAME folder as your R markdown file, load and clean the data by using the following R code.
It is always a good idea to take a look at the first few rows of the raw file to see what the data looks like before loading the data. In this raw ‘consumo_cerveja’ file, you will notice that commas are actually used both as decimals and to separate the columns. Thus, you need to let R know by specifying the sep and dec options as in the code below.
beer <- read.csv("data/consumo_cerveja.csv",
stringsAsFactors = FALSE, sep = ",",
dec=",",nrows=365)
# rename the variables
beer$date <- beer$Data
beer$temp_median_c <- beer$Temperatura.Media..C.
beer$temp_min_c <- beer$Temperatura.Minima..C.
beer$temp_max_c <- beer$Temperatura.Maxima..C.
beer$precip_mm <- beer$Precipitacao..mm.
beer$weekend <- factor(beer$Final.de.Semana)
beer$beer_cons_liters <- as.numeric(beer$Consumo.de.cerveja..litros.)
beer <- beer[ , 8:ncol(beer)]
After renaming the variables using the code above, your data will be saved in the object beer
, and the relevant variables plus their meanings are given in the table below:
Variable | Description |
---|---|
date | Date the data for each observation was recorded. |
temp_median_c | Median temperature in \(^0C\). |
temp_min_c | Minimum temperature in \(^0C\). |
temp_max_c | Maximum temperature in \(^0C\). |
precip_mm | Precipitation in \(mm\). |
weekend | Indicator variable for weekend: 1 = weekend, 0 = weekday. |
beer_cons_liters | Beer consumption in liters. |
Continue with the rain
variable from last time. Again, In R, remember to treat the variable as a factor variable and not a discrete variable. Fit the same model as last time. That is, fit a linear model for log(beer_cons_liters)
using weekend
, rain
, and temp_median_c
as your predictors.
Write a code for doing \(k\)-fold cross validation. Refer back to the class notes for details on \(k\)-fold cross validation. Let \(k=10\) and use average RMSE as the metric for quantifying predictive error. What is the average RMSE for your regression model?
If you were able to complete this question last time, just write your answer down again.
Hint: if you are not sure how to begin writing your code for doing the cross validation, you should consider writing a “for loop”. A “for loop” is actually not the most efficient way to get this done but it will work just fine here. If you don’t know how to write a “for loop”, use the skeleton code below as a guide for writing your own “for loop” for this question.
# Suppose your data is stored in the object "Data"
# First set a seed to ensure your results are reproducible
set.seed(...) # use whatever number you want
# Now randomly re-shuffle the data
Data <- Data[sample(nrow(Data)),]
# Define the number of folds you want
K <- ...
# Define a matrix to save your results into
RSME <- matrix(0,nrow=K,ncol=1)
# Split the row indexes into k equal parts
kth_fold <- cut(seq(1,nrow(Data)),breaks=K,labels=FALSE)
# Now write the for loop for the k-fold cross validation
for(k in 1:K){
# Split your data into the training and test datasets
test_index <- which(kth_fold==k)
train <- Data[-test_index,]
test <- Data[test_index,]
# Now that you've split the data,
RSME[k,] <- ... # write your code for computing RMSE for each k here
# You should consider using your code for question 7 above
}
... #Calculate the average of all values in the RSME matrix here.
Now, do EDA to explore interaction effects between all your predictors. Summarize your findings in a few sentences.
Extend your linear model to include interaction terms between weekend
and the other two predictors. Are the interaction terms significant? What are the implications of these findings in the context of the data?
To include an interaction term between two variables x1 and x2 in your linear model, use
OR
Do stepwise model selection using AIC and BIC with the “full model” set to the model from question 3. Summarize your findings in a few sentences.
Use your code for the \(k\)-fold cross validation from question 1 to compute the average RMSE for the new model in question 3. Is the new RMSE model lower or higher than your result from question 1? What can you infer from that?
This exercise is based on ideas proposed by Sam Voisin.