IDS 702 In-class analysis 1

Beer consumption in Sao Paulo I

August 20, 2020

Due: 1 hour after class ends

Housekeeping

Structure and format

You will work in your pre-assigned teams. Each team should submit ONLY ONE report for this exercise. You must write the names of all team members at the top of the report containing your responses. You all must do the work using one student’s computer and R/RStudio.

Have one team member open R/RStudio on their computer and share their screen with the other team members within the breakout room. At the top of the team report, write “host” in parenthesis besides this student’s name. Have another team member be responsible for documenting the responses. At the top of the team report, write “writer” in parenthesis besides this student’s name.

NOTE: Generally, you will not be penalized for not taking on these roles many times during the semester. This is to simply ensure that you do switch the roles around a “decent number” of times within each team throughout the semester. That said, I will penalize any student who obviously dominates these roles over everyone else, so be sure to give other students an opportunity to do them.

R/RStudio

You all should have R and RStudio installed on your computers by now. If you do not, first install the latest version of R here: https://cran.rstudio.com (remember to select the right installer for your operating system). Next, install the latest version of RStudio here: https://www.rstudio.com/products/rstudio/download/. Scroll down to the “Installers for Supported Platforms” section and find the right installer for your operating system.

Gradescope

Gradescope will let you select your team mates when submitting, so make sure to do so. Only one person needs to submit the sheet on Gradescope. You can submit your document in the most common formats, but pdf files are preferred. Submit on Gradescope here: https://www.gradescope.com/courses/157499/assignments. Be sure to submit under the right assignment entry.

Introduction

The purpose of this exercise is to give you additional practice working with multiple linear regression. This analysis is based on the Kaggle beer consumption dataset found here: https://www.kaggle.com/dongeorge/beer-consumption-sao-paulo/. You will demonstrate the impacts of some variables on beer consumption in a given region and the consumption forecast for certain scenarios. The data (sample) were collected in Sao Paulo, Brazil, in a university area, where there are some parties with groups of students from 18 to 28 years of age (average).

Kaggle is a great online community of data scientists. To learn more about Kaggle, follow this link: https://www.kaggle.com/getting-started/44916.

The data

Download the data (named consumo_cerveja.csv) from Sakai and save it locally to the same directory as your R markdown file. To find the data file on Sakai, go to Resources \(\rightarrow\) Datasets \(\rightarrow\) In-Class Analyses. Once you have downloaded the data file into the SAME folder as your R markdown file, load and clean the data by using the following R code.

It is always a good idea to take a look at the first few rows of the raw file to see what the data looks like before loading the data. In this raw ‘consumo_cerveja’ file, you will notice that commas are actually used both as decimals and to separate the columns. Thus, you need to let R know by specifying the sep and dec options as in the code below.

beer <- read.csv("data/consumo_cerveja.csv",
                 stringsAsFactors = FALSE, sep = ",",dec=",",nrows=365)
# rename the variables
beer$date <- beer$Data
beer$temp_median_c <- beer$Temperatura.Media..C.
beer$temp_min_c <- beer$Temperatura.Minima..C.
beer$temp_max_c <- beer$Temperatura.Maxima..C.
beer$precip_mm <- beer$Precipitacao..mm.
beer$weekend <- factor(beer$Final.de.Semana)
beer$beer_cons_liters <- as.numeric(beer$Consumo.de.cerveja..litros.)
beer <- beer[ , 8:ncol(beer)]

After renaming the variables using the code above, your data will be saved in the object beer, and the relevant variables plus their meanings are given in the table below:

Variable Description
date Date the data for each observation was recorded.
temp_median_c Median temperature in \(^0C\).
temp_min_c Minimum temperature in \(^0C\).
temp_max_c Maximum temperature in \(^0C\).
precip_mm Precipitation in \(mm\).
weekend Indicator variable for weekend: 1 = weekend, 0 = weekday.
beer_cons_liters Beer consumption in liters.

Exercises

Treat the variable beer_cons_liters as your response variable and the other variables as potential predictors.

  1. Make a histogram of beer_cons_liters. Describe the distribution. Is the normality assumption a plausible one here? You don’t need to include any plots in your report.

  2. Make exploratory plots of beer_cons_liters versus each potential predictor. Are all the relationships linear? If any one of them is nonlinear, describe the relationship. You don’t need to include any plots in your report.

  3. Intuitively, and without using any statistical test or method, does it make sense to include all three of temp_median_c, temp_min_c and temp_max_c as predictors in a MLR model for predicting beer_cons_liters? Justify your response in one or two sentences.

  4. Fit a linear model for beer_cons_liters using weekend, precip_mm, and temp_median_c as your predictors. Interpret all the significant parameters of the fitted regression model in context of the data.

  5. Based on your results, what percent of the variability in beer_cons_liters is explained by your model? Also, which of the variables appears to be the best covariate for explaining or predicting beer consumption? Why?

Acknowledgement

This exercise is based on ideas proposed by Sam Voisin.