class: center, middle, inverse, title-slide # IDS 702: Module 1.1 ## Motivating example ### Dr. Olanrewaju Michael Akande --- ## Introduction - By now, you should already be familiar with t-tests and simple linear regression (SLR). -- - At the very least, you should know the basics. -- - Specifically, you should know how to fit a SLR model and assess whether or not the model assumptions are violated. -- - We will use those ideas as building blocks for the models we will explore throughout this course. --- ## Motivating example - In the 1970’s, Harris Trust and Savings Bank was sued for discrimination on the basis of sex. -- - As evidence, the defense presented analysis of salaries of employees of one type (skilled, entry level clerical). -- - The data is in the file `wagediscrim.txt` on Sakai. -- - We are interested in answering the question: **did female employees tend to receive lower base/starting salaries than similarly qualified and experienced male employees?** -- <div class="question"> Which statistical tests can we use to probe the question above? </div> --- ## Data 93 employees on data file (61 female, 32 male). Variable | Description :------------- | :------------ bsal | Annual salary at time of hire sal77 | Annual salary in 1977. educ | years of education. exper | months previous work prior to hire at bank. fsex | 1 if female, 0 if male senior | months worked at bank since hired age | months Since we care about inference on .hlight[bsal], as our response variable, we will exclude .hlight[sal77] for all analysis. -- <div class="question"> Is this reasonable? Why or why not? </div> --- ## Data How many rows? How many columns? ```r wages <- read.csv("data/wagediscrim.txt", header= T) dim(wages) ``` ``` ## [1] 93 8 ``` Take a look at the first few rows of the data. ```r head(wages) ``` ``` ## bsal sal77 sex senior age educ exper fsex ## 1 5040 12420 Male 96 329 15 14.0 0 ## 2 6300 12060 Male 82 357 15 72.0 0 ## 3 6000 15120 Male 67 315 15 35.5 0 ## 4 6000 16320 Male 97 354 12 24.0 0 ## 5 6000 12300 Male 66 351 12 56.0 0 ## 6 6840 10380 Male 92 374 15 41.5 0 ``` --- ## Data Check variable types. ```r wages$sex <- factor(wages$sex,levels=c("Male","Female")) wages$fsex <- factor(wages$fsex) str(wages) ``` ``` ## 'data.frame': 93 obs. of 8 variables: ## $ bsal : int 5040 6300 6000 6000 6000 6840 8100 6000 6000 6900 ... ## $ sal77 : int 12420 12060 15120 16320 12300 10380 13980 10140 12360 10920 ... ## $ sex : Factor w/ 2 levels "Male","Female": 1 1 1 1 1 1 1 1 1 1 ... ## $ senior: int 96 82 67 97 66 92 66 82 88 75 ... ## $ age : int 329 357 315 354 351 374 369 363 555 416 ... ## $ educ : int 15 15 15 12 12 15 16 12 12 15 ... ## $ exper : num 14 72 35.5 24 56 41.5 54.5 32 252 132 ... ## $ fsex : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ... ``` --- ## Exploratory data analysis (EDA) Next, quick summaries of each variable. ```r summary(wages) ``` ``` ## bsal sal77 sex senior age ## Min. :3900 Min. : 7860 Male :32 Min. :65.00 Min. :280.0 ## 1st Qu.:4980 1st Qu.: 9000 Female:61 1st Qu.:74.00 1st Qu.:349.0 ## Median :5400 Median :10020 Median :84.00 Median :468.0 ## Mean :5420 Mean :10393 Mean :82.28 Mean :474.4 ## 3rd Qu.:6000 3rd Qu.:11220 3rd Qu.:90.00 3rd Qu.:590.0 ## Max. :8100 Max. :16320 Max. :98.00 Max. :774.0 ## educ exper fsex ## Min. : 8.00 Min. : 0.0 0:32 ## 1st Qu.:12.00 1st Qu.: 35.5 1:61 ## Median :12.00 Median : 70.0 ## Mean :12.51 Mean :100.9 ## 3rd Qu.:15.00 3rd Qu.:144.0 ## Max. :16.00 Max. :381.0 ``` --- ## EDA Since we only care about comparing starting salaries for male and female employees for now, let's look at boxplots of .hlight[bsal] by .hlight[sex]. ```r ggplot(wages,aes(x=sex, y=bsal, fill=sex)) + geom_boxplot() + coord_flip() + scale_fill_brewer(palette="Blues") + labs(title="Baseline Salary vs Sex",y="Base Salary",x="Sex") + theme_classic() + theme(legend.position="none") ``` <img src="1-1-MLR-motivating-example_files/figure-html/unnamed-chunk-6-1.png" style="display: block; margin: auto;" /> -- <div class="question"> What do you think? What can you infer from this plot? </div> --- ## T-test? We could go further and try a t-test for the hypotheses. .small[ `$$H_0: \mu_{\textrm{male}} - \mu_{\textrm{female}} \leq 0 \ \ \textrm{vs.} \ \ H_A: \mu_{\textrm{male}} - \mu_{\textrm{female}} > 0$$` ] ```r t.test(bsal~sex,data=wages,alternative="greater") ``` ``` ## ## Welch Two Sample t-test ## ## data: bsal by sex ## t = 5.83, df = 51.329, p-value = 1.855e-07 ## alternative hypothesis: true difference in means is greater than 0 ## 95 percent confidence interval: ## 582.9857 Inf ## sample estimates: ## mean in group Male mean in group Female ## 5956.875 5138.852 ``` -- <div class="question"> Is a t-test sufficient here? Any concerns? </div> --- ## SLR? How about fitting a SLR model to the two variables. .block[ .small[ $$ \textrm{bsal}_i = \beta_0 + \beta_1 \textrm{sex}_i + \epsilon_i; \ \ \epsilon_i \overset{iid}{\sim} \mathcal{N}(0, \sigma^2), \ \ \ i = 1,\ldots, n.$$ ] ] ```r model1 <- lm(bsal~sex,data=wages); summary(model1) ``` ``` ## ## Call: ## lm(formula = bsal ~ sex, data = wages) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1336.88 -338.85 43.12 261.15 2143.12 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 5956.9 105.3 56.580 < 2e-16 ## sexFemale -818.0 130.0 -6.293 1.08e-08 ## ## Residual standard error: 595.6 on 91 degrees of freedom ## Multiple R-squared: 0.3032, Adjusted R-squared: 0.2955 ## F-statistic: 39.6 on 1 and 91 DF, p-value: 1.076e-08 ``` -- <div class="question"> What can we infer from these results? </div> --- ## EDA - .block[ T-test shows men started at higher salaries than women `\((t=5.83, p < .0001)\)`; same conclusion from the regression. ] - But one could argue this is so because both methods **do not** control for other characteristics. Indeed, we have ignored the other variables. - There are other variables that are correlated with .hlight[bsal]. Here's the correlation matrix of all numerical variables using the .hlight[corr] function in R. <table class="table" style="font-size: 20px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> </th> <th style="text-align:right;"> bsal </th> <th style="text-align:right;"> sal77 </th> <th style="text-align:right;"> senior </th> <th style="text-align:right;"> age </th> <th style="text-align:right;"> educ </th> <th style="text-align:right;"> exper </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> bsal </td> <td style="text-align:right;"> 1.00 </td> <td style="text-align:right;"> 0.42 </td> <td style="text-align:right;"> -0.29 </td> <td style="text-align:right;"> 0.03 </td> <td style="text-align:right;"> 0.41 </td> <td style="text-align:right;"> 0.17 </td> </tr> <tr> <td style="text-align:left;"> sal77 </td> <td style="text-align:right;"> 0.42 </td> <td style="text-align:right;"> 1.00 </td> <td style="text-align:right;"> 0.13 </td> <td style="text-align:right;"> -0.55 </td> <td style="text-align:right;"> 0.42 </td> <td style="text-align:right;"> -0.37 </td> </tr> <tr> <td style="text-align:left;"> senior </td> <td style="text-align:right;"> -0.29 </td> <td style="text-align:right;"> 0.13 </td> <td style="text-align:right;"> 1.00 </td> <td style="text-align:right;"> -0.18 </td> <td style="text-align:right;"> 0.06 </td> <td style="text-align:right;"> -0.07 </td> </tr> <tr> <td style="text-align:left;"> age </td> <td style="text-align:right;"> 0.03 </td> <td style="text-align:right;"> -0.55 </td> <td style="text-align:right;"> -0.18 </td> <td style="text-align:right;"> 1.00 </td> <td style="text-align:right;"> -0.23 </td> <td style="text-align:right;"> 0.80 </td> </tr> <tr> <td style="text-align:left;"> educ </td> <td style="text-align:right;"> 0.41 </td> <td style="text-align:right;"> 0.42 </td> <td style="text-align:right;"> 0.06 </td> <td style="text-align:right;"> -0.23 </td> <td style="text-align:right;"> 1.00 </td> <td style="text-align:right;"> -0.10 </td> </tr> <tr> <td style="text-align:left;"> exper </td> <td style="text-align:right;"> 0.17 </td> <td style="text-align:right;"> -0.37 </td> <td style="text-align:right;"> -0.07 </td> <td style="text-align:right;"> 0.80 </td> <td style="text-align:right;"> -0.10 </td> <td style="text-align:right;"> 1.00 </td> </tr> </tbody> </table> --- ## EDA Or visually (using the `ggcorrplot` package), ```r wages_corr <- round(cor(wages[,!is.element(colnames(wages),c("sex","fsex"))]),2) ggcorrplot(wages_corr, method = "circle",type = "lower", colors = c("#6D9EC1", "white", "#E46726")) ``` <img src="1-1-MLR-motivating-example_files/figure-html/unnamed-chunk-10-1.png" style="display: block; margin: auto;" /> --- ## EDA - So, let's take a look at scatter plots of all variables - First, recall the description of all the variables. Variable | Description :------------- | :------------ bsal | Annual salary at time of hire sal77 | Annual salary in 1977. educ | years of education. exper | months previous work prior to hire at bank. fsex | 1 if female, 0 if male senior | months worked at bank since hired age | months --- ## EDA ```r ggpairs(wages[,!is.element(colnames(wages),c("sal77","sex","fsex"))], mapping=ggplot2::aes(colour = "red4",alpha=0.6)) #GGally package ``` <img src="1-1-MLR-motivating-example_files/figure-html/unnamed-chunk-11-1.png" style="display: block; margin: auto;" /> This plot looks very busy! --- ## EDA Let's take a closer look one variable at a time. First, .hlight[bsal] vs. .hlight[senior]. ```r ggplot(wages,aes(x=senior, y=bsal)) + geom_point(alpha = .7,aes(color=sex)) + geom_smooth(method="lm",col="red3") + theme_classic() + labs(title="Baseline Salary vs Seniority",x="Seniority",y="Base Salary") ``` <img src="1-1-MLR-motivating-example_files/figure-html/unnamed-chunk-12-1.png" style="display: block; margin: auto;" /> --- ## EDA Next, .hlight[bsal] vs. .hlight[age] ```r ggplot(wages,aes(x=age, y=bsal)) + geom_point(alpha = .7,aes(color=sex)) + geom_smooth(method="lm",col="red3") + theme_classic() + labs(title="Baseline Salary vs Age",x="Age",y="Base Salary") ``` <img src="1-1-MLR-motivating-example_files/figure-html/unnamed-chunk-13-1.png" style="display: block; margin: auto;" /> --- ## EDA .hlight[bsal] vs. .hlight[educ] ```r ggplot(wages,aes(x=educ, y=bsal)) + geom_point(alpha = .7,aes(color=sex)) + geom_smooth(method="lm",col="red3") + theme_classic() + labs(title="Baseline Salary vs Education",x="Education",y="Base Salary") ``` <img src="1-1-MLR-motivating-example_files/figure-html/unnamed-chunk-14-1.png" style="display: block; margin: auto;" /> --- ## EDA Finally, .hlight[bsal] vs. .hlight[exper] ```r ggplot(wages,aes(x=exper, y=bsal)) + geom_point(alpha = .7,aes(color=sex)) + geom_smooth(method="lm",col="red3") + theme_classic() + labs(title="Baseline Salary vs Experience",x="Experience",y="Base Salary") ``` <img src="1-1-MLR-motivating-example_files/figure-html/unnamed-chunk-15-1.png" style="display: block; margin: auto;" /> --- ## Takeaways - Clearly, they are other variables that may be relevant in explaining baseline salary. -- - We need to explore other statistical methods than the t-test and simple linear regression. -- - We need methods that can explore the relationship between baseline salary and sex while also controlling for the other variables that clearly may be relevant. -- - This brings us to .hlight[multiple linear regression (MLR)]. -- - .block[ Something to keep in mind, the overall conclusions may not change after using a better model for this data. In general, this should never stop you from exploring and reporting the results from better models; you should always be rigorous when doing analyses and be honest when reporting the results! ] --- class: center, middle # What's next? ### Move on to the readings for the next module!