Instructions

This assignment involves multiple linear regression. You will use the same data from Data Analysis Assignment 3. YOU SHOULD USE THE SAME VERSION OF THE DATA YOU USED FOR THAT ASSIGNMENT. Please type your solutions using R Markdown, LaTeX or any other word processor but YOU MUST knit or convert the final output file to “.pdf”. Submissions should be made on gradescope: go to Assignments \(\rightarrow\) Data Analysis Assignment 3.

DO NOT INCLUDE R CODE OR OUTPUT IN YOUR SOLUTIONS/REPORTS All R code can be included in an appendix, and R outputs should be converted to nicely formatted tables. Feel free to use R packages such as kable, xtable, stargazer, etc.

Also, you can round up ALL numbers/estimates to 2 decimal places (4 decimal places at the most to avoid exact zeros when possible).

Reminder: You are allowed and even encouraged to talk to each other about general concepts, or to the instructor/TAs. However, the write-ups, solutions, and code MUST be entirely your own work.

Questions

MATERNAL SMOKING AND PRE-TERM BIRTH. This question is a continuation of the question on maternal smoking and birth weigths from the last assignment. Remember that the data file contained an indicator variable called Premature (gestational age < 270 days), which is just a recoding of gestational age. For this assignment, use that as your outcome/response variable.

Our questions of interest are very similar to the last assignment as well, except using the new response variable. The questions are as follows:

  • Do mothers who smoke tend to have higher chances of pre-term birth than mothers who do not smoke? What is a likely range for the odds ratio of pre-term birth for smokers and non-smokers?
  • Is there any evidence that the odds ratio of pre-term birth for smokers and non-smokers differs by mother’s race? If so, characterize those differences.
  • Are there other interesting associations with the odds of pre-term birth that are worth mentioning?

First build your model, then do model assessment and validation. You should only proceed to answer the questions when you are satisfied with your final model; you should answer all the questions using that final model.

Analyze the data and investigate these questions using a logistic regression model. Also, do the following.

  • Write a report (maximum of 5 pages) describing your findings. Code and additional plots can be placed in an appendix. You will be penalized should your report exceed 5 pages!
  • Be sure to also include the following in your report:
    • the model you ultimately decided to use,
    • clear model building, that is, justification for the final model (e.g., why you chose certain transformations and why you decided the final model is reasonable to use based on binned residual diagnostics, ROC curves and other metrics for model comparison, such as change in deviance),
    • the relevant regression output (includes: a table with coefficients and SEs, and p-values or confidence intervals; and somewhere in the text or table the “area under the curve” for your final model),
    • your interpretation of the results in the context of the questions of interest, and
    • any potential limitations of the analysis.

To help organize your thoughts, you should organize your report into sections as follows.

  • Summary: a few sentences describing the inferential question(s), the method used and the most important results.
  • Introduction: a short but more in-depth introduction to the inferential question(s) of interest. Here, you are basically writing the experiment and questions of interest given in your own words.
  • Data: your EDA, interesting features of the data, and how you dealt with missing/erroneous values. Try to include one or two plots of your most interesting EDA findings.
  • Model: a detailed description of the model used, how you selected the model, how you selected the variables, model assessment, model validation, and presentation of the model results. What are your overall conclusions in context of the inferential problem(s)? Try to include one or two plots that can help drive your point home.
  • Conclusion: the importance of your findings and potential limitations of the study.



FEW THINGS TO KEEP IN MIND LIKE THE LAST ASSIGNMENT:

There are some complexities in this dataset to be aware of since this is the same dataset as the last assignment. Some variables again have missing values. In particular, you will see from the .csv file that the height and weight of the father are missing quite frequently. This is typical in data on births: it is often difficult to get data about the fathers. I recommend that you not consider father’s height and weight when modeling. Some of the other variables have a few missing cases here and there. For this analysis, you can drop them from the modeling. This is not the ideal way to handle missing data in an analysis–and we will learn better methods later in the course–but for now it will move the analysis forward. I strongly recommend that you make a data file that has complete observations on every single case for all the variables you are thinking about including in the model, and run the regression using that file. For example, I posted such a file in the Sakai site that excludes all of the variables on the fathers. You are welcome to use this file, or make your own if you want to use fathers’ data. The modified data can be found in the file “smoking.csv” on Sakai.

The file contains an indicator variable for Premature (gestational age < 270 days), which is just a recoding of gestational age; we use that as our outcome variable. The data files also contain two other outcome variables: gestational age and birth weight. Both of these could be affected by smoking, so both are outcomes rather than predictors. It does not make sense scientifically to include one as a predictor of the other; the two variables happen simultaneously and hence are a bivariate outcome. For this analysis, we exclude birth weight from the modeling. Of course, one could do a separate regression for birth weight to see if smoking has an effect on gestational ages. Even better, one could treat birth weight and gestational age as a bivariate outcome and fit a regression model that predicts the bivariate outcome. This is a model we won’t have to time to learn about in our course, but come find the instructor if you want to learn more.

The main file also includes information on the number of cigarettes smoked and about timing for mothers who quit smoking. For this analysis you do not have to use those variables, as we just compare smokers and non-smokers. Also, for this analysis, you can ignore the birth date variable, you can collapse education categories from 6-7 into one category for education = trade school, and you can also collapse race categories from 0 - 5 into one category for race = white.

Finally, regarding the fathers’ data, you might pay attention to correlation among the mothers’ and fathers’ values. For example, the mothers’ and fathers’ races might tend to be similar (use a “table” command to see the contingency table of the two races), in which case you have to be concerned about effects of multicollinearity if you want to include both mother’s and father’s races in the model.

Code Book

Variable Description
Id id number
birth birth date where 1096 = January1, 1961
gestation length of gestation in days
Ignore this for this assignment .
bwt birth weight in ounces (999 = unknown)
Ignore this for this assignment .
parity total number of previous pregnancies, including fetal deaths and still births. (99=unknown)
mrace mother’s race or ethnicity
0-5=white
6=mexican
7=black
8=asian
9=mix
99=unknown
mage mother’s age in years at termination of pregnancy
med mother’s education
0 = less than 8th grade
1 = 8th to 12th grade. did not graduate high school
2 = high school graduate, no other schooling
3 = high school graduate + trade school
4 = high school graduate + some college
5 = college graduate
6,7 = trade school but unclear if graduated from high school
9 = unknown
mht mother’s height in inches
mpregwt mother’s pre-pregnancy weight in pounds
drace father’s race or ethnicity
0-5 = white
6 = mexican
7 = black
8 = asian
9 = mix
dage father’s age in years at termination of pregnancy
ded father’s education
0 = less than 8th grade
1 = 8th to 12th grade. did not graduate high school
2 = high school graduate, no other schooling
3 = high school graduate + trade school
4 = high school graduate + some college
5 = college graduate
6,7 = trade school but unclear if graduated from high school
9 = unknown
dht father’s height
dwt father’s pre-pregnancy weight in pounds
marital marital status of mother
1 = married
2 = legally separated
3 = divorced
4 = widowed
5 = never married
income family yearly income in 2500 increments. 0 = under 2500, 1 = 2500-4999, …, 9 = 15000+. 98=unknown, 99=not asked
smoke does mother smoke?
0 = never
1 = smokes now
2 = until preg
3 = once did, not now
time If mother quit, how long ago did she quit?
0 = never smoked,
1 = still smokes,
2 = quit during pregnancy,
3 = up to 1 yr ago,
4 = up to 2 yr ago,
5 = up to 3 yr ago,
6 = up to 4 yr ago,
7 = 5 to 9yr ago,
8 = 10+yr ago,
9 = quit and don’t know,
98 = unknown
number number of cigs smoked a day for past and current smokers
0 = never smoked
1 = 1-4
2 = 5-9
3 = 10-14
4 = 15-19
5 = 20-29
6 = 30-39
7 = 40-60
8 = 60+,
9 = smoke but don’t know
Premature 1 = baby born before gestational age of 270, and 0 = otherwise.
This is a dichotomized function of the gestational age. We use it as the outcome/response variable.

Grading

20 points.