$19.60
Overview This assignment has two goals: to exercise your skills in using R for data analysis, and to recall basic ideas from descriptive statistics, visualization, hypothesis testing, and multiple linear regression. Your job in this assignment is to investigate the connection between maternal smoking and infant health, using data. You will accomplish this by working through a guided analysis, detailed below. This case study is adapted from Chapter 10 of Nolan and Speed (2000), but the presentation here is self-contained. Please read the entire assignment before you begin your work. Maternal smoking and infant health Nolan and Speed (2000) present the following quotation from the 1989 Report of the Surgeon General: . . . cigarette smoking seems to be a more significant determinant of birth weight than the mother’s pre-pregnancy height, weight, parity, payment status, or history of previous pregnancy outcome, or the infant’s sex. The reduction in birth-weight associated with maternal tobacco use seems to be a direct effect of smoking on fetal growth. Mothers who smoke also have increased rates of premature delivery. (“Parity” refers to whether or not a pregnant woman has previously given birth. “Payment status” has to do with the type of the mother’s pre-natal health insurance.) We can isolate two claims: 1. Mothers who smoke deliver premature babies more often than mothers who do not. 2. Cigarette smoking has a stronger relationship to infant birth weight than several other relevant covariates. At the risk of stating the obvious, premature delivery and small, underweight newborns are bad things. The first step in deciding whether maternal smoking causes these bad outcomes is to figure out whether maternal smoking is associated with them; the latter is the content of these claims. You will study the claims in turn. The dataset forming the basis of your analysis is (a subset of) the Child Health and Development Studies (CHDS), a large survey on all babies born between 1960 and 1967 at the Kaiser Foundation Hospital in Oakland, California. On the course website is the file babies.data. It contains observations (rows) for 1236 live male births. The variables recorded for each birth are given in the following table: 1 Name Description bwt Newborn weight (rounded to the nearest ounce) gestation Length of the pregnancy (days) parity Whether the baby is (1) or is not (0) the first-born age Age of the mother at conception (years) height Mother’s height (inches) weight Mother’s weight (pounds) smoke Whether the mother smokes (1) or not (0) What to submit Write a report which addresses your findings about the claims. Summarize each claim in your own words, as you understand it. For each claim, outline why your analysis of the data ought to be informative, explain the practical meaning of the possible analysis outcomes, report what outcome you obtained, and describe your conclusions. Some specific guidelines appear in subsequent sections of this document. Refer to figures and tables obtained from your R session whenever it seems helpful. Please remember to give every figure a title, axis labels with units, and (where appropriate) a legend. I strongly encourage you to install and use the R package ggplot2 to make your figures—once you learn how to use it, many otherwise difficult graphical tasks become simple one-line commands. The report should be long enough to convey what you understood about the content of the claims, and how strong a case is made for or against them by this data. The report should be no longer than that. The report should be written using LATEX, and submitted in pdf format. Your submission should include three files: 1. a file assignment1.pdf containing your report; 2. a file assignment1.R containing all the R commands you used for your analyses; 3. a file assignment1-transcript.Rt containing a transcript of an R session in which assignment1.R has been run without errors. Please submit these materials through the course website before the due date. Preparing the data Download the data file from the website and load it into R, as a data frame named babies. The variables gestation, age, height, weight, and smoke all have some missing values. The code for a missing value is not exactly the same across the variables. Figure out the missingness code for each variable, then replace all occurrences of the missingness code with R’s missing value code, NA. Some of the variables in the dataset are actually categorical, but are coded numerically. Convert these variables from numeric vectors to factors in the babies data frame, with appropriately named levels. Confirm the conversion worked by inspecting a summary of the data frame. 2 Look at a small number of other descriptive statistics or graphics that might be helpful in getting an initial feel for the data. Analyzing claim 1: guidelines Claim 1 states: mothers who smoke deliver premature babies more often than mothers who do not. A full-term pregnancy is defined by the medical community as lasting 40 weeks. A premature birth is defined as occurring prior to the 37th week of gestation. 1. Make one or more suitable graphical comparisons of the gestation distribution for smoking mothers to the gestation distribution of non-smoking mothers. 2. Add to the babies data frame a two-level factor variable indicating whether or not each baby was born prematurely. Use this factor and the factor smoke to carry out a relevant tabular comparison of distributions. 3. Make a figure which allows the comparison in the previous bullet point to be carried out visually. 4. Use the same table to carry out one or more hypothesis tests of the null hypothesis that smoking and non-smoking mothers have the same rate of premature delivery. 5. A related question is whether the overall average gestation time is shorter for smoking mothers, compared to non-smoking mothers. Conduct one or more appropriate hypothesis tests. 6. If there are other statistics, tables, figures, tests, or analyses that seem useful or important to you in assessing claim 1, produce them and report on them. Analyzing claim 2: guidelines Claim 2 states: Cigarette smoking has a stronger relationship to infant birth weight than several other relevant covariates. The only other covariates available in the data for us to check are parity, age, height, and weight. 1. Compare the difference in the average birth-weight between smoking and non-smoking mothers to the difference in the average birth-weight between first-borns and non-first-borns. Conduct suitable hypothesis tests to accompany the comparison. 2. Divide the mothers into “tall” (above median height in the data) and “short” (below median height in the data). Repeat the comparison of the previous bullet point for babies born to tall versus short women (rather than for first-borns versus non-first-borns). 3. Do the same again, for mothers who are “heavy” (above median weight) and “light” (below median weight). 4. Make a multi-panel figure which allows the comparisons of the previous three bullet points to be carried out visually for whole distributions, rather than averages. Put the y-axes across the panels in exactly the same range, to ease visual comparison. 3 5. Fit a multiple linear regression of birth-weight against height, weight, and parity (but not smoking status). Summarize and check the fit. 6. Fit a second regression like the previous bullet point, but including smoking status. Compare the two regression models informally and formally. Interpret the results of the comparison. 7. What are pros and cons of the multiple-regression approach, as compared to the univariate comparisons you carried out initially? 8. If there are other statistics, tables, figures, tests, or analyses that seem useful or important to you in assessing claim 3, produce them and report on them. 9. (EXTRA CREDIT) Use the plotting package ggplot2 to produce a single multi-panel figure which does the following: for each bin created in a three-way classification by htall/short, heavy/light, parityi, visually compare the birth-weight distribution of smokers versus nonsmokers. Create the figure using a single R expression that involves only ggplot2 functions. What advantages does this comparison have over the linear regression approach?