Description

Please write basic analysis for questions, the sample report and format are given below. （each question about 1-2 page)You don’t need to write introduction, only basic analysis for questions1. Provide an estimate for the fraction of students who played a video game

in the week prior to the survey. Provide both a point estimate and an

interval estimate for this proportion. 2. Check to see how the amount of time spent playing video games in the

week prior to the survey compares to the reported frequency of play (daily,

weekly, etc). How might the fact that there was an exam in the week

prior to the survey affect your previous estimates and this comparison?

4 attachmentsSlide 1 of 4attachment_1attachment_1attachment_2attachment_2attachment_3attachment_3attachment_4attachment_4

Unformatted Attachment Preview

Case Study 2: Who plays video games

Introduction

The data

Background

Investigations

Survey Methodology

– Sample from a larger population

– Basic rule: all individuals must have equal chance of being

selected

– If all members of a population were identical, sampling would

not be necessary

– Aim for a sample that is generalizable to total population of

interest

2

Introduction

The data

Background

Investigations

Description

* Target population: 3, 000 − 4, 000 students in statistics

courses at UC Berkeley.

* The survey’s aim was to determine the extent to which the

students play video games and which aspects of video games

they find most and least fun.

* Out of 314 students in Statistics 2, Section 1, during Fall

1994, 95 were selected at random to participate in the survey.

* Complete surveys were obtained from 91 out of 95 students.

* The Survey asks students to identify how often they play

video games and what they like and dislike about the games.

* The data available here are the students responses to the

questionnaire.

3

The answers to the questions were coded numerically as follows:

Time

Like to play

Where play

How often

Play if busy

Playing educational

Sex

Age

Computer at home

Hate math

Work

Own PC

PS has CD-Rom

Have email

Grade expected

# of hours played in the week prior to survey

1=never played, 2=very much, 3=somewhat,

4=not really, 5=not at all

1=arcade, 2=home system, 3=home computer,

4=arcade and either home computer or system,

5= home computer and system, 6=all three

1=daily, 2=weekly, 3=monthly, 4=semesterly

1=yes, 0=no

1=yes, 0=no

1=male, 0=female

Student’s age in years

1=yes, 0=no

1=yes, 0=no

# of hours worked the week prior to the survey

1=yes, 0=no

1=yes, 0=no

1=yes, 0=no

4=A, 3=B, 2=C, 1=D, 0=F

4

Sample observations

Snapshot of the data

5

Missing data

If a question was not answered or improperly answered, then

it was coded as 99.

Those respondents who had never played a video game or who

did not at all like playing video games were asked to skip

many of the questions.

6

Follow up survey

The was a second part of the survey that covers whether the

student likes or dislikes playing games and why.

These questions are different from the others in the more than

one response may be given.

7

Follow up survey

Type

Action

Adventure

Simulation

Sports

Strategy

Percent

50%

28%

17%

39%

63%

Table 1: What types of games

do you play? (at most three

answers)

The student is asked to check all types that he or she plays.

For example, 50% of the students responding to this question

said that they play action games

Not all students responded to this question, in part because

those who said that they have never played a video game or

did not like to play video games were instructed to skip this

question.

8

Follow up survey (cont.)

Type

Graphics/Realism

Relaxation

Eye/hand coordination

Mental Challenge

Feeling of mastery

Bored

Percent

26%

66%

5%

24%

28%

27%

Table 2: Why do you play the

games you checked above? (at

most three answers)

Students who did answer this question were also asked to

provide reasons why they play the games they do. They were

asked to select up to three such reasons.

9

Follow up survey (cont.)

Type

Too much time

Frustrating

Lonely

Too many rules

Costs too much

Boring

Friend’s don’t play

It is pointless

Percent

48%

26%

6%

19%

40%

17%

17%

33%

Table 3: What don’t you like

about video game playing? (at

most three answers)

All students were asked to answer this question, and again

they were asked to select up to three reasons for not liking

video games.

Third part of the survey collect general information about the

student: age, sex, etc.

10

Introduction

The data

Background

The survey methodology

Investigations

The survey methodology

All of the population studied were undergraduates enrolled in Introductory Probability and Statistics, Section 1, during Fall 1994.

The list of all students who had taken the second exam of the

semester was used to select the students to be surveyed.

The exam was given a week prior to the survey.

To choose 95 students for the study, each student was

assigned a number from 1 to 314.

A pseudo random number generator selected 95 numbers

between 1 to 314.

To encourage honest responses, the students anonymity was

preserved.

11

The survey methodology(cont.)

The students had taken an exam the week before the survey,

and the graded exam papers were returned to them during the

discussion section in the week of the survey.

On Friday, those students who had not been reach during the

discussion section were located during the lecture.

A total of 91 students completed the survey.

To encourage accuracy in reporting, the data collectors were

asked to briefly inform the student of the purpose of the

survey and of the guarantee of anonymity.

12

Introduction

The data

Background

Investigations

The objective of this study is to investigate the responses of the

participants with the intention of providing useful information to

the designers of a new computer lab.

1. Provide an estimate for the fraction of students who played a video game

in the week prior to the survey. Provide both a point estimate and an

interval estimate for this proportion.

2. Check to see how the amount of time spent playing video games in the

week prior to the survey compares to the reported frequency of play (daily,

weekly, etc). How might the fact that there was an exam in the week

prior to the survey affect your previous estimates and this comparison?

3. Provide a point estimate and an interval estimate for the average amount

of time spent playing video games in the week prior to the survey. Keep

in mind the overall shape of the sample distribution. A simulation study

may help determine the appropriateness of an interval estimate.

13

The objective of this study is to investigate the responses of the

participants with the intention of providing useful information to

the designers of a new computer lab.

4. Consider the ”attitude” questions. In general, do you think the students

enjoy playing video games? If you had to make a short list of the most

important reasons why students like/dislike video games, what would you

put on the list? Don’t forget that those students who say that they have

never played video games or do not at all like video games are asked to

skip over some of these questions. So, there may be many

nonrespondents to the questions as to whether they think video games

are educational, where they play video games, etc.

14

The objective of this study is to investigate the responses of the

participants with the intention of providing useful information to

the designers of a new computer lab.

5. Look for the differences between those who like to play video games and

those who don’t. To do this, use the questions in the last part of the

survey, and make comparisons between male and female students, those

who work for pay and those who don’t, those who own a computer and

those who don’t. Graphical display and cross-tabulations are particularly

helpful in making these kinds of comparisons. Also, you may want to

collapse the range of responses to a question down to two or three

possibilities before making these comparisons.

6. (Extra credit) Further investigate the grade that students expect in the

course. How does it match the target distribution used in grade

assignment of 20% A’s, 30%B’s, 40% C’s and 10%D’s or lower? If the

nonrespondents were failing students who no longer bothered to come to

the discussion section, would this change the picture?

15

time like where freq busy educ sex age home math work own cdrom email grade

2 3 3 2 0 1 0 19 1 0 10 1 0 1 4 0 3 3 3 0 0 0 18 1 1 0 1 1 1 2 0 3 1 3 0 0 1 19 1 0

0 1 0 1 3 0.5 3 3 3 0 1 0 19 1 0 0 1 0 1 3 0 3 3 4 0 1 0 19 1 1 0 0 0 1 3 0 3 2 4 0 0

1 19 0 0 12 0 0 0 3 0 4 3 4 0 0 1 20 1 1 10 1 0 1 3 0 3 3 4 0 0 0 19 1 0 13 0 0 1 3

2 3 2 1 1 1 1 19 0 0 0 0 0 0 4 0 3 3 4 0 1 1 19 1 1 0 1 0 1 4 0 3 1 4 0 0 0 20 1 0 0

1 0 0 3 0 3 2 4 0 0 0 19 1 0 0 1 0 1 4 0 2 4 1 0 1 0 19 1 1 0 0 0 1 4 3 3 3 2 1 0 0

18 0 0 0 0 0 1 3 1 3 5 2 0 1 0 18 1 1 14 1 0 1 3 0 5 99 99 99 99 1 19 1 0 0 1 0 1 3

0 3 3 4 0 1 1 21 1 0 2 1 0 1 4 0 3 2 3 0 0 1 20 1 0 0 1 0 1 3 2 2 2 2 1 0 1 18 1 0 0

1 0 1 4 0 3 99 99 99 99 0 19 0 0 9 0 99 1 3 2 3 2 2 0 1 1 20 1 0 15 1 0 0 4 0 3 2 3

0 1 1 24 1 0 10 0 0 0 4 2 3 3 1 1 1 1 19 0 0 0 1 0 1 4 0 5 99 99 99 99 0 19 0 0 0 0

99 0 2 0 5 99 99 99 99 1 21 1 0 0 1 0 0 3 0 3 3 4 0 99 0 20 1 1 0 1 0 1 3 0 2 3 4 0

0 1 22 1 1 0 1 1 1 4 0 3 2 3 0 0 1 18 0 0 10 0 0 0 3 0 4 3 4 0 0 1 19 1 1 0 1 0 1 3

0 4 3 4 0 1 0 20 1 0 0 1 0 1 3 0 4 3 4 0 0 0 19 1 1 0 0 0 1 4 1 3 5 2 0 1 1 19 1 0

99 1 1 1 3 0 4 2 3 0 0 1 19 1 1 0 1 1 1 3 0 2 1 3 0 0 1 19 1 0 10 0 0 1 3 0 3 3 1 0

1 0 19 1 0 12 1 0 1 3 0.1 2 6 2 0 1 1 18 0 0 5 1 1 1 4 0.5 4 3 3 0 0 0 19 1 0 0 1 0 0

3 1 3 4 4 99 1 0 20 1 0 0 1 0 1 3 0 3 1 4 0 0 0 19 0 0 0 0 0 1 3 0 3 3 2 1 1 0 20 1

0 20 1 0 0 3 0 4 99 99 0 0 0 19 1 0 5 1 0 1 4 2 2 4 2 0 0 1 19 1 0 0 1 0 1 3 2 3 4 2

0 1 1 19 0 0 10 1 1 1 3 0.5 3 4 2 1 0 1 19 1 1 99 0 0 1 4 0 3 4 99 0 0 1 19 1 99 99

1 0 1 3 2 3 5 2 1 1 1 19 1 0 15 0 0 1 4 0 3 4 2 0 0 1 19 1 1 0 1 1 0 3 0 3 4 3 1 1 0

19 1 1 0 1 0 1 3 0 99 99 99 99 99 1 20 1 1 15 1 1 1 3 2 3 2 2 0 0 1 19 1 0 0 1 0 1

4 0 4 99 4 0 99 0 18 1 1 0 1 0 1 3 0 5 99 99 99 99 0 20 1 1 0 1 0 1 3 0.5 3 2 2 0 0

1 19 1 0 16 1 0 1 3 3 2 3 1 0 1 1 18 1 0 7 1 0 1 3 0 3 1 3 0 0 1 19 0 0 15 0 0 1 3 0

4 3 3 0 1 0 21 1 0 5 1 0 1 4 0 4 3 4 0 0 0 18 1 0 0 1 0 1 4 4 2 99 1 1 1 1 20 1 0 6 1

0 0 4 30 2 99 2 1 0 1 19 0 1 0 0 0 1 3 14 2 99 1 1 0 0 19 1 0 0 1 0 1 2 0 3 1 3 0 1

1 19 0 0 0 0 0 0 3 0 2 99 3 0 1 0 21 0 0 18 1 0 0 2 0 4 99 99 0 0 0 20 1 0 0 1 1 1 4

0.5 2 3 2 1 1 1 19 1 0 20 1 1 1 4 14 2 4 1 1 1 1 18 1 0 35 1 1 1 3 1 2 4 2 0 1 1 19

1 0 19 1 0 1 4 0 4 2 4 0 0 1 18 1 0 0 1 0 1 4 0 2 5 2 1 1 1 20 1 1 20 0 0 1 4 1.5 3

3 2 0 1 0 19 1 1 8 1 0 0 3 0 4 2 4 0 0 1 19 1 1 0 1 0 0 3 0 3 4 3 0 0 1 19 1 1 0 1 0

1 3 2 2 99 2 1 99 1 20 1 0 10 1 1 1 3 0 5 99 99 99 99 1 19 0 1 16 1 0 1 3 0 3 3 2 0

0 1 23 0 0 0 1 0 1 4 0 5 99 99 99 99 0 19 1 0 40 0 0 1 3 0 2 3 3 0 1 0 20 0 0 0 1 1

1 2 0 5 99 99 99 99 0 19 1 1 15 1 0 1 3 0 3 3 4 0 0 1 19 1 0 16 0 0 1 3 0 2 3 3 0 1

1 25 0 0 55 1 0 1 3 2 2 1 2 0 1 1 19 1 0 10 1 0 1 3 1 2 3 1 0 0 1 20 1 1 0 1 0 1 4 0

1 99 99 99 99 1 19 1 1 10 1 0 0 4 0 3 2 4 0 0 0 19 0 1 15 0 99 1 2 2 2 3 2 0 1 1 21

0 0 15 0 0 1 4 0 3 2 4 0 0 0 18 1 1 15 0 99 0 3 2 2 4 2 1 0 1 19 0 0 0 1 0 1 3 2 3 4

2 1 0 1 19 1 0 0 1 99 1 4 5 3 3 2 0 1 0 20 1 0 14 1 1 1 4 0 2 5 4 0 1 0 33 1 0 40 1

0 0 2 3 3 3 2 0 0 1 19 1 0 5 1 1 1 3 0 3 4 3 0 1 0 19 0 1 5 1 0 1 2

1

Smoking in Mothers Result in Decreasing Birth Weights

Disclaimer

The report provided here IS NOT the definitive answer key for Homework 1.

This is meant to serve as an example of what we think might be an adequate

submission. There are multiple possible answers for these parts that can also get

full marks.

(This is the header mentioned in the HW guidelines – includes Title, Authors, and Contribution

Statement. This DOES NOT count towards the report 10 page limit.)

0. Contribution Statement

Both Benjamin Pham and Xinran Wang wrote R code according to their written parts in

this work. Both students discussed and implemented the data processing section. Benjamin

Pham wrote the Numerical Analysis section, and Graphical Analysis section. Xinran wrote

the Incidence section and Conclusion. Both students contributed equally to the Introduction,

Advanced Analysis section, and Conclusion. In addition, both students reviewed and added

changes to the whole report.

1. Introduction

(Background abbreviated, should include literature reviews + citations in motivating the

analysis, 1 page)

Smoking has remained a highly addictive and destructive habit among adults in the past 50

years despite multitudes of public health advances. Addiction to nicotine cigarettes causes

80% of people that do try to quit to fail and to indulge themselves in their habit despite

knowing the dangers of doing so(1). It is no suprise that soon-to-be pregnant mothers, who

may have educated themselves on the hazards of smoking while pregnant, start or continue

to smoke. Smoking during pregnancy is known to cause adverse effects to fetal development.

Small birth weights and early gestational periods due to smoking during pregnancy usually

results in a lower survival rate for babies from various problems such as restricted oxygen

and nutritional transfer during fetal development(2).

The main goal of this analysis is to investigate the differences in distributions of babies’ birth

weight to smoking mothers versus non-smoking mothers. In this analysis, we use numerical

summaries and graphical methods to describe the distribution of babies’ birth weight, and

experiment on our estimates on the low-weight birth weight rate. Numerical summaries

include the minimum, maximum, mean, median, standard deviations, kurtosis, skewness,

and quantiles of the birth weights for babies born to women who smoked and did not smoke

during their pregnancy. Graphical methods, including histograms and Q-Q plots, compare the

distributions of the two groups. Incidence experiments, which is run on different classification

standards on low birth-weight babies, assessing the robustness of our estimates. We then

utilized the Chi-Squared Test of Independence, a hypothesis testing method, to determine

if smoking status is associated with low birth weight. Combining all evidences above, we

determine whether the differences observed between groups is important.

Data

The data from babies.txt is part of the Child Health and Development Studies database

which details pregnancies occurring between 1960 and 1967 of women enrolled in the Kaiser

Foundation Health plan in the Oakland area. The data consists of women in different race.

The dataset consists of 1236 male babies who have lived at least 28 days and were all single

births (no twins). The two variables of interest are the baby’s birth weight which is a

numerical, discrete variable measured in ounces, and smoking status, which is a categorical

variable and is represented by an integer indicator, represented as 1 if the mother smoked

during her pregnancy and 0 if the mother did not smoke during her pregnancy.

1

2. Basic Analysis

(For each question, Provide a methods, analysis, and conclusion section as shown in the

guidelines. The method section describes what was conducted to yield the results. The

analysis section shows the results. The conclusion section shows the interpretation of the

results. You are NOT limited to talking purely about a specific subsection in each conclusion.

You can call back to stated results from prior sections. Each section should be 1-2 pages as

needed.)

2.1. Data Processing

Methods

The data was loaded with R. Our basic analysis mainly focused on birth weight (bwt) and

mother smoking status (smoke). The data was cleaned where observations with missing

values in these columns were removed from our analysis.

Analysis

The data originally had 1,236 observations. Removing these observations with missing

smoke values reduced the dataset to 1,226 observations. The observations in the dataset are

distributed unevenly as there are 484 Non-Smoker mothers and 742 Smoker mothers.

Conclusion

Removing observations in the dataset results in a loss of data. However, there is still an

extremely large number of observations in the dataset which shows that the analysis will not

be majorly affected by the loss of 10 observations. It is possible to trim even more data from

the dataset if missing values in other columns are considered. However, this could result in a

loss of potentially important data points.

2

2.2. Numerical Analysis for Birth Weight Distribution of Smoker

vs. Non-Smoker Babies

Methods

A five number summary of the birth weights for babies of both Smoker and Non-Smoker

mothers was generated to initially examine the data. A five number summary of data

consists of the minimum, 1st Quartile, median, Third Quartile, and maximum of the data.

The skewness and kurtosis of birth weights with different smoke statuses were individually

calculated. These calculations were then compared to determine the similarity of these

distributions to both each other and the Normal Distribution. By this analysis, a normal

distribution has a skewness coefficient of approximately 0 and a kurtosis coefficient of

approximately 3. To validate this, the kurtosis and skewness of the birth weights in each

smoke category were compared to their respective expected normal distribution.

Skewness is defined as:

Skewness =

n

1X

Xi − X̄ 3

(

)

n i=1

s

Kurtosis =

n

1X

Xi − X̄ 4

)

(

n i=1

s

Kurtosis is defined as:

Where n is the number of observations, Xi is the observation, X̄ is the mean, and s is the

standard deviation.

Analysis

Table 1: Summary Statistics of Smoker and Non-Smoker Mothers. The five point summary and the

number of observations in each smoker status group.

Smoker BWT

Non-Smoker BWT

Min

1st QRT

Median

Mean

3rd QRT

Max

Number of Observations

58

55

102

113

115

123

114.1095

123.0472

126

134

163

176

484

742

The Smoke and Non-Smoking birthweight distributions both have different five point summary

statistics. It is notable that the mean birthweight of babies from Smoking Mothers is smaller

than that of Non-Smoker Mothers.

3

Table 2: Kurtosis and Skewness of Smoker and Non-Smoker Mothers Compared to Expected Normal

Distribution. The kurtosis and skewness of birthweight measurements of each smoke category are

compared to those of the expected normal distribution of their respective sample size.

Kurtosis

Smokers BWT

Non-Smokers BWT

Expected Normal Distribution Smoke

Expected Normal Distribution Non-Smoke

Skewness

2.975698 -0.0334909

4.026186 -0.1866062

2.965161 0.0962498

2.958478 0.1042179

From these calculations, the Smoker birthweights distribution has the same kurtosis and

skewness as a Normal distribution. The Non-Smoker birthweights distribution seem to deviate

a bit from the Smokers birthweights distribution and the Normal Distribution with a kurtosis

of 4.03. All distributions are symmetric since their skewness are close to 0.

Conclusion

From the kurtosis and skewness calculations, it can be seen that the Smoker baby birthweights

are indeed different from the Non-Smoker birthweights. The Smoker birthweight distribution

seems to be more normal than the Non-smoker birthweights since the Smoker Birthweights

have a very similar kurtosis to a random normal distribution. The kurtosis of the NonSmoker birthweights has a more pronounced peak than the Normal Distribution and Smoker

Birthweights since the kurtosis is larger. Even though the Non-Smoker birthweight distribution

appears to be different from the Normal Distribution, it is considered weakly normally

distributed due to Law of Large Numbers (LLN) and Central Limit Theorem (CLT) since

there are a sufficiently large number of observations, 742 observations as shown in Section 2.1.

This is not enough to confirm that the Smoking and Non-Smoking distributions are normally

distributed. From the initial look in the five-point summaries of Smoker Birth Weights and

Non-Smoker Birth Weights, there are some slight differences in the summary statistics which

can be indicative of different distributions.

4

2.3. Graphical Analysis for Birth Weight Distribution of Smoker

vs. Non-Smoker Babies

To confirm that the Smoking and Non-Smoking distributions are normally distributed,

graphical methods must be used to visualize each respective distribution.

Methods

A histogram of both Smoking and Non-Smoking birthweights were created because the

birthweight is a continuous numeric variable. To compare to a normal distribution, a

expected normal curve with the means and standard deviation of Smoking and Non-Smoking

birthweights respectively was drawn in red over the respective histograms. Q-Q plots are

then used to confirm if the Smoking and Non-Smoking birthweights do indeed come from a

Normal Distribution with their mean and sd parameters.

Analysis

80

100

140

0.020

180

60

80

100

140

180

Normal Q−Q Plot

Normal Q−Q Plot

−1

0

1

2

3

60

120

−2

120

non_smoke$bwt

Sample Quantiles

smoke$bwt

60

−3

0.000

Density

0.020

60

Sample Quantiles

Histogram of non_smoke$bwt

0.000

Density

Histogram of smoke$bwt

−3

Theoretical Quantiles

−2

−1

0

1

2

3

Theoretical Quantiles

Figure 1: Histogram of Birthweights by Smoking Status (top) and Q-Q plot of Birthweights by

Smoking Status (bottom)

5

The data in both Smoking and Non-Smoking birthweights seem to follow the general shape

of their respective expected random Normal density curve, which is depicted in red. The

dashed blue line represents the mean of each respective birthweight distribution.

In each Q-Qplot, the red line represents their theoretical normal distribution. Because the

data points are aligned on the red line in both Q-Qplots, this shows that the Smoking and

Non-Smoking birthweights are very close to their expected theoretical normal distribution

with some minor deviations at the tails of the distribution. These plots highlight potential

outliers, which are explored further with boxplots (see Figure 3 in the Appendix).

Conclusion

From the histograms and the Q-Q plots, we conclude that the Smoking and Non-Smoking

birthweights are normally distributed. However, they do not share the same normal distribution since the Non-Smoker birthweight distribution is skinnier than the Smoker birthweight

distribution. The mean of the Smoker birthweights are smaller than the mean of the

Non-Smoker birthweights.

6

2.4. Incidence of Low Birth Weight Babies

Methods

We propose to use the number of babies classified as low-birth-weight to estimate the incidence.

The incidence rate of low birth weight is defined as:

nlow−birth−weight

ntotal

In order to understand how the incidence of low birth weight changes when the threshold of

low birth weight classification is changed, a list of thresholds of low birth weight standard

is generated to examine the the robustness of our estimates. The pattern of the change in

proportion estimates as the threshold changes will be examined in this section through a

scatterplot of low birth weight proportions against possible classification thresholds. The

standard deviation of the proportions are also calculated to assess estimate reliability. We

will suggest that our estimate is a reliable estimate if this value does not vary much when

slightly changing the classification standard.

Analysis

Using the provided standard (birth weight less than 88.2 oz), there was 40 out of 484 (8.26%)

low-birth-weight babies from the smoking mother group, and 23 out of 742 (3.10%) from

the non-smoking mother group. Numerically, it is observed that the incidence rate for

low-birth-weight babies is lower in the non-smoking mother group compared to the smoking

mother group.

In the scatterplot below, each point represents the low-birth-weight rate of each group using

a sequence of potential classification thresholds. Since more babies will be classified as

low-weight babies when the threshold is moved up, we expected a monotonically increasing

trend as shown in the scatter plot below. We visually observed that in the neighborhood

of the threshold standard that we use (88.2 ounces), no substantial jumps of the incidence

estimate is triggered by slight movements in the threshold.

7

SD of Incidence Rate vs Window Size

0.03

0.02

0.00

0.01

0.8

0.6

0.4

0.2

0.04

Smoking

Non−Smoking

Standard Deviation of Incidence Rate

Smoking

Non−Smoking

0.0

Proportion of Babies Lower than Threshold

1.0

Low Birth Weight Rate vs Standard

60

70

80

90 100

120

5

Threshold for Low Birth Weight (ounces)

10

15

20

Window Size

Figure 2: Proportion of Babies Classified as Low Birth Weight vs Potential Low Birth Weight Baby

Thresholds (left) and Standard Deviation of Incidence Rate vs Window Size (right).

The scatterplot below illustrates the changes of standard deviation in incidence rate when

changing the examining window around the 88.2 ounces standard. We observed that the

estimate is more robust for the non-smoking group, as the rise of standard deviation remains

to be slow when the window enlarges. Compared to the non-smoking group, the rise in

standard deviation is slightly steeper in the smoker group. However, this standard deviation

does not look substantial when the window size becomes large at 20.

Conclusion

Based on the analysis above, we find that the incidence rate for low-birth-weight babies is

higher among the smoking mother gorup compared to the non-smoking one in our sample

(8.26% vs 3.10%). From our experiments, we conclude that the estimate for the low-birthweight babies is reliable and robust. The estimate does not vary much when a few more or

fewer babies were classified as low birth weight.

8

3. Advanced Analysis

(Use methods to answer an additional question not asked. You can also use additional

methods not covered in class. 1-2 pages.)

We have observed from the previous analysis that there are some groupwise-differences in

the distributions. We have also suggested that the incidence rate is a reliable estimate. We

would then want to assess if there are any statistically significant differences in the incidence

rate between the smoking mother groups and the non-smoking mother groups to analyze if

mother’s smoking status is associated with babies weigh under 88.2 ounces.

Methods

In assessing the importance of differences between the incidence rate, and further if smoking

status is associated with the low-birth-weight classification, we propose to use a YatesCorrected Chi-Squared Test of Independence. The null hypothesis in this test is that there is

no association between mother’s smoking status and baby’s birth weight. The alternative

hypothesis is that there exists an association between smoking status and baby birthweight.

Under a significance level of 0.05, we plan to reject the null hypothesis if our test statistic is

above a critical value of 3.84.

Analysis

From the result of the Chi-Squared Test of Independence, we observed a p-value less than our

level of significance 0.05 (p = 0.00011). The test statistic 14.99 is also larger than our critical

value of 3.84 (see Figure 4 for a visualization with the probability density function (pdf)).

Therefore, we decided to reject the null hypothesis and conclude that there is a statistically

significant association between mother’s smoking status and the incidence of low-birth-weight

babies, under the significance level of 0.05.

Conclusion

Using a chi-squared test of independence (α = 0.05), we conclude that a mother’s smoking

status is associated with the incidence of low-weight-babies weighing less or equal to 88.2

ounces.

9

4. Discussion and Conclusion

(Summarize your main findings here. Compare and contrast the results from your separate

analyses. Compare your overall findings to findings found in other studies – Does it match

what others have found? If not, why? Are there limitations in the data? 1 page.)

The numerical analysis shows that only confirms that the Smoker birthweights are normally

distributed although the Non-Smoker birthweights is weakly proved to be normally distributed

as well due to CLT and LLN. The graphical analysis confirms that both the Smoker and

Non-Smoker birthweights are normally distributed. The incidence analysis shows that there

is a higher proportion of low-birth-weight babies in the smoking group. From the experiments

conducted, we also concluded that our estimate of the incidence is reliable and robust. Lastly,

the chi-squared test of independence suggests that there is an association between mother’s

smoking status and the incidence of low-birth-weight babies when using the original threshold

of 88.2 ounces.

Several confounders should be considered in the investigation process. This study must

account for confounders since the data used in this research was a result of a retrospective

observational study. Since the data was not produced by a controlled experiment, we can

only infer association and cannot establish a causal relationship. Another limitation is that

the experiment was performed with a group of people that is potentially not representative

of all mothers. All of these mothers had single births that were male who had survived for at

least 28 days. These might be additional confounders that could influence the conclusion.

For instance, there could be socio-economic factors that could affect the mother’s health and

in extension, the baby’s health. It could also be possible that there is an effect of gestational

age on lower birth weight of the baby since birth weight increases with gestational age(3). A

future direction in expanding this analysis is to investigate the effect of smoking on gestational

age to determine whether a smaller gestational age acts as a mediator in the relationship

between smoking status and low birth weight.

Although we cannot establish a causal relationship between smoking during pregnancy and

low birth weights, this report found a strong association between these two variables. (more

writing tie back to the scientific question, and assess if this difference being important to the

health of the baby, which is abbreviated here) Furthermore, there is extensive studies that

emphasize birth weight as an indicator of the baby’s health since low birth weight babies

are more likely to develop complications such as cognitive deficits, motor delays, cerebral

palsy, and psychological problems. In fact, low birth weight babies are 20 times more likely to

develop fatal complications and die in comparison to normal birth weight babies(4). Therefore,

although a causal relationship cannot be found, smoking during pregnancy is something that

should not be overlooked.

10

Work Cited and Appendix DOES NOT count towards 10 page limit

5. Work Cited

(Abbreviated, would recommend storing citations with a citation manager such as Mendeley

(https://www.mendeley.com/download-desktop-new/) or Zotero (https://www.zotero.org/)

so you can store your citations and easily make a work cited page. The studies cited can be

found in popular scientific literature sites such as pubmed (https://www.ncbi.nlm.nih.gov/p

mc/). For this report, I used the citation format commonly found in Nature, but you can

choose whatever MLA, ALA citation format you would like to use.)

1. Benowitz, N. L. Nicotine addiction. N. Engl. J. Med. 362, 2295–2303 (2010).

2. Wickstrom, R. Effects of Nicotine During Pregnancy: Human and Experimental

Evidence. CN 5, 213–222 (2007).

3. Topçu, H. O. et al. Birth weight for gestational age: A reference study in a tertiary

referral hospital in the middle region of Turkey. Journal of the Chinese Medical

Association 77, 578–582 (2014).

4. K. C., A., Basel, P. L. & Singh, S. Low birth weight and its associated risk factors:

Health facility-based case-control study. PLoS ONE 15, e0234907 (2020).

11

6. Appendix

Boxplot of Non−Smoker bwt

60

60

80

80

100

100

120

140

140

160

180

Boxplot of Smoker bwt

Figure 3: Boxplot of birthweights separated by Smoking Status. There are a lot of observations

with low birthweights in the non-smoker data that can potentially skew the analysis.

12

0.6

0.4

0.2

0.0

dchisq(x, df = 1)

0.8

Chi−Squared Density Plot df = 1

0

5

10

15

20

x

Figure 4: Density of Chi-Squared Distribution df = 1. The red line represents the test statistic from

the chi-squared test. The p-value is calulated by adding up the sum of the area to the right of the

red line. The blue line represents the critical value for the minimimum p-value of 0.05. The p-value

is very small because the area under the curve is extremely small.

13

Exploratory Data Analysis and Inference

Format

Objective

One of the most important goals of this course is to learn how to write a data analysis report. The HW

is to be submitted in a format similar to a data analysis report. The difference is that the HW will be

more structured so that it can be more easily composed and graded.

Structure

The overall structure of the HW report should be as follows:

0. Header

1. Introduction

2. Analysis

3. Conclusion(s)/Discussion

4. Appendix/Appendices

Now let’s consider the basic outline of the data analysis report in more detail:

0. Header. This includes important general information:

• Title: Choose a succinct but specific title that reflects the goals of the analysis.

• Author contributions: Include a brief description of the respective contribution of each of

the team members.

1. Introduction. Good features for the Introduction include:

• Brief summary of the study and data, as well as any relevant substantive context,

background, or framing issues.

• The “big questions” answered by your data analyses, and summaries of your conclusions

about these questions. These questions should include: 1) the questions posed by the HW

prompts; 2) other questions that you may propose.

• Brief outline of remainder of the report.

2. Basic analysis. In this format, the analysis is organized by research questions. Devote a

subsection for each question raised in the Introduction. These questions should be organized

according to the HW prompts. Within each subsection, statistical method, analyses, and

conclusion would be described (for each question). For example:

2.1 Data processing and summaries

Methods

Analysis

Conclusions

2.2 Comparison between males and females

Methods

Analysis

Conclusions

1

2.3 Effect of Age

Methods

Analysis

Conclusions

Etc. . .

3. Advanced analysis. This section contains analysis that goes beyond the HW prompts. It will

display your own interest and creativity. It may include:

• An additional analysis question, e.g. estimating another parameter, considering the effect of

another variable in the data, evaluating the validity of statistical assumptions not

considered, etc.

• Using a more advanced analysis method to answer one of the HW questions or a new

question.

4. Conclusion(s)/Discussion. This section closes the report:

• Conclusion summary: It should reprise the questions and goals of the analysis stated in the

introduction. It should also summarize the findings and compare them to the original goals.

• Discussion: If relevant, include additional observations or details gleaned from the analysis

section. If relevant, discuss relevance to the science and other studies. Discuss data

limitations. New questions, future work, etc., can also be raised here.

5. Appendix/Appendices. This section is not mandatory but it may be necessary depending on

what you do. This is the place to put details and ancillary materials, that is, materials that you

want to include but would disturb the reading flow if they were put in the main text. These

might include such items as

• Technical descriptions of (unusual) statistical procedures

• Detailed tables or computer output

• Figures and Tables that were not central to the arguments presented in the body of the

report

6. Computer code. In a general data analysis report, computer code may be included in the

Appendix. In our course, code should be submitted as a separate file. Make sure to

document your code by including appropriate section headers, text sentences, comments and

annotations, to make it easier for the reader to follow what you are doing.

Formatting and length

A good data analysis report should present all the necessary information in a concise fashion. To

exercise this and facilitate grading, please abide to the following constraints:

• Use 12-point font for the main text, with full space between lines.

• Start every section in a new page. This will make it easier for you to mark which pages

correspond to each graded item in Gradescope.

• Length guidelines:

o Header + Introduction: 1 page

o Each question: 1 to 2 pages each, including tables and figures

o Advanced analysis: 1 to 2 pages, including tables and figures

o Summary/conclusions/discussion: 1 page

• The total length of the report should not exceed 10 pages (not including Appendix or code).

Any additional material, if it is really necessary, should go in the Appendix.

2

Presentation style

Points will be given for good presentation style and abiding to the formatting constraints. As a

guideline, a good data analysis report has several important features:

• It is organized in a way that makes it easy for different audiences to skim/fish through it to find

the topics and the level of detail that are of interest to them.

• The writing is as invisible/unremarkable as possible, so that the content of the analysis is what

the reader remembers, not distracting quirks or tics in the writing. Examples of distractions

include:

– Extra sentences, overly formal or flowery prose, or at the other extreme overly casual or

overly brief prose.

– Grammatical and spelling errors.

– Placing the data analysis in too broad or too narrow a context for the questions of

interest to your primary audience.

– Focusing on process rather than reporting procedures and outcomes.

– Getting bogged down in technical details, rather than presenting what is necessary to

properly understand your conclusions on substantive questions of interest to the primary

audience.

• Tables are well organized, with well labeled columns and rows. Do not make the table too large

so that they can be easily followed and the reader does not get lost.

• Figures are well composed, with well labeled axes and large enough fonts. If relevant, use

colors and line types to distinguish between different results and include a legend. Do not make

the figure too busy so that it can be easily understood.

3

Purchase answer to see full

attachment

Tags:

confidence interval

video games

R language

Sample Statistics

Central Limit Theorem

User generated content is uploaded by users for the purposes of learning and should be used following Studypool’s honor code & terms of service.

## Reviews, comments, and love from our customers and community:

This page is having a slideshow that uses Javascript. Your browser either doesn't support Javascript or you have it turned off. To see this page as it is meant to appear please use a Javascript enabled browser.