Description
Please write basic analysis for questions, the sample report and format are given below. (each question about 1-2 page)You don’t need to write introduction, only basic analysis for questions1. Provide an estimate for the fraction of students who played a video game
in the week prior to the survey. Provide both a point estimate and an
interval estimate for this proportion. 2. Check to see how the amount of time spent playing video games in the
week prior to the survey compares to the reported frequency of play (daily,
weekly, etc). How might the fact that there was an exam in the week
prior to the survey affect your previous estimates and this comparison?
4 attachmentsSlide 1 of 4attachment_1attachment_1attachment_2attachment_2attachment_3attachment_3attachment_4attachment_4
Unformatted Attachment Preview
Case Study 2: Who plays video games
Introduction
The data
Background
Investigations
Survey Methodology
– Sample from a larger population
– Basic rule: all individuals must have equal chance of being
selected
– If all members of a population were identical, sampling would
not be necessary
– Aim for a sample that is generalizable to total population of
interest
2
Introduction
The data
Background
Investigations
Description
* Target population: 3, 000 − 4, 000 students in statistics
courses at UC Berkeley.
* The survey’s aim was to determine the extent to which the
students play video games and which aspects of video games
they find most and least fun.
* Out of 314 students in Statistics 2, Section 1, during Fall
1994, 95 were selected at random to participate in the survey.
* Complete surveys were obtained from 91 out of 95 students.
* The Survey asks students to identify how often they play
video games and what they like and dislike about the games.
* The data available here are the students responses to the
questionnaire.
3
The answers to the questions were coded numerically as follows:
Time
Like to play
Where play
How often
Play if busy
Playing educational
Sex
Age
Computer at home
Hate math
Work
Own PC
PS has CD-Rom
Have email
Grade expected
# of hours played in the week prior to survey
1=never played, 2=very much, 3=somewhat,
4=not really, 5=not at all
1=arcade, 2=home system, 3=home computer,
4=arcade and either home computer or system,
5= home computer and system, 6=all three
1=daily, 2=weekly, 3=monthly, 4=semesterly
1=yes, 0=no
1=yes, 0=no
1=male, 0=female
Student’s age in years
1=yes, 0=no
1=yes, 0=no
# of hours worked the week prior to the survey
1=yes, 0=no
1=yes, 0=no
1=yes, 0=no
4=A, 3=B, 2=C, 1=D, 0=F
4
Sample observations
Snapshot of the data
5
Missing data
If a question was not answered or improperly answered, then
it was coded as 99.
Those respondents who had never played a video game or who
did not at all like playing video games were asked to skip
many of the questions.
6
Follow up survey
The was a second part of the survey that covers whether the
student likes or dislikes playing games and why.
These questions are different from the others in the more than
one response may be given.
7
Follow up survey
Type
Action
Adventure
Simulation
Sports
Strategy
Percent
50%
28%
17%
39%
63%
Table 1: What types of games
do you play? (at most three
answers)
The student is asked to check all types that he or she plays.
For example, 50% of the students responding to this question
said that they play action games
Not all students responded to this question, in part because
those who said that they have never played a video game or
did not like to play video games were instructed to skip this
question.
8
Follow up survey (cont.)
Type
Graphics/Realism
Relaxation
Eye/hand coordination
Mental Challenge
Feeling of mastery
Bored
Percent
26%
66%
5%
24%
28%
27%
Table 2: Why do you play the
games you checked above? (at
most three answers)
Students who did answer this question were also asked to
provide reasons why they play the games they do. They were
asked to select up to three such reasons.
9
Follow up survey (cont.)
Type
Too much time
Frustrating
Lonely
Too many rules
Costs too much
Boring
Friend’s don’t play
It is pointless
Percent
48%
26%
6%
19%
40%
17%
17%
33%
Table 3: What don’t you like
about video game playing? (at
most three answers)
All students were asked to answer this question, and again
they were asked to select up to three reasons for not liking
video games.
Third part of the survey collect general information about the
student: age, sex, etc.
10
Introduction
The data
Background
The survey methodology
Investigations
The survey methodology
All of the population studied were undergraduates enrolled in Introductory Probability and Statistics, Section 1, during Fall 1994.
The list of all students who had taken the second exam of the
semester was used to select the students to be surveyed.
The exam was given a week prior to the survey.
To choose 95 students for the study, each student was
assigned a number from 1 to 314.
A pseudo random number generator selected 95 numbers
between 1 to 314.
To encourage honest responses, the students anonymity was
preserved.
11
The survey methodology(cont.)
The students had taken an exam the week before the survey,
and the graded exam papers were returned to them during the
discussion section in the week of the survey.
On Friday, those students who had not been reach during the
discussion section were located during the lecture.
A total of 91 students completed the survey.
To encourage accuracy in reporting, the data collectors were
asked to briefly inform the student of the purpose of the
survey and of the guarantee of anonymity.
12
Introduction
The data
Background
Investigations
The objective of this study is to investigate the responses of the
participants with the intention of providing useful information to
the designers of a new computer lab.
1. Provide an estimate for the fraction of students who played a video game
in the week prior to the survey. Provide both a point estimate and an
interval estimate for this proportion.
2. Check to see how the amount of time spent playing video games in the
week prior to the survey compares to the reported frequency of play (daily,
weekly, etc). How might the fact that there was an exam in the week
prior to the survey affect your previous estimates and this comparison?
3. Provide a point estimate and an interval estimate for the average amount
of time spent playing video games in the week prior to the survey. Keep
in mind the overall shape of the sample distribution. A simulation study
may help determine the appropriateness of an interval estimate.
13
The objective of this study is to investigate the responses of the
participants with the intention of providing useful information to
the designers of a new computer lab.
4. Consider the ”attitude” questions. In general, do you think the students
enjoy playing video games? If you had to make a short list of the most
important reasons why students like/dislike video games, what would you
put on the list? Don’t forget that those students who say that they have
never played video games or do not at all like video games are asked to
skip over some of these questions. So, there may be many
nonrespondents to the questions as to whether they think video games
are educational, where they play video games, etc.
14
The objective of this study is to investigate the responses of the
participants with the intention of providing useful information to
the designers of a new computer lab.
5. Look for the differences between those who like to play video games and
those who don’t. To do this, use the questions in the last part of the
survey, and make comparisons between male and female students, those
who work for pay and those who don’t, those who own a computer and
those who don’t. Graphical display and cross-tabulations are particularly
helpful in making these kinds of comparisons. Also, you may want to
collapse the range of responses to a question down to two or three
possibilities before making these comparisons.
6. (Extra credit) Further investigate the grade that students expect in the
course. How does it match the target distribution used in grade
assignment of 20% A’s, 30%B’s, 40% C’s and 10%D’s or lower? If the
nonrespondents were failing students who no longer bothered to come to
the discussion section, would this change the picture?
15
time like where freq busy educ sex age home math work own cdrom email grade
2 3 3 2 0 1 0 19 1 0 10 1 0 1 4 0 3 3 3 0 0 0 18 1 1 0 1 1 1 2 0 3 1 3 0 0 1 19 1 0
0 1 0 1 3 0.5 3 3 3 0 1 0 19 1 0 0 1 0 1 3 0 3 3 4 0 1 0 19 1 1 0 0 0 1 3 0 3 2 4 0 0
1 19 0 0 12 0 0 0 3 0 4 3 4 0 0 1 20 1 1 10 1 0 1 3 0 3 3 4 0 0 0 19 1 0 13 0 0 1 3
2 3 2 1 1 1 1 19 0 0 0 0 0 0 4 0 3 3 4 0 1 1 19 1 1 0 1 0 1 4 0 3 1 4 0 0 0 20 1 0 0
1 0 0 3 0 3 2 4 0 0 0 19 1 0 0 1 0 1 4 0 2 4 1 0 1 0 19 1 1 0 0 0 1 4 3 3 3 2 1 0 0
18 0 0 0 0 0 1 3 1 3 5 2 0 1 0 18 1 1 14 1 0 1 3 0 5 99 99 99 99 1 19 1 0 0 1 0 1 3
0 3 3 4 0 1 1 21 1 0 2 1 0 1 4 0 3 2 3 0 0 1 20 1 0 0 1 0 1 3 2 2 2 2 1 0 1 18 1 0 0
1 0 1 4 0 3 99 99 99 99 0 19 0 0 9 0 99 1 3 2 3 2 2 0 1 1 20 1 0 15 1 0 0 4 0 3 2 3
0 1 1 24 1 0 10 0 0 0 4 2 3 3 1 1 1 1 19 0 0 0 1 0 1 4 0 5 99 99 99 99 0 19 0 0 0 0
99 0 2 0 5 99 99 99 99 1 21 1 0 0 1 0 0 3 0 3 3 4 0 99 0 20 1 1 0 1 0 1 3 0 2 3 4 0
0 1 22 1 1 0 1 1 1 4 0 3 2 3 0 0 1 18 0 0 10 0 0 0 3 0 4 3 4 0 0 1 19 1 1 0 1 0 1 3
0 4 3 4 0 1 0 20 1 0 0 1 0 1 3 0 4 3 4 0 0 0 19 1 1 0 0 0 1 4 1 3 5 2 0 1 1 19 1 0
99 1 1 1 3 0 4 2 3 0 0 1 19 1 1 0 1 1 1 3 0 2 1 3 0 0 1 19 1 0 10 0 0 1 3 0 3 3 1 0
1 0 19 1 0 12 1 0 1 3 0.1 2 6 2 0 1 1 18 0 0 5 1 1 1 4 0.5 4 3 3 0 0 0 19 1 0 0 1 0 0
3 1 3 4 4 99 1 0 20 1 0 0 1 0 1 3 0 3 1 4 0 0 0 19 0 0 0 0 0 1 3 0 3 3 2 1 1 0 20 1
0 20 1 0 0 3 0 4 99 99 0 0 0 19 1 0 5 1 0 1 4 2 2 4 2 0 0 1 19 1 0 0 1 0 1 3 2 3 4 2
0 1 1 19 0 0 10 1 1 1 3 0.5 3 4 2 1 0 1 19 1 1 99 0 0 1 4 0 3 4 99 0 0 1 19 1 99 99
1 0 1 3 2 3 5 2 1 1 1 19 1 0 15 0 0 1 4 0 3 4 2 0 0 1 19 1 1 0 1 1 0 3 0 3 4 3 1 1 0
19 1 1 0 1 0 1 3 0 99 99 99 99 99 1 20 1 1 15 1 1 1 3 2 3 2 2 0 0 1 19 1 0 0 1 0 1
4 0 4 99 4 0 99 0 18 1 1 0 1 0 1 3 0 5 99 99 99 99 0 20 1 1 0 1 0 1 3 0.5 3 2 2 0 0
1 19 1 0 16 1 0 1 3 3 2 3 1 0 1 1 18 1 0 7 1 0 1 3 0 3 1 3 0 0 1 19 0 0 15 0 0 1 3 0
4 3 3 0 1 0 21 1 0 5 1 0 1 4 0 4 3 4 0 0 0 18 1 0 0 1 0 1 4 4 2 99 1 1 1 1 20 1 0 6 1
0 0 4 30 2 99 2 1 0 1 19 0 1 0 0 0 1 3 14 2 99 1 1 0 0 19 1 0 0 1 0 1 2 0 3 1 3 0 1
1 19 0 0 0 0 0 0 3 0 2 99 3 0 1 0 21 0 0 18 1 0 0 2 0 4 99 99 0 0 0 20 1 0 0 1 1 1 4
0.5 2 3 2 1 1 1 19 1 0 20 1 1 1 4 14 2 4 1 1 1 1 18 1 0 35 1 1 1 3 1 2 4 2 0 1 1 19
1 0 19 1 0 1 4 0 4 2 4 0 0 1 18 1 0 0 1 0 1 4 0 2 5 2 1 1 1 20 1 1 20 0 0 1 4 1.5 3
3 2 0 1 0 19 1 1 8 1 0 0 3 0 4 2 4 0 0 1 19 1 1 0 1 0 0 3 0 3 4 3 0 0 1 19 1 1 0 1 0
1 3 2 2 99 2 1 99 1 20 1 0 10 1 1 1 3 0 5 99 99 99 99 1 19 0 1 16 1 0 1 3 0 3 3 2 0
0 1 23 0 0 0 1 0 1 4 0 5 99 99 99 99 0 19 1 0 40 0 0 1 3 0 2 3 3 0 1 0 20 0 0 0 1 1
1 2 0 5 99 99 99 99 0 19 1 1 15 1 0 1 3 0 3 3 4 0 0 1 19 1 0 16 0 0 1 3 0 2 3 3 0 1
1 25 0 0 55 1 0 1 3 2 2 1 2 0 1 1 19 1 0 10 1 0 1 3 1 2 3 1 0 0 1 20 1 1 0 1 0 1 4 0
1 99 99 99 99 1 19 1 1 10 1 0 0 4 0 3 2 4 0 0 0 19 0 1 15 0 99 1 2 2 2 3 2 0 1 1 21
0 0 15 0 0 1 4 0 3 2 4 0 0 0 18 1 1 15 0 99 0 3 2 2 4 2 1 0 1 19 0 0 0 1 0 1 3 2 3 4
2 1 0 1 19 1 0 0 1 99 1 4 5 3 3 2 0 1 0 20 1 0 14 1 1 1 4 0 2 5 4 0 1 0 33 1 0 40 1
0 0 2 3 3 3 2 0 0 1 19 1 0 5 1 1 1 3 0 3 4 3 0 1 0 19 0 1 5 1 0 1 2
1
Smoking in Mothers Result in Decreasing Birth Weights
Disclaimer
The report provided here IS NOT the definitive answer key for Homework 1.
This is meant to serve as an example of what we think might be an adequate
submission. There are multiple possible answers for these parts that can also get
full marks.
(This is the header mentioned in the HW guidelines – includes Title, Authors, and Contribution
Statement. This DOES NOT count towards the report 10 page limit.)
0. Contribution Statement
Both Benjamin Pham and Xinran Wang wrote R code according to their written parts in
this work. Both students discussed and implemented the data processing section. Benjamin
Pham wrote the Numerical Analysis section, and Graphical Analysis section. Xinran wrote
the Incidence section and Conclusion. Both students contributed equally to the Introduction,
Advanced Analysis section, and Conclusion. In addition, both students reviewed and added
changes to the whole report.
1. Introduction
(Background abbreviated, should include literature reviews + citations in motivating the
analysis, 1 page)
Smoking has remained a highly addictive and destructive habit among adults in the past 50
years despite multitudes of public health advances. Addiction to nicotine cigarettes causes
80% of people that do try to quit to fail and to indulge themselves in their habit despite
knowing the dangers of doing so(1). It is no suprise that soon-to-be pregnant mothers, who
may have educated themselves on the hazards of smoking while pregnant, start or continue
to smoke. Smoking during pregnancy is known to cause adverse effects to fetal development.
Small birth weights and early gestational periods due to smoking during pregnancy usually
results in a lower survival rate for babies from various problems such as restricted oxygen
and nutritional transfer during fetal development(2).
The main goal of this analysis is to investigate the differences in distributions of babies’ birth
weight to smoking mothers versus non-smoking mothers. In this analysis, we use numerical
summaries and graphical methods to describe the distribution of babies’ birth weight, and
experiment on our estimates on the low-weight birth weight rate. Numerical summaries
include the minimum, maximum, mean, median, standard deviations, kurtosis, skewness,
and quantiles of the birth weights for babies born to women who smoked and did not smoke
during their pregnancy. Graphical methods, including histograms and Q-Q plots, compare the
distributions of the two groups. Incidence experiments, which is run on different classification
standards on low birth-weight babies, assessing the robustness of our estimates. We then
utilized the Chi-Squared Test of Independence, a hypothesis testing method, to determine
if smoking status is associated with low birth weight. Combining all evidences above, we
determine whether the differences observed between groups is important.
Data
The data from babies.txt is part of the Child Health and Development Studies database
which details pregnancies occurring between 1960 and 1967 of women enrolled in the Kaiser
Foundation Health plan in the Oakland area. The data consists of women in different race.
The dataset consists of 1236 male babies who have lived at least 28 days and were all single
births (no twins). The two variables of interest are the baby’s birth weight which is a
numerical, discrete variable measured in ounces, and smoking status, which is a categorical
variable and is represented by an integer indicator, represented as 1 if the mother smoked
during her pregnancy and 0 if the mother did not smoke during her pregnancy.
1
2. Basic Analysis
(For each question, Provide a methods, analysis, and conclusion section as shown in the
guidelines. The method section describes what was conducted to yield the results. The
analysis section shows the results. The conclusion section shows the interpretation of the
results. You are NOT limited to talking purely about a specific subsection in each conclusion.
You can call back to stated results from prior sections. Each section should be 1-2 pages as
needed.)
2.1. Data Processing
Methods
The data was loaded with R. Our basic analysis mainly focused on birth weight (bwt) and
mother smoking status (smoke). The data was cleaned where observations with missing
values in these columns were removed from our analysis.
Analysis
The data originally had 1,236 observations. Removing these observations with missing
smoke values reduced the dataset to 1,226 observations. The observations in the dataset are
distributed unevenly as there are 484 Non-Smoker mothers and 742 Smoker mothers.
Conclusion
Removing observations in the dataset results in a loss of data. However, there is still an
extremely large number of observations in the dataset which shows that the analysis will not
be majorly affected by the loss of 10 observations. It is possible to trim even more data from
the dataset if missing values in other columns are considered. However, this could result in a
loss of potentially important data points.
2
2.2. Numerical Analysis for Birth Weight Distribution of Smoker
vs. Non-Smoker Babies
Methods
A five number summary of the birth weights for babies of both Smoker and Non-Smoker
mothers was generated to initially examine the data. A five number summary of data
consists of the minimum, 1st Quartile, median, Third Quartile, and maximum of the data.
The skewness and kurtosis of birth weights with different smoke statuses were individually
calculated. These calculations were then compared to determine the similarity of these
distributions to both each other and the Normal Distribution. By this analysis, a normal
distribution has a skewness coefficient of approximately 0 and a kurtosis coefficient of
approximately 3. To validate this, the kurtosis and skewness of the birth weights in each
smoke category were compared to their respective expected normal distribution.
Skewness is defined as:
Skewness =
n
1X
Xi − X̄ 3
(
)
n i=1
s
Kurtosis =
n
1X
Xi − X̄ 4
)
(
n i=1
s
Kurtosis is defined as:
Where n is the number of observations, Xi is the observation, X̄ is the mean, and s is the
standard deviation.
Analysis
Table 1: Summary Statistics of Smoker and Non-Smoker Mothers. The five point summary and the
number of observations in each smoker status group.
Smoker BWT
Non-Smoker BWT
Min
1st QRT
Median
Mean
3rd QRT
Max
Number of Observations
58
55
102
113
115
123
114.1095
123.0472
126
134
163
176
484
742
The Smoke and Non-Smoking birthweight distributions both have different five point summary
statistics. It is notable that the mean birthweight of babies from Smoking Mothers is smaller
than that of Non-Smoker Mothers.
3
Table 2: Kurtosis and Skewness of Smoker and Non-Smoker Mothers Compared to Expected Normal
Distribution. The kurtosis and skewness of birthweight measurements of each smoke category are
compared to those of the expected normal distribution of their respective sample size.
Kurtosis
Smokers BWT
Non-Smokers BWT
Expected Normal Distribution Smoke
Expected Normal Distribution Non-Smoke
Skewness
2.975698 -0.0334909
4.026186 -0.1866062
2.965161 0.0962498
2.958478 0.1042179
From these calculations, the Smoker birthweights distribution has the same kurtosis and
skewness as a Normal distribution. The Non-Smoker birthweights distribution seem to deviate
a bit from the Smokers birthweights distribution and the Normal Distribution with a kurtosis
of 4.03. All distributions are symmetric since their skewness are close to 0.
Conclusion
From the kurtosis and skewness calculations, it can be seen that the Smoker baby birthweights
are indeed different from the Non-Smoker birthweights. The Smoker birthweight distribution
seems to be more normal than the Non-smoker birthweights since the Smoker Birthweights
have a very similar kurtosis to a random normal distribution. The kurtosis of the NonSmoker birthweights has a more pronounced peak than the Normal Distribution and Smoker
Birthweights since the kurtosis is larger. Even though the Non-Smoker birthweight distribution
appears to be different from the Normal Distribution, it is considered weakly normally
distributed due to Law of Large Numbers (LLN) and Central Limit Theorem (CLT) since
there are a sufficiently large number of observations, 742 observations as shown in Section 2.1.
This is not enough to confirm that the Smoking and Non-Smoking distributions are normally
distributed. From the initial look in the five-point summaries of Smoker Birth Weights and
Non-Smoker Birth Weights, there are some slight differences in the summary statistics which
can be indicative of different distributions.
4
2.3. Graphical Analysis for Birth Weight Distribution of Smoker
vs. Non-Smoker Babies
To confirm that the Smoking and Non-Smoking distributions are normally distributed,
graphical methods must be used to visualize each respective distribution.
Methods
A histogram of both Smoking and Non-Smoking birthweights were created because the
birthweight is a continuous numeric variable. To compare to a normal distribution, a
expected normal curve with the means and standard deviation of Smoking and Non-Smoking
birthweights respectively was drawn in red over the respective histograms. Q-Q plots are
then used to confirm if the Smoking and Non-Smoking birthweights do indeed come from a
Normal Distribution with their mean and sd parameters.
Analysis
80
100
140
0.020
180
60
80
100
140
180
Normal Q−Q Plot
Normal Q−Q Plot
−1
0
1
2
3
60
120
−2
120
non_smoke$bwt
Sample Quantiles
smoke$bwt
60
−3
0.000
Density
0.020
60
Sample Quantiles
Histogram of non_smoke$bwt
0.000
Density
Histogram of smoke$bwt
−3
Theoretical Quantiles
−2
−1
0
1
2
3
Theoretical Quantiles
Figure 1: Histogram of Birthweights by Smoking Status (top) and Q-Q plot of Birthweights by
Smoking Status (bottom)
5
The data in both Smoking and Non-Smoking birthweights seem to follow the general shape
of their respective expected random Normal density curve, which is depicted in red. The
dashed blue line represents the mean of each respective birthweight distribution.
In each Q-Qplot, the red line represents their theoretical normal distribution. Because the
data points are aligned on the red line in both Q-Qplots, this shows that the Smoking and
Non-Smoking birthweights are very close to their expected theoretical normal distribution
with some minor deviations at the tails of the distribution. These plots highlight potential
outliers, which are explored further with boxplots (see Figure 3 in the Appendix).
Conclusion
From the histograms and the Q-Q plots, we conclude that the Smoking and Non-Smoking
birthweights are normally distributed. However, they do not share the same normal distribution since the Non-Smoker birthweight distribution is skinnier than the Smoker birthweight
distribution. The mean of the Smoker birthweights are smaller than the mean of the
Non-Smoker birthweights.
6
2.4. Incidence of Low Birth Weight Babies
Methods
We propose to use the number of babies classified as low-birth-weight to estimate the incidence.
The incidence rate of low birth weight is defined as:
nlow−birth−weight
ntotal
In order to understand how the incidence of low birth weight changes when the threshold of
low birth weight classification is changed, a list of thresholds of low birth weight standard
is generated to examine the the robustness of our estimates. The pattern of the change in
proportion estimates as the threshold changes will be examined in this section through a
scatterplot of low birth weight proportions against possible classification thresholds. The
standard deviation of the proportions are also calculated to assess estimate reliability. We
will suggest that our estimate is a reliable estimate if this value does not vary much when
slightly changing the classification standard.
Analysis
Using the provided standard (birth weight less than 88.2 oz), there was 40 out of 484 (8.26%)
low-birth-weight babies from the smoking mother group, and 23 out of 742 (3.10%) from
the non-smoking mother group. Numerically, it is observed that the incidence rate for
low-birth-weight babies is lower in the non-smoking mother group compared to the smoking
mother group.
In the scatterplot below, each point represents the low-birth-weight rate of each group using
a sequence of potential classification thresholds. Since more babies will be classified as
low-weight babies when the threshold is moved up, we expected a monotonically increasing
trend as shown in the scatter plot below. We visually observed that in the neighborhood
of the threshold standard that we use (88.2 ounces), no substantial jumps of the incidence
estimate is triggered by slight movements in the threshold.
7
SD of Incidence Rate vs Window Size
0.03
0.02
0.00
0.01
0.8
0.6
0.4
0.2
0.04
Smoking
Non−Smoking
Standard Deviation of Incidence Rate
Smoking
Non−Smoking
0.0
Proportion of Babies Lower than Threshold
1.0
Low Birth Weight Rate vs Standard
60
70
80
90 100
120
5
Threshold for Low Birth Weight (ounces)
10
15
20
Window Size
Figure 2: Proportion of Babies Classified as Low Birth Weight vs Potential Low Birth Weight Baby
Thresholds (left) and Standard Deviation of Incidence Rate vs Window Size (right).
The scatterplot below illustrates the changes of standard deviation in incidence rate when
changing the examining window around the 88.2 ounces standard. We observed that the
estimate is more robust for the non-smoking group, as the rise of standard deviation remains
to be slow when the window enlarges. Compared to the non-smoking group, the rise in
standard deviation is slightly steeper in the smoker group. However, this standard deviation
does not look substantial when the window size becomes large at 20.
Conclusion
Based on the analysis above, we find that the incidence rate for low-birth-weight babies is
higher among the smoking mother gorup compared to the non-smoking one in our sample
(8.26% vs 3.10%). From our experiments, we conclude that the estimate for the low-birthweight babies is reliable and robust. The estimate does not vary much when a few more or
fewer babies were classified as low birth weight.
8
3. Advanced Analysis
(Use methods to answer an additional question not asked. You can also use additional
methods not covered in class. 1-2 pages.)
We have observed from the previous analysis that there are some groupwise-differences in
the distributions. We have also suggested that the incidence rate is a reliable estimate. We
would then want to assess if there are any statistically significant differences in the incidence
rate between the smoking mother groups and the non-smoking mother groups to analyze if
mother’s smoking status is associated with babies weigh under 88.2 ounces.
Methods
In assessing the importance of differences between the incidence rate, and further if smoking
status is associated with the low-birth-weight classification, we propose to use a YatesCorrected Chi-Squared Test of Independence. The null hypothesis in this test is that there is
no association between mother’s smoking status and baby’s birth weight. The alternative
hypothesis is that there exists an association between smoking status and baby birthweight.
Under a significance level of 0.05, we plan to reject the null hypothesis if our test statistic is
above a critical value of 3.84.
Analysis
From the result of the Chi-Squared Test of Independence, we observed a p-value less than our
level of significance 0.05 (p = 0.00011). The test statistic 14.99 is also larger than our critical
value of 3.84 (see Figure 4 for a visualization with the probability density function (pdf)).
Therefore, we decided to reject the null hypothesis and conclude that there is a statistically
significant association between mother’s smoking status and the incidence of low-birth-weight
babies, under the significance level of 0.05.
Conclusion
Using a chi-squared test of independence (α = 0.05), we conclude that a mother’s smoking
status is associated with the incidence of low-weight-babies weighing less or equal to 88.2
ounces.
9
4. Discussion and Conclusion
(Summarize your main findings here. Compare and contrast the results from your separate
analyses. Compare your overall findings to findings found in other studies – Does it match
what others have found? If not, why? Are there limitations in the data? 1 page.)
The numerical analysis shows that only confirms that the Smoker birthweights are normally
distributed although the Non-Smoker birthweights is weakly proved to be normally distributed
as well due to CLT and LLN. The graphical analysis confirms that both the Smoker and
Non-Smoker birthweights are normally distributed. The incidence analysis shows that there
is a higher proportion of low-birth-weight babies in the smoking group. From the experiments
conducted, we also concluded that our estimate of the incidence is reliable and robust. Lastly,
the chi-squared test of independence suggests that there is an association between mother’s
smoking status and the incidence of low-birth-weight babies when using the original threshold
of 88.2 ounces.
Several confounders should be considered in the investigation process. This study must
account for confounders since the data used in this research was a result of a retrospective
observational study. Since the data was not produced by a controlled experiment, we can
only infer association and cannot establish a causal relationship. Another limitation is that
the experiment was performed with a group of people that is potentially not representative
of all mothers. All of these mothers had single births that were male who had survived for at
least 28 days. These might be additional confounders that could influence the conclusion.
For instance, there could be socio-economic factors that could affect the mother’s health and
in extension, the baby’s health. It could also be possible that there is an effect of gestational
age on lower birth weight of the baby since birth weight increases with gestational age(3). A
future direction in expanding this analysis is to investigate the effect of smoking on gestational
age to determine whether a smaller gestational age acts as a mediator in the relationship
between smoking status and low birth weight.
Although we cannot establish a causal relationship between smoking during pregnancy and
low birth weights, this report found a strong association between these two variables. (more
writing tie back to the scientific question, and assess if this difference being important to the
health of the baby, which is abbreviated here) Furthermore, there is extensive studies that
emphasize birth weight as an indicator of the baby’s health since low birth weight babies
are more likely to develop complications such as cognitive deficits, motor delays, cerebral
palsy, and psychological problems. In fact, low birth weight babies are 20 times more likely to
develop fatal complications and die in comparison to normal birth weight babies(4). Therefore,
although a causal relationship cannot be found, smoking during pregnancy is something that
should not be overlooked.
10
Work Cited and Appendix DOES NOT count towards 10 page limit
5. Work Cited
(Abbreviated, would recommend storing citations with a citation manager such as Mendeley
(https://www.mendeley.com/download-desktop-new/) or Zotero (https://www.zotero.org/)
so you can store your citations and easily make a work cited page. The studies cited can be
found in popular scientific literature sites such as pubmed (https://www.ncbi.nlm.nih.gov/p
mc/). For this report, I used the citation format commonly found in Nature, but you can
choose whatever MLA, ALA citation format you would like to use.)
1. Benowitz, N. L. Nicotine addiction. N. Engl. J. Med. 362, 2295–2303 (2010).
2. Wickstrom, R. Effects of Nicotine During Pregnancy: Human and Experimental
Evidence. CN 5, 213–222 (2007).
3. Topçu, H. O. et al. Birth weight for gestational age: A reference study in a tertiary
referral hospital in the middle region of Turkey. Journal of the Chinese Medical
Association 77, 578–582 (2014).
4. K. C., A., Basel, P. L. & Singh, S. Low birth weight and its associated risk factors:
Health facility-based case-control study. PLoS ONE 15, e0234907 (2020).
11
6. Appendix
Boxplot of Non−Smoker bwt
60
60
80
80
100
100
120
140
140
160
180
Boxplot of Smoker bwt
Figure 3: Boxplot of birthweights separated by Smoking Status. There are a lot of observations
with low birthweights in the non-smoker data that can potentially skew the analysis.
12
0.6
0.4
0.2
0.0
dchisq(x, df = 1)
0.8
Chi−Squared Density Plot df = 1
0
5
10
15
20
x
Figure 4: Density of Chi-Squared Distribution df = 1. The red line represents the test statistic from
the chi-squared test. The p-value is calulated by adding up the sum of the area to the right of the
red line. The blue line represents the critical value for the minimimum p-value of 0.05. The p-value
is very small because the area under the curve is extremely small.
13
Exploratory Data Analysis and Inference
Format
Objective
One of the most important goals of this course is to learn how to write a data analysis report. The HW
is to be submitted in a format similar to a data analysis report. The difference is that the HW will be
more structured so that it can be more easily composed and graded.
Structure
The overall structure of the HW report should be as follows:
0. Header
1. Introduction
2. Analysis
3. Conclusion(s)/Discussion
4. Appendix/Appendices
Now let’s consider the basic outline of the data analysis report in more detail:
0. Header. This includes important general information:
• Title: Choose a succinct but specific title that reflects the goals of the analysis.
• Author contributions: Include a brief description of the respective contribution of each of
the team members.
1. Introduction. Good features for the Introduction include:
• Brief summary of the study and data, as well as any relevant substantive context,
background, or framing issues.
• The “big questions” answered by your data analyses, and summaries of your conclusions
about these questions. These questions should include: 1) the questions posed by the HW
prompts; 2) other questions that you may propose.
• Brief outline of remainder of the report.
2. Basic analysis. In this format, the analysis is organized by research questions. Devote a
subsection for each question raised in the Introduction. These questions should be organized
according to the HW prompts. Within each subsection, statistical method, analyses, and
conclusion would be described (for each question). For example:
2.1 Data processing and summaries
Methods
Analysis
Conclusions
2.2 Comparison between males and females
Methods
Analysis
Conclusions
1
2.3 Effect of Age
Methods
Analysis
Conclusions
Etc. . .
3. Advanced analysis. This section contains analysis that goes beyond the HW prompts. It will
display your own interest and creativity. It may include:
• An additional analysis question, e.g. estimating another parameter, considering the effect of
another variable in the data, evaluating the validity of statistical assumptions not
considered, etc.
• Using a more advanced analysis method to answer one of the HW questions or a new
question.
4. Conclusion(s)/Discussion. This section closes the report:
• Conclusion summary: It should reprise the questions and goals of the analysis stated in the
introduction. It should also summarize the findings and compare them to the original goals.
• Discussion: If relevant, include additional observations or details gleaned from the analysis
section. If relevant, discuss relevance to the science and other studies. Discuss data
limitations. New questions, future work, etc., can also be raised here.
5. Appendix/Appendices. This section is not mandatory but it may be necessary depending on
what you do. This is the place to put details and ancillary materials, that is, materials that you
want to include but would disturb the reading flow if they were put in the main text. These
might include such items as
• Technical descriptions of (unusual) statistical procedures
• Detailed tables or computer output
• Figures and Tables that were not central to the arguments presented in the body of the
report
6. Computer code. In a general data analysis report, computer code may be included in the
Appendix. In our course, code should be submitted as a separate file. Make sure to
document your code by including appropriate section headers, text sentences, comments and
annotations, to make it easier for the reader to follow what you are doing.
Formatting and length
A good data analysis report should present all the necessary information in a concise fashion. To
exercise this and facilitate grading, please abide to the following constraints:
• Use 12-point font for the main text, with full space between lines.
• Start every section in a new page. This will make it easier for you to mark which pages
correspond to each graded item in Gradescope.
• Length guidelines:
o Header + Introduction: 1 page
o Each question: 1 to 2 pages each, including tables and figures
o Advanced analysis: 1 to 2 pages, including tables and figures
o Summary/conclusions/discussion: 1 page
• The total length of the report should not exceed 10 pages (not including Appendix or code).
Any additional material, if it is really necessary, should go in the Appendix.
2
Presentation style
Points will be given for good presentation style and abiding to the formatting constraints. As a
guideline, a good data analysis report has several important features:
• It is organized in a way that makes it easy for different audiences to skim/fish through it to find
the topics and the level of detail that are of interest to them.
• The writing is as invisible/unremarkable as possible, so that the content of the analysis is what
the reader remembers, not distracting quirks or tics in the writing. Examples of distractions
include:
– Extra sentences, overly formal or flowery prose, or at the other extreme overly casual or
overly brief prose.
– Grammatical and spelling errors.
– Placing the data analysis in too broad or too narrow a context for the questions of
interest to your primary audience.
– Focusing on process rather than reporting procedures and outcomes.
– Getting bogged down in technical details, rather than presenting what is necessary to
properly understand your conclusions on substantive questions of interest to the primary
audience.
• Tables are well organized, with well labeled columns and rows. Do not make the table too large
so that they can be easily followed and the reader does not get lost.
• Figures are well composed, with well labeled axes and large enough fonts. If relevant, use
colors and line types to distinguish between different results and include a legend. Do not make
the figure too busy so that it can be easily understood.
3
Purchase answer to see full
attachment
Tags:
confidence interval
video games
R language
Sample Statistics
Central Limit Theorem
User generated content is uploaded by users for the purposes of learning and should be used following Studypool’s honor code & terms of service.
Reviews, comments, and love from our customers and community: