Description

HW questions at the end of casestudy4 file, given the data you needuse exactly the method given to solve the questionsmeet all the format requirements givengiven the sample homework report, check that for hw format(need to have intro, data, analysis, conclusion, advance analysis)given the data you need

8 attachmentsSlide 1 of 8attachment_1attachment_1attachment_2attachment_2attachment_3attachment_3attachment_4attachment_4attachment_5attachment_5attachment_6attachment_6attachment_7attachment_7attachment_8attachment_8

Unformatted Attachment Preview

Introduction

The data

Background

Investigations

Math

189:

Chapter

4 Data Analysis and Inference

Armin Schwartzman Professor Division of Biostatistics and Halıcıoǧlu Data Science Institute University of California, San Diego

Snow gauge

* Main source of Water for Northern California comes from the

Sierra Nevada mountains.

* To help monitor the water supply, the Forest Service of the

United States Department of Agriculture (USDA) operates a

gamma transmission snow gauge in the Central Sierra Nevada

near Soda Springs, CA. The gauge is used to determine a

depth profile of snow density.

* Analysis of the snow pack profile helps with monitoring the

water supply and flood management. It is also a source of

data for the study of climate change.

2

Snow gauge(cont.)

* The gauge does not directly measure snow density. The

density reading is converted from a measurement of gamma

ray emissions.

* Due to instrument wear and radioactive source decay, there

may be changes over the seasons in the functions used to

convert the measured values into density readings.

* To adjust the conversion method, a calibration run is made

each year at the beginning of the winter season.

* In this case study we will develop a procedure to calibrate the

snow gauge from data.

3

Introduction

The data

Background

Investigations

Math

189:

Chapter

4 Data Analysis and Inference

Armin Schwartzman Professor Division of Biostatistics and Halıcıoǧlu Data Science Institute University of California, San Diego

Description

* The data are from a single calibration run of the snow gauge.

* The run consists of placing polyethylene blocks of known

densities between the two poles of the snow gauge and taking

readings on the blocks. The polyethylene blocks are used to

simulated snow.

* The measurements reported are amplified versions of the

gamma photon count made by the detector. We call the

gauge measurement the ”gain”.

* The data available here consists of 10 measurements for each

of 9 densities in grams per cubic centimeter of polyethylene.

4

The Data

5

The calibration process

To be used in practice, the snow gauge needs to map the measured

gamma ray intensity to snow density. However, the experiment is

done in reverse. The calibration process goes as follows.

1. The experiment measures gamma ray intensity as a function

of the density of the polyethylene blocks.

2. From the data, a function is determined that maps density to

gamma ray intensity.

3. The inverse of the above function is used to map gamma ray

intensity to density.

6

Introduction

The data

Background

Investigations

Math

189:

Chapter

4 Data Analysis and Inference

Armin Schwartzman Professor Division of Biostatistics and Halıcıoǧlu Data Science Institute University of California, San Diego

A Physical Model

The gamma rays that are emitted from the radioactive source

may be scattered or absorbed by the polyethylene molecules

between the source and the detector. With denser

polyethelene, fewer gamma rats will reach the detector.

A simplified version of the model that may be workable for the

calibration problem of interest is described here. A gamma ray

on route to the detector passes a number of polyethylene

molecules.The number of molecules depends on the density of

the polyethylene. A molecule may absorb the gamma photon,

bounce it out of the path to the detector, or allow it to pass.

7

A Physical Model

If each molecule acts independently, the chance that a gamma

ray successfully arrives at the detector is p m where p is the

chance that a single molecule will neither absorb nor bounce

the gamma ray, and m is the number of molecules in a

straight line path from the source to the detector.

Let d = Cm be the density, proportional to the number of

molecules m by some unknown constant C .

Let g = Ap m be the instrument gain, proportional to the

probability of detection p m by some unknown constant A.

8

A Physical Model

The gamma ray measurement can be expressed as

g = Ap m = Ae (log p)m = Ae (log p)/C ·(Cm) = Ae

d

where A > 0 and < 0 are unknown coefficients. In other
words, the gamma ray measurement decays exponentially with
the density.
The purpose of the calibration is to estimate the unknown
coefficients A and .
9
Linearization
The above model can be made linear on the density d by
taking a log transformation:
log g = log A + d
If we observe the gain g , then Y = log g can be modeled as a
linear function of the density X = d as:
Y =
0
+ X + error
Once 0 and have been estimated, the model can be
inverted to estimate a new density d as a function of a new
observed gain g .
10
Introduction
The data
Background
Investigations
Math
189:
Chapter
4 Data Analysis and Inference
Armin Schwartzman Professor Division of Biostatistics and Halıcıoǧlu Data Science Institute University of California, San Diego
The aim of this HW is to provide a procedure for converting gain
into density when the gauge is in operation. Keep in mind that
the calibration experiment was conducted by varying density and
measuring the response in gain, but when the gauge is ultimately in
use, the density is to be estimated from the measured gain.
1. Raw data: Fit a regression line to the data and plot the fit.
Examine the residual plot and explain why a transformation
may be necessary.
2. Transformed data: Determine an appropriate transformation
and fit the model to the transformed data. Plot the new fit
and examine the residuals. Justify your final model using both
theoretical and empirical arguments.
3. Robustness: Suppose the densities of the polyethylene blocks
are not reported exactly. How might this a↵ect the fit? Use a
simulation to answer this question.
11
The aim of this HW is to provide a procedure for converting gain
into density when the gauge is in operation. Keep in mind that
the calibration experiment was conducted by varying density and
measuring the response in gain, but when the gauge is ultimately in
use, the density is to be estimated from the measured gain.
4. Forward prediction: Produce point estimates and uncertainty
bands for predicting the gain (in the original scale) as a
function of the measured density. Can some gains be predicted
more accurately than others? Consider specific prediction
intervals for densities of 0.508 and 0.001 and compare these
intervals to the range of measured gains for those densities.
12
The aim of this HW is to provide a procedure for converting gain
into density when the gauge is in operation. Keep in mind that
the calibration experiment was conducted by varying density and
measuring the response in gain, but when the gauge is ultimately in
use, the density is to be estimated from the measured gain.
5. Reverse prediction: The average measured gains for the
density values of 0.508 and 0.001 are 38.6 and 426.7,
respectively. Invert the forward prediction line and uncertainty
bands to produce point estimates and prediction intervals for
the density that correspond to the gain measurements 38.6
and 426.7. How do the reverse predictions compare to the
true density values? Are some densities harder to predict than
other densities?
13
The aim of this HW is to provide a procedure for converting gain
into density when the gauge is in operation. Keep in mind that
the calibration experiment was conducted by varying density and
measuring the response in gain, but when the gauge is ultimately in
use, the density is to be estimated from the measured gain.
6. Cross-Validation: The reverse prediction may be influenced
by the fact that the measurement corresponding to the
densities 0.508 and 0.001 were included in the fitting. To
avoid this, omit the set of measurements corresponding to the
block of density 0.508, apply your estimation/calibration
procedure to the remaining data, and provide an interval
estimate for the density of a block with an average reading of
38.6. Where does the actual density fall in the interval? Try
the same test, for the set of measurements at the 0.001
density.
14
MATH 189: Exploratory Data Analysis and Inference
Spring 2021
HW Submission Format
Objective
One of the most important goals of this course is to learn how to write a data analysis report. The HW
is to be submitted in a format similar to a data analysis report. The difference is that the HW will be
more structured so that it can be more easily composed and graded.
Structure
The overall structure of the HW report should be as follows:
0. Header
1. Introduction
2. Analysis
3. Conclusion(s)/Discussion
4. Appendix/Appendices
Now let’s consider the basic outline of the data analysis report in more detail:
0. Header. This includes important general information:
• Title: Choose a succinct but specific title that reflects the goals of the analysis.
• Author contributions: Include a brief description of the respective contribution of each of
the team members.
1. Introduction. Good features for the Introduction include:
• Brief summary of the study and data, as well as any relevant substantive context,
background, or framing issues.
• The “big questions” answered by your data analyses, and summaries of your conclusions
about these questions. These questions should include: 1) the questions posed by the HW
prompts; 2) other questions that you may propose.
• Brief outline of remainder of the report.
2. Basic analysis. In this format, the analysis is organized by research questions. Devote a
subsection for each question raised in the Introduction. These questions should be organized
according to the HW prompts. Within each subsection, statistical method, analyses, and
conclusion would be described (for each question). For example:
2.1 Data processing and summaries
Methods
Analysis
Conclusions
2.2 Comparison between males and females
Methods
Analysis
Conclusions
1
2.3 Effect of Age
Methods
Analysis
Conclusions
Etc. . .
3. Advanced analysis. This section contains analysis that goes beyond the HW prompts. It will
display your own interest and creativity. It may include:
• An additional analysis question, e.g. estimating another parameter, considering the effect of
another variable in the data, evaluating the validity of statistical assumptions not
considered, etc.
• Using a more advanced analysis method to answer one of the HW questions or a new
question.
4. Conclusion(s)/Discussion. This section closes the report:
• Conclusion summary: It should reprise the questions and goals of the analysis stated in the
introduction. It should also summarize the findings and compare them to the original goals.
• Discussion: If relevant, include additional observations or details gleaned from the analysis
section. If relevant, discuss relevance to the science and other studies. Discuss data
limitations. New questions, future work, etc., can also be raised here.
5. Appendix/Appendices. This section is not mandatory but it may be necessary depending on
what you do. This is the place to put details and ancillary materials, that is, materials that you
want to include but would disturb the reading flow if they were put in the main text. These
might include such items as
• Technical descriptions of (unusual) statistical procedures
• Detailed tables or computer output
• Figures and Tables that were not central to the arguments presented in the body of the
report
6. Computer code. In a general data analysis report, computer code may be included in the
Appendix. In our course, code should be submitted as a separate file. Make sure to
document your code by including appropriate section headers, text sentences, comments and
annotations, to make it easier for the reader to follow what you are doing.
Formatting and length
A good data analysis report should present all the necessary information in a concise fashion. To
exercise this and facilitate grading, please abide to the following constraints:
• Use 12-point font for the main text, with full space between lines.
• Start every section in a new page. This will make it easier for you to mark which pages
correspond to each graded item in Gradescope.
• Length guidelines:
o Header + Introduction: 1 page
o Each question: 1 to 2 pages each, including tables and figures
o Advanced analysis: 1 to 2 pages, including tables and figures
o Summary/conclusions/discussion: 1 page
• The total length of the report should not exceed 10 pages (not including Appendix or code).
Any additional material, if it is really necessary, should go in the Appendix.
2
Presentation style
Points will be given for good presentation style and abiding to the formatting constraints. As a
guideline, a good data analysis report has several important features:
• It is organized in a way that makes it easy for different audiences to skim/fish through it to find
the topics and the level of detail that are of interest to them.
• The writing is as invisible/unremarkable as possible, so that the content of the analysis is what
the reader remembers, not distracting quirks or tics in the writing. Examples of distractions
include:
– Extra sentences, overly formal or flowery prose, or at the other extreme overly casual or
overly brief prose.
– Grammatical and spelling errors.
– Placing the data analysis in too broad or too narrow a context for the questions of
interest to your primary audience.
– Focusing on process rather than reporting procedures and outcomes.
– Getting bogged down in technical details, rather than presenting what is necessary to
properly understand your conclusions on substantive questions of interest to the primary
audience.
• Tables are well organized, with well labeled columns and rows. Do not make the table too large
so that they can be easily followed and the reader does not get lost.
• Figures are well composed, with well labeled axes and large enough fonts. If relevant, use
colors and line types to distinguish between different results and include a legend. Do not make
the figure too busy so that it can be easily understood.
3
Smoking in Mothers Result in Decreasing Birth Weights
Benjamin Pham and Xinran Wang
Disclaimer
The report provided here IS NOT the definitive answer key for Homework 1.
This is meant to serve as an example of what we think might be an adequate
submission. There are multiple possible answers for these parts that can also get
full marks.
(This is the header mentioned in the HW guidelines - includes Title, Authors, and Contribution
Statement. This DOES NOT count towards the report 10 page limit.)
0. Contribution Statement
Both Benjamin Pham and Xinran Wang wrote R code according to their written parts in
this work. Both students discussed and implemented the data processing section. Benjamin
Pham wrote the Numerical Analysis section, and Graphical Analysis section. Xinran wrote
the Incidence section and Conclusion. Both students contributed equally to the Introduction,
Advanced Analysis section, and Conclusion. In addition, both students reviewed and added
changes to the whole report.
1. Introduction
(Background abbreviated, should include literature reviews + citations in motivating the
analysis, 1 page)
Smoking has remained a highly addictive and destructive habit among adults in the past 50
years despite multitudes of public health advances. Addiction to nicotine cigarettes causes
80% of people that do try to quit to fail and to indulge themselves in their habit despite
knowing the dangers of doing so(1). It is no suprise that soon-to-be pregnant mothers, who
may have educated themselves on the hazards of smoking while pregnant, start or continue
to smoke. Smoking during pregnancy is known to cause adverse effects to fetal development.
Small birth weights and early gestational periods due to smoking during pregnancy usually
results in a lower survival rate for babies from various problems such as restricted oxygen
and nutritional transfer during fetal development(2).
The main goal of this analysis is to investigate the differences in distributions of babies’ birth
weight to smoking mothers versus non-smoking mothers. In this analysis, we use numerical
summaries and graphical methods to describe the distribution of babies’ birth weight, and
experiment on our estimates on the low-weight birth weight rate. Numerical summaries
include the minimum, maximum, mean, median, standard deviations, kurtosis, skewness,
and quantiles of the birth weights for babies born to women who smoked and did not smoke
during their pregnancy. Graphical methods, including histograms and Q-Q plots, compare the
distributions of the two groups. Incidence experiments, which is run on different classification
standards on low birth-weight babies, assessing the robustness of our estimates. We then
utilized the Chi-Squared Test of Independence, a hypothesis testing method, to determine
if smoking status is associated with low birth weight. Combining all evidences above, we
determine whether the differences observed between groups is important.
Data
The data from babies.txt is part of the Child Health and Development Studies database
which details pregnancies occurring between 1960 and 1967 of women enrolled in the Kaiser
Foundation Health plan in the Oakland area. The data consists of women in different race.
The dataset consists of 1236 male babies who have lived at least 28 days and were all single
births (no twins). The two variables of interest are the baby’s birth weight which is a
numerical, discrete variable measured in ounces, and smoking status, which is a categorical
variable and is represented by an integer indicator, represented as 1 if the mother smoked
during her pregnancy and 0 if the mother did not smoke during her pregnancy.
1
2. Basic Analysis
(For each question, Provide a methods, analysis, and conclusion section as shown in the
guidelines. The method section describes what was conducted to yield the results. The
analysis section shows the results. The conclusion section shows the interpretation of the
results. You are NOT limited to talking purely about a specific subsection in each conclusion.
You can call back to stated results from prior sections. Each section should be 1-2 pages as
needed.)
2.1. Data Processing
Methods
The data was loaded with R. Our basic analysis mainly focused on birth weight (bwt) and
mother smoking status (smoke). The data was cleaned where observations with missing
values in these columns were removed from our analysis.
Analysis
The data originally had 1,236 observations. Removing these observations with missing
smoke values reduced the dataset to 1,226 observations. The observations in the dataset are
distributed unevenly as there are 484 Non-Smoker mothers and 742 Smoker mothers.
Conclusion
Removing observations in the dataset results in a loss of data. However, there is still an
extremely large number of observations in the dataset which shows that the analysis will not
be majorly affected by the loss of 10 observations. It is possible to trim even more data from
the dataset if missing values in other columns are considered. However, this could result in a
loss of potentially important data points.
2
2.2. Numerical Analysis for Birth Weight Distribution of Smoker
vs. Non-Smoker Babies
Methods
A five number summary of the birth weights for babies of both Smoker and Non-Smoker
mothers was generated to initially examine the data. A five number summary of data
consists of the minimum, 1st Quartile, median, Third Quartile, and maximum of the data.
The skewness and kurtosis of birth weights with different smoke statuses were individually
calculated. These calculations were then compared to determine the similarity of these
distributions to both each other and the Normal Distribution. By this analysis, a normal
distribution has a skewness coefficient of approximately 0 and a kurtosis coefficient of
approximately 3. To validate this, the kurtosis and skewness of the birth weights in each
smoke category were compared to their respective expected normal distribution.
Skewness is defined as:
Skewness =
n
1X
Xi − X̄ 3
(
)
n i=1
s
Kurtosis =
n
1X
Xi − X̄ 4
)
(
n i=1
s
Kurtosis is defined as:
Where n is the number of observations, Xi is the observation, X̄ is the mean, and s is the
standard deviation.
Analysis
Table 1: Summary Statistics of Smoker and Non-Smoker Mothers. The five point summary and the
number of observations in each smoker status group.
Smoker BWT
Non-Smoker BWT
Min
1st QRT
Median
Mean
3rd QRT
Max
Number of Observations
58
55
102
113
115
123
114.1095
123.0472
126
134
163
176
484
742
The Smoke and Non-Smoking birthweight distributions both have different five point summary
statistics. It is notable that the mean birthweight of babies from Smoking Mothers is smaller
than that of Non-Smoker Mothers.
3
Table 2: Kurtosis and Skewness of Smoker and Non-Smoker Mothers Compared to Expected Normal
Distribution. The kurtosis and skewness of birthweight measurements of each smoke category are
compared to those of the expected normal distribution of their respective sample size.
Kurtosis
Smokers BWT
Non-Smokers BWT
Expected Normal Distribution Smoke
Expected Normal Distribution Non-Smoke
Skewness
2.975698 -0.0334909
4.026186 -0.1866062
2.965161 0.0962498
2.958478 0.1042179
From these calculations, the Smoker birthweights distribution has the same kurtosis and
skewness as a Normal distribution. The Non-Smoker birthweights distribution seem to deviate
a bit from the Smokers birthweights distribution and the Normal Distribution with a kurtosis
of 4.03. All distributions are symmetric since their skewness are close to 0.
Conclusion
From the kurtosis and skewness calculations, it can be seen that the Smoker baby birthweights
are indeed different from the Non-Smoker birthweights. The Smoker birthweight distribution
seems to be more normal than the Non-smoker birthweights since the Smoker Birthweights
have a very similar kurtosis to a random normal distribution. The kurtosis of the NonSmoker birthweights has a more pronounced peak than the Normal Distribution and Smoker
Birthweights since the kurtosis is larger. Even though the Non-Smoker birthweight distribution
appears to be different from the Normal Distribution, it is considered weakly normally
distributed due to Law of Large Numbers (LLN) and Central Limit Theorem (CLT) since
there are a sufficiently large number of observations, 742 observations as shown in Section 2.1.
This is not enough to confirm that the Smoking and Non-Smoking distributions are normally
distributed. From the initial look in the five-point summaries of Smoker Birth Weights and
Non-Smoker Birth Weights, there are some slight differences in the summary statistics which
can be indicative of different distributions.
4
2.3. Graphical Analysis for Birth Weight Distribution of Smoker
vs. Non-Smoker Babies
To confirm that the Smoking and Non-Smoking distributions are normally distributed,
graphical methods must be used to visualize each respective distribution.
Methods
A histogram of both Smoking and Non-Smoking birthweights were created because the
birthweight is a continuous numeric variable. To compare to a normal distribution, a
expected normal curve with the means and standard deviation of Smoking and Non-Smoking
birthweights respectively was drawn in red over the respective histograms. Q-Q plots are
then used to confirm if the Smoking and Non-Smoking birthweights do indeed come from a
Normal Distribution with their mean and sd parameters.
Analysis
80
100
140
0.020
180
60
80
100
140
180
Normal Q−Q Plot
Normal Q−Q Plot
−1
0
1
2
3
60
120
−2
120
non_smoke$bwt
Sample Quantiles
smoke$bwt
60
−3
0.000
Density
0.020
60
Sample Quantiles
Histogram of non_smoke$bwt
0.000
Density
Histogram of smoke$bwt
−3
Theoretical Quantiles
−2
−1
0
1
2
3
Theoretical Quantiles
Figure 1: Histogram of Birthweights by Smoking Status (top) and Q-Q plot of Birthweights by
Smoking Status (bottom)
5
The data in both Smoking and Non-Smoking birthweights seem to follow the general shape
of their respective expected random Normal density curve, which is depicted in red. The
dashed blue line represents the mean of each respective birthweight distribution.
In each Q-Qplot, the red line represents their theoretical normal distribution. Because the
data points are aligned on the red line in both Q-Qplots, this shows that the Smoking and
Non-Smoking birthweights are very close to their expected theoretical normal distribution
with some minor deviations at the tails of the distribution. These plots highlight potential
outliers, which are explored further with boxplots (see Figure 3 in the Appendix).
Conclusion
From the histograms and the Q-Q plots, we conclude that the Smoking and Non-Smoking
birthweights are normally distributed. However, they do not share the same normal distribution since the Non-Smoker birthweight distribution is skinnier than the Smoker birthweight
distribution. The mean of the Smoker birthweights are smaller than the mean of the
Non-Smoker birthweights.
6
2.4. Incidence of Low Birth Weight Babies
Methods
We propose to use the number of babies classified as low-birth-weight to estimate the incidence.
The incidence rate of low birth weight is defined as:
nlow−birth−weight
ntotal
In order to understand how the incidence of low birth weight changes when the threshold of
low birth weight classification is changed, a list of thresholds of low birth weight standard
is generated to examine the the robustness of our estimates. The pattern of the change in
proportion estimates as the threshold changes will be examined in this section through a
scatterplot of low birth weight proportions against possible classification thresholds. The
standard deviation of the proportions are also calculated to assess estimate reliability. We
will suggest that our estimate is a reliable estimate if this value does not vary much when
slightly changing the classification standard.
Analysis
Using the provided standard (birth weight less than 88.2 oz), there was 40 out of 484 (8.26%)
low-birth-weight babies from the smoking mother group, and 23 out of 742 (3.10%) from
the non-smoking mother group. Numerically, it is observed that the incidence rate for
low-birth-weight babies is lower in the non-smoking mother group compared to the smoking
mother group.
In the scatterplot below, each point represents the low-birth-weight rate of each group using
a sequence of potential classification thresholds. Since more babies will be classified as
low-weight babies when the threshold is moved up, we expected a monotonically increasing
trend as shown in the scatter plot below. We visually observed that in the neighborhood
of the threshold standard that we use (88.2 ounces), no substantial jumps of the incidence
estimate is triggered by slight movements in the threshold.
7
SD of Incidence Rate vs Window Size
0.03
0.02
0.00
0.01
0.8
0.6
0.4
0.2
0.04
Smoking
Non−Smoking
Standard Deviation of Incidence Rate
Smoking
Non−Smoking
0.0
Proportion of Babies Lower than Threshold
1.0
Low Birth Weight Rate vs Standard
60
70
80
90 100
120
5
Threshold for Low Birth Weight (ounces)
10
15
20
Window Size
Figure 2: Proportion of Babies Classified as Low Birth Weight vs Potential Low Birth Weight Baby
Thresholds (left) and Standard Deviation of Incidence Rate vs Window Size (right).
The scatterplot below illustrates the changes of standard deviation in incidence rate when
changing the examining window around the 88.2 ounces standard. We observed that the
estimate is more robust for the non-smoking group, as the rise of standard deviation remains
to be slow when the window enlarges. Compared to the non-smoking group, the rise in
standard deviation is slightly steeper in the smoker group. However, this standard deviation
does not look substantial when the window size becomes large at 20.
Conclusion
Based on the analysis above, we find that the incidence rate for low-birth-weight babies is
higher among the smoking mother gorup compared to the non-smoking one in our sample
(8.26% vs 3.10%). From our experiments, we conclude that the estimate for the low-birthweight babies is reliable and robust. The estimate does not vary much when a few more or
fewer babies were classified as low birth weight.
8
3. Advanced Analysis
(Use methods to answer an additional question not asked. You can also use additional
methods not covered in class. 1-2 pages.)
We have observed from the previous analysis that there are some groupwise-differences in
the distributions. We have also suggested that the incidence rate is a reliable estimate. We
would then want to assess if there are any statistically significant differences in the incidence
rate between the smoking mother groups and the non-smoking mother groups to analyze if
mother’s smoking status is associated with babies weigh under 88.2 ounces.
Methods
In assessing the importance of differences between the incidence rate, and further if smoking
status is associated with the low-birth-weight classification, we propose to use a YatesCorrected Chi-Squared Test of Independence. The null hypothesis in this test is that there is
no association between mother’s smoking status and baby’s birth weight. The alternative
hypothesis is that there exists an association between smoking status and baby birthweight.
Under a significance level of 0.05, we plan to reject the null hypothesis if our test statistic is
above a critical value of 3.84.
Analysis
From the result of the Chi-Squared Test of Independence, we observed a p-value less than our
level of significance 0.05 (p = 0.00011). The test statistic 14.99 is also larger than our critical
value of 3.84 (see Figure 4 for a visualization with the probability density function (pdf)).
Therefore, we decided to reject the null hypothesis and conclude that there is a statistically
significant association between mother’s smoking status and the incidence of low-birth-weight
babies, under the significance level of 0.05.
Conclusion
Using a chi-squared test of independence (α = 0.05), we conclude that a mother’s smoking
status is associated with the incidence of low-weight-babies weighing less or equal to 88.2
ounces.
9
4. Discussion and Conclusion
(Summarize your main findings here. Compare and contrast the results from your separate
analyses. Compare your overall findings to findings found in other studies - Does it match
what others have found? If not, why? Are there limitations in the data? 1 page.)
The numerical analysis shows that only confirms that the Smoker birthweights are normally
distributed although the Non-Smoker birthweights is weakly proved to be normally distributed
as well due to CLT and LLN. The graphical analysis confirms that both the Smoker and
Non-Smoker birthweights are normally distributed. The incidence analysis shows that there
is a higher proportion of low-birth-weight babies in the smoking group. From the experiments
conducted, we also concluded that our estimate of the incidence is reliable and robust. Lastly,
the chi-squared test of independence suggests that there is an association between mother’s
smoking status and the incidence of low-birth-weight babies when using the original threshold
of 88.2 ounces.
Several confounders should be considered in the investigation process. This study must
account for confounders since the data used in this research was a result of a retrospective
observational study. Since the data was not produced by a controlled experiment, we can
only infer association and cannot establish a causal relationship. Another limitation is that
the experiment was performed with a group of people that is potentially not representative
of all mothers. All of these mothers had single births that were male who had survived for at
least 28 days. These might be additional confounders that could influence the conclusion.
For instance, there could be socio-economic factors that could affect the mother’s health and
in extension, the baby’s health. It could also be possible that there is an effect of gestational
age on lower birth weight of the baby since birth weight increases with gestational age(3). A
future direction in expanding this analysis is to investigate the effect of smoking on gestational
age to determine whether a smaller gestational age acts as a mediator in the relationship
between smoking status and low birth weight.
Although we cannot establish a causal relationship between smoking during pregnancy and
low birth weights, this report found a strong association between these two variables. (more
writing tie back to the scientific question, and assess if this difference being important to the
health of the baby, which is abbreviated here) Furthermore, there is extensive studies that
emphasize birth weight as an indicator of the baby’s health since low birth weight babies
are more likely to develop complications such as cognitive deficits, motor delays, cerebral
palsy, and psychological problems. In fact, low birth weight babies are 20 times more likely to
develop fatal complications and die in comparison to normal birth weight babies(4). Therefore,
although a causal relationship cannot be found, smoking during pregnancy is something that
should not be overlooked.
10
Work Cited and Appendix DOES NOT count towards 10 page limit
5. Work Cited
(Abbreviated, would recommend storing citations with a citation manager such as Mendeley
(https://www.mendeley.com/download-desktop-new/) or Zotero (https://www.zotero.org/)
so you can store your citations and easily make a work cited page. The studies cited can be
found in popular scientific literature sites such as pubmed (https://www.ncbi.nlm.nih.gov/p
mc/). For this report, I used the citation format commonly found in Nature, but you can
choose whatever MLA, ALA citation format you would like to use.)
1. Benowitz, N. L. Nicotine addiction. N. Engl. J. Med. 362, 2295–2303 (2010).
2. Wickstrom, R. Effects of Nicotine During Pregnancy: Human and Experimental
Evidence. CN 5, 213–222 (2007).
3. Topçu, H. O. et al. Birth weight for gestational age: A reference study in a tertiary
referral hospital in the middle region of Turkey. Journal of the Chinese Medical
Association 77, 578–582 (2014).
4. K. C., A., Basel, P. L. & Singh, S. Low birth weight and its associated risk factors:
Health facility-based case-control study. PLoS ONE 15, e0234907 (2020).
11
6. Appendix
Boxplot of Non−Smoker bwt
60
60
80
80
100
100
120
140
140
160
180
Boxplot of Smoker bwt
Figure 3: Boxplot of birthweights separated by Smoking Status. There are a lot of observations
with low birthweights in the non-smoker data that can potentially skew the analysis.
12
0.6
0.4
0.2
0.0
dchisq(x, df = 1)
0.8
Chi−Squared Density Plot df = 1
0
5
10
15
20
x
Figure 4: Density of Chi-Squared Distribution df = 1. The red line represents the test statistic from
the chi-squared test. The p-value is calulated by adding up the sum of the area to the right of the
red line. The blue line represents the critical value for the minimimum p-value of 0.05. The p-value
is very small because the area under the curve is extremely small.
13
density gain 0.6860 17.60 0.6860 17.30 0.6860 16.90 0.6860
0.6860 18.50 0.6860 18.70 0.6860 17.40 0.6860 18.60 0.6860
0.6040 25.90 0.6040 26.30 0.6040 24.80 0.6040 24.80 0.6040
0.6040 30.50 0.6040 28.40 0.6040 27.70 0.5080 39.40 0.5080
0.5080 37.70 0.5080 36.30 0.5080 38.70 0.5080 39.40 0.5080
0.5080 40.30 0.4120 60.00 0.4120 58.30 0.4120 59.60 0.4120
0.4120 55.00 0.4120 52.90 0.4120 54.10 0.4120 56.90 0.4120
0.3180 92.70 0.3180 90.50 0.3180 85.80 0.3180 87.50 0.3180
0.3180 88.20 0.3180 88.60 0.3180 84.70 0.2230 128.0 0.2230
0.2230 129.0 0.2230 127.0 0.2230 129.0 0.2230 132.0 0.2230
0.2230 133.0 0.1480 199.0 0.1480 204.0 0.1480 199.0 0.1480
0.1480 200.0 0.1480 205.0 0.1480 202.0 0.1480 199.0 0.1480
0.0800 298.0 0.0800 297.0 0.0800 288.0 0.0800 296.0 0.0800
0.0800 299.0 0.0800 298.0 0.0800 293.0 0.0010 423.0 0.0010
0.0010 428.0 0.0010 436.0 0.0010 427.0 0.0010 426.0 0.0010
0.0010 429.0
1
16.20
16.80
27.60
37.60
38.80
59.10
56.00
88.30
130.0
133.0
207.0
199.0
293.0
421.0
428.0
0.6860
0.6040
0.6040
0.5080
0.5080
0.4120
0.3180
0.3180
0.2230
0.2230
0.1480
0.0800
0.0800
0.0010
0.0010
17.10
24.80
28.50
38.10
39.20
56.30
87.00
91.60
131.0
134.0
200.0
298.0
301.0
422.0
427.0
Low Birth Weight Rate vs Standard
SD of Incidence Rate vs Window Size
1.0
• Smoking
• Non-Smoking
Smoking
Non-Smoking
0.04
ខំ
0.8
0.03
9'0
Proportion of Babies Lower than Threshold
0.4
Standard Deviation of Incidence Rate
0.02
. . .00....0.000
0.2
0.01
O'Ο
0.00
60
70
80
90 100
120
5
10
15
20
Threshold for Low Birth Weight (ounces)
Window Size
Figure 2: Proportion of Babies Classified as Low Birth Weight vs Potential Low Birth Weight Baby
Thresholds (left) and Standard Deviation of Incidence Rate vs Window Size (right).
The scatterplot below illustrates the changes of standard deviation in incidence rate when
changing the examining window around the 88.2 ounces standard. We observed that the
Histogram of smoke$bwt
Histogram of non_smoke$bwt
0.020
0.020
Density
Density
0.000
0.000
60
80
100
140
180
60
80
100
140
180
smoke$bwt
non_smoke$bwt
Normal Q-Q Plot
Normal Q-Q Plot
Sample Quantiles
60 120
Sample Quantiles
60 120
-3
-2
-1 0 1
2 3
-3 -2 -1 0 1 2 3
Theoretical Quantiles
Theoretical Quantiles
Figure 1: Histogram of Birthweights by Smoking Status (top) and Q-Q plot of Birthweights by
Smoking Status (bottom)
Purchase answer to see full
attachment
Tags:
data analysis
regression model
gamma ray measurement
large amount of data
actual density
User generated content is uploaded by users for the purposes of learning and should be used following Studypool's honor code & terms of service.

## Reviews, comments, and love from our customers and community:

This page is having a slideshow that uses Javascript. Your browser either doesn't support Javascript or you have it turned off. To see this page as it is meant to appear please use a Javascript enabled browser.