From 6daad5c6d32c7e66729a5444709e1dd4de82b1e0 Mon Sep 17 00:00:00 2001 From: aaronshaw Date: Thu, 16 May 2019 08:57:45 -0500 Subject: [PATCH 1/1] adding wk7 solutions; tweak to wk 6 --- problem_sets/week_06/ps6-worked-solution.html | 4 +- problem_sets/week_07/ps7-worked-solutions.Rmd | 230 ++++++++ .../week_07/ps7-worked-solutions.html | 508 ++++++++++++++++++ 3 files changed, 740 insertions(+), 2 deletions(-) create mode 100644 problem_sets/week_07/ps7-worked-solutions.Rmd create mode 100644 problem_sets/week_07/ps7-worked-solutions.html diff --git a/problem_sets/week_06/ps6-worked-solution.html b/problem_sets/week_06/ps6-worked-solution.html index a0c9e48..d182029 100644 --- a/problem_sets/week_06/ps6-worked-solution.html +++ b/problem_sets/week_06/ps6-worked-solution.html @@ -324,11 +324,11 @@ library(ggridges)
ridge_plot <- ggplot(data=df, aes(x=weeks_alive, y = dose)) + geom_density_ridges(jittered_points = T, fill = 'orange')
 ridge_plot
## Picking joint bandwidth of 10.5
-

+

# add a fancy minimalist theme to make it prettier:
 ridge_plot + theme_minimal()
## Picking joint bandwidth of 10.5
-

+

A two sample t-test assumes independence and normality. An ANOVA assumes independence, normality, and equal variance. It’s a bit tough to tell, but the overall assumption of equal variance seems reasonable. Normality is a bit of a hard sell within groups or overall. Nevertheless, most analysts would march ahead with the analysis despite these violations of assumptions. We can discuss how you might think and talk about this in class.

The global mean is

mean(df$weeks_alive)
diff --git a/problem_sets/week_07/ps7-worked-solutions.Rmd b/problem_sets/week_07/ps7-worked-solutions.Rmd new file mode 100644 index 0000000..f7fb58d --- /dev/null +++ b/problem_sets/week_07/ps7-worked-solutions.Rmd @@ -0,0 +1,230 @@ +--- +title: "Week 7 problem set: Worked solutions" +author: "Jeremy Foote & Aaron Shaw" +date: "May 16, 2019" +output: html_document +--- + +```{r setup, include=FALSE} +knitr::opts_chunk$set(echo = TRUE) +``` +## Programming Challenges + +### PC1-3 + +We'll import the .dta file first using an appropriate command from the `readstata13` package. Also, after looking through the dataverse files, it turns out there is a version of the data which is a TSV file, and can be imported with `read_delim()` + +```{r} +library(readstata13) + +df <- read.dta13('~/Documents/Teaching/2019/stats/data/week_07/Halloween2012-2014-2015_PLOS.dta') + +## Same result +## df.t <- read.delim('~/Documents/Teaching/2019/stats/data/week_07/Halloween2012-2014-2015_PLOS.tab') + +head(df) +summary(df) +``` +There are a few strange things about the dataset. One is the `neob`, which the codebook says means "not equal to obama"; in other words, it's the converse of the `obama` column. The `treat_year` column is a unique index of the `obama` column and the `year` column. See the codebook for more information. Happily, we only need to use the first two columns for now. + +### PC4 + +The `table` function is a great way to create contingency tables. We'll recode the variables as logical TRUE/FALSE values first. + +```{r} + +# Change both measures into T/F +df$obama = as.logical(df$obama) +df$fruit = as.logical(df$fruit) + +# create the table with nice labels +obama.tbl <- table(fruit=df$fruit, flotus=df$obama) +obama.tbl + +``` + +### PC5 + +The simplest way to determine if the groups are independent is a $\chi^2$ test. Since it's a 2x2 comparison, we could also test for a difference in proportions using the `prop.test()` function. + +```{r} + +chisq.test(obama.tbl) + +prop.test(obama.tbl) + +``` +Notice that both functions report identical $\chi^2$ test results and p-values. + +There are many ways you could answer the question about why these results are different from the regression results presented in the paper. We'll discuss some of them in class this week and some more next week as part of our discussion of multiple regression. + +### PC6 + +First we want to get the proportion and standard error for fruit in each group. These can be calculated individually or using a function (guess which one we'll document here). Also, note that I'm just going to use the `complete.cases()` function to eliminate the missing items for the sake of simplicity. + +```{r} +df <- df[complete.cases(df),] + +prop.se = function(values){# Takes in a vector of T/F values + N = length(values) + prop = mean(as.numeric(values)) + se = sqrt(prop * (1-prop)/N) + return(c(prop, se)) +} + +prop.se(df$fruit[df$obama]) +prop.se(df$fruit[!df$obama]) + +``` +In order to graph that it will help to convert the results into a data frame: +```{r} +library(ggplot2) + +prop.and.se <- data.frame(rbind( + prop.se(df$fruit[df$obama]), + prop.se(df$fruit[!df$obama]) +)) +names(prop.and.se) <- c("proportion", "se") +prop.and.se$obama <- c(TRUE, FALSE) + +ggplot(prop.and.se, aes(x=obama,y=proportion)) + + geom_point(aes(color=obama), size=5) + # Add the points for the proportions + geom_errorbar(aes(ymin=proportion - 1.96 * se, # Add error bars + ymax=proportion + 1.96 * se, + width=0, # Remove the whiskers + color=obama),size=1.1) + # Make them a little bigger and color them + coord_flip() + # Flip the chart + theme_light() + # Change the theme (theme_minimal is also nice) + scale_color_manual(values = c('gray','black'), guide=F) + # Change the colors + ylim(0,.5) + # Change the y axis to go from 0 to .5 + ylab('Proportion choosing fruit') + # Add labels + xlab('Picture shown was Michelle Obama') + +``` + +Another way to do that involves a slightly different version of the function we created above and then using the `group_by` and `summarize` functions in the `dplyr` library. +```{r} + +prop.se = function(values){# Takes in a vector of T/F values + N = length(values) + prop = mean(values) + se = sqrt(prop * (1-prop)/N) + return(se) +} + +library(dplyr) + +prop.and.se = df %>% filter(!is.na(fruit)) %>% + group_by(obama) %>% + summarize( + proportion=mean(fruit), + se = prop.se(fruit) + ) + +## Same exact plotting code +ggplot(prop.and.se, aes(x=obama,y=proportion)) + + geom_point(aes(color=obama), size=5) + + geom_errorbar(aes(ymin=proportion - 1.96 * se, + ymax=proportion + 1.96 * se, + width=0, + color=obama),size=1.1) + + coord_flip() + + theme_light() + + scale_color_manual(values = c('gray','black'), guide=F) + + ylim(0,.5) + + ylab('Proportion choosing fruit') + + xlab('Picture shown was Michelle Obama') + +``` + +### PC7 + +Here's one way to export our table, using write.csv +```{r} +write.csv(obama.tbl, file = 'crosstabs.csv') + +``` +We can make sure it worked +```{r} +read.csv('crosstabs.csv') +``` +We lost some information, because the `table` function doesn't save column names. Another way to do this would be to change it into a dataframe first, like this: +```{r} +as.data.frame(obama.tbl) +``` +and then save that dataframe. +```{r} +write.csv(as.data.frame(obama.tbl), file = 'crosstabs.csv') +read.csv('crosstabs.csv') +``` +You could also use the `xtable` package to do this. The package has many functions to customize table outputs, but a relatively simple way to generate an html table looks like this: +```{r} +library(xtable) +print(xtable(obama.tbl), type="html") +``` +There is a lot of documentation and examples online to help you customize as you see fit. + +## Statistical Questions + +### SQ1—7.6 +a) Husband and wife ages are positively, linearly correlated, with a few possible outliers. +b) Husband and wife heights also appear to be positively and linearly correlated, but with very weakly. +c) The age plot shows a much more evident linear trend. The data points also align more tightly. +d) No. Correlation is not affected by rescaling. Revisit the formulas for calculating correlation for more depth on this.. + +### SQ2—7.30 +a) $\widehat{heart~weight} = -0.357 + 4.034 \times body_wt$ +b) If a cat had a body weight of 0, we would expect it to have a heart weight of -0.357 (which in itself is obviously non-sensical. The intercept just serves to anchor the regression line and often has no substantive meaning in the real world). +c) For every additional kilogram in weight, heart weight is expected to increase 4.034 grams on average. +d) About 64% of the variance in heart weights is explained by body weight. +e) $\sqrt{0.6466} = .8041$ + +### SQ3—7.40 +a) $\bar{y} = b_0 + b_1 \bar{x}$ +$3.9983 = 4.010 + b_1 \times -0.0883$ +```{r} +b_1 = (3.9983-4.010) / -0.0883 +b_1 +``` +b) The null hypothesis is that $H_0: \beta_1 = 0$. With a t-score of 4.13 and a p-value of \~0.00 there is strong evidence of a positive relationship between beauty and teaching scores. +c) + 1. Nearly normal residuals: The distriubtion of residuals looks a little bit skewed but fairly normal. + 2. Homoscedasticity of residuals: The residuals have approximately equal variance across the range of beauty around zero. + 3. Independent observations/residuals: The question does not really shed much light on this. Looking at the original paper, it is actually a bit unclear how the instructors were chosen and whether or not they are a random sample of all instructors at UT Austin. + 4. Linear relationship between independent/dependent variables: From the scatterplot, the relationship appears to be linear (or at least not obviously non-linear). + +### SQ4—7.42 +a) Substituting into the regression equation: +$\bar{y} = 3.91 + .78 * 28$ +```{r} + +y_bar = 3.91 + .78 * 28 + +y_bar + +``` +b) If our null hypothesis is that $\beta_{gestational\_age} = 0$, we calculate the t-score as +$$T = \frac{0.78 - 0}{.35} = 2.23$$ +We can look up a p-value for this in a t-table for $df = 23$ or we can figure it out in R. The `pt(q,df)` function gives the proportion of the T-distribution which is less than `q` with `df` degrees of freedom. + +```{r} +(1 - pt(q = 2.23, df = 23)) * 2 # To get 2-tailed p-value, we multiply this by 2 + +pt(q = 2.23, df = 23, lower.tail = FALSE) * 2 # Alternatively, we can get the area of the upper tail and multiply that by 2 (multiplying by 2 works because t-distributions are symmetrical) +``` +The p-value is less than 0.05, so with a traditional $\alpha = 0.05$ we would reject the null hypothesis and conclude that there is substantial evidence of an association between gestational age and head circumference for low birth-weight babies. + +## Empirical Questions + +### EQ1. +Final score had a strong postive correlation with "Modded", "Starting score", and "Karma", and strong negative relationship with "Anonymous user". Other correlations were fairly weak. + +### EQ2. +The authors have presented a histogram of the dependent variable and while it's not quite normal, it does not seem so skewed that we would be too worried about fitting a linear model. It might be helpful to see more information about the linearity of relationships between the independent variables and the outcome. In addition, the fact that this is data collected over time suggests that there may be temporal dependencies in the data that undermine the assumptions of a linear model. It might be helpful to see a residual plot and QQ-plot to check for homogeneity of variance and normal distribution of errors. + +## EQ3. +The coefficients in the table represent the expected amount that a final score would change for a one-unit change in a given measure (in this sense, multiple regression is the same as regression with a single predictor). For example, if we focus on the first row in the table, for every 1-point increase in starting score, we would expect the final score to increase by 1.08 on average. + +The t-statistic is the (coefficient - 0) / standard error, and the P value is the proportion of the t-distribution which is as extreme or more extreme than that t-statistic. For the purposes of the first coefficient, the extremely high t-statistic and extremely low p-value indicate that the probability of observing a relationship this strong under the null hypothesis of no association between starting score and final score is very, very small. This indicates that we can reject the null hypothesis. + +The $R^2$ value is the amount of variation in final scores which is explained by these measures. In this case, the $R^2$ value of 0.52 indicates that the variables in the model explain a substantial amount of the variation in the final scores, but that other explanatory factors likely exist that are not captured by the model. diff --git a/problem_sets/week_07/ps7-worked-solutions.html b/problem_sets/week_07/ps7-worked-solutions.html new file mode 100644 index 0000000..6beaeae --- /dev/null +++ b/problem_sets/week_07/ps7-worked-solutions.html @@ -0,0 +1,508 @@ + + + + + + + + + + + + + + + +Week 7 problem set: Worked solutions + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + + + + +
+

Programming Challenges

+
+

PC1-3

+

We’ll import the .dta file first using an appropriate command from the readstata13 package. Also, after looking through the dataverse files, it turns out there is a version of the data which is a TSV file, and can be imported with read_delim()

+
library(readstata13)
+
+df <- read.dta13('~/Documents/Teaching/2019/stats/data/week_07/Halloween2012-2014-2015_PLOS.dta')
+
+## Same result
+## df.t <- read.delim('~/Documents/Teaching/2019/stats/data/week_07/Halloween2012-2014-2015_PLOS.tab')
+
+head(df)
+
##   obama fruit year age male neob treat_year
+## 1     0     0 2014   6    0    1          4
+## 2     0     1 2014   5    0    1          4
+## 3     0     0 2014   9    1    1          4
+## 4     0     0 2014   5    1    1          4
+## 5     0     0 2014   7    0    1          4
+## 6     0     0 2014   9    0    1          4
+
summary(df)
+
##      obama            fruit             year           age       
+##  Min.   :0.0000   Min.   :0.0000   Min.   :2012   Min.   : 2.00  
+##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:2014   1st Qu.: 6.00  
+##  Median :0.0000   Median :0.0000   Median :2015   Median : 8.00  
+##  Mean   :0.3639   Mean   :0.2512   Mean   :2014   Mean   : 8.52  
+##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:2015   3rd Qu.:11.00  
+##  Max.   :1.0000   Max.   :1.0000   Max.   :2015   Max.   :19.00  
+##                   NA's   :1                                      
+##       male             neob          treat_year   
+##  Min.   :0.0000   Min.   :0.0000   Min.   :1.000  
+##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:3.000  
+##  Median :1.0000   Median :1.0000   Median :5.000  
+##  Mean   :0.5262   Mean   :0.6361   Mean   :4.406  
+##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:6.000  
+##  Max.   :1.0000   Max.   :1.0000   Max.   :6.000  
+##  NA's   :1
+

There are a few strange things about the dataset. One is the neob, which the codebook says means “not equal to obama”; in other words, it’s the converse of the obama column. The treat_year column is a unique index of the obama column and the year column. See the codebook for more information. Happily, we only need to use the first two columns for now.

+
+
+

PC4

+

The table function is a great way to create contingency tables. We’ll recode the variables as logical TRUE/FALSE values first.

+
# Change both measures into T/F
+df$obama = as.logical(df$obama)
+df$fruit = as.logical(df$fruit)
+
+# create the table with nice labels
+obama.tbl <- table(fruit=df$fruit, flotus=df$obama)
+obama.tbl
+
##        flotus
+## fruit   FALSE TRUE
+##   FALSE   593  322
+##   TRUE    185  122
+
+
+

PC5

+

The simplest way to determine if the groups are independent is a \(\chi^2\) test. Since it’s a 2x2 comparison, we could also test for a difference in proportions using the prop.test() function.

+
chisq.test(obama.tbl)
+
## 
+##  Pearson's Chi-squared test with Yates' continuity correction
+## 
+## data:  obama.tbl
+## X-squared = 1.8637, df = 1, p-value = 0.1722
+
prop.test(obama.tbl)
+
## 
+##  2-sample test for equality of proportions with continuity
+##  correction
+## 
+## data:  obama.tbl
+## X-squared = 1.8637, df = 1, p-value = 0.1722
+## alternative hypothesis: two.sided
+## 95 percent confidence interval:
+##  -0.01957437  0.11053751
+## sample estimates:
+##    prop 1    prop 2 
+## 0.6480874 0.6026059
+

Notice that both functions report identical \(\chi^2\) test results and p-values.

+

There are many ways you could answer the question about why these results are different from the regression results presented in the paper. We’ll discuss some of them in class this week and some more next week as part of our discussion of multiple regression.

+
+
+

PC6

+

First we want to get the proportion and standard error for fruit in each group. These can be calculated individually or using a function (guess which one we’ll document here). Also, note that I’m just going to use the complete.cases() function to eliminate the missing items for the sake of simplicity.

+
df <- df[complete.cases(df),]
+
+prop.se = function(values){# Takes in a vector of T/F values
+  N = length(values)
+  prop = mean(as.numeric(values))
+  se = sqrt(prop * (1-prop)/N)
+  return(c(prop, se))
+}
+
+prop.se(df$fruit[df$obama])
+
## [1] 0.27477477 0.02118524
+
prop.se(df$fruit[!df$obama])
+
## [1] 0.23680824 0.01525123
+

In order to graph that it will help to convert the results into a data frame:

+
library(ggplot2)
+
## Registered S3 methods overwritten by 'ggplot2':
+##   method         from 
+##   [.quosures     rlang
+##   c.quosures     rlang
+##   print.quosures rlang
+
prop.and.se <- data.frame(rbind(
+  prop.se(df$fruit[df$obama]),
+  prop.se(df$fruit[!df$obama])
+))
+names(prop.and.se) <- c("proportion", "se")
+prop.and.se$obama <- c(TRUE, FALSE)
+
+ggplot(prop.and.se, aes(x=obama,y=proportion)) + 
+  geom_point(aes(color=obama), size=5) +  # Add the points for the proportions
+  geom_errorbar(aes(ymin=proportion - 1.96 * se, # Add error bars
+                    ymax=proportion + 1.96 * se,
+                    width=0, # Remove the whiskers
+                    color=obama),size=1.1) + # Make them a little bigger and color them
+  coord_flip() + # Flip the chart
+  theme_light() + # Change the theme (theme_minimal is also nice)
+  scale_color_manual(values = c('gray','black'), guide=F) + # Change the colors
+  ylim(0,.5) + # Change the y axis to go from 0 to .5
+  ylab('Proportion choosing fruit') + # Add labels
+  xlab('Picture shown was Michelle Obama')
+

+

Another way to do that involves a slightly different version of the function we created above and then using the group_by and summarize functions in the dplyr library.

+
prop.se = function(values){# Takes in a vector of T/F values
+  N = length(values)
+  prop = mean(values)
+  se = sqrt(prop * (1-prop)/N)
+  return(se)
+}
+
+library(dplyr)
+
## 
+## Attaching package: 'dplyr'
+
## The following objects are masked from 'package:stats':
+## 
+##     filter, lag
+
## The following objects are masked from 'package:base':
+## 
+##     intersect, setdiff, setequal, union
+
prop.and.se = df %>% filter(!is.na(fruit)) %>%
+  group_by(obama) %>%
+  summarize(
+    proportion=mean(fruit),
+    se = prop.se(fruit)
+  )
+
+## Same exact plotting code
+ggplot(prop.and.se, aes(x=obama,y=proportion)) + 
+  geom_point(aes(color=obama), size=5) + 
+  geom_errorbar(aes(ymin=proportion - 1.96 * se,
+                    ymax=proportion + 1.96 * se,
+                    width=0,
+                    color=obama),size=1.1) + 
+  coord_flip() + 
+  theme_light() +
+  scale_color_manual(values = c('gray','black'), guide=F) +
+  ylim(0,.5) + 
+  ylab('Proportion choosing fruit') +
+  xlab('Picture shown was Michelle Obama')
+

+
+
+

PC7

+

Here’s one way to export our table, using write.csv

+
write.csv(obama.tbl, file = 'crosstabs.csv')
+

We can make sure it worked

+
read.csv('crosstabs.csv')
+
##       X FALSE. TRUE.
+## 1 FALSE    593   322
+## 2  TRUE    185   122
+

We lost some information, because the table function doesn’t save column names. Another way to do this would be to change it into a dataframe first, like this:

+
as.data.frame(obama.tbl)
+
##   fruit flotus Freq
+## 1 FALSE  FALSE  593
+## 2  TRUE  FALSE  185
+## 3 FALSE   TRUE  322
+## 4  TRUE   TRUE  122
+

and then save that dataframe.

+
write.csv(as.data.frame(obama.tbl), file = 'crosstabs.csv')
+read.csv('crosstabs.csv')
+
##   X fruit flotus Freq
+## 1 1 FALSE  FALSE  593
+## 2 2  TRUE  FALSE  185
+## 3 3 FALSE   TRUE  322
+## 4 4  TRUE   TRUE  122
+

You could also use the xtable package to do this. The package has many functions to customize table outputs, but a relatively simple way to generate an html table looks like this:

+
library(xtable)
+print(xtable(obama.tbl), type="html")
+
## <!-- html table generated in R 3.6.0 by xtable 1.8-4 package -->
+## <!-- Thu May 16 08:40:44 2019 -->
+## <table border=1>
+## <tr> <th>  </th> <th> FALSE </th> <th> TRUE </th>  </tr>
+##   <tr> <td align="right"> FALSE </td> <td align="right"> 593 </td> <td align="right"> 322 </td> </tr>
+##   <tr> <td align="right"> TRUE </td> <td align="right"> 185 </td> <td align="right"> 122 </td> </tr>
+##    </table>
+

There is a lot of documentation and examples online to help you customize as you see fit.

+
+
+
+

Statistical Questions

+
+

SQ1—7.6

+
    +
  1. Husband and wife ages are positively, linearly correlated, with a few possible outliers.
  2. +
  3. Husband and wife heights also appear to be positively and linearly correlated, but with very weakly.
  4. +
  5. The age plot shows a much more evident linear trend. The data points also align more tightly.
  6. +
  7. No. Correlation is not affected by rescaling. Revisit the formulas for calculating correlation for more depth on this..
  8. +
+
+
+

SQ2—7.30

+
    +
  1. \(\widehat{heart~weight} = -0.357 + 4.034 \times body_wt\)
  2. +
  3. If a cat had a body weight of 0, we would expect it to have a heart weight of -0.357 (which in itself is obviously non-sensical. The intercept just serves to anchor the regression line and often has no substantive meaning in the real world).
  4. +
  5. For every additional kilogram in weight, heart weight is expected to increase 4.034 grams on average.
  6. +
  7. About 64% of the variance in heart weights is explained by body weight.
  8. +
  9. \(\sqrt{0.6466} = .8041\)
  10. +
+
+
+

SQ3—7.40

+
    +
  1. \(\bar{y} = b_0 + b_1 \bar{x}\) \(3.9983 = 4.010 + b_1 \times -0.0883\)
  2. +
+
b_1 = (3.9983-4.010) / -0.0883
+b_1
+
## [1] 0.1325028
+
    +
  1. The null hypothesis is that \(H_0: \beta_1 = 0\). With a t-score of 4.13 and a p-value of ~0.00 there is strong evidence of a positive relationship between beauty and teaching scores.
  2. +
  3. +
+
    +
  1. Nearly normal residuals: The distriubtion of residuals looks a little bit skewed but fairly normal.
    +
  2. +
  3. Homoscedasticity of residuals: The residuals have approximately equal variance across the range of beauty around zero.
    +
  4. +
  5. Independent observations/residuals: The question does not really shed much light on this. Looking at the original paper, it is actually a bit unclear how the instructors were chosen and whether or not they are a random sample of all instructors at UT Austin.
    +
  6. +
  7. Linear relationship between independent/dependent variables: From the scatterplot, the relationship appears to be linear (or at least not obviously non-linear).
  8. +
+
+
+

SQ4—7.42

+
    +
  1. Substituting into the regression equation:
    +\(\bar{y} = 3.91 + .78 * 28\)
  2. +
+
y_bar = 3.91 + .78 * 28
+
+y_bar
+
## [1] 25.75
+
    +
  1. If our null hypothesis is that \(\beta_{gestational\_age} = 0\), we calculate the t-score as \[T = \frac{0.78 - 0}{.35} = 2.23\] We can look up a p-value for this in a t-table for \(df = 23\) or we can figure it out in R. The pt(q,df) function gives the proportion of the T-distribution which is less than q with df degrees of freedom.
  2. +
+
(1 - pt(q = 2.23, df = 23)) * 2 # To get 2-tailed p-value, we multiply this by 2
+
## [1] 0.03579437
+
pt(q = 2.23, df = 23, lower.tail = FALSE) * 2 # Alternatively, we can get the area of the upper tail and multiply that by 2 (multiplying by 2 works because t-distributions are symmetrical)
+
## [1] 0.03579437
+

The p-value is less than 0.05, so with a traditional \(\alpha = 0.05\) we would reject the null hypothesis and conclude that there is substantial evidence of an association between gestational age and head circumference for low birth-weight babies.

+
+
+
+

Empirical Questions

+
+

EQ1.

+

Final score had a strong postive correlation with “Modded”, “Starting score”, and “Karma”, and strong negative relationship with “Anonymous user”. Other correlations were fairly weak.

+
+
+

EQ2.

+

The authors have presented a histogram of the dependent variable and while it’s not quite normal, it does not seem so skewed that we would be too worried about fitting a linear model. It might be helpful to see more information about the linearity of relationships between the independent variables and the outcome. In addition, the fact that this is data collected over time suggests that there may be temporal dependencies in the data that undermine the assumptions of a linear model. It might be helpful to see a residual plot and QQ-plot to check for homogeneity of variance and normal distribution of errors.

+
+
+
+

EQ3.

+

The coefficients in the table represent the expected amount that a final score would change for a one-unit change in a given measure (in this sense, multiple regression is the same as regression with a single predictor). For example, if we focus on the first row in the table, for every 1-point increase in starting score, we would expect the final score to increase by 1.08 on average.

+

The t-statistic is the (coefficient - 0) / standard error, and the P value is the proportion of the t-distribution which is as extreme or more extreme than that t-statistic. For the purposes of the first coefficient, the extremely high t-statistic and extremely low p-value indicate that the probability of observing a relationship this strong under the null hypothesis of no association between starting score and final score is very, very small. This indicates that we can reject the null hypothesis.

+

The \(R^2\) value is the amount of variation in final scores which is explained by these measures. In this case, the \(R^2\) value of 0.52 indicates that the variables in the model explain a substantial amount of the variation in the final scores, but that other explanatory factors likely exist that are not captured by the model.

+
+ + + + +
+ + + + + + + + -- 2.39.5