problem_sets/week_05/ps5-worked_solution.Rmd

   1 ---
   2 title: 'Week 5 problem set: Worked solutions'
   3 subtitle: "Statistics and statistical programming  \nNorthwestern University  \nMTS 525"
   4 author: "Aaron Shaw"
   5 date: "May 2, 2019"
   6 output: html_document
   7 ---
   8
   9 ```{r setup, include=FALSE}
  10 knitr::opts_chunk$set(echo = TRUE)
  11 ```
  12
  13 ## PC0
  14
  15 Load the dataset. As usual, you'll need to edit this block of code to point at the appropriate file location to make it work on your machine.
  16 ```{r}
  17 pop <- read.delim(file="~/Documents/Teaching/2019/stats/data/week_05/population.tsv")
  18 ```
  19
  20 ## PC1
  21
  22 ```{r}
  23 mean(pop$x, na.rm=T)
  24
  25 ## Insert code for your group as needed:
  26 w3 <- read.csv(file="~/Documents/Teaching/2019/stats/data/week_03/group_03.csv", row.names=NULL)
  27
  28 mean(w3$x, na.rm=T)
  29 ```
  30 Given the structure of the full dataset, it's also easy to calculate all of the group means:
  31 ```{r}
  32 s.means <- tapply(pop$x, pop$group, mean, na.rm=T)
  33 s.means
  34 ```
  35 We will discuss the relationship of the individual group means to population mean in class. Basically, we can think of each group as a sample, so the sample means are the *sampling distribution* of the population mean.
  36
  37 ## PC2
  38
  39 I'll do this two ways. First, just plugging the values from the group sample into the formula for the standard error, I can then add/subtract twice the standard from the mean to find the 95% CI.
  40 ```{r}
  41 se <- sd(w3$x, na.rm=T) / sqrt(length(w3$x[!is.na(w3$x)]))
  42 mean(w3$x, na.rm=T)-(2*se)
  43 mean(w3$x, na.rm=T)+(2*se)
  44 ```
  45
  46 Now, I'll write a more general function to calculate confidence intervals. Note that I create an 'alpha' argument with a default value of 0.05. I then divide alpha by 2. Can you explain why this division step is necessary?
  47
  48 ```{r}
  49 ci <- function (x, alpha=0.05) {
  50     x <- x[!is.na(x)]
  51     probs <- c(alpha/2, 1-alpha/2)
  52     critical.values <- qnorm(probs, mean=0, sd=1)
  53     se <- sd(x)/sqrt(length(x))
  54     mean(x) + critical.values * se
  55 }
  56
  57 ## Here I run the function on the group 3 data
  58 ci(w3$x)
  59 ```
  60
  61 Again, it's easy to estimate this for every group:
  62
  63 ```{r}
  64 group.confints <- tapply(pop$x, pop$group, ci)
  65 group.confints
  66 ```
  67 ## PC3
  68
  69 We'll discuss this one in class too. Since the samples are (random) samples, we should not be surprised that their individual group means are different from the population mean. We should also not be surprised that the 95% CI for the population mean estimated from at least one of the samples does *not* include the true population mean. Since our confidence interval is 95%, we would expect to be wrong about 1/20 times on average!
  70
  71 ## PC4
  72
  73 Comparing the distributions (first for the single comparison)
  74
  75 ```{r}
  76 hist(pop$x)
  77 hist(w3$x)
  78
  79 summary(pop$x)
  80 sd(pop$x, na.rm=T)
  81
  82 summary(w3$x)
  83 sd(w3$x, na.rm=T)
  84 ```
  85
  86 And here's code to do it for each group. A ggplot2 "faceted" histogram will look great and makes for concise code.
  87
  88 ```{r}
  89
  90 library(ggplot2)
  91 ggplot(pop, aes(x)) + geom_histogram() + facet_wrap(. ~ group)
  92
  93 tapply(pop$x, pop$group, summary)
  94
  95 tapply(pop$x, pop$group, sd, na.rm=T)
  96 ```
  97 They all look a little bit different from each other and from the population distribution. We'll discuss these differences in class. Again, none of this should be shocking given the relationship of the samples to the population.
  98
  99 ## PC5
 100
 101 ```{r}
 102 s.means <- tapply(pop$x, pop$group, mean, na.rm=T)
 103 s.means
 104
 105 sd(s.means)
 106
 107 ## My standard error from one of the groups above:
 108 se
 109 ```
 110 We will discuss the relationship of these values in class. As mentioned earlier, the distribution of sample means drawn from the population is the *sampling distribution*. The standard error of the mean estimated from any of the individual groups/samples should be a good approximation of (but not necessarily equal to!) the standard deviation of the sampling distribution of the means.
 111
 112 ## PC 6
 113
 114 ### (a)
 115 Since there's going to be some randomness in the next few commands, I'll set a seed value to ensure the results are consistent on other laptops.
 116 ```{r}
 117 set.seed(20190501)
 118 pop.unif <- sample(seq(0, 9), 10000, replace=TRUE)
 119 ```
 120
 121 ### (b)
 122
 123 ```{r}
 124 mean(pop.unif)
 125 hist(pop.unif)
 126 ```
 127
 128 ### (c)
 129
 130 ```{r}
 131 hist(
 132   sapply( rep(1, 100), function (x) {
 133     mean(sample(pop.unif, 2))}
 134     )
 135   )
 136
 137 ```
 138
 139 ## (d)
 140 Same commands as (c) just changing the sample size
 141 ```{r}
 142 hist(sapply(rep(1, 100), function (x) { mean(sample(pop.unif, 10))}))
 143
 144 hist(sapply(rep(1, 100), function (x) { mean(sample(pop.unif, 100))}))
 145 ```
 146
 147 ## PC7
 148
 149 We'll discuss this in class. Noteable things you might observe include that the sampling distribution of the means approaches normality as it gets larger in size whether the population we draw from is uniform, log-normal, or really just about any other distribution. This is an illustration of some aspects of the *central limit theorem*. It is also an illustration of the *t-distribution* (the basis for the t-tests that you learned about this week).
 150
 151 # Statistical Questions
 152
 153 ## SQ1 — 5.22
 154
 155 (a) We want to calculate the following:
 156
 157 $$ \overline{x}_{diff} \pm t^*_{df}\frac{s_{diff}}{\sqrt{n}}$$
 158
 159 Plug in the corresponding values and the Z-score for a 95% confidence interval (1.98):
 160
 161 $$-0.545 \pm 1.98 \times \frac{8.887}{\sqrt{200}} $$
 162 I'll let R take it from here:
 163
 164 ```{r}
 165 -0.545 - (1.98 * (8.887/sqrt(200)))
 166 -0.545 + (1.98 * (8.887/sqrt(200)))
 167 ```
 168
 169 (b) With 95% confidence, the average difference between reading and writing test scores falls between -1.79 and 0.70.
 170
 171 (c) No, because the interval includes/crosses zero.
 172
 173 ## SQ2 — 5.28
 174
 175 We want to test the following hypotheses:
 176 $$H_0~:~\mu_{0.99}=\mu_{1}$$
 177 $$H_A~:~\mu_{0.99}\neq\mu_{1}$$
 178 To do so, we can use a two-sample t-test to compare the two sample means. The conditions we'd like to satisfy are independence and normality. Re: independence, the samples are random and not terribly large (presumably less than 10% of all the diamonds of each carat rating on earth), so we should be good to go. Re: normality, visual inspection of the histograms presented in the textbook suggests that what skew may be present in either distribution is not extreme.
 179
 180 Given that the conditions are met, here's how you could construct the test statistic $T$:
 181
 182 $$T~=~\frac{Point~estimate~-~Null}{Standard~error_{difference}}$$
 183
 184 Plugging in formulas from chapter 5 of the textbook this looks like:
 185
 186 $$T~=~ \frac{(\overline{x}_1-\overline{x}_2)-(0)}{\sqrt{\frac{s^2_1}{n_1}+\frac{s^2_2}{n_2}}}$$
 187 Now, plug in values from the table:
 188
 189 $$T~=~ \frac{(44.51-56.81)-(0)}{\sqrt{\frac{13.32^2}{23}+\frac{16.13^2}{23}}}$$
 190 Work that out and you should have $T~=~-2.82$. The degrees of freedom are estimated by the smaller of $n_1-1$ or $n_2-1$ (which are equal in this case), so $df~=~22$. Consulting the table of T-statistics from the back of the book, we find:
 191 $$p_{value}=P(T_{22} \gt 2.82) = 0.01$$
 192 Assuming we're okay with a fale positive rate of $p\leq0.05$, this provides support for the alternative hypothesis and we can reject the null of no difference between the average standardized prices of 0.99 and 1 carat diamonds.
 193
 194 ## SQ3 — 5.30
 195
 196 I want to calculate the following:
 197
 198 $$(\overline{x_1}-\overline{x_2})~\pm~t^*_{df}\sqrt{\frac{s^2_1}{n_1}+\frac{s^2_2}{n_2}}$$
 199 Since the critical values for the distribution of the test statistic $T$ is available in R using the `qt()` function, I can use a similar procedure here as in my `ci()` function from earlier in the problem set to calculate the value of $t_{df}$ (in this case, $t_{22}$). Note that if you didn't use the `qt()` part of this you could have looked up the critical value for a two-tailed test in the t distribution table at the back of the textbook (or elsewhere).
 200
 201 ```{r}
 202
 203 diff.means <- 44.51-56.81
 204
 205 t.star <- qt(0.025, df=22)
 206
 207 se <- sqrt( ((13.32^2)/23) + ((16.13^2)/23) )
 208
 209 diff.means - (t.star*se)
 210 diff.means + (t.star*se)
 211 ```
 212
 213 In words, I am 95% confident that the average difference between the standardized prices of 0.99 carat diamond and a 1 carat diamond falls between \$3.27 and \$21.33 (the 1 carat diamond costs more).
 214
 215 ## SQ4 — 5.48
 216
 217 (a) Hypotheses:
 218
 219 $H_0:$ The mean hours worked for the groups are all equal.
 220
 221 $$\mu_{\lt~hs} = \mu_{hs} = \mu_{jc} = \mu_{ba} = \mu_{grad} $$
 222 $H_A:$ The mean hours worked vary by education level. In other words, the means are not equal.
 223
 224 (b) Conditions and assumptions necessary for unbiased ANOVA estimates include:
 225
 226 Independent observations, normal(ish) distributions, and constant(ish) variance. The problem doesn't say much about the sample to help evaluate the independence of the observations, but it's definitely less than 10% of the population and is likely a fairly good approximation of a random sample (thereby satisfying the rule of thumb). From the boxplots the distributions all look fairly normal. The standard deviations are also similar. We'll assume that the conditions are met for the purposes of the test.
 227
 228 (c) Working across the blanks in the table by row:
 229
 230 * The degrees of freedom for degree $= 5-1 = 4$
 231 * The Sum of Squares between degree levels $= 501.54 \times 4 = 2006.16$
 232 * The F value $= Sum~Sq~degree / Mean~Sq~residuals = 501.54 / 229.12 = 2.189$
 233 * The degrees of freedom for Residuals $= 1171-4 = 1167$
 234 * Mean Square Residuals (Error) $= 267382/1167 = 229.12$
 235 * Total degrees of freedom $=1172 - 1 = 1171$
 236 * Total Sum of squares $=2006.16+267382 = 269388.16$
 237
 238 (d) According to the ANOVA results, we cannot reject the null hypothesis at a $p \leq 0.05$ level, suggesting that the mean hours worked may be equal across education levels.
 239
 240 ## SQ5 — 5.52
 241
 242 (a) False. The ANOVA procedure does not evaluate the pairwise comparisons, but the overall variation across groups.
 243 (b) True, otherwise the F-value will not be large enough to reject the null hypothesis.
 244 (c) False. It is possible that none of the pairwise comparisons will be significantly different even if the ANOVA rejects the null.
 245 (d) Assuming this question is about the Bonferroni correction, False. The correction does not divide $\alpha$ by the number of groups, but rather the number of pairwise tests. In this case, 4 groups yields ${4}\choose{2} = 6$ pairs, meaning that the corrected value for $\alpha = 0.05/6 = 0.0083$. Other corrections exist even though they were not discussed in the book and they may choose other values for $\alpha$.
 246
 247 ## SQ6 — Reinhart
 248
 249 We'll discuss this one as a group. Personally, I find the focus on p-values somewhat thought-stopping and would prefer to see researchers and publications embrace reporting standards that yield more intuitive, flexible, and meaningful understanding. Confidence intervals seem to do this more effectively than p-values and, in that respect, I like confidence intervals quite a lot!
 250
 251 # Empirical Paper Questions
 252
 253 ## EQ1  — RQ4 questions
 254
 255 (a) For RQ4, the units of analysis are the 109 respondents. The credibility index (a group of five survey items evaluating how credible the respondents perceived the blogs they read to be) was split at the mean and these high/low groups were the independent (grouping) variable in the ANOVA. The six perceived relationship management factors (survey items) were the dependent variables for the ANOVA.
 256
 257 (b) The analysis tests whether the mean responses on the survey items composing each of the different relationship management factors were equal across the high/low blog credibility groups. The alternative hypotheses are whether there are differences between the groups for any of the perceived relationship management dimensions.
 258
 259 (c) None of the ANOVA tests rejected the null hypothesis of no difference. In other words, there was no evidence that perceptions of relationship management dimensions varied across individuals perceiving blogs as low or high credibiliy.
 260
 261 (d) It is (usually) a bit hard to say much from a null result! See the answer to (c) above.
 262
 263 ## EQ5 — RQ5 questions
 264
 265 (a) Again, the units are the 109 respondents and the partitioned (low/high) credibility index serves as the independent (grouping) variable. The crisis index is the dependent variable.
 266
 267 (b) The ANOVA tests whether average assessments of perceived crisis in the organization in question were equal by whether participants perceived the blogs to be low/high credibility. The alternative hypotheses are whether there are differences between the groups for perceptions of the organization being in crisis.
 268
 269 (c) The results of the ANOVA reject the null, suggesting support for the alternative hypothesis that participants reporting low credibility blogs reported different (higher) scores on the crisis index.
 270
 271 (d) I find the reported differences compelling, but would like more information in order to determine more specific takeaways. For example, I would like to see descriptive statistics about the various measures to help evaluate whether they meet the assumptions for identifying the ANOVA. Survey indices like this are a bit slippery insofar as they can seem to yield results when the differences are really artifacts of the measurements and how they are coded. I am also a bit concerned that the questions seemed to ask about blog credibility in general rather than the specific credibility of the specific blogs read by the study participants? The presumed relationship between credibility and the assignment to the blogs in question is not confirmed empirically, meaning that the differences in perceptions of organizational crisis might be more related to baseline attitudes than to anything specific about the treatment conditions in the experiment. I would also like to know more about the conditional means and standard errors in order to evaluate whether the pairwise average perceptions of organizational crisis vary across perceived credibility levels.
 272
 273 ## EQ6 — RQ6 questions
 274
 275 (a) Analogous to RQ5 except that the (six) different dimensions of relationship management separated into high/low categories served as the independent (grouping) variables in the ANOVA. Perceptions of organizational crisis remained the dependent variable.
 276
 277 (b) This set of ANOVAs test whether assessments of perceived organizational crisis were equal or varied depending on the prevalence of specific relationship management strategies.
 278
 279 (c) The results of the ANOVAs are mixed, rejecting the null hypothesis of no difference for two of the relaionship management dimensions (task sharing and responsiveness/customer service), but failing to do so for the other four dimensions. For the two dimensions in question, higher prevalence of each strategy appeared to reduce the perception of crisis.
 280
 281 (d) Again, the differences reported are compelling insofar as they indicate larger varation across the groups in terms of perceived crisis than would be expected in a world where no difference existed. That said, in addition to all the issues mentioned above for EQ5 part (d), this set of tests also raisesw issues of multiple comparisons. Not only is it useful to consider multiple comparisons in the context of a single ANOVA, but when running many ANOVAs across many grouping variables. If the authors really wanted to test whether the specific PR strategies mentioned here altered perceptions of organizational crisis across different types of blogs, they should/could have incorporated those variations into their experimental design much more directly. As such, the results remain suggestive that some of the observed relationships may exist, but can provide only weak support for even those that reject the null hypotheses in statistical tests.