From: aaronshaw <aaron.d.shaw@gmail.com>
Date: Fri, 3 May 2019 17:34:41 +0000 (-0500)
Subject: updates after class
X-Git-Url: https://code.communitydata.science/stats_class_2019.git/commitdiff_plain/07da118471a31405d2044b68b4eb30334dea9d8b?ds=sidebyside;hp=6b6a21a00c0fc0eff5645b65b080c5cae393fe25

updates after class
---

diff --git a/problem_sets/week_05/ps5-worked_solution.Rmd b/problem_sets/week_05/ps5-worked_solution.Rmd
index 658fded..33572be 100644
--- a/problem_sets/week_05/ps5-worked_solution.Rmd
+++ b/problem_sets/week_05/ps5-worked_solution.Rmd
@@ -32,13 +32,13 @@ Given the structure of the full dataset, it's also easy to calculate all of the
 s.means <- tapply(pop$x, pop$group, mean, na.rm=T)
 s.means
 ```
-We will discuss the relationship of the individual group means to population mean in class.
+We will discuss the relationship of the individual group means to population mean in class. Basically, we can think of each group as a sample, so the sample means are the *sampling distribution* of the population mean.
 
 ## PC2  
 
-I'll do this two ways. First, just plugging the values into the formula for the standard error, I can then add/subtract twice the standard from the mean to find the 95% CI.
+I'll do this two ways. First, just plugging the values from the group sample into the formula for the standard error, I can then add/subtract twice the standard from the mean to find the 95% CI.
 ```{r}
-se <- sd(w3$x, na.rm=T) / sqrt(length(w3$x))
+se <- sd(w3$x, na.rm=T) / sqrt(length(w3$x[!is.na(w3$x)]))
 mean(w3$x, na.rm=T)-(2*se)
 mean(w3$x, na.rm=T)+(2*se)
 ```
@@ -66,7 +66,7 @@ group.confints
 ```
 ## PC3  
 
-We'll discuss this one in class.  
+We'll discuss this one in class too. Since the samples are (random) samples, we should not be surprised that their individual group means are different from the population mean. We should also not be surprised that the 95% CI for the population mean estimated from at least one of the samples does *not* include the true population mean. Since our confidence interval is 95%, we would expect to be wrong about 1/20 times on average!
 
 ## PC4  
 
@@ -94,7 +94,7 @@ tapply(pop$x, pop$group, summary)
 
 tapply(pop$x, pop$group, sd, na.rm=T)
 ```
-They all look a little bit different from each other and from the population distribution. We'll discuss these differences in class.  
+They all look a little bit different from each other and from the population distribution. We'll discuss these differences in class. Again, none of this should be shocking given the relationship of the samples to the population.  
 
 ## PC5  
 
@@ -107,7 +107,7 @@ sd(s.means)
 ## My standard error from one of the groups above:
 se
 ```
-We will discuss the relationship of these values in class.  
+We will discuss the relationship of these values in class. As mentioned earlier, the distribution of sample means drawn from the population is the *sampling distribution*. The standard error of the mean estimated from any of the individual groups/samples should be a good approximation of (but not necessarily equal to!) the standard deviation of the sampling distribution of the means. 
 
 ## PC 6  
 
@@ -146,7 +146,7 @@ hist(sapply(rep(1, 100), function (x) { mean(sample(pop.unif, 100))}))
 
 ## PC7  
 
-We'll discuss this in class.
+We'll discuss this in class. Noteable things you might observe include that the sampling distribution of the means approaches normality as it gets larger in size whether the population we draw from is uniform, log-normal, or really just about any other distribution. This is an illustration of some aspects of the *central limit theorem*. It is also an illustration of the *t-distribution* (the basis for the t-tests that you learned about this week).
 
 # Statistical Questions
 
@@ -260,7 +260,7 @@ We'll discuss this one as a group. Personally, I find the focus on p-values some
 
 (d) It is (usually) a bit hard to say much from a null result! See the answer to (c) above.
 
-### EQ5 â RQ5 questions 
+## EQ5 â RQ5 questions 
 
 (a) Again, the units are the 109 respondents and the partitioned (low/high) credibility index serves as the independent (grouping) variable. The crisis index is the dependent variable.  
 
@@ -270,7 +270,7 @@ We'll discuss this one as a group. Personally, I find the focus on p-values some
 
 (d) I find the reported differences compelling, but would like more information in order to determine more specific takeaways. For example, I would like to see descriptive statistics about the various measures to help evaluate whether they meet the assumptions for identifying the ANOVA. Survey indices like this are a bit slippery insofar as they can seem to yield results when the differences are really artifacts of the measurements and how they are coded. I am also a bit concerned that the questions seemed to ask about blog credibility in general rather than the specific credibility of the specific blogs read by the study participants? The presumed relationship between credibility and the assignment to the blogs in question is not confirmed empirically, meaning that the differences in perceptions of organizational crisis might be more related to baseline attitudes than to anything specific about the treatment conditions in the experiment. I would also like to know more about the conditional means and standard errors in order to evaluate whether the pairwise average perceptions of organizational crisis vary across perceived credibility levels.
 
-### EQ6 â RQ6 questions  
+## EQ6 â RQ6 questions  
 
 (a) Analogous to RQ5 except that the (six) different dimensions of relationship management separated into high/low categories served as the independent (grouping) variables in the ANOVA. Perceptions of organizational crisis remained the dependent variable. 
 
diff --git a/problem_sets/week_05/ps5-worked_solution.html b/problem_sets/week_05/ps5-worked_solution.html
index d4f103d..007b07d 100644
--- a/problem_sets/week_05/ps5-worked_solution.html
+++ b/problem_sets/week_05/ps5-worked_solution.html
@@ -222,16 +222,16 @@ s.means</code></pre>
 ## 2.887230 2.892782 2.376018 2.456387 2.489604 2.572719 2.786722 2.535294 
 ## group_17 group_18 group_19 group_20 
 ## 2.592676 2.354645 3.016203 2.314035</code></pre>
-<p>We will discuss the relationship of the individual group means to population mean in class.</p>
+<p>We will discuss the relationship of the individual group means to population mean in class. Basically, we can think of each group as a sample, so the sample means are the <em>sampling distribution</em> of the population mean.</p>
 </div>
 <div id="pc2" class="section level2">
 <h2>PC2</h2>
-<p>Iâll do this two ways. First, just plugging the values into the formula for the standard error, I can then add/subtract twice the standard from the mean to find the 95% CI.</p>
-<pre class="r"><code>se &lt;- sd(w3$x, na.rm=T) / sqrt(length(w3$x))
+<p>Iâll do this two ways. First, just plugging the values from the group sample into the formula for the standard error, I can then add/subtract twice the standard from the mean to find the 95% CI.</p>
+<pre class="r"><code>se &lt;- sd(w3$x, na.rm=T) / sqrt(length(w3$x[!is.na(w3$x)]))
 mean(w3$x, na.rm=T)-(2*se)</code></pre>
-<pre><code>## [1] 2.245946</code></pre>
+<pre><code>## [1] 2.232594</code></pre>
 <pre class="r"><code>mean(w3$x, na.rm=T)+(2*se)</code></pre>
-<pre><code>## [1] 3.273947</code></pre>
+<pre><code>## [1] 3.2873</code></pre>
 <p>Now, Iâll write a more general function to calculate confidence intervals. Note that I create an âalphaâ argument with a default value of 0.05. I then divide alpha by 2. Can you explain why this division step is necessary?</p>
 <pre class="r"><code>ci &lt;- function (x, alpha=0.05) {
     x &lt;- x[!is.na(x)]
@@ -309,7 +309,7 @@ group.confints</code></pre>
 </div>
 <div id="pc3" class="section level2">
 <h2>PC3</h2>
-<p>Weâll discuss this one in class.</p>
+<p>Weâll discuss this one in class too. Since the samples are (random) samples, we should not be surprised that their individual group means are different from the population mean. We should also not be surprised that the 95% CI for the population mean estimated from at least one of the samples does <em>not</em> include the true population mean. Since our confidence interval is 95%, we would expect to be wrong about 1/20 times on average!</p>
 </div>
 <div id="pc4" class="section level2">
 <h2>PC4</h2>
@@ -426,7 +426,7 @@ group.confints</code></pre>
 ## 2.717638 2.450142 2.183341 2.059100 2.211206 2.030818 2.210043 2.341134 
 ## group_17 group_18 group_19 group_20 
 ## 2.060548 2.310790 2.289548 1.968713</code></pre>
-<p>They all look a little bit different from each other and from the population distribution. Weâll discuss these differences in class.</p>
+<p>They all look a little bit different from each other and from the population distribution. Weâll discuss these differences in class. Again, none of this should be shocking given the relationship of the samples to the population.</p>
 </div>
 <div id="pc5" class="section level2">
 <h2>PC5</h2>
@@ -442,8 +442,8 @@ s.means</code></pre>
 <pre><code>## [1] 0.2696987</code></pre>
 <pre class="r"><code>## My standard error from one of the groups above:
 se</code></pre>
-<pre><code>## [1] 0.2570002</code></pre>
-<p>We will discuss the relationship of these values in class.</p>
+<pre><code>## [1] 0.2636766</code></pre>
+<p>We will discuss the relationship of these values in class. As mentioned earlier, the distribution of sample means drawn from the population is the <em>sampling distribution</em>. The standard error of the mean estimated from any of the individual groups/samples should be a good approximation of (but not necessarily equal to!) the standard deviation of the sampling distribution of the means.</p>
 </div>
 <div id="pc-6" class="section level2">
 <h2>PC 6</h2>
@@ -480,7 +480,7 @@ pop.unif &lt;- sample(seq(0, 9), 10000, replace=TRUE)</code></pre>
 </div>
 <div id="pc7" class="section level2">
 <h2>PC7</h2>
-<p>Weâll discuss this in class.</p>
+<p>Weâll discuss this in class. Noteable things you might observe include that the sampling distribution of the means approaches normality as it gets larger in size whether the population we draw from is uniform, log-normal, or really just about any other distribution. This is an illustration of some aspects of the <em>central limit theorem</em>. It is also an illustration of the <em>t-distribution</em> (the basis for the t-tests that you learned about this week).</p>
 </div>
 <div id="statistical-questions" class="section level1">
 <h1>Statistical Questions</h1>
@@ -587,8 +587,9 @@ diff.means - (t.star*se)</code></pre>
 <li><p>None of the ANOVA tests rejected the null hypothesis of no difference. In other words, there was no evidence that perceptions of relationship management dimensions varied across individuals perceiving blogs as low or high credibiliy.</p></li>
 <li><p>It is (usually) a bit hard to say much from a null result! See the answer to (c) above.</p></li>
 </ol>
-<div id="eq5-rq5-questions" class="section level3">
-<h3>EQ5 â RQ5 questions</h3>
+</div>
+<div id="eq5-rq5-questions" class="section level2">
+<h2>EQ5 â RQ5 questions</h2>
 <ol style="list-style-type: lower-alpha">
 <li><p>Again, the units are the 109 respondents and the partitioned (low/high) credibility index serves as the independent (grouping) variable. The crisis index is the dependent variable.</p></li>
 <li><p>The ANOVA tests whether average assessments of perceived crisis in the organization in question were equal by whether participants perceived the blogs to be low/high credibility. The alternative hypotheses are whether there are differences between the groups for perceptions of the organization being in crisis.</p></li>
@@ -596,8 +597,8 @@ diff.means - (t.star*se)</code></pre>
 <li><p>I find the reported differences compelling, but would like more information in order to determine more specific takeaways. For example, I would like to see descriptive statistics about the various measures to help evaluate whether they meet the assumptions for identifying the ANOVA. Survey indices like this are a bit slippery insofar as they can seem to yield results when the differences are really artifacts of the measurements and how they are coded. I am also a bit concerned that the questions seemed to ask about blog credibility in general rather than the specific credibility of the specific blogs read by the study participants? The presumed relationship between credibility and the assignment to the blogs in question is not confirmed empirically, meaning that the differences in perceptions of organizational crisis might be more related to baseline attitudes than to anything specific about the treatment conditions in the experiment. I would also like to know more about the conditional means and standard errors in order to evaluate whether the pairwise average perceptions of organizational crisis vary across perceived credibility levels.</p></li>
 </ol>
 </div>
-<div id="eq6-rq6-questions" class="section level3">
-<h3>EQ6 â RQ6 questions</h3>
+<div id="eq6-rq6-questions" class="section level2">
+<h2>EQ6 â RQ6 questions</h2>
 <ol style="list-style-type: lower-alpha">
 <li><p>Analogous to RQ5 except that the (six) different dimensions of relationship management separated into high/low categories served as the independent (grouping) variables in the ANOVA. Perceptions of organizational crisis remained the dependent variable.</p></li>
 <li><p>This set of ANOVAs test whether assessments of perceived organizational crisis were equal or varied depending on the prevalence of specific relationship management strategies.</p></li>
@@ -606,7 +607,6 @@ diff.means - (t.star*se)</code></pre>
 </ol>
 </div>
 </div>
-</div>