From: aaronshaw Date: Thu, 18 Apr 2019 13:39:45 +0000 (-0500) Subject: w04 materials X-Git-Url: https://code.communitydata.science/stats_class_2019.git/commitdiff_plain/2a64d6586cf9e84bc59cc01434ea92744dff537e?ds=inline;hp=f21e8d18685e0ed07df48b7eb5050a616ee316dd w04 materials --- diff --git a/r_lectures/w04-R_lecture.Rmd b/r_lectures/w04-R_lecture.Rmd index c7c84ff..a693dd9 100644 --- a/r_lectures/w04-R_lecture.Rmd +++ b/r_lectures/w04-R_lecture.Rmd @@ -1,5 +1,5 @@ --- -title: "Week 3 R lecture" +title: "Week 4 R lecture" subtitle: "Statistics and statistical programming \nNorthwestern University \nMTS 525" author: "Aaron Shaw" date: "April 18, 2019" @@ -25,35 +25,117 @@ class(a.good.time) a.good.time ``` +## Binomial and factorial functions + +In Chapter 3 (and in last week's problem set), you needed to calculate some binomial choice arithmetic and/or factorials. They weren't absolutely necessary for the problem set, but here are the corresponding functions in R. + +Let's say we want to calculate how many possible pairs you can draw from a population of ten individuals, a.k.a., $10 \choose 2$ or, instead you wanted to calculate $10!$ +```{r} +choose(10,2) + +factorial(10) +``` + ## Distribution functions -distribution functions: lets focus on *unif(): the key is on page 222 of Verzani -The “d” functions return the p.d.f. of the distribution -dunif(x=1, min=0, max=3) # 1/3 of the area is the to the left 1 -The “p” functions return the c.d.f. of the distribution. -dunif(q=2, min=0, max=3) #1/(b-a) is 2/3 -The “q” functions return the quantiles. -qunif(p=0.5, min=0, max=3) # half way between 0 and 3 -The “r” functions return random samples from a distribution. -runif(n=1, min=0, max=3) # a random value in [0,3] +R has a number of built-in functions to help you work with distributions in various ways that also started to come up in *OpenIntro* Chapter 3. I will introduce a couple of points about them here, but I also highly recommend you look at the relevant section of the Verzani *Using R Introductory Statistics* book (pp 222-229) for more on this (and, honestly, for more on most of the topics we're covering in R). + +The key to this is that R has a set of distributions (e.g. uniform, normal, binomial, log-normal, etc.) and a set of functions (`d`, `p`, `q`, and `r`) that can be run for each distribution. I'll use a uniform distribuition (a distribution between any two values (*min*, *max*) where the values may occur with uniform probability) for my example below. Verzani has others for when you need them. + +The `d` function gets you information about the density function of the distribution. The `p` function works with the cumulative probabilities. The `q` function gets you quantiles from the distribution. The `r` function allows you to generate random samples from the distribution. As you can see, the letters corresponding to each function *almost* make sense...<*sigh*>. They also each take specific arguments that can vary a bit depending on which kind of distribution you are using them with (as always, the help pages and the internet can be helpful here). + +Onwards to the example code, which looks at a uniform distribution between 0 and 3: + +```{r} +dunif(x=1, min=0, max=3) # What proportion of the area is the to the left of 1? + +punif(q=1, min=0, max=3) # Same as the prior example in this case. + +qunif(p=0.5, min=0, max=3) # 50th percentile + +runif(n=4, min=0, max=3) # Random values in [0,3] +``` +Look at the Verzani text for additional examples, including several that can solve binomial probability calculations (e.g., if you flip a fair coin 100 times, what are the odds of observing heads 60 or more times?). + +### A quick simulation (using a for-loop!) + +Beyond proving invaluable for rapid calculations of solutions to problem set questions, the distribution functions are very, very useful for running simulations. We won't really spend a lot of time on simulations in class, but I'll give you an example here that can generalize to more complicated problems. I also use a programming technique we haven't talked about yet called a for-loop to help repeat the sampling process multiple times. + +For my simulation let's say that I want to repeatedly draw random samples from a distribution and examine the distribution of the resulting sample means. I'll start by generating a vector of 10,000 random data points drawn from a log-normal distribution where the mean and standard deviation of the log-transformed values are 0 and 1 respectively: + +```{r} +d <- rlnorm(10000, meanlog=0, sdlog=1) + +head(d) +mean(d) +sd(d) +hist(d) +``` + +Okay, now, I want to draw 500 samples of 100 observations from this population and take the mean of each sample. Time to write a function! Notice that I require two inputs into my function: the population data and the sample size. -## Doing simple simulations with random data -runif() -rnorm() +```{r} +sample.mean <- function(pop, n){ + s <- sample(pop, n) + return(mean(s)) +} -## A quick simulation +## Run it once to see how it goes: +sample.mean(d, 100) +``` +Next step: let's run that 500 times. Here's where the for-loop comes in handy. A couple of things about the syntax of for-loops in R: The basic syntax of a for-loop is that you call some operation to occur over some index. Here's a simple example that illustrates how they work. The loop iterates through the integers between 1-10 and prints the square of each value: +```{r} +for(x in c(1:10)){ + print(x^2) +} +``` -In case you don't believe the central limit theorem, let's put together a quick simulation to illustrate it in R. +Since I want to store the output of my sample means loop, I will first create an object `s.means` that is a numeric vector with one value (0) that will be replaced in a moment. +```{r} +s.means <- 0 +``` +Onwards to the loop itself. In the block of code below, you'll notice that I once again declare an index over which to iterate. That's what happens inside that first set of parentheses where I have `i in c(1:30)`. That's telling R to go through the loop for each value from 1:30 and to call the current index value `i` during each loop. Each time through the loop, the value of `i` advances through the index (in this case, it goes up by 1). The result is that each time through it will take the output of my `sample.mean` function and append it as the $i^{th}$ value of `s.means`. The `next` call at the end is optional, but can be important sometimes to help you keep track of what's going on. -Write a function to repeatedly take the mean of a sample. +```{r} +for(i in c(1:500)){ + s.means[i] <- sample.mean(d, 100) + next +} +``` +The `s.means` variable now contains a distribution of sample means! What are the characteristics of the distribution? You already know how to do that. -Experiment by changing the size of the sample +```{r} +summary(s.means) +``` +Let's plot it too: +```{r} +library(ggplot2) +qplot(s.means, geom="density") +``` +That looks pretty "normal." + +Experiment with this example by changing the size of the sample and/or the number of samples we draw. + +Now, think back to the original vector `d.` Can you explain what fundamental statistical principle is illustrated in this example? Why do the values in `s.means` fluctuate so much? What is the relationship of `s.means` to `d`? ## Quantile quantile plots +Last, but not least, you might have admired the quantile-quantile plots presented in some of the examples in *OpenIntro*. The usual idea with "Q-Q- plots" is that you want to see how the observed (empirical) quantiles of some data compare against the theoretical quantiles of a normal distribution. You too can create these plots! -## Binomial and factorial functions -Choose, factorial +Here's an example that visualizes the result of our simulation (labeled "sample") against a normal distribution with the same mean and standard deviation (labeled "theoretical"). Notice that to accommodate ggplot2 I have to turn `s.means` into a data frame first. +```{r} +s.means <- data.frame(s.means) +ggplot(s.means, aes(sample=s.means)) + geom_qq() + geom_qq_line(color="red") + +``` + + +And/or (finally) we could even standardize the values of `s.means` as z-scores using the `scale()` function: + +```{r} +s.z <- data.frame(scale(s.means)); names(s.z) <- "z" +ggplot(s.z, aes(sample=z)) + geom_qq() + geom_qq_line(color="red") +``` diff --git a/r_lectures/w04-R_lecture.html b/r_lectures/w04-R_lecture.html new file mode 100644 index 0000000..65bc60f --- /dev/null +++ b/r_lectures/w04-R_lecture.html @@ -0,0 +1,339 @@ + + + + + + + + + + + + + + + +Week 4 R lecture + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + + + + +

This week, we’ll focus on one more way to manage date-time objects and some ways to generate distributions.

+
+

as.Date

+

First, something I meant to include in last week’s materials. The as.Date() function provides an alternative to as.POSIX() that is far more memorable and readable, but far less precise. Note that it truncates the time of day and the timezone from the ouput

+
m <- "2019-02-21 04:35:00"
+class(m)
+
## [1] "character"
+
a.good.time <- as.Date(m, format="%Y-%m-%d %H:%M:%S", tz="CDT")
+class(a.good.time)
+
## [1] "Date"
+
a.good.time
+
## [1] "2019-02-21"
+
+
+

Binomial and factorial functions

+

In Chapter 3 (and in last week’s problem set), you needed to calculate some binomial choice arithmetic and/or factorials. They weren’t absolutely necessary for the problem set, but here are the corresponding functions in R.

+

Let’s say we want to calculate how many possible pairs you can draw from a population of ten individuals, a.k.a., \(10 \choose 2\) or, instead you wanted to calculate \(10!\)

+
choose(10,2)
+
## [1] 45
+
factorial(10)
+
## [1] 3628800
+
+
+

Distribution functions

+

R has a number of built-in functions to help you work with distributions in various ways that also started to come up in OpenIntro Chapter 3. I will introduce a couple of points about them here, but I also highly recommend you look at the relevant section of the Verzani Using R Introductory Statistics book (pp 222-229) for more on this (and, honestly, for more on most of the topics we’re covering in R).

+

The key to this is that R has a set of distributions (e.g. uniform, normal, binomial, log-normal, etc.) and a set of functions (d, p, q, and r) that can be run for each distribution. I’ll use a uniform distribuition (a distribution between any two values (min, max) where the values may occur with uniform probability) for my example below. Verzani has others for when you need them.

+

The d function gets you information about the density function of the distribution. The p function works with the cumulative probabilities. The q function gets you quantiles from the distribution. The r function allows you to generate random samples from the distribution. As you can see, the letters corresponding to each function almost make sense…<sigh>. They also each take specific arguments that can vary a bit depending on which kind of distribution you are using them with (as always, the help pages and the internet can be helpful here).

+

Onwards to the example code, which looks at a uniform distribution between 0 and 3:

+
dunif(x=1, min=0, max=3) # What proportion of the area is the to the left of 1?
+
## [1] 0.3333333
+
punif(q=1, min=0, max=3) # Same as the prior example in this case.
+
## [1] 0.3333333
+
qunif(p=0.5, min=0, max=3) # 50th percentile
+
## [1] 1.5
+
runif(n=4, min=0, max=3) # Random values in [0,3]
+
## [1] 2.3559855 2.1470659 0.1627627 1.0328450
+

Look at the Verzani text for additional examples, including several that can solve binomial probability calculations (e.g., if you flip a fair coin 100 times, what are the odds of observing heads 60 or more times?).

+
+

A quick simulation (using a for-loop!)

+

Beyond proving invaluable for rapid calculations of solutions to problem set questions, the distribution functions are very, very useful for running simulations. We won’t really spend a lot of time on simulations in class, but I’ll give you an example here that can generalize to more complicated problems. I also use a programming technique we haven’t talked about yet called a for-loop to help repeat the sampling process multiple times.

+

For my simulation let’s say that I want to repeatedly draw random samples from a distribution and examine the distribution of the resulting sample means. I’ll start by generating a vector of 10,000 random data points drawn from a log-normal distribution where the mean and standard deviation of the log-transformed values are 0 and 1 respectively:

+
d <- rlnorm(10000, meanlog=0, sdlog=1)
+
+head(d)
+
## [1] 0.6233752 0.2127751 0.2600851 0.3407706 0.5642591 0.6269948
+
mean(d)
+
## [1] 1.621906
+
sd(d)
+
## [1] 2.074704
+
hist(d)
+

+

Okay, now, I want to draw 500 samples of 100 observations from this population and take the mean of each sample. Time to write a function! Notice that I require two inputs into my function: the population data and the sample size.

+
sample.mean <- function(pop, n){
+  s <- sample(pop, n)
+  return(mean(s))
+}
+
+## Run it once to see how it goes:
+sample.mean(d, 100)
+
## [1] 1.799963
+

Next step: let’s run that 500 times. Here’s where the for-loop comes in handy. A couple of things about the syntax of for-loops in R: The basic syntax of a for-loop is that you call some operation to occur over some index. Here’s a simple example that illustrates how they work. The loop iterates through the integers between 1-10 and prints the square of each value:

+
for(x in c(1:10)){
+  print(x^2)
+}
+
## [1] 1
+## [1] 4
+## [1] 9
+## [1] 16
+## [1] 25
+## [1] 36
+## [1] 49
+## [1] 64
+## [1] 81
+## [1] 100
+

Since I want to store the output of my sample means loop, I will first create an object s.means that is a numeric vector with one value (0) that will be replaced in a moment.

+
s.means <- 0
+

Onwards to the loop itself. In the block of code below, you’ll notice that I once again declare an index over which to iterate. That’s what happens inside that first set of parentheses where I have i in c(1:30). That’s telling R to go through the loop for each value from 1:30 and to call the current index value i during each loop. Each time through the loop, the value of i advances through the index (in this case, it goes up by 1). The result is that each time through it will take the output of my sample.mean function and append it as the \(i^{th}\) value of s.means. The next call at the end is optional, but can be important sometimes to help you keep track of what’s going on.

+
for(i in c(1:500)){
+  s.means[i] <- sample.mean(d, 100)
+  next
+}
+

The s.means variable now contains a distribution of sample means! What are the characteristics of the distribution? You already know how to do that.

+
summary(s.means)
+
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
+##   1.190   1.477   1.607   1.626   1.765   2.268
+

Let’s plot it too:

+
library(ggplot2)
+qplot(s.means, geom="density")
+

+

That looks pretty “normal.”

+

Experiment with this example by changing the size of the sample and/or the number of samples we draw.

+

Now, think back to the original vector d. Can you explain what fundamental statistical principle is illustrated in this example? Why do the values in s.means fluctuate so much? What is the relationship of s.means to d?

+
+
+
+

Quantile quantile plots

+

Last, but not least, you might have admired the quantile-quantile plots presented in some of the examples in OpenIntro. The usual idea with “Q-Q- plots” is that you want to see how the observed (empirical) quantiles of some data compare against the theoretical quantiles of a normal distribution. You too can create these plots!

+

Here’s an example that visualizes the result of our simulation (labeled “sample”) against a normal distribution with the same mean and standard deviation (labeled “theoretical”). Notice that to accommodate ggplot2 I have to turn s.means into a data frame first.

+
s.means <- data.frame(s.means)
+ggplot(s.means, aes(sample=s.means)) + geom_qq() + geom_qq_line(color="red")
+

+

And/or (finally) we could even standardize the values of s.means as z-scores using the scale() function:

+
s.z <- data.frame(scale(s.means)); names(s.z) <- "z"
+ggplot(s.z, aes(sample=z)) + geom_qq() + geom_qq_line(color="red")
+

+
+ + + + +
+ + + + + + + +