X-Git-Url: https://code.communitydata.science/stats_class_2019.git/blobdiff_plain/3b28949272cb74d2a1e53b5c866f2be577fe90a0..d77163d23d116e6069bb89ce9be8c37d20d6d022:/r_lectures/w04-R_lecture.html diff --git a/r_lectures/w04-R_lecture.html b/r_lectures/w04-R_lecture.html deleted file mode 100644 index 65bc60f..0000000 --- a/r_lectures/w04-R_lecture.html +++ /dev/null @@ -1,339 +0,0 @@ - - - - -
- - - - - - - - - - -This week, weâll focus on one more way to manage date-time objects and some ways to generate distributions.
-First, something I meant to include in last weekâs materials. The as.Date()
function provides an alternative to as.POSIX()
that is far more memorable and readable, but far less precise. Note that it truncates the time of day and the timezone from the ouput
m <- "2019-02-21 04:35:00"
-class(m)
-## [1] "character"
-a.good.time <- as.Date(m, format="%Y-%m-%d %H:%M:%S", tz="CDT")
-class(a.good.time)
-## [1] "Date"
-a.good.time
-## [1] "2019-02-21"
-In Chapter 3 (and in last weekâs problem set), you needed to calculate some binomial choice arithmetic and/or factorials. They werenât absolutely necessary for the problem set, but here are the corresponding functions in R.
-Letâs say we want to calculate how many possible pairs you can draw from a population of ten individuals, a.k.a., \(10 \choose 2\) or, instead you wanted to calculate \(10!\)
-choose(10,2)
-## [1] 45
-factorial(10)
-## [1] 3628800
-R has a number of built-in functions to help you work with distributions in various ways that also started to come up in OpenIntro Chapter 3. I will introduce a couple of points about them here, but I also highly recommend you look at the relevant section of the Verzani Using R Introductory Statistics book (pp 222-229) for more on this (and, honestly, for more on most of the topics weâre covering in R).
-The key to this is that R has a set of distributions (e.g. uniform, normal, binomial, log-normal, etc.) and a set of functions (d
, p
, q
, and r
) that can be run for each distribution. Iâll use a uniform distribuition (a distribution between any two values (min, max) where the values may occur with uniform probability) for my example below. Verzani has others for when you need them.
The d
function gets you information about the density function of the distribution. The p
function works with the cumulative probabilities. The q
function gets you quantiles from the distribution. The r
function allows you to generate random samples from the distribution. As you can see, the letters corresponding to each function almost make senseâ¦<sigh>. They also each take specific arguments that can vary a bit depending on which kind of distribution you are using them with (as always, the help pages and the internet can be helpful here).
Onwards to the example code, which looks at a uniform distribution between 0 and 3:
-dunif(x=1, min=0, max=3) # What proportion of the area is the to the left of 1?
-## [1] 0.3333333
-punif(q=1, min=0, max=3) # Same as the prior example in this case.
-## [1] 0.3333333
-qunif(p=0.5, min=0, max=3) # 50th percentile
-## [1] 1.5
-runif(n=4, min=0, max=3) # Random values in [0,3]
-## [1] 2.3559855 2.1470659 0.1627627 1.0328450
-Look at the Verzani text for additional examples, including several that can solve binomial probability calculations (e.g., if you flip a fair coin 100 times, what are the odds of observing heads 60 or more times?).
-Beyond proving invaluable for rapid calculations of solutions to problem set questions, the distribution functions are very, very useful for running simulations. We wonât really spend a lot of time on simulations in class, but Iâll give you an example here that can generalize to more complicated problems. I also use a programming technique we havenât talked about yet called a for-loop to help repeat the sampling process multiple times.
-For my simulation letâs say that I want to repeatedly draw random samples from a distribution and examine the distribution of the resulting sample means. Iâll start by generating a vector of 10,000 random data points drawn from a log-normal distribution where the mean and standard deviation of the log-transformed values are 0 and 1 respectively:
-d <- rlnorm(10000, meanlog=0, sdlog=1)
-
-head(d)
-## [1] 0.6233752 0.2127751 0.2600851 0.3407706 0.5642591 0.6269948
-mean(d)
-## [1] 1.621906
-sd(d)
-## [1] 2.074704
-hist(d)
-
-Okay, now, I want to draw 500 samples of 100 observations from this population and take the mean of each sample. Time to write a function! Notice that I require two inputs into my function: the population data and the sample size.
-sample.mean <- function(pop, n){
- s <- sample(pop, n)
- return(mean(s))
-}
-
-## Run it once to see how it goes:
-sample.mean(d, 100)
-## [1] 1.799963
-Next step: letâs run that 500 times. Hereâs where the for-loop comes in handy. A couple of things about the syntax of for-loops in R: The basic syntax of a for-loop is that you call some operation to occur over some index. Hereâs a simple example that illustrates how they work. The loop iterates through the integers between 1-10 and prints the square of each value:
-for(x in c(1:10)){
- print(x^2)
-}
-## [1] 1
-## [1] 4
-## [1] 9
-## [1] 16
-## [1] 25
-## [1] 36
-## [1] 49
-## [1] 64
-## [1] 81
-## [1] 100
-Since I want to store the output of my sample means loop, I will first create an object s.means
that is a numeric vector with one value (0) that will be replaced in a moment.
s.means <- 0
-Onwards to the loop itself. In the block of code below, youâll notice that I once again declare an index over which to iterate. Thatâs what happens inside that first set of parentheses where I have i in c(1:30)
. Thatâs telling R to go through the loop for each value from 1:30 and to call the current index value i
during each loop. Each time through the loop, the value of i
advances through the index (in this case, it goes up by 1). The result is that each time through it will take the output of my sample.mean
function and append it as the \(i^{th}\) value of s.means
. The next
call at the end is optional, but can be important sometimes to help you keep track of whatâs going on.
for(i in c(1:500)){
- s.means[i] <- sample.mean(d, 100)
- next
-}
-The s.means
variable now contains a distribution of sample means! What are the characteristics of the distribution? You already know how to do that.
summary(s.means)
-## Min. 1st Qu. Median Mean 3rd Qu. Max.
-## 1.190 1.477 1.607 1.626 1.765 2.268
-Letâs plot it too:
-library(ggplot2)
-qplot(s.means, geom="density")
-
-That looks pretty ânormal.â
-Experiment with this example by changing the size of the sample and/or the number of samples we draw.
-Now, think back to the original vector d.
Can you explain what fundamental statistical principle is illustrated in this example? Why do the values in s.means
fluctuate so much? What is the relationship of s.means
to d
?
Last, but not least, you might have admired the quantile-quantile plots presented in some of the examples in OpenIntro. The usual idea with âQ-Q- plotsâ is that you want to see how the observed (empirical) quantiles of some data compare against the theoretical quantiles of a normal distribution. You too can create these plots!
-Hereâs an example that visualizes the result of our simulation (labeled âsampleâ) against a normal distribution with the same mean and standard deviation (labeled âtheoreticalâ). Notice that to accommodate ggplot2 I have to turn s.means
into a data frame first.
s.means <- data.frame(s.means)
-ggplot(s.means, aes(sample=s.means)) + geom_qq() + geom_qq_line(color="red")
-
-And/or (finally) we could even standardize the values of s.means
as z-scores using the scale()
function:
s.z <- data.frame(scale(s.means)); names(s.z) <- "z"
-ggplot(s.z, aes(sample=z)) + geom_qq() + geom_qq_line(color="red")
-
-