From: aaronshaw Date: Fri, 29 Mar 2019 17:05:24 +0000 (-0500) Subject: debugging symlinks... X-Git-Url: https://code.communitydata.science/stats_class_2019.git/commitdiff_plain/3f4231cfc3c84843a9398a66945d29ffb160d451?ds=sidebyside debugging symlinks... --- diff --git a/r_lectures/w01-R_lecture.Rmd b/r_lectures/w01-R_lecture.Rmd new file mode 100644 index 0000000..0c47151 --- /dev/null +++ b/r_lectures/w01-R_lecture.Rmd @@ -0,0 +1,321 @@ +--- +title: "Week 1 R Lecture" +subtitle: "Statistics and Statistical Programming \nNorthwestern University \nMTS 525" +author: "Aaron Shaw" +date: "April 1, 2019" +output: html_document +urlcolor: blue +--- + +```{r setup, include=FALSE} +knitr::opts_chunk$set(echo = TRUE) +``` + + +# Screencast #1 +## Very brief introduction to R (and R Studio and R Markdown) + +If this is your first foray into R (or just your latest attempt), welcome! I hope this introduction helps you get started. + +This file accompanies a screencast. The idea is that you can watch the screencast with RStudio and the file open on your own machine. You can read the compiled file as an HTML document and interact with it more directly by opening the accompanying .Rmd (RMarkdown) file in R Studio. + +We'll begin with a few basics. What is R? What is R Studio? What is the R Console? What is R Markdown? After that (in the second screencast), we'll move on to performing some basic operations that you'll need to understand to actually use R to perform statistical programming. + +## What is R? + +R is a free software environment and programming language for statistical computing. At it's core, it's a very flexible system that you can use to conduct any kind of statistical computing you (or anyone else) can imagine. People may say "R" to refer to the language and the software environment interchangeably. You can read more about R on the [R project home page](https://www.r-project.org/about.html). + +## What is R Studio? + +[R Studio](https://www.rstudio.com) is an "integrated development environment" (IDE) that you can use with R. In other words, it's an application built to make it relatively easy to conduct statistical analysis, manage datasets, generate plots, and generally interact with R in a whole varierty of ways. + +R Studio has a bunch of options that you can use to adjust the look, feel, and organization of the interface. You can find them under the 'Tools' menu and 'Global Options'. + +R Studio also has a number of very, very helpful keyboard shortcuts. Personally, I love keyboard shortcuts and I find that they vastly improve my experience using R Studio. If you want to learn the keyboard shortcuts, print out a copy of a cheatsheet [like this](https://github.com/rstudio/cheatsheets/raw/master/rstudio-ide.pdf) and make sure it's handy any time you even think about using R Studio. You'll improve quickly. + +The most important keyboard shortcut when you're using R Studio is probably 'CTRL-Enter.' It lets you send a command from the scripting window to the R console. + +## What is the R Console? + +If you're reading this in RStudio, you should see a window nearby labeled "Console." This window allows you to enter direct commands to R. You can type these commands after the little sideways caret symbol ('>') or send them to the console from a script or notebook file like this one. I will demonstrate how I do both of these things in the screencast. + +The important thing to know about the R Console is that when you type anything (a 'statement') after the sideways caret and press 'Enter' R will try to evaluate the statement and do whatever it says. If the console cannot evaluate the statement successfully, it will generate an error. + +## What is R Markdown? + +This file (sample_notebook.Rmd) is an R Markdown document. Markdown is a simple formatting syntax for authoring documents that can compile in many formats, including HTML, PDF, and MS Word documents. R Markdown is an implementation of Markdown specially created to work with R. For more details on using R Markdown see . + +Think of RMarkdown notebooks as files where you can write, execute, and compile a combination of text and R code that can then be "knitted" together. When you click the **Knit** button a document will be generated that includes both the text content as well as the output of any embedded R code "chunks" within the document. You can embed an R code chunk like this: + +```{r cars} +summary(cars) +``` + +That chunk calls a built-in function 'summary' to provide information about a built-in dataset called 'cars'. R has many built-in functions and a few built-in datasets. We'll come back to them later. For now, the point is that you can see how RMarkdown integrates text and code. + +Embedding a new code chunk is easy. There is an 'Insert chunk' option in the 'Code' menu as well as an 'Insert' dropdown at the top of the .Rmd window. You can also use the CTRL-ALT-i keyboard shortcut (recommended!). + +RMarkdown also lets you format text in a variety of ways including *italics* and **bold**.[^1] While it is not required for my course, I strongly recommend that you do your problem sets using R Markdown. The results will be clean, easy-to-read HMTL files that can integrate R code, analysis output, and graphics. + +[^1]: It even does footnotes! + +### Including Plots in R Markdown + +You can also embed plots, for example: + + +```{r pressure, echo=FALSE} +plot(pressure) +``` + +Note that the `echo = FALSE` parameter was added to the code chunk to prevent printing of the R code that generated the plot. It is sometimes helpful to run code chunks without printing them. + +## Some other basics of R Studio + +R Studio is being very actively developed and has many features that I don't know much/anything about. You can learn a **lot** more on the R Studio site and online (more on that in a moment). For now, I want to make sure you know how to do a few other things that will make it possible to complete your assignments for my class. + +### Setting preferences and options + +The appearance and some features of R studio can be customized "globally" (across all projects) through the 'Global options' item in the 'Tools' menu. For example, I prefer a darker editor theme that feels more relaxing on my eyes. + +### Working with projects + +An R Studio 'project' is a bundle of data, code, figures, output, and more that you want to keep bundled together. A project might contain multiple data files or notebooks. It also might contain other material such as a README file (documentation), supplementary materials, or a finished paper. For the purposes of class, you should treat each problem set as a project. **I ask you to submit each problem set as an entire (compressed) project directory.** For the purposes of the rest of your life, what counts as a project is really up to you. + +R Studio projects are saved as '.Rproj' files accompanied by whatever else the project may entail. You can open them with R Studio and/or create new ones from the 'File' menu (select 'New Project'). Note that R studio can have multiple scripts open, but seems to only be able to have one project open at a time. + +### Creating and saving a new R Markdown script + +Creating new R Markdown scripts is also very straightforward in R Studio. From the 'File' menu, select 'New File' and 'R Markdown'. This will let you define some key attributes of the new file and automatically populate the .Rmd with some basic information. + +### Getting help + +There are many, many ways to get help figuring out how to do things in R, R Studio, and R Markdown. I'll talk more about getting help with R functions in the second Screencast, but for now you should make sure you also have some idea of where/how to look things up when you have questions about R Studio or R Markdown. For example, try out the 'Help' menu items and identify some of the cheatsheets (like the one I mentioned earlier) that you think you might want to have around while you're learning to use these tools. The R Studio website links to several other resources and tutorials that you might find useful. [StackOverflow](https://www.stackoverflow.com) also has extensive Q&A activity for questions about R, R Studio, Markdown, and related topics. + + + +\newpage + +# Screencast #2 +## Basics of R + +This second screencast focuses on building basic skills with R. It can/will be far more interactive. The rest of the R Markdown script is intentionally short and is basically just an outline of the topics that will be covered. Please run these commands and experiment with R yourself in parallel as you watch/listen. + +## Using R as a calculator + +R is a very fast calculator. You can enter simple arithmetic operations (addition, subtraction, multiplication, division, exponentiation) directly into the console or via your scripts, e.g.: + + +```{r} +2 + 2 +6/3 +10^5 +``` + +Try entering some others at the console yourself! + +## Variables + +In R, you can use variables to do many things. The basic idea is that a variable allows you to 'assign' a value or set of values to a name. You indicate assignment by typing `<-` (keyboard shortcut: 'Alt--') or `=`. Here's an example: + +```{r} +x <- 2 +x +``` + +In the first line, I assigned a value of '2' to be called 'x'. In the second line, I just type 'x', which tells R to print the value for x. Surprise, surprise, it prints '2'. (More on why it also prints `[1]` in a moment...) + +Try this out yourself at the R console. Then try assigning another value to 'x' and ask R to print x again. + +For the most part, you can assign any value or set of values to any variable name and you can then use the variable name instead of the value(s): + +```{r} +cups.of.coffee <- 3 +cups.of.coffee + 1 +cups.of.coffee*3 +``` + +Some variable names and words are 'special', however, in that R has pre-assigned values to them or pre-assigned functions. We will encounter many of these. For one example of a pre-assigned variable, try typing `pi` at the console and press 'Enter'. + +One other special value a variable may take is `NA` (no quotes!) which means it is missing. If a value is missing, you may not be able to do mathematical operations with it: + +```{r} +cups.of.coffee <- NA +cups.of.coffee-1 +``` + +## Types (also known as classes) + +Every variable has a 'type' or 'class'. For example, we've already created a few variables which are 'numeric'. These can be whole integers or have decimals. If you ever want to know what a variable's type is, you can ask R to tell you using the `class()` function like this: + +```{r} +class(x) +``` + +We'll come back to functions in a moment. In the meantime, other important types of variables are are 'characters' and 'logical': + +```{r} +my.name <- "Aaron" +class(my.name) +my.answer <- TRUE ## Note the capitalization! +class(my.answer) +``` + +It is often important to know what class a variable is because R lets you perform some operations on certain kinds of variables, but not on others. + +## Functions + +In R, you use functions to do just about everything (e.g., inquire about the class or type of a variable as we did above). Every function takes some input (called an argument) usually in parentheses and provides some output (sometimes called the return value). Some functions take multiple inputs and return multiple outputs. You can also write your own functions and edit existing functions. This is part of what makes R so powerful and flexible. + +Arguably the most important function is `help()`. The help function will retrieve the documentation for any function. To learn more about help, try entering `help(help)` at the console. + +Another useful function allows you to delete a variable: `rm()` or `remove()`. Try creating a variable and removing it. + +There are many built in functions. Some are common mathematical operations like `sqrt()`, `log()`, or `log1p()`. Others help you manage your workspace like `ls()`. + +Check your reference card for many, many more examples. + +## Vectors + +You can think of a vector as a set of things that are all the same type. In R, all variables are vectors even though they may have just one thing in them! That's why the R Console prints out `[1]` next to the value of a variable with just one value: + +```{r} +my.name +``` + +You can make vectors with a special function `c()`: + +```{r} +ages <- c(36, 50, 38) +ages + +``` +Vectors can be of any type but they can have only one type: +```{r} +class(ages) +painters <- c("frida", "diego", "daniel") +class(painters) +``` + +If you mix types vectors together, they will be "coerced" to a single type. The results be surprising (and sometimes annoying). + +```{r} +class(c(ages, painters)) ## Notice that you can "nest" functions within each other! +``` +### Indexing + +You can index the elements in a vector using square brackets and a number like this: + +```{r} +painters[2] +``` +You can also use indexing to refer to multiple elements in a vector +```{r} +painters[2:3] ## A sequence of the second and third elements +``` +You can even assign new values to an item (or add items) in a vector using indexing: + +```{r} +ages[2] <- 52 +ages +``` + +### Recycling + +Mathematical operations are "recycled" when applied to a vector: + +```{r} +ages*2 +ages/2 +``` + +### Naming items + +You can apply a name to any item in a vector +```{r} +names(ages) +names(ages) <- c("Wilma", "Fred", "Barney") +names(ages) +``` +Now you can index into 'ages' using the name of each item: +```{r} +ages["Barney"] +``` + +### Working with vectors with multiple elements + +Some functions are very handy for working with vectors that have multiple elements: + +```{r} +length(ages) +sum(ages) +mean(ages) +sd(ages) ## Standard deviation. More on that later. +sort(ages) +range(ages) +summary(ages) +table(ages) +``` + +You can also construct new vectors by performing logical comparisons on an existing vector: + +```{r} +ages < 39 +ages != 38 + +painters == "Diego" +painters == "diego" +painters != "frida" +``` + +This is very useful for indexing and recoding a variable. In this case I'll use the built-in variable 'rivers' which is the lengths in miles of 141 major North American rivers (type `help(rivers)` to learn more) : + +```{r} +rivers +head(rivers) ## 'head()' shows you the first five values of a vector +rivers < 300 ## Recycles the comparison and returns TRUE or FALSE for each river +rivers[rivers < 300] ## A subset of the data + +little.rivers <- rivers[rivers < 300] +big.rivers <- rivers; big.rivers[big.rivers < 300] <- NA ## Two commands, one line. Recodes the short rivers as 'Missing' +``` + +## Basic plotting and visualizations + +Visualizations can help you explore data and interpret results. Use them often! + +```{r} +table(rivers>300) +hist(rivers) +boxplot(rivers) +``` + + +## Packages + +By default, R has many built-in functions and example datasets. However, many people have extended R by creating additional functions. Often these additional functions are collected together and distributed as "packages" or "libraries" that may also include additional datasets. Rstudio gives you a couple of ways to work with these. The traditional method is via the following commands (note the use of 'eval=FALSE' in the .Rmd file means that R will not execute the code — I've done that because this generates a bunch of output we don't need and you only need to install each package once anyway): + +```{r eval=FALSE} + +install.packages("UsingR") ## note the quotation marks. This package accompanies the Verzani book. +install.packages("openintro") ## This package goes along with our textbook. + +## Then you can load the package this way: +library(UsingR) ## No quotes! +library(openintro) +``` +Run these commands on your system. Use the 'Packages' tab to explore the documentation of the functions and datasets available through the `openintro` package. + +## Loading datasets + +Often datasets will be located online or locally on your computer and you'll want to load them directly. For '.Rdata' files you can do this using the `load()` command. For others you may want to use commands like `read.csv()`, `read.table()`, or `read.foreign()` (that last one requires the 'foreign' package, so you'll need to load it first). RStudio also has a drop-down menu item ('File' → 'Import dataset') that can help you load a local file. + +## Environment and History + +By default, R Studio allows you to see all the variables or 'objects' currently available to you in a particular session. Find the window/tab called "Environment" and take a look at what's there. + +There's another tab (likely in the same window) called "History" that contains all the commands you have run in the current session. This can be super helpful when you're trying to piece together what you did a few moments ago or why that command you just ran worked and the one you tried a before did not. + +## Getting help + +As mentioned earlier, the `help()` command is your friend. RStudio also has a 'Help' tab in one of the default windows. You can also use the RStudio cheatsheets, StackOverflow, the Verzani textbook, the [Quick-R tutorials](https://www.statmethods.net/index.html), and/or many, many other resources on the internet, including the [rseek search engine](https://rseek.org/) (which just searches the web for R-related resources). \ No newline at end of file diff --git a/r_lectures/w02-R_lecture.Rmd b/r_lectures/w02-R_lecture.Rmd deleted file mode 120000 index 48f6b08..0000000 --- a/r_lectures/w02-R_lecture.Rmd +++ /dev/null @@ -1 +0,0 @@ -projects/w02-R_lecture/w02-R_lecture.Rmd \ No newline at end of file diff --git a/r_lectures/w02-R_lecture.Rmd b/r_lectures/w02-R_lecture.Rmd new file mode 100644 index 0000000..8db89ef --- /dev/null +++ b/r_lectures/w02-R_lecture.Rmd @@ -0,0 +1,273 @@ +--- +title: "Week 2 R lecture" +subtitle: "Statistics and statistical programming \nNorthwestern University \nMTS 525" +author: "Aaron Shaw" +date: "April 4, 2019" +output: html_document +--- + +```{r setup, include=FALSE} +knitr::opts_chunk$set(echo = TRUE) +``` + +## Adding comments to your code + +Sorry, I realized that I forgot to explain this in last week's R lecture! + +R interprets the `#` character and anything that comes after it as a comment. R will not try to interpret whatever comes next as a command: + +```{r} +2+2 + +# This is a comment. The next line is too: +# 2+2 +``` + + +## More advanced variable types: + +### Factors + +These are for categorical data. You can create them with the `factor()` command or by running `as.factor()` on a character vector. + +```{r} +cities <- factor(c("Chicago", "Detroit", "Milwaukee")) +summary(cities) +class(cities) + +more.cities <- c("Oakland", "Seattle", "San Diego") +summary(more.cities) +class(more.cities) + +more.cities <- as.factor(more.cities) +summary(more.cities) +class(more.cities) + +``` + +### Lists + +Lists are a lot like vectors, but can contain any kind of object (e.g., other variables). + +```{r} +cities.list <- list(cities, more.cities) +cities.list +``` +We can name the items in the list just like we did for a vector: +```{r} +names(cities.list) <- c("midwest", "west") + +# This works too: +cities.list <- list("midwest" = cities, "west" = more.cities) +``` +You can index into the list just like you can with a vector except that instead of one set of square brackets you have to use two: +```{r} +cities.list[["midwest"]] +cities.list[[2]] +``` +With a list you can also index recursively (down into the individual objects contained in the list). For example: +```{r} +cities.list[["west"]][2] +``` + +Some functions "just work" on lists. Others don't or produce weird output that you probably weren't expecting. You should be careful and check the output to see what happens: + +```{r} + +# summary works as you might hope: +summary(cities.list) + +# table produces something very weird: +table(cities.list) +``` + +### Matrices + +Matrices are a little less common for everyday use, but it's good to know they're there and that you can do matrix arithmetic to your heart's content. An example is below. Check out the help documentation for the `matrix()` function for more. + +```{r} + +m1 <- matrix(c(1:12), nrow=4, byrow=FALSE) +m1 + +m2 <- matrix(seq(2,24,2), nrow=4, byrow=FALSE) +m2 + +m1*m2 +t(m2) # transposition + +``` + +### Data frames + +A data frame is a format for storing tabular data. Formally, it consists of a list of vectors of equal length. + +For our purposes, data frames are the most important data structure (or type) in R. We will use them constantly. They have rows (usually units or observations) and columns (usually variables). There are also many functions designed to work especially (even exclusively) with data frames. Let's take a look at another built-in example dataset, `faithful` (note: read the help documentation on `faithful` to learn about the dataset!): + +```{r} +faithful <- faithful # This makes the dataset visible in the "Environment" tab in RStudio. + +dim(faithful) # often the first thing I do with any data frame +nrow(faithful) + +names(faithful) ## try colnames(faithful) too + +head(faithful) ## look at the first few rows of data + +summary(faithful) + +``` + +You can index into a data frame using numeric values or variable names. The notation uses square brackets again and requires you to remember the convention of `[, ]`: + +```{r} +faithful[1,1] # The item in the first row of the first column + +faithful[,2] # all of the items in the second column + +faithful[10:20, 2] # ranges work too + +faithful[37, "eruptions"] +``` + +It is very useful to work with column (variable) names in a data frame using the `$` symbol: + +```{r} +faithful$eruptions + +mean(faithful$waiting) + +boxplot(faithful$waiting) +``` + +Data frames are very useful for bivariate analyses (e.g., plots and tables). The base R notation for a bivariate presentation usually uses the `~` character. If both of the variables in your bivariate comparison are within the same data frame you can use the `data=` argument. For example, here is a scatterplot of eruption time (Y axis) over waiting time (X axis): + +```{r} +plot(eruptions ~ waiting, data=faithful) +``` + + +Data frames can have an arbitrary number of columns (variables). Another built in dataset used frequently in R documentation and examples is `mtcars` (read the help documentation! it contains a codebook that tells you about each variable). Let's look at that one next: + +```{r} +mtcars <- mtcars + +dim(mtcars) + +head(mtcars) +``` + +There are many ways to create and modify data frames. Here is an example playing with the `mtcars` data. I use the `data.frame` command to build a new data frame from three vectors: + +```{r} + +my.mpg <- mtcars$mpg +my.cyl <- mtcars$cyl +my.disp <- mtcars$disp + +df.small <- data.frame(my.mpg, my.cyl, my.disp) +class(df.small) +head(df.small) + +# recode a value as missing +df.small[5,1] <- NA + +# removing a column +df.small[,3] <- NULL +dim(df.small) +head(df.small) + +``` + +Creating new variables, recoding, and transformations look very similar to working with vectors. Notice the `na.rm=TRUE` argument I am passing to the `mean` function in the first line here: + +```{r} +df.small$mpg.big <- df.small$my.mpg > mean(df.small$my.mpg, na.rm=TRUE) + +table(df.small$mpg.big) + +df.small$mpg.l <- log1p(df.small$my.mpg) # notice: log1p() +head(df.small$mpg.l) +## convert a number into a factor: + +df.small$my.cyl.factor <- factor(df.small$my.cyl) +summary(df.small$my.cyl.factor) +``` + +Some special functions are particularly useful for working with data frames: + +```{r} +is.na(df.small$my.mpg) + +sum(is.na(df.small$my.mpg)) # sum() works in mysterious ways sometimes... + +complete.cases(df.small) +sum(complete.cases(df.small)) +``` +## "Apply" functions and beyond + +R has some special functions to help apply operations over vectors, lists, etc. These can seem a little complicated at first, but they are super, super useful. + +Most of the base R versions of these have "apply" in the name. There are also alternatives (some created by the same people who created ggplot2 that you can read more about in, for example, the Healy *Data Visualization* book). I will stick to the base R versions here. Please feel free to read about and use the alternatives! + +Let's start with an example using the `mtcars` dataset again. The `sapply()` and `lapply()` functions both "apply" the second argument (a function) iteratively to the items (variables) in the first argument: + +```{r} +sapply(mtcars, quantile) + +lapply(mtcars, quantile) # Same output, different format/class +``` + +Experiment with that idea on your own a little bit before moving on. For example, can you find the mean of each variable using either `sapply` or `lapply`? + + +The `tapply` function allows you to apply functions conditionally. For example, below I find mean gas mileage by number of cylinders. The second argument (`mtcars$cyl`) provides an index into the first (`mtcars$mpg`) before the third argument (`mean`) is applied to each of the conditional subsets: + +```{r} +tapply(mtcars$mpg, mtcars$cyl, mean) +``` + +Try some other calculations using `tapply()`. Can you calculate the average engine discplacement conditional on number of cylinders? What about the average miles per gallon conditional on whether the car has an automatic transmission? + +Note that `apply()` works pretty smoothly with matrices, but it can be a bit complicated/surprising otherwise. + +### Some basic graphs with ggplot2 + +ggplot2 is what I like to use for plotting so I'll develop examples with it from here on out. + +Make sure you've installed the package with `install.packages()` and load it with `library()`. + +There is another built-in (automotive) dataset that comes along with the ggplot2 package called `mpg`. This dataset includes variables of several types and is used in much of the package documentation, so it's helpful to become familiar with it. + +I'll develop a few simple examples below. For more, please take a look at the (extensive) [ggplot2 documentation](https://ggplot2.tidyverse.org/reference/index.html). There are **many** options and arguments for most of these functions and many more functions to help you produce publication-ready graphics. Chapter 3 of the Healy book is also an extraordinary resource for getting started creating visualizations with ggplot2. + +```{r} +library(ggplot2) +mpg <- mpg + +# First thing, call ggplot() to start building up a plot +# aes() indicates which variables to use as "aesthetic" mappings + +p <- ggplot(data=mpg, aes(manufacturer, hwy)) + +p + geom_boxplot() + +# another relationship: +p <- ggplot(data=mpg, aes(fl, hwy)) +p + geom_boxplot() +p + geom_violin() + +``` + +Here's another that visualizes the relationship between miles per gallon (mpg) in the city vs. mpg on the highway: + +```{r} +p <- ggplot(data=mpg, aes(cty, hwy)) + +p+geom_point() + +# Multivariate graphical displays can get pretty wild +p + geom_point(aes(color=factor(class), shape=factor(cyl))) +``` + +