X-Git-Url: https://code.communitydata.science/stats_class_2020.git/blobdiff_plain/3f31579d7ac0cbade8c9fc684ea0130682160cd5..5884215d06117cc9edb5774934b5bc5914ee9fcb:/r_tutorials/w04-R_tutorial.html?ds=inline diff --git a/r_tutorials/w04-R_tutorial.html b/r_tutorials/w04-R_tutorial.html index 9a5cb2e..46618e4 100644 --- a/r_tutorials/w04-R_tutorial.html +++ b/r_tutorials/w04-R_tutorial.html @@ -1572,8 +1572,54 @@ head(more.cars)
## 
 ## TRUE 
 ##  384
-

When find yourself trying to load a tabular data file that consists of plain text, but has some idiosyncratic difference from a csv (e.g., it is tab-separated instead of comma-separated), you should use read.delim().

-

How will you know what to use? Get to know your data first! Seriously, try opening it the file (or at least opening up part of it) using a text editor and/or spreadsheet software. Looking at the “raw” plain text can help you figure out what arguments you need to use to make the data load up exactly the way you want it.

+

When find yourself trying to load a tabular data file that consists of plain text, but has some idiosyncratic difference from a csv (e.g., it is tab-separated instead of comma-separated), you may want to use read.delim().

+

Note that the Tidyverse has some related functions, namely read_csv() that can also import csv’s very efficiently and with helpful defaults to try and guess what kinds of variables you’re working with. The guesses are usually pretty good! Here’s an example using read_csv()

+
library(tidyverse)
+
## ── Attaching packages ─────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
+
## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
+## ✓ tibble  3.0.3     ✓ dplyr   1.0.2
+## ✓ tidyr   1.1.2     ✓ stringr 1.4.0
+## ✓ readr   1.3.1     ✓ forcats 0.5.0
+
## ── Conflicts ────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
+## x dplyr::filter() masks stats::filter()
+## x dplyr::lag()    masks stats::lag()
+
yet.more.cars <- read_csv(url("https://communitydata.cc/~ads/teaching/2019/stats/data/week_03/mtcars.csv"))
+
## Warning: Missing column names filled in: 'X1' [1]
+
## Parsed with column specification:
+## cols(
+##   X1 = col_character(),
+##   mpg = col_double(),
+##   cyl = col_double(),
+##   disp = col_double(),
+##   hp = col_double(),
+##   drat = col_double(),
+##   wt = col_double(),
+##   qsec = col_double(),
+##   vs = col_double(),
+##   am = col_double(),
+##   gear = col_double(),
+##   carb = col_double()
+## )
+
yet.more.cars
+
## # A tibble: 32 x 12
+##    X1            mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb
+##    <chr>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
+##  1 Mazda RX4    21       6  160    110  3.9   2.62  16.5     0     1     4     4
+##  2 Mazda RX4 …  21       6  160    110  3.9   2.88  17.0     0     1     4     4
+##  3 Datsun 710   22.8     4  108     93  3.85  2.32  18.6     1     1     4     1
+##  4 Hornet 4 D…  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1
+##  5 Hornet Spo…  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2
+##  6 Valiant      18.1     6  225    105  2.76  3.46  20.2     1     0     3     1
+##  7 Duster 360   14.3     8  360    245  3.21  3.57  15.8     0     0     3     4
+##  8 Merc 240D    24.4     4  147.    62  3.69  3.19  20       1     0     4     2
+##  9 Merc 230     22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2
+## 10 Merc 280     19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4
+## # … with 22 more rows
+
table(yet.more.cars == my.mtcars)
+
## 
+## TRUE 
+##  384
+

How will you know what to use? The best practice is always to get to know your data first! Seriously, try opening it the file (or at least opening up part of it) using a text editor and/or spreadsheet software. Looking at the “raw” plain text can help you figure out what arguments you need to use to make the data load up exactly the way you want it.

For example, you might notice that my import of the mtcars.csv file introduces an important difference from the original mtcars dataset. In the original mtcars, the car model names are row.names attributes of the dataframe instead of a variable.

head(mtcars)
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
@@ -1595,7 +1641,7 @@ head(more.cars)
## [25] "Pontiac Firebird" "Fiat X1-9" "Porsche 914-2" ## [28] "Lotus Europa" "Ford Pantera L" "Ferrari Dino" ## [31] "Maserati Bora" "Volvo 142E" -

Since it’s the first column of the raw data, you can fix this with an additional argument to read.csv:

+

Since it’s the first column of the raw data, you can fix this with an additional argument to read.csv (I’m sure there’s also a way to do this with read_csv() but I didn’t look it up in time to include in this tutorial):

my.mtcars <- read.csv(url("https://communitydata.science/~ads/teaching/2020/stats/data/week_04/mtcars.csv"), 
                       row.names=1)
 head(my.mtcars)
@@ -1639,16 +1685,9 @@ sapply(mtcars[variables], function(v){

2.2 Conditional means in Tidyverse code

I should reiterate here that the *apply functions are part of Base R. Other functions to do similar things exist and you may find these other functions more intelligible or useful for you. In particular, this is the sort of task that the Tidyverse handles in an arguably more intuitive fashion than Base R because it allows you to organize functions as a sequence of actions with function names that are verbs and this generally leads to more readable code. Here’s an example snippet that replicates the same output. :

-
library(tidyverse)
-
## ── Attaching packages ───────────────────────────────────────────────────── tidyverse 1.3.0 ──
-
## ✓ ggplot2 3.3.2     ✓ purrr   0.3.4
-## ✓ tibble  3.0.1     ✓ dplyr   1.0.2
-## ✓ tidyr   1.1.1     ✓ stringr 1.4.0
-## ✓ readr   1.3.1     ✓ forcats 0.5.0
-
## ── Conflicts ──────────────────────────────────────────────────────── tidyverse_conflicts() ──
-## x dplyr::filter() masks stats::filter()
-## x dplyr::lag()    masks stats::lag()
-
mtcars %>% 
+
library(tidyverse)
+
+mtcars %>% 
   group_by(cyl) %>%
   summarize( across(
     .cols=c(mpg, disp, hp, wt),
@@ -1750,9 +1789,9 @@ head(cyl.conditional.means)

Again, you could run row.names(cyl.conditional.means) <- NULL` to reset the row numbers.

By now you might have noticed that a some of my code in this section replicates parts of the output that the Tidyverse example above created by default. As I said earlier, the Tidyverse really shines when it comes to tidying data (shocking, I know). There are also functions (“verbs”) in the Tidyverse libraries to perform all of the additional steps (such as sorting data by the values of a column). If you are interested in doing so, check out more of the Tidyverse documentation and the Wickham and Grolemund R for Data Science book listed among the course resources.

-
-

2.5 Working with dates

-

Date and time objects are another more advanced data class in R. Managing date and time data can be a headache. This is because dates and times tend to be formatted inconsistently and are usually given to you as character variables, so you will need to transform them into a format that R can “understand” as dates. There are many packages for working with dates and times, but for now I’ll introduce you to the Base R way of doing so. This uses a data type formally called “POSIX” (no need to worry about why it’s called that).

+
+

2.5 Working with dates (in Base R)

+

Date and time objects are another more advanced data class in R. Managing date and time data can be a headache. This is because dates and times tend to be formatted inconsistently and are usually given to you as character variables, so you should know how to transform them into a format that R can “understand” as dates. There are many packages for working with dates and times, but for now I’ll introduce you to the Base R way of doing so. This uses a data type formally called “POSIX” (no need to worry about why it’s called that).

To build up an example, I’ll create some date-time values, add a little noise, and convert them into a character vector:

add.an.hour <- seq(0, 3600*24, by=3600)
 some.hours <- as.character(Sys.time() + add.an.hour) ## Look up Sys.time() to see what it does.
@@ -1760,34 +1799,48 @@ some.hours <- as.character(Sys.time() + add.an.hour) ## Look up Sys.time() to
class(some.hours)
## [1] "character"
head(some.hours)
-
## [1] "2020-09-28 12:06:04" "2020-09-28 13:06:04" "2020-09-28 14:06:04"
-## [4] "2020-09-28 15:06:04" "2020-09-28 16:06:04" "2020-09-28 17:06:04"
+
## [1] "2020-09-29 12:23:26" "2020-09-29 13:23:26" "2020-09-29 14:23:26"
+## [4] "2020-09-29 15:23:26" "2020-09-29 16:23:26" "2020-09-29 17:23:26"

These are beautifully formatted timestamps, but R will not understand them as such. This is often how you might receive data in, for example, a dataset you import from Qualtrics, scrape from the web, or elsehwere. You can convert the some.hours vector into an object class that R will recognize as a time object using the as.POSIXct() function. Notice that it even adds a timezone back in!

as.POSIXct(some.hours)
-
##  [1] "2020-09-28 12:06:04 CDT" "2020-09-28 13:06:04 CDT"
-##  [3] "2020-09-28 14:06:04 CDT" "2020-09-28 15:06:04 CDT"
-##  [5] "2020-09-28 16:06:04 CDT" "2020-09-28 17:06:04 CDT"
-##  [7] "2020-09-28 18:06:04 CDT" "2020-09-28 19:06:04 CDT"
-##  [9] "2020-09-28 20:06:04 CDT" "2020-09-28 21:06:04 CDT"
-## [11] "2020-09-28 22:06:04 CDT" "2020-09-28 23:06:04 CDT"
-## [13] "2020-09-29 00:06:04 CDT" "2020-09-29 01:06:04 CDT"
-## [15] "2020-09-29 02:06:04 CDT" "2020-09-29 03:06:04 CDT"
-## [17] "2020-09-29 04:06:04 CDT" "2020-09-29 05:06:04 CDT"
-## [19] "2020-09-29 06:06:04 CDT" "2020-09-29 07:06:04 CDT"
-## [21] "2020-09-29 08:06:04 CDT" "2020-09-29 09:06:04 CDT"
-## [23] "2020-09-29 10:06:04 CDT" "2020-09-29 11:06:04 CDT"
-## [25] "2020-09-29 12:06:04 CDT"
+
##  [1] "2020-09-29 12:23:26 CDT" "2020-09-29 13:23:26 CDT"
+##  [3] "2020-09-29 14:23:26 CDT" "2020-09-29 15:23:26 CDT"
+##  [5] "2020-09-29 16:23:26 CDT" "2020-09-29 17:23:26 CDT"
+##  [7] "2020-09-29 18:23:26 CDT" "2020-09-29 19:23:26 CDT"
+##  [9] "2020-09-29 20:23:26 CDT" "2020-09-29 21:23:26 CDT"
+## [11] "2020-09-29 22:23:26 CDT" "2020-09-29 23:23:26 CDT"
+## [13] "2020-09-30 00:23:26 CDT" "2020-09-30 01:23:26 CDT"
+## [15] "2020-09-30 02:23:26 CDT" "2020-09-30 03:23:26 CDT"
+## [17] "2020-09-30 04:23:26 CDT" "2020-09-30 05:23:26 CDT"
+## [19] "2020-09-30 06:23:26 CDT" "2020-09-30 07:23:26 CDT"
+## [21] "2020-09-30 08:23:26 CDT" "2020-09-30 09:23:26 CDT"
+## [23] "2020-09-30 10:23:26 CDT" "2020-09-30 11:23:26 CDT"
+## [25] "2020-09-30 12:23:26 CDT"

If things aren’t formatted in quite the way R expects, you can also tell it how to parse a character string as a POSIXct object:

-
m <- "2019-02-21 04:35:00"
-class(m)
+
sometime <- "2019-02-21 04:35:00"
+class(sometime)
## [1] "character"
-
a.good.time <- as.POSIXct(m, format="%Y-%m-%d %H:%M:%S", tz="CDT")
+
a.good.time <- as.POSIXct(sometime, format="%Y-%m-%d %H:%M:%S", tz="CDT")
 class(a.good.time)
## [1] "POSIXct" "POSIXt"

Once you have a time object, you can even do date arithmetic with difftime() (but watch out as this can get complicated):

difftime(Sys.time(), a.good.time, units="weeks")
-
## Time difference of 83.64594 weeks
+
## Time difference of 83.79052 weeks

This calculated the number of weeks elapsed between the current time and an example date/time I created above.

+
+

2.5.1 as.Date vs. as.POSIX()

+

The as.Date() function provides an alternative to as.POSIX() that is far more memorable and readable, but far less precise. Note that it truncates the time of day and the timezone from the ouput. I’ll use the same

+
a.good.time <- as.Date(sometime, format="%Y-%m-%d %H:%M:%S", tz="CDT")
+class(a.good.time)
+
## [1] "Date"
+
a.good.time
+
## [1] "2019-02-21"
+
+
+

2.5.2 An easier way (most of the time) in the Tidyverse

+

The Tidyverse (via the lubridate package) usually handles dates and times quite well. When you need to import and work with date and time objects, you may find it best to try Tidyverse data import tools (e.g., read_csv()) as a starting point for this reason.

+

I highly recommend reading this chapter of Wickham and Grolemund on dates and times as they introduce a number of challenges and nuances that can befuddle even the most brilliant statistical programmers.

+
@@ -1805,31 +1858,31 @@ head(odds)
more.odds <- rep(seq(from=1, to=100, by=2), 5)

You can use sample() to draw samples from an object:

sample(x=odds, size=3)
-
## [1] 73 59 87
+
## [1] 39 13 71
sample(x=evens, size=3)
-
## [1] 12 36 14
+
## [1] 58 22 32

You can also sample “with replacement.” Here I take 100 random draws from the binomial distribution, which is analogous to 100 independent trials of, say, a coin flip:

draws <- sample(x=c(0,1), size=100, replace=TRUE)
 
 table(draws)
## draws
 ##  0  1 
-## 51 49
+## 52 48

What if you wanted to take a random set of 10 observations from a dataframe? You can use sample on an index of the rows:

odds.n.evens <- data.frame(odds, evens)
 
 odds.n.evens[ sample(row.names(odds.n.evens), 10), ]
##    odds evens
-## 34   67    68
-## 12   23    24
-## 33   65    66
 ## 35   69    70
-## 43   85    86
-## 15   29    30
+## 44   87    88
 ## 5     9    10
-## 10   19    20
-## 23   45    46
-## 9    17    18
+## 3 5 6 +## 43 85 86 +## 8 15 16 +## 36 71 72 +## 21 41 42 +## 47 93 94 +## 41 81 82

3.1 Managing randomness

Try running one of the sample commands above again. You will (probably!) get a different result because sample makes a random draw each time it runs.

@@ -1989,7 +2042,7 @@ $(document).ready(function () { theme: "bootstrap3", context: '.toc-content', hashGenerator: function (text) { - return text.replace(/[.\\/?&!#<>]/g, '').replace(/\s/g, '_').toLowerCase(); + return text.replace(/[.\\/?&!#<>]/g, '').replace(/\s/g, '_'); }, ignoreSelector: ".toc-ignore", scrollTo: 0