X-Git-Url: https://code.communitydata.science/stats_class_2020.git/blobdiff_plain/3f31579d7ac0cbade8c9fc684ea0130682160cd5..b5b6d3406dd061c2f83e5082812718b686da0d5a:/r_tutorials/w04-R_tutorial.html diff --git a/r_tutorials/w04-R_tutorial.html b/r_tutorials/w04-R_tutorial.html index 9a5cb2e..46618e4 100644 --- a/r_tutorials/w04-R_tutorial.html +++ b/r_tutorials/w04-R_tutorial.html @@ -1572,8 +1572,54 @@ head(more.cars)
##
## TRUE
## 384
-When find yourself trying to load a tabular data file that consists of plain text, but has some idiosyncratic difference from a csv (e.g., it is tab-separated instead of comma-separated), you should use read.delim()
.
How will you know what to use? Get to know your data first! Seriously, try opening it the file (or at least opening up part of it) using a text editor and/or spreadsheet software. Looking at the ârawâ plain text can help you figure out what arguments you need to use to make the data load up exactly the way you want it.
+When find yourself trying to load a tabular data file that consists of plain text, but has some idiosyncratic difference from a csv (e.g., it is tab-separated instead of comma-separated), you may want to use read.delim()
.
Note that the Tidyverse has some related functions, namely read_csv()
that can also import csvâs very efficiently and with helpful defaults to try and guess what kinds of variables youâre working with. The guesses are usually pretty good! Hereâs an example using read_csv()
library(tidyverse)
+## ââ Attaching packages âââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ tidyverse 1.3.0 ââ
+## â ggplot2 3.3.2 â purrr 0.3.4
+## â tibble 3.0.3 â dplyr 1.0.2
+## â tidyr 1.1.2 â stringr 1.4.0
+## â readr 1.3.1 â forcats 0.5.0
+## ââ Conflicts ââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââââ tidyverse_conflicts() ââ
+## x dplyr::filter() masks stats::filter()
+## x dplyr::lag() masks stats::lag()
+yet.more.cars <- read_csv(url("https://communitydata.cc/~ads/teaching/2019/stats/data/week_03/mtcars.csv"))
+## Warning: Missing column names filled in: 'X1' [1]
+## Parsed with column specification:
+## cols(
+## X1 = col_character(),
+## mpg = col_double(),
+## cyl = col_double(),
+## disp = col_double(),
+## hp = col_double(),
+## drat = col_double(),
+## wt = col_double(),
+## qsec = col_double(),
+## vs = col_double(),
+## am = col_double(),
+## gear = col_double(),
+## carb = col_double()
+## )
+yet.more.cars
+## # A tibble: 32 x 12
+## X1 mpg cyl disp hp drat wt qsec vs am gear carb
+## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
+## 1 Mazda RX4 21 6 160 110 3.9 2.62 16.5 0 1 4 4
+## 2 Mazda RX4 ⦠21 6 160 110 3.9 2.88 17.0 0 1 4 4
+## 3 Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
+## 4 Hornet 4 D⦠21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
+## 5 Hornet Spo⦠18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
+## 6 Valiant 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
+## 7 Duster 360 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
+## 8 Merc 240D 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
+## 9 Merc 230 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
+## 10 Merc 280 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
+## # ⦠with 22 more rows
+table(yet.more.cars == my.mtcars)
+##
+## TRUE
+## 384
+How will you know what to use? The best practice is always to get to know your data first! Seriously, try opening it the file (or at least opening up part of it) using a text editor and/or spreadsheet software. Looking at the ârawâ plain text can help you figure out what arguments you need to use to make the data load up exactly the way you want it.
For example, you might notice that my import of the mtcars.csv file introduces an important difference from the original mtcars
dataset. In the original mtcars
, the car model names are row.names
attributes of the dataframe instead of a variable.
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb
@@ -1595,7 +1641,7 @@ head(more.cars)
## [25] "Pontiac Firebird" "Fiat X1-9" "Porsche 914-2"
## [28] "Lotus Europa" "Ford Pantera L" "Ferrari Dino"
## [31] "Maserati Bora" "Volvo 142E"
-Since itâs the first column of the raw data, you can fix this with an additional argument to read.csv
:
Since itâs the first column of the raw data, you can fix this with an additional argument to read.csv
(Iâm sure thereâs also a way to do this with read_csv()
but I didnât look it up in time to include in this tutorial):
my.mtcars <- read.csv(url("https://communitydata.science/~ads/teaching/2020/stats/data/week_04/mtcars.csv"),
row.names=1)
head(my.mtcars)
@@ -1639,16 +1685,9 @@ sapply(mtcars[variables], function(v){
I should reiterate here that the *apply
functions are part of Base R. Other functions to do similar things exist and you may find these other functions more intelligible or useful for you. In particular, this is the sort of task that the Tidyverse handles in an arguably more intuitive fashion than Base R because it allows you to organize functions as a sequence of actions with function names that are verbs and this generally leads to more readable code. Hereâs an example snippet that replicates the same output. :
library(tidyverse)
-## ââ Attaching packages âââââââââââââââââââââââââââââââââââââââââââââââââââââ tidyverse 1.3.0 ââ
-## â ggplot2 3.3.2 â purrr 0.3.4
-## â tibble 3.0.1 â dplyr 1.0.2
-## â tidyr 1.1.1 â stringr 1.4.0
-## â readr 1.3.1 â forcats 0.5.0
-## ââ Conflicts ââââââââââââââââââââââââââââââââââââââââââââââââââââââââ tidyverse_conflicts() ââ
-## x dplyr::filter() masks stats::filter()
-## x dplyr::lag() masks stats::lag()
-mtcars %>%
+library(tidyverse)
+
+mtcars %>%
group_by(cyl) %>%
summarize( across(
.cols=c(mpg, disp, hp, wt),
@@ -1750,9 +1789,9 @@ head(cyl.conditional.means)
Again, you could run row.names(cyl.conditional.means)
<- NULL` to reset the row numbers.
By now you might have noticed that a some of my code in this section replicates parts of the output that the Tidyverse example above created by default. As I said earlier, the Tidyverse really shines when it comes to tidying data (shocking, I know). There are also functions (âverbsâ) in the Tidyverse libraries to perform all of the additional steps (such as sorting data by the values of a column). If you are interested in doing so, check out more of the Tidyverse documentation and the Wickham and Grolemund R for Data Science book listed among the course resources.
Date and time objects are another more advanced data class in R. Managing date and time data can be a headache. This is because dates and times tend to be formatted inconsistently and are usually given to you as character variables, so you will need to transform them into a format that R can âunderstandâ as dates. There are many packages for working with dates and times, but for now Iâll introduce you to the Base R way of doing so. This uses a data type formally called âPOSIXâ (no need to worry about why itâs called that).
+Date and time objects are another more advanced data class in R. Managing date and time data can be a headache. This is because dates and times tend to be formatted inconsistently and are usually given to you as character variables, so you should know how to transform them into a format that R can âunderstandâ as dates. There are many packages for working with dates and times, but for now Iâll introduce you to the Base R way of doing so. This uses a data type formally called âPOSIXâ (no need to worry about why itâs called that).
To build up an example, Iâll create some date-time values, add a little noise, and convert them into a character vector:
add.an.hour <- seq(0, 3600*24, by=3600)
some.hours <- as.character(Sys.time() + add.an.hour) ## Look up Sys.time() to see what it does.
@@ -1760,34 +1799,48 @@ some.hours <- as.character(Sys.time() + add.an.hour) ## Look up Sys.time() to
class(some.hours)
## [1] "character"
head(some.hours)
-## [1] "2020-09-28 12:06:04" "2020-09-28 13:06:04" "2020-09-28 14:06:04"
-## [4] "2020-09-28 15:06:04" "2020-09-28 16:06:04" "2020-09-28 17:06:04"
+## [1] "2020-09-29 12:23:26" "2020-09-29 13:23:26" "2020-09-29 14:23:26"
+## [4] "2020-09-29 15:23:26" "2020-09-29 16:23:26" "2020-09-29 17:23:26"
These are beautifully formatted timestamps, but R will not understand them as such. This is often how you might receive data in, for example, a dataset you import from Qualtrics, scrape from the web, or elsehwere. You can convert the some.hours
vector into an object class that R will recognize as a time object using the as.POSIXct()
function. Notice that it even adds a timezone back in!
as.POSIXct(some.hours)
-## [1] "2020-09-28 12:06:04 CDT" "2020-09-28 13:06:04 CDT"
-## [3] "2020-09-28 14:06:04 CDT" "2020-09-28 15:06:04 CDT"
-## [5] "2020-09-28 16:06:04 CDT" "2020-09-28 17:06:04 CDT"
-## [7] "2020-09-28 18:06:04 CDT" "2020-09-28 19:06:04 CDT"
-## [9] "2020-09-28 20:06:04 CDT" "2020-09-28 21:06:04 CDT"
-## [11] "2020-09-28 22:06:04 CDT" "2020-09-28 23:06:04 CDT"
-## [13] "2020-09-29 00:06:04 CDT" "2020-09-29 01:06:04 CDT"
-## [15] "2020-09-29 02:06:04 CDT" "2020-09-29 03:06:04 CDT"
-## [17] "2020-09-29 04:06:04 CDT" "2020-09-29 05:06:04 CDT"
-## [19] "2020-09-29 06:06:04 CDT" "2020-09-29 07:06:04 CDT"
-## [21] "2020-09-29 08:06:04 CDT" "2020-09-29 09:06:04 CDT"
-## [23] "2020-09-29 10:06:04 CDT" "2020-09-29 11:06:04 CDT"
-## [25] "2020-09-29 12:06:04 CDT"
+## [1] "2020-09-29 12:23:26 CDT" "2020-09-29 13:23:26 CDT"
+## [3] "2020-09-29 14:23:26 CDT" "2020-09-29 15:23:26 CDT"
+## [5] "2020-09-29 16:23:26 CDT" "2020-09-29 17:23:26 CDT"
+## [7] "2020-09-29 18:23:26 CDT" "2020-09-29 19:23:26 CDT"
+## [9] "2020-09-29 20:23:26 CDT" "2020-09-29 21:23:26 CDT"
+## [11] "2020-09-29 22:23:26 CDT" "2020-09-29 23:23:26 CDT"
+## [13] "2020-09-30 00:23:26 CDT" "2020-09-30 01:23:26 CDT"
+## [15] "2020-09-30 02:23:26 CDT" "2020-09-30 03:23:26 CDT"
+## [17] "2020-09-30 04:23:26 CDT" "2020-09-30 05:23:26 CDT"
+## [19] "2020-09-30 06:23:26 CDT" "2020-09-30 07:23:26 CDT"
+## [21] "2020-09-30 08:23:26 CDT" "2020-09-30 09:23:26 CDT"
+## [23] "2020-09-30 10:23:26 CDT" "2020-09-30 11:23:26 CDT"
+## [25] "2020-09-30 12:23:26 CDT"
If things arenât formatted in quite the way R expects, you can also tell it how to parse a character string as a POSIXct object:
-m <- "2019-02-21 04:35:00"
-class(m)
+sometime <- "2019-02-21 04:35:00"
+class(sometime)
## [1] "character"
-a.good.time <- as.POSIXct(m, format="%Y-%m-%d %H:%M:%S", tz="CDT")
+a.good.time <- as.POSIXct(sometime, format="%Y-%m-%d %H:%M:%S", tz="CDT")
class(a.good.time)
## [1] "POSIXct" "POSIXt"
Once you have a time object, you can even do date arithmetic with difftime()
(but watch out as this can get complicated):
difftime(Sys.time(), a.good.time, units="weeks")
-## Time difference of 83.64594 weeks
+## Time difference of 83.79052 weeks
This calculated the number of weeks elapsed between the current time and an example date/time I created above.
+
+2.5.1 as.Date
vs. as.POSIX()
+The as.Date()
function provides an alternative to as.POSIX()
that is far more memorable and readable, but far less precise. Note that it truncates the time of day and the timezone from the ouput. Iâll use the same
+a.good.time <- as.Date(sometime, format="%Y-%m-%d %H:%M:%S", tz="CDT")
+class(a.good.time)
+## [1] "Date"
+a.good.time
+## [1] "2019-02-21"
+
+
+2.5.2 An easier way (most of the time) in the Tidyverse
+The Tidyverse (via the lubridate
package) usually handles dates and times quite well. When you need to import and work with date and time objects, you may find it best to try Tidyverse data import tools (e.g., read_csv()
) as a starting point for this reason.
+I highly recommend reading this chapter of Wickham and Grolemund on dates and times as they introduce a number of challenges and nuances that can befuddle even the most brilliant statistical programmers.
+
more.odds <- rep(seq(from=1, to=100, by=2), 5)
You can use sample()
to draw samples from an object:
sample(x=odds, size=3)
-## [1] 73 59 87
+## [1] 39 13 71
sample(x=evens, size=3)
-## [1] 12 36 14
+## [1] 58 22 32
You can also sample âwith replacement.â Here I take 100 random draws from the binomial distribution, which is analogous to 100 independent trials of, say, a coin flip:
draws <- sample(x=c(0,1), size=100, replace=TRUE)
table(draws)
## draws
## 0 1
-## 51 49
+## 52 48
What if you wanted to take a random set of 10 observations from a dataframe? You can use sample
on an index of the rows:
odds.n.evens <- data.frame(odds, evens)
odds.n.evens[ sample(row.names(odds.n.evens), 10), ]
## odds evens
-## 34 67 68
-## 12 23 24
-## 33 65 66
## 35 69 70
-## 43 85 86
-## 15 29 30
+## 44 87 88
## 5 9 10
-## 10 19 20
-## 23 45 46
-## 9 17 18
+## 3 5 6
+## 43 85 86
+## 8 15 16
+## 36 71 72
+## 21 41 42
+## 47 93 94
+## 41 81 82
Try running one of the sample
commands above again. You will (probably!) get a different result because sample
makes a random draw each time it runs.