X-Git-Url: https://code.communitydata.science/stats_class_2020.git/blobdiff_plain/0a581181eaac0541c14bab5a28584879d1ff9f63..31486cfa5c1c5a6b42e9ad55d2d8c2bd908b28e7:/r_tutorials/w05a-R_tutorial.html?ds=inline diff --git a/r_tutorials/w05a-R_tutorial.html b/r_tutorials/w05a-R_tutorial.html index 6813b21..3c76d1d 100644 --- a/r_tutorials/w05a-R_tutorial.html +++ b/r_tutorials/w05a-R_tutorial.html @@ -1559,7 +1559,7 @@ data_url <- url("https://raw.githubusercontent.com/nytimes/covid-19-data d <- read_csv(data_url) d -
## # A tibble: 12,059 x 5
+## # A tibble: 12,334 x 5
## date state fips cases deaths
## <date> <chr> <chr> <dbl> <dbl>
## 1 2020-01-21 Washington 53 1 0
@@ -1572,49 +1572,49 @@ d
## 8 2020-01-25 Washington 53 1 0
## 9 2020-01-26 Arizona 04 1 0
## 10 2020-01-26 California 06 2 0
-## # ⦠with 12,049 more rows
-For the sake of my examples, Iâm planning to work with the date
, state
, cases
, and deaths
variables. Notice that by using the read_csv()
function to import the data, R already recognizes the date
column as dates. Also notice that the column names for cases and deaths donât reflect the fact that both variables are cumulative counts. Also also, notice that it looks like I need to convert the state variable to a factor. Iâll start there and then get a quick sense of how much data I have for each state with a univariate table.
For the sake of my examples, Iâm planning to work with the date
, state
, cases
, and deaths
variables. Notice that by using the read_csv()
function to import the data, R already recognizes the date
column as dates. Also notice that the column names for cases and deaths donât reflect the fact that both variables are cumulative counts. Also also, notice that it looks like I will want to convert the state variable to a factor (since thatâs a more accurate representation of the data and it will likely make my analysis/plotting work easier later on). Iâll start there and then get a quick sense of how much data I have for each state with a univariate table.
d$state <- factor(d$state)
table(d$state)
##
## Alabama Alaska Arizona
-## 209 210 256
+## 214 215 261
## Arkansas California Colorado
-## 211 257 217
+## 216 262 222
## Connecticut Delaware District of Columbia
-## 214 211 215
+## 219 216 220
## Florida Georgia Guam
-## 221 220 207
+## 226 225 212
## Hawaii Idaho Illinois
-## 216 209 258
+## 221 214 263
## Indiana Iowa Kansas
-## 216 214 215
+## 221 219 220
## Kentucky Louisiana Maine
-## 216 213 210
+## 221 218 215
## Maryland Massachusetts Michigan
-## 217 250 212
+## 222 255 217
## Minnesota Mississippi Missouri
-## 216 211 215
+## 221 216 220
## Montana Nebraska Nevada
-## 209 234 217
+## 214 239 222
## New Hampshire New Jersey New Mexico
-## 220 218 211
+## 225 223 216
## New York North Carolina North Dakota
-## 221 219 211
+## 226 224 216
## Northern Mariana Islands Ohio Oklahoma
-## 194 213 216
+## 199 218 221
## Oregon Pennsylvania Puerto Rico
-## 223 216 209
+## 228 221 214
## Rhode Island South Carolina South Dakota
-## 221 216 212
+## 226 221 217
## Tennessee Texas Utah
-## 217 239 226
+## 222 244 231
## Vermont Virgin Islands Virginia
-## 215 208 215
+## 220 213 220
## Washington West Virginia Wisconsin
-## 261 205 246
+## 266 210 251
## Wyoming
-## 211
+## 216
Two things to point out here: (1) not all of our âstatesâ are technically states (e.g., Puerto Rico, District of Columbia, Virgin Islands, Northern Mariana Islands, Guam). I prefer to think of this as the NYT data scientist team quietly reminding us that the United States maintains a number of colonial properties without formal political representation! The second thing (2) is that not all states have the same number of observations/rows. You can probably figure out exactly why this might be the case from the documentation of the data sources and or from thinking more carefully about the context (e.g., some states had cases much earlier in 2020 than others). Anyhow, just some things to be aware of as we move forward with our analysis.
I can start by just plotting the cumulative cases for all of the states and work towards the specific plot we want from there:
ggplot(data = d, aes(date, cases)) +
geom_line()
-
+
Notice that ggplot handles the date
variable quite well by default! It recognizes the units of time and generates axis labels in terms of months. Also notice that ggplot handles the axis labels for the cases
variableâ¦less well. I donât know about you, but my brain doesnât parse scientific notation quickly/easily. Finally, the fact that this figure incorporates all the state-level observations as cumulative counts means that there is just a huge clutter of points/lines in this figure. Itâs impossible to really figure out whatâs going on, much less learn anything other than the cumulative number of cases within states appears to have increased over time (thanks for nothing, ggplot).
Thatâs already much less cluttered and much clearer. It also looks plausibly accurate (itâs always good to sanity check your data visualizations as you goâweird anomalies in a graph are usually a good indicator of something weird happening in the underlying code and/or data.
Now onwards to converting my cumulative case counts into weekly case counts. When I wrote this tutorial, the first way I thought to do this involved making calls to the Tidyverse mutate
, group_by
, and summarize
verbs. After a little trial and error, I got it to work with the following code (which Iâll walk through in detail below):
il_weekly_cases <- d %>%
@@ -1647,7 +1647,7 @@ table(d$state)
summarize(new_cases = sum(diff_cases, na.rm = T), )
il_weekly_cases
-## # A tibble: 38 x 2
+## # A tibble: 39 x 2
## weekdate new_cases
## <fct> <dbl>
## 1 2020-01-20 1
@@ -1660,7 +1660,7 @@ il_weekly_cases
## 8 2020-03-09 87
## 9 2020-03-16 953
## 10 2020-03-23 3568
-## # ⦠with 28 more rows
+## # ⦠with 29 more rows
Thereâs quite a lot happening there so letâs go through it verb-by-verb.
First, I filter
my cases to restrict the set to Illinois data. Then I use mutate
to create a diff_cases
variable that disaggregates the cumulative values of cases
(read the documentation for diff
to learn more about this one). Differenced values alone wouldnât produce the correct number of items (try running length(1:10)
and compare that with length(diff(1:10, 1))
to see what I mean), so I store the first value of my cases
variable and then append the differenced values after that (Note that this assumes and takes advantage of the fact that the data is sorted by date. I could add a call to arrange(-desc())
before doing my mutation to ensure the correct ordering, but wonât bother with that for now). Within the same call to mutate I also create a new variable weekdate
that collapses the dates into weeks (see the documentation for cut.Date
) and stores the resulting strings as factors (e.g., a factor where the levels correspond to a series of Mondays: â2020-01-20â, â2020-01-27ââ¦). Hopefully, so far so good?
Next, I use group_by
to aggregate everything by my weekdate
factor values. This is essentially creating conditional groupings of the data that I can then summarize in my next command.
il_weekly_cases %>%
ggplot(aes(weekdate, new_cases)) +
geom_line()
-
+
Hmm. looks like I have a problem here. My first guess is that thereâs something funny going on with my weekdate
variable because it looks very different on the x-axis. Letâs troubleshoot:
class(il_weekly_cases$weekdate)
## [1] "factor"
Whoops. Indeed, I need to convert that weekdate
variable back into an object of class âdateâ so that it will work with ggplot. There are a number of ways I could do this, but Iâll just make a new variable by first coercing weekdate
to a character vector and then coercing that into a date using as.Date
(and remember that it is sometimes easier to read these ânestedâ commands from the inside-out).
il_weekly_cases$date <- as.Date(as.character((il_weekly_cases$weekdate)))
il_weekly_cases
-## # A tibble: 38 x 3
+## # A tibble: 39 x 3
## weekdate new_cases date
## <fct> <dbl> <date>
## 1 2020-01-20 1 2020-01-20
@@ -1689,14 +1689,14 @@ il_weekly_cases
## 8 2020-03-09 87 2020-03-09
## 9 2020-03-16 953 2020-03-16
## 10 2020-03-23 3568 2020-03-23
-## # ⦠with 28 more rows
+## # ⦠with 29 more rows
That ought to work for plotting now:
plot1 <- il_weekly_cases %>%
ggplot(aes(date, new_cases)) +
geom_line()
plot1
-
+
Much better! Notice that the final week of the data appears to fall off a cliff. Thatâs just an artifact of the way that the NYT has published the data for part of the most recent week. Once it updates, the case count probably wonât tumble like that (yikes).
To start, letâs see whether there might be any way I want to improve the x-axis labels. The ggplot defaults for my date
variable are pretty good already, but maybe I want to incorporate a label (âbreakâ) for each month as well as a more granular grid in the background (âminor_breaksâ) that shows the weeks? Also, I like the date labels along the axis as abbreviations of the month names, so Iâll keep that with a call to date_labels
. Hereâs what all of that looks like:
plot2 <- plot1 + scale_x_date(date_labels = "%b", date_breaks = "1 month", date_minor_breaks = "1 week")
plot2
-
+
The ggplot documentation for scale_date
can give you some other examples and ideas. Also, notice how I appended the scale_date
layer to my existing plot and stored it as a new object? This can make it easier to work iteratively on a single plot, adding new layers as I go without losing existing material along the way.
Now I can fix up the y-axis labels a bit using a call to the labels
argument after I load the scales
package (why doesnât ggplot support this kind of labeling itself? I have no clue).
library(scales)
plot3 <- plot2 + scale_y_continuous(label = comma)
plot3
-
+
Nearly done. All thatâs left is a title and better axis names. Iâll do that with yet another layer call to labs
. The arguments here are pretty intuitive.
plot4 <- plot3 + labs(x = "Week (in 2020)", y = "New cases", title = "COVID-19 cases in Illinois")
plot4
-
+
Last, but not least, I mentioned in our class session that ggplot also has âthemesâ that can be useful for styling plots. One I have used for publications is the âlightâ theme. Here I apply that theme asâ¦yet another layer:
plot4 + theme_light()
-
+
Thatâs looking much better than when we started! If you wanted to export it as a standalone file (e.g., .png, .pdf, or whatever), I recommend looking at the documentation for the ggsave()
function, which is available via ggplot2. Base R also has a save()
function that you can work with, although it can be a bit more complicated to get comfortable with.
Thus far, we have worked mostly with âwideâ format data where (nearly) every row corresponds to a single unit/observation and every column corresponds to a distinct variable (for which we usually have no more than one value attributed to any unit/observation). This often results in wider format data that is great for many things. However, it turns out that longer format data can be super helpful for a number of purposes. Producing richer, multidimensional ggplot visualizations is one of them.
Consider the format of my tidied dataframe that I used for plotting:
il_weekly_cases
-## # A tibble: 38 x 3
+## # A tibble: 39 x 3
## weekdate new_cases date
## <fct> <dbl> <date>
## 1 2020-01-20 1 2020-01-20
@@ -1744,11 +1744,11 @@ plot4
## 8 2020-03-09 87 2020-03-09
## 9 2020-03-16 953 2020-03-16
## 10 2020-03-23 3568 2020-03-23
-## # ⦠with 28 more rows
+## # ⦠with 29 more rows
This dataframe is in a pretty âlongâ format. Each row is a week and each column is a variable unique to that week (okay, I could consolidate my weekdate
and date
columns into just one, but thatâs not really the point here. The idea is that thereâs minimal redundant information in the rows and in the columns).
Our original dataframe was also pretty âlongâ:
d
-## # A tibble: 12,059 x 5
+## # A tibble: 12,334 x 5
## date state fips cases deaths
## <date> <fct> <chr> <dbl> <dbl>
## 1 2020-01-21 Washington 53 1 0
@@ -1761,7 +1761,7 @@ plot4
## 8 2020-01-25 Washington 53 1 0
## 9 2020-01-26 Arizona 04 1 0
## 10 2020-01-26 California 06 2 0
-## # ⦠with 12,049 more rows
+## # ⦠with 12,324 more rows
Here we have multiple observations per state (I think I would say the units or rows correspond to âstate-datesâ or something like that). Itâs not as âlongâ as possible, though, because we also have multiple columns corresponding to the two variables of interest: cases
and deaths
.
For the purposes of producing a multi-state and multivariate set of plots, the most important thing I want to do is consolidate my dataset into a format where I have the following columns: date
(collapsed into weeks), state
, variable
(which will either have a value of new cases
or new deaths
), and a column for value
that will hold the corresponding state-week count for the variable in each row. If that doesnât make sense, donât worry, weâll get there soon enough.
Doing this involves a different approach to tidying up my data. Iâll start by dropping the step where I filtered by state=="Illinois"
and replacing it with a group_by
step before I create my weekdate
variable. Iâm also going to go ahead and drop the date
and fips
variables because theyâre just getting in my way.
## # A tibble: 12,059 x 4
+## # A tibble: 12,334 x 4
## # Groups: state [55]
## state cases deaths weekdate
## <fct> <dbl> <dbl> <fct>
@@ -1786,7 +1786,7 @@ weekly
## 8 Washington 1 0 2020-01-20
## 9 Arizona 1 0 2020-01-20
## 10 California 2 0 2020-01-20
-## # ⦠with 12,049 more rows
+## # ⦠with 12,324 more rows
Now Iâve got multiple observations for each state-week spread across multiple rows (because my rows were structured around a more granular measure of time). My next move is to collapse these into a single observation for each state-week. Remember that my cases
and deaths
variables are still cumulative counts, so as I do this aggregation by week I will only need to store the maximum value for each state-week in order to calculate the number of new cases per state-week.
tidy_weekly <- weekly %>%
group_by(state, weekdate) %>%
@@ -1807,7 +1807,7 @@ weekly
)
tidy_weekly
-## # A tibble: 1,780 x 6
+## # A tibble: 1,835 x 6
## # Groups: state [55]
## state weekdate cum_cases cum_deaths new_cases new_deaths
## <fct> <date> <dbl> <dbl> <dbl> <dbl>
@@ -1821,7 +1821,7 @@ tidy_weekly
## 8 Massachusetts 2020-01-27 1 0 1 0
## 9 Washington 2020-01-27 1 0 0 0
## 10 Arizona 2020-02-03 1 0 0 0
-## # ⦠with 1,770 more rows
+## # ⦠with 1,825 more rows
Weâre much closer to our goal now!
I can go ahead and drop the cumulative cases and deaths columns with a call to select
in my next step. Then the big next (and nearly final) step is to âpivotâ the data to organize the new_cases
and new_deaths
measures in the way I described above. To manage this, Iâll use the pivot_longer()
function (part of the tidyr
package from the tidyverse):
long_weekly <- tidy_weekly %>%
@@ -1833,7 +1833,7 @@ tidy_weekly
)
long_weekly
-## # A tibble: 3,560 x 4
+## # A tibble: 3,670 x 4
## # Groups: state [55]
## state weekdate variable value
## <fct> <date> <chr> <dbl>
@@ -1847,7 +1847,7 @@ long_weekly
## 8 Washington 2020-01-20 new_deaths 0
## 9 Arizona 2020-01-27 new_cases 0
## 10 Arizona 2020-01-27 new_deaths 0
-## # ⦠with 3,550 more rows
+## # ⦠with 3,660 more rows
Can you see what that did? I now have two rows of data for every state-week. One row contains a value for new_cases
and one contains a value for new_deaths
. Both of those variables have been âpivotedâ into a single variable
column and their corresponding values recorded in another new column. Note that this makes our dataframe a little longer even though it does not technically reduce the âwidthâ of this particular dataset (because weâve taken two columns and pivoted them to createâ¦two different columns). However, consider that we could accommodate as many additional numerical variables and values as we might like in this manner and you can start to see how this pivoting step could result in much longer data (the length becomes a function of the number of units in your dataset and the variables you include in your pivoting step).
Before we move forward Iâm also going to clean up the values of variable
. This turns out to be helpful later on when weâre plotting, but makes more sense to implement here before I start creating any plot layers.
long_weekly <- long_weekly %>%
@@ -1861,13 +1861,13 @@ long_weekly
) %>%
ggplot(aes(weekdate, value)) +
geom_line()
-
+
With our âlongerâ data format, we can plot Illinois cases against deaths from the same tibble by incorporating a color=variable
argument :
long_weekly %>%
filter(state == "Illinois") %>%
ggplot(aes(weekdate, value, color = variable)) +
geom_line()
-
+
Unfortunately, that plot isnât so great because the death counts are dwarfed by the case counts (thank goodness!).
Now letâs compare Illinois case counts against some the neighboring states in the upper midwest:
upper_midwest <- c("Illinois", "Michigan", "Wisconsin", "Iowa", "Minnesota")
@@ -1876,7 +1876,7 @@ long_weekly %>%
filter(state %in% upper_midwest & variable == "new cases") %>%
ggplot(aes(weekdate, value, color = state)) +
geom_line()
-
+
Notice that I use the %in%
operator to filter for the values of the state
vector that are âinâ the upper_midwest
vector (see help(%in%)
for more).
Also notice that we now have ourselves a multivariate time series!
So now how about finding some way to also incorporate those death counts? If I just add them to this same plot weâll run into the same issue we did with the Illinois data because the death counts look tiny plotted on the same scale as the case counts. A good solution in such a situation is to create a second plot for weekly deaths that we can display together with this weekly cases plot that uses a differently scaled y-axis. The ggplot way to do this involves another type of layer called âfacets.â Hereâs an example that creates a faceted âgridâ (noy much of a grid since there are only two variables or categories weâre using to do the faceting) of weekly case counts and deaths for the same five states.
@@ -1887,10 +1887,10 @@ long_weekly %>% facet_grid(rows = vars(variable), scales = "free_y") midwest_plot - +Nice! Now we can clean up some of the other elements we worked on with the original plot (axes, title, etc.). Iâll bake that into a single chunk below.
midwest_plot + scale_x_date(date_labels = "%b", date_breaks = "1 month", date_minor_breaks = "1 week") + scale_y_continuous(label = comma) + labs(x = "Week (in 2020)", y = "", title = "COVID-19 cases in the Upper Midwest") + theme_light()
-
+
Thatâs it! Mission accomplished. Weâve got ourselves a nice concise visualization of weekly COVID-19 cases and deaths across five upper midwest states over nearly 8 months of the pandemic.