From: aaronshaw Date: Thu, 8 Oct 2020 15:26:16 +0000 (-0500) Subject: typos and text improvements. X-Git-Url: https://code.communitydata.science/stats_class_2020.git/commitdiff_plain/0a581181eaac0541c14bab5a28584879d1ff9f63?ds=sidebyside;hp=4e1671977b9073cc4c6d46ab614a3767008ad6c6 typos and text improvements. --- diff --git a/r_tutorials/w05a-R_tutorial.html b/r_tutorials/w05a-R_tutorial.html index 137badd..6813b21 100644 --- a/r_tutorials/w05a-R_tutorial.html +++ b/r_tutorials/w05a-R_tutorial.html @@ -1549,8 +1549,8 @@ MTS 525

1 Getting started (more better plots)

-

This is a supplement to the Week 5 R tutorial focused on elaborating some examples of time series plots and more polished plots using ggplot2. I’ll work some data on state-level COVID-19 in the United States published by The New York Times (NYT). You can access the data an details about the sources, measurement, and different datasets available via the NYT github repository.

-

To start, I’ll load up the tidyverse library and also attach the lubridate package to help handle dates and times. Then I’ll import the “raw csv” from the web, and take a look at the dataset:

+

This is a supplement to the Week 5 R tutorial focused on elaborating some examples of time series plots and more polished plots using ggplot2. I’ll work with some data on state-level COVID-19 in the United States published by The New York Times (NYT). You can access the data as well as details about the sources, measurement, and related available datasets via the NYT github repository.

+

To start, I’ll load up the tidyverse library and also attach the lubridate package, which can help to handle dates and times. Then I’ll import the “raw csv” of my dataset from the web, and take a look at it:

library(tidyverse)
 library(lubridate)
 
@@ -1559,7 +1559,7 @@ data_url <- url("https://raw.githubusercontent.com/nytimes/covid-19-data
 d <- read_csv(data_url)
 
 d
-
## # A tibble: 12,004 x 5
+
## # A tibble: 12,059 x 5
 ##    date       state      fips  cases deaths
 ##    <date>     <chr>      <chr> <dbl>  <dbl>
 ##  1 2020-01-21 Washington 53        1      0
@@ -1572,69 +1572,71 @@ d
## 8 2020-01-25 Washington 53 1 0 ## 9 2020-01-26 Arizona 04 1 0 ## 10 2020-01-26 California 06 2 0 -## # … with 11,994 more rows
-

For the sake of my examples, I’m planning to work with the date, state, cases, and deaths variables. Notice that by using the read_csv() function to import the data, R already recognizes the date column as dates. It looks like I need to convert the state variable to a factor, however. After I do that I can get a quick sense of how much data I have for each state with a univariate table that just counts the number of observations (rows) for each value of state.

+## # … with 12,049 more rows +

For the sake of my examples, I’m planning to work with the date, state, cases, and deaths variables. Notice that by using the read_csv() function to import the data, R already recognizes the date column as dates. Also notice that the column names for cases and deaths don’t reflect the fact that both variables are cumulative counts. Also also, notice that it looks like I need to convert the state variable to a factor. I’ll start there and then get a quick sense of how much data I have for each state with a univariate table.

d$state <- factor(d$state)
 table(d$state)
## 
 ##                  Alabama                   Alaska                  Arizona 
-##                      208                      209                      255 
+##                      209                      210                      256 
 ##                 Arkansas               California                 Colorado 
-##                      210                      256                      216 
+##                      211                      257                      217 
 ##              Connecticut                 Delaware     District of Columbia 
-##                      213                      210                      214 
+##                      214                      211                      215 
 ##                  Florida                  Georgia                     Guam 
-##                      220                      219                      206 
+##                      221                      220                      207 
 ##                   Hawaii                    Idaho                 Illinois 
-##                      215                      208                      257 
+##                      216                      209                      258 
 ##                  Indiana                     Iowa                   Kansas 
-##                      215                      213                      214 
+##                      216                      214                      215 
 ##                 Kentucky                Louisiana                    Maine 
-##                      215                      212                      209 
+##                      216                      213                      210 
 ##                 Maryland            Massachusetts                 Michigan 
-##                      216                      249                      211 
+##                      217                      250                      212 
 ##                Minnesota              Mississippi                 Missouri 
-##                      215                      210                      214 
+##                      216                      211                      215 
 ##                  Montana                 Nebraska                   Nevada 
-##                      208                      233                      216 
+##                      209                      234                      217 
 ##            New Hampshire               New Jersey               New Mexico 
-##                      219                      217                      210 
+##                      220                      218                      211 
 ##                 New York           North Carolina             North Dakota 
-##                      220                      218                      210 
+##                      221                      219                      211 
 ## Northern Mariana Islands                     Ohio                 Oklahoma 
-##                      193                      212                      215 
+##                      194                      213                      216 
 ##                   Oregon             Pennsylvania              Puerto Rico 
-##                      222                      215                      208 
+##                      223                      216                      209 
 ##             Rhode Island           South Carolina             South Dakota 
-##                      220                      215                      211 
+##                      221                      216                      212 
 ##                Tennessee                    Texas                     Utah 
-##                      216                      238                      225 
+##                      217                      239                      226 
 ##                  Vermont           Virgin Islands                 Virginia 
-##                      214                      207                      214 
+##                      215                      208                      215 
 ##               Washington            West Virginia                Wisconsin 
-##                      260                      204                      245 
+##                      261                      205                      246 
 ##                  Wyoming 
-##                      210
+## 211 +

Two things to point out here: (1) not all of our “states” are technically states (e.g., Puerto Rico, District of Columbia, Virgin Islands, Northern Mariana Islands, Guam). I prefer to think of this as the NYT data scientist team quietly reminding us that the United States maintains a number of colonial properties without formal political representation! The second thing (2) is that not all states have the same number of observations/rows. You can probably figure out exactly why this might be the case from the documentation of the data sources and or from thinking more carefully about the context (e.g., some states had cases much earlier in 2020 than others). Anyhow, just some things to be aware of as we move forward with our analysis.

2 Plotting a univariate time series

-

I recommend using geom_path() to create univariate time series plots. Specifically, I’ll call geom_line(), which is a specialized version of geom_path() that connects observations in order according to the values of variable that is mapped to the x-axis. By convention, a univariate time series maps dates to the x-axis, so this will just plot a line connecting the dots over time.

-

For my first example, I want to build up a plot of weekly case counts in Illinois. I can start off by just plotting the cumulative cases for all of the states and work my way towards the specific plot I want from there:

+

A univariate time series is just a fancy term for a plot of a single variable for which you have repeated observations collected over time. I recommend using geom_path() (that’s a hyperlink to the documentation) to create univariate time series plots. Specifically, I’ll call geom_line(), which is a specialized (masked) version of geom_path() that connects observations in order according to the values of variable that is mapped to the x-axis. By convention, a univariate time series maps dates to the x-axis, so this will just plot a line connecting the values of my y-values over time.

+

For a univariate example, let’s build a plot of weekly case counts in Illinois.

+

I can start by just plotting the cumulative cases for all of the states and work towards the specific plot we want from there:

ggplot(data = d, aes(date, cases)) +
   geom_line()
-

-

Notice that ggplot handles the date variable quite well by default! It recognizes the units of time and generates axis labels in terms of months. Also notice that ggplot handles the axis labels for the cases variable…less well. I don’t know about you, but my brain doesn’t parse scientific notation quickly/easily.

-
-
-

3 Tidying timeseries data for better plots

-

Okay, let’s get to work cleaning all this up. At this point, my next steps are to (1) restrict the data to the Illinois cases; (2) reorganize the cumulative daily case counts into weekly counts; and (3) plot it again with better axis labels and a nice title.

+

+

Notice that ggplot handles the date variable quite well by default! It recognizes the units of time and generates axis labels in terms of months. Also notice that ggplot handles the axis labels for the cases variable…less well. I don’t know about you, but my brain doesn’t parse scientific notation quickly/easily. Finally, the fact that this figure incorporates all the state-level observations as cumulative counts means that there is just a huge clutter of points/lines in this figure. It’s impossible to really figure out what’s going on, much less learn anything other than the cumulative number of cases within states appears to have increased over time (thanks for nothing, ggplot).

+
+

2.1 Tidying some timeseries data

+

Okay, let’s get to work cleaning this up. At this point, my next steps are to (1) restrict the data to the Illinois cases; (2) reorganize the cumulative daily case counts into weekly counts; and (3) plot it again with better axis labels and a nice title.

I can restrict the data to Illinois in a few ways. Since I’m using ggplot, I’ll work with Tidyverse “pipes” (%>%) and “verbs” (in this case, filter):

d %>%
   filter(state == "Illinois") %>%
   ggplot(aes(date, cases)) +
   geom_line()
-

-

That’s already much less cluttered. Inserting a call to the Tidyverse mutate, group_by, and summarize verbs can help me generate the weekly counts I’m looking for. Here’s the code to produce a new object. I’ll walk through it below:

+

+

That’s already much less cluttered and much clearer. It also looks plausibly accurate (it’s always good to sanity check your data visualizations as you go—weird anomalies in a graph are usually a good indicator of something weird happening in the underlying code and/or data.

+

Now onwards to converting my cumulative case counts into weekly case counts. When I wrote this tutorial, the first way I thought to do this involved making calls to the Tidyverse mutate, group_by, and summarize verbs. After a little trial and error, I got it to work with the following code (which I’ll walk through in detail below):

il_weekly_cases <- d %>%
   filter(state == "Illinois") %>%
   mutate(
@@ -1659,14 +1661,19 @@ il_weekly_cases
## 9 2020-03-16 953 ## 10 2020-03-23 3568 ## # … with 28 more rows -

There’s quite a lot happening there. I’ll go through it verb-by-verb.

-

First, I use mutate to create a diff_cases variable that disaggregates the cumulative values of cases (read the documentation for diff to learn more about this one). Differenced values alone wouldn’t produce the same number of items (try running length(1:10) and compare that with length(diff(1:10, 1)) to see what I mean), so I stores the first value of my cases variable and then append the differenced values after that. Within the same call to mutate I also create a new variable weekdate that collapses the dates into weeks (see the documentation for cut.Date) and stores the resulting strings as factors (e.g., a factor where the levels correspond to a series of Mondays: “2020-01-20”, “2020-01-27”…). Hopefully, so far so good?

-

Next, I use group_by to aggregate everything by my weekdate factor values.

-

Finally I use summarize to reshape my data and collapse everything into weekly counts of new cases (notice that I use sum inside the summarize call to add up the case counts within the grouping variable). Okay, let’s see about plotting this now:

-

Hmm. looks like I have a problem with my dates. Let’s troubleshoot this:

+

There’s quite a lot happening there so let’s go through it verb-by-verb.

+

First, I filter my cases to restrict the set to Illinois data. Then I use mutate to create a diff_cases variable that disaggregates the cumulative values of cases (read the documentation for diff to learn more about this one). Differenced values alone wouldn’t produce the correct number of items (try running length(1:10) and compare that with length(diff(1:10, 1)) to see what I mean), so I store the first value of my cases variable and then append the differenced values after that (Note that this assumes and takes advantage of the fact that the data is sorted by date. I could add a call to arrange(-desc()) before doing my mutation to ensure the correct ordering, but won’t bother with that for now). Within the same call to mutate I also create a new variable weekdate that collapses the dates into weeks (see the documentation for cut.Date) and stores the resulting strings as factors (e.g., a factor where the levels correspond to a series of Mondays: “2020-01-20”, “2020-01-27”…). Hopefully, so far so good?

+

Next, I use group_by to aggregate everything by my weekdate factor values. This is essentially creating conditional groupings of the data that I can then summarize in my next command.

+

Finally I use summarize to reshape my data and collapse everything into weekly counts of new cases (notice that I use sum inside the summarize call to add up the case counts within the grouping variable). The result is a brand new two-column tibble consisting of weekdates and weekly counts of new cases. Excellent!

+

Okay, let’s see about plotting this now:

+
il_weekly_cases %>%
+  ggplot(aes(weekdate, new_cases)) +
+  geom_line()
+

+

Hmm. looks like I have a problem here. My first guess is that there’s something funny going on with my weekdate variable because it looks very different on the x-axis. Let’s troubleshoot:

class(il_weekly_cases$weekdate)
## [1] "factor"
-

Whoops. It looks like I need to convert that weekdate variable into an object of class “date” so that it will work with ggplot. There are a number of ways I could do this, but I’ll just make a new variable by first converting weekdate to a character vector and then converting that into a date using as.Date (and remember that it is sometimes easier to read these “nested” commands from the inside-out).

+

Whoops. Indeed, I need to convert that weekdate variable back into an object of class “date” so that it will work with ggplot. There are a number of ways I could do this, but I’ll just make a new variable by first coercing weekdate to a character vector and then coercing that into a date using as.Date (and remember that it is sometimes easier to read these “nested” commands from the inside-out).

il_weekly_cases$date <- as.Date(as.character((il_weekly_cases$weekdate)))
 il_weekly_cases
## # A tibble: 38 x 3
@@ -1683,41 +1690,45 @@ il_weekly_cases
## 9 2020-03-16 953 2020-03-16 ## 10 2020-03-23 3568 2020-03-23 ## # … with 28 more rows -

That ought to work now:

+

That ought to work for plotting now:

plot1 <- il_weekly_cases %>%
   ggplot(aes(date, new_cases)) +
   geom_line()
 
 plot1
-

-

Much better! Notice that the final week of the data appears to fall off a cliff. That’s just an artifact of the way that the NYT has published the data for part of the most recent week. Once it updates, the case count probably won’t drop like that (yikes). Anyhow, onwards to cleaning things up and adding a title.

+

+

Much better! Notice that the final week of the data appears to fall off a cliff. That’s just an artifact of the way that the NYT has published the data for part of the most recent week. Once it updates, the case count probably won’t tumble like that (yikes).

-
-

4 Working on ggplot axis labels, titles, and scales

-

As I mentioned briefly in class ggplot2 treats labels, titles, and scales as “layers” within it’s “grammar of graphics” (and yes, I’m rolling my eyes as I type those scare-quotes). For the purposes of our example here I’m going to use scale_date to work with the x-axis, scale_continuous to work with the y-axis, and labs to clean up the title and axis labels.

-

For starters, let’s see whether there might be any way I want to improve the axis labels. The ggplot defaults for my date variable are pretty good already, but maybe I want to incorporate a label/break for each month as well as a more granular grid in the background that shows the weeks? Here’s what all of that looks like:

+
+

2.2 Working on ggplot axis labels, titles, and scales

+

Now we can style the plot. As I mentioned briefly in class ggplot2 treats labels, titles, and scales as “layers” within it’s “grammar of graphics” (that sound you hear is me rolling my eyes as I type those scare-quotes). For the purposes of our example here I’m going to use scale_date to work with the x-axis, scale_continuous to work with the y-axis, and labs to clean up the title and axis labels. Each of those have documentation and should appear on the ggplot2 cheatsheet available via RStudio/Tidyverse.

+

To start, let’s see whether there might be any way I want to improve the x-axis labels. The ggplot defaults for my date variable are pretty good already, but maybe I want to incorporate a label (“break”) for each month as well as a more granular grid in the background (“minor_breaks”) that shows the weeks? Also, I like the date labels along the axis as abbreviations of the month names, so I’ll keep that with a call to date_labels. Here’s what all of that looks like:

plot2 <- plot1 + scale_x_date(date_labels = "%b", date_breaks = "1 month", date_minor_breaks = "1 week")
 plot2
-

-

The ggplot documentation for scale_date can give you some other examples and ideas. Also, notice how I appended the scale_date layer to my existing plot and stored it as a new object? This can make it easier to work iteratively without losing any of my earlier layers along the way.

-

Now I can fix up the y-axis labels a bit using a call to the labels argument after I load the scales package.

+

+

The ggplot documentation for scale_date can give you some other examples and ideas. Also, notice how I appended the scale_date layer to my existing plot and stored it as a new object? This can make it easier to work iteratively on a single plot, adding new layers as I go without losing existing material along the way.

+

Now I can fix up the y-axis labels a bit using a call to the labels argument after I load the scales package (why doesn’t ggplot support this kind of labeling itself? I have no clue).

library(scales)
 plot3 <- plot2 + scale_y_continuous(label = comma)
 plot3
-

-

Nearly done. All that’s left is a title and better axis names. I’ll do that with yet another layer.

+

+

Nearly done. All that’s left is a title and better axis names. I’ll do that with yet another layer call to labs. The arguments here are pretty intuitive.

plot4 <- plot3 + labs(x = "Week (in 2020)", y = "New cases", title = "COVID-19 cases in Illinois")
 plot4
-

-

Last, but not least, I mentioned in our class session that ggplot also has “themes” that can be useful for styling plots. One I have used for publications is the “light” theme. Here’s how to apply that:

+

+

Last, but not least, I mentioned in our class session that ggplot also has “themes” that can be useful for styling plots. One I have used for publications is the “light” theme. Here I apply that theme as…yet another layer:

plot4 + theme_light()
-

+

That’s looking much better than when we started! If you wanted to export it as a standalone file (e.g., .png, .pdf, or whatever), I recommend looking at the documentation for the ggsave() function, which is available via ggplot2. Base R also has a save() function that you can work with, although it can be a bit more complicated to get comfortable with.

-
-

5 Long versus wide data (and why long data is often helpful)

-

So what if you wanted to plot a multivariate time series (e.g., the same plot for more than one state and/or for more than one measure)? As always, you have a number of options, but the most effective way to achieve this with ggplot involves learning to work with “long” format data.

-

Thus far, we have worked mostly with “wide” format data where (nearly) every row corresponds to a single unit/observation and every column corresponds to a variable (for which we usually have no more than one value attributed to any unit/observation). Wide format data is great for many things, but it turns out that learning to work with long format data can be super helpful for a number of purposes. Producing richer, multidimensional ggplot visualizations is one of them.

+
+
+

3 Multivariate and multidimensional time series plots

+

Okay, that’s a lovely univariate time series plot. Now let’s make this more sophisticated and interesting by incorporating more data, more dimensions, and more variables. In order to do that, I want to start with a little detour into data structures. Try to stay with me—this turns out to be super important for working more efficiently with tools like ggplot as well as learning to manage more complex statistical analysis strategies (that we won’t really cover in the course, but so be it).

+
+

3.1 Long versus wide data (and why long data is often helpful)

+

So now you want to plot a multivariate time series (e.g., the same plot for more than one state and/or for more than one measure). As always, you have a number of options, but the most effective way to achieve this with ggplot involves learning to work with “longer” data.

+

Thus far, we have worked mostly with “wide” format data where (nearly) every row corresponds to a single unit/observation and every column corresponds to a distinct variable (for which we usually have no more than one value attributed to any unit/observation). This often results in wider format data that is great for many things. However, it turns out that longer format data can be super helpful for a number of purposes. Producing richer, multidimensional ggplot visualizations is one of them.

Consider the format of my tidied dataframe that I used for plotting:

il_weekly_cases
## # A tibble: 38 x 3
@@ -1734,10 +1745,10 @@ plot4
## 9 2020-03-16 953 2020-03-16 ## 10 2020-03-23 3568 2020-03-23 ## # … with 28 more rows -

This dataframe is in a “wide” format. Each row is a week and each column is a variable unique to that week.

-

Our original dataframe was a bit “longer”:

+

This dataframe is in a pretty “long” format. Each row is a week and each column is a variable unique to that week (okay, I could consolidate my weekdate and date columns into just one, but that’s not really the point here. The idea is that there’s minimal redundant information in the rows and in the columns).

+

Our original dataframe was also pretty “long”:

d
-
## # A tibble: 12,004 x 5
+
## # A tibble: 12,059 x 5
 ##    date       state      fips  cases deaths
 ##    <date>     <fct>      <chr> <dbl>  <dbl>
 ##  1 2020-01-21 Washington 53        1      0
@@ -1750,8 +1761,10 @@ plot4
## 8 2020-01-25 Washington 53 1 0 ## 9 2020-01-26 Arizona 04 1 0 ## 10 2020-01-26 California 06 2 0 -## # … with 11,994 more rows
-

We see multiple observations per state (I think I would say the units or rows correspond to “state-dates” or something like that). It’s not completely “long” however, because we also have multiple columns corresponding to the two variables of interest: cases and deaths. The point I want to make is that there are a number of ways we can make this data “longer.” For the purposes of producing a multi-state plot like the one above, the most important of these is going to involve dropping the step where I filtered by state=="Illinois" and replacing by a group_by step before I create my weekdate variable. I’m also going to go ahead and drop the date and fips variables because they’re just getting in my way at this point. I’ll start there

+## # … with 12,049 more rows +

Here we have multiple observations per state (I think I would say the units or rows correspond to “state-dates” or something like that). It’s not as “long” as possible, though, because we also have multiple columns corresponding to the two variables of interest: cases and deaths.

+

For the purposes of producing a multi-state and multivariate set of plots, the most important thing I want to do is consolidate my dataset into a format where I have the following columns: date (collapsed into weeks), state, variable (which will either have a value of new cases or new deaths), and a column for value that will hold the corresponding state-week count for the variable in each row. If that doesn’t make sense, don’t worry, we’ll get there soon enough.

+

Doing this involves a different approach to tidying up my data. I’ll start by dropping the step where I filtered by state=="Illinois" and replacing it with a group_by step before I create my weekdate variable. I’m also going to go ahead and drop the date and fips variables because they’re just getting in my way.

weekly <- d %>%
   group_by(state) %>%
   mutate(
@@ -1759,7 +1772,7 @@ plot4
) %>% select(state, cases, deaths, weekdate) weekly -
## # A tibble: 12,004 x 4
+
## # A tibble: 12,059 x 4
 ## # Groups:   state [55]
 ##    state      cases deaths weekdate  
 ##    <fct>      <dbl>  <dbl> <fct>     
@@ -1773,17 +1786,19 @@ weekly
## 8 Washington 1 0 2020-01-20 ## 9 Arizona 1 0 2020-01-20 ## 10 California 2 0 2020-01-20 -## # … with 11,994 more rows
-

I’m getting somewhere with this, I promise. One of the principles of “tidy” data is to make it so that every variable has a column, every observation has a row, and every value has a cell. Right now, I’ve got multiple observations for each state-week spread across multiple rows. Remember that my cases and deaths variables are actually cumulative counts, so I really only need to store the maximum value for each state-week in order to calculate the new cases per state-week. Let’s see what to do about that:

+## # … with 12,049 more rows +

Now I’ve got multiple observations for each state-week spread across multiple rows (because my rows were structured around a more granular measure of time). My next move is to collapse these into a single observation for each state-week. Remember that my cases and deaths variables are still cumulative counts, so as I do this aggregation by week I will only need to store the maximum value for each state-week in order to calculate the number of new cases per state-week.

tidy_weekly <- weekly %>%
   group_by(state, weekdate) %>%
   summarize(
     cum_cases = max(cases, na.rm = T),
     cum_deaths = max(deaths, na.rm = T)
   )
-
tidy_weekly$weekdate <- as.Date(as.character(tidy_weekly$weekdate))
-
-tidy_weekly <- tidy_weekly %>%
+

Notice that the call to group_by groups by multiple variables. The order here matters! If I reversed it to read group_by(weekdate, state) the results would be very different. With the correct ordering, I have things bundled up into state-week sub-groups and then I move on to calculate the maximum value of cumulative cases within each bundle.

+

Next, I can fix up my weekdate variable again so that it is a Date object.

+
tidy_weekly$weekdate <- as.Date(as.character(tidy_weekly$weekdate))
+

This will allow me to do some sorting within my state-week bundles to ensure things are in the proper order before I convert my weekly cumulative case count into weekly new case counts.

+
tidy_weekly <- tidy_weekly %>%
   group_by(state) %>%
   arrange(-desc(weekdate)) %>%
   mutate(
@@ -1807,7 +1822,8 @@ tidy_weekly
## 9 Washington 2020-01-27 1 0 0 0 ## 10 Arizona 2020-02-03 1 0 0 0 ## # … with 1,770 more rows
-

This is headed in the right direction. For some purposes, though, it’s still not quite “long” enough For starters, I can drop the cumulative cases and deaths columns. The other thing I can do is “pivot” the data to organize the new_cases and new_deaths measures a little differently. To manage this, I’ll use the pivot_longer() function (part of the tidyr package from the tidyverse). I will also go ahead and coerce my weekdate into a Date object again:

+

We’re much closer to our goal now!

+

I can go ahead and drop the cumulative cases and deaths columns with a call to select in my next step. Then the big next (and nearly final) step is to “pivot” the data to organize the new_cases and new_deaths measures in the way I described above. To manage this, I’ll use the pivot_longer() function (part of the tidyr package from the tidyverse):

long_weekly <- tidy_weekly %>%
   select(state, weekdate, new_cases, new_deaths) %>%
   pivot_longer(
@@ -1832,37 +1848,38 @@ long_weekly
## 9 Arizona 2020-01-27 new_cases 0 ## 10 Arizona 2020-01-27 new_deaths 0 ## # … with 3,550 more rows -

Can you see what that did? I now have two rows of data for every state-week. One that contains a value for new_cases and one that contains a value for new_deaths. Both of those variables have been “pivoted” into a single variable column.

-

Before we move forward I’m going to clean up the values of variable.

+

Can you see what that did? I now have two rows of data for every state-week. One row contains a value for new_cases and one contains a value for new_deaths. Both of those variables have been “pivoted” into a single variable column and their corresponding values recorded in another new column. Note that this makes our dataframe a little longer even though it does not technically reduce the “width” of this particular dataset (because we’ve taken two columns and pivoted them to create…two different columns). However, consider that we could accommodate as many additional numerical variables and values as we might like in this manner and you can start to see how this pivoting step could result in much longer data (the length becomes a function of the number of units in your dataset and the variables you include in your pivoting step).

+

Before we move forward I’m also going to clean up the values of variable. This turns out to be helpful later on when we’re plotting, but makes more sense to implement here before I start creating any plot layers.

long_weekly <- long_weekly %>%
   mutate(
     variable = recode(variable, new_cases = "new cases", new_deaths = "new deaths")
   )
-

Okay, prepared with my tidy_weekly and my long_weekly tibbles, I’m now ready to generate some more interesting multidimensional plots. Let’s start with the same sort of time series of new cases we made for Illinois before so we can see how to replicate that with this new data structure:

+

Okay, prepared with my long_weekly tibble, I’m now ready to generate some more interesting and multidimensional plots. Let’s start with the same univariate time series of new cases we made for Illinois before so we can see how to replicate that figure with this new data structure:

long_weekly %>%
   filter(
     state == "Illinois" & variable == "new cases"
   ) %>%
   ggplot(aes(weekdate, value)) +
   geom_line()
-

-

Now we can easily plot Illinois cases against deaths from the same tibble:

+

+

With our “longer” data format, we can plot Illinois cases against deaths from the same tibble by incorporating a color=variable argument :

long_weekly %>%
   filter(state == "Illinois") %>%
   ggplot(aes(weekdate, value, color = variable)) +
   geom_line()
-

-

That plot isn’t so great because the death counts are dwarfed by the case counts. Thank goodness!

-

Now let’s compare Illinois case counts against some its neighbors in the upper midwest:

+

+

Unfortunately, that plot isn’t so great because the death counts are dwarfed by the case counts (thank goodness!).

+

Now let’s compare Illinois case counts against some the neighboring states in the upper midwest:

upper_midwest <- c("Illinois", "Michigan", "Wisconsin", "Iowa", "Minnesota")
 
 long_weekly %>%
   filter(state %in% upper_midwest & variable == "new cases") %>%
   ggplot(aes(weekdate, value, color = state)) +
   geom_line()
-

-

Now that’s getting a bit more interesting.

-

What about finding some way to also incorporate the death counts? Well, ggplot has another layer option called “facets” that can help produce multiple plots and present them alongside each other (or in a grid). Here’s an example that creates a faceted “grid” (really just a side-by-side comparison) of case counts and deaths for the same five states.

+

+

Notice that I use the %in% operator to filter for the values of the state vector that are “in” the upper_midwest vector (see help(%in%) for more).

+

Also notice that we now have ourselves a multivariate time series!

+

So now how about finding some way to also incorporate those death counts? If I just add them to this same plot we’ll run into the same issue we did with the Illinois data because the death counts look tiny plotted on the same scale as the case counts. A good solution in such a situation is to create a second plot for weekly deaths that we can display together with this weekly cases plot that uses a differently scaled y-axis. The ggplot way to do this involves another type of layer called “facets.” Here’s an example that creates a faceted “grid” (noy much of a grid since there are only two variables or categories we’re using to do the faceting) of weekly case counts and deaths for the same five states.

midwest_plot <- long_weekly %>%
   filter(state %in% upper_midwest) %>%
   ggplot(aes(weekdate, value, color = state)) +
@@ -1870,10 +1887,12 @@ long_weekly %>%
   facet_grid(rows = vars(variable), scales = "free_y")
 
 midwest_plot
-

-

Now we can clean up some of the other elements we worked on with the original plot (axes, title, etc.). I’ll bake that into a single chunk below.

+

+

Nice! Now we can clean up some of the other elements we worked on with the original plot (axes, title, etc.). I’ll bake that into a single chunk below.

midwest_plot + scale_x_date(date_labels = "%b", date_breaks = "1 month", date_minor_breaks = "1 week") + scale_y_continuous(label = comma) + labs(x = "Week (in 2020)", y = "", title = "COVID-19 cases in the Upper Midwest") + theme_light()
-

+

+

That’s it! Mission accomplished. We’ve got ourselves a nice concise visualization of weekly COVID-19 cases and deaths across five upper midwest states over nearly 8 months of the pandemic.

+
diff --git a/r_tutorials/w05a-R_tutorial.pdf b/r_tutorials/w05a-R_tutorial.pdf index a74dd77..ffd18b0 100644 Binary files a/r_tutorials/w05a-R_tutorial.pdf and b/r_tutorials/w05a-R_tutorial.pdf differ diff --git a/r_tutorials/w05a-R_tutorial.rmd b/r_tutorials/w05a-R_tutorial.rmd index 7059784..f079d7e 100644 --- a/r_tutorials/w05a-R_tutorial.rmd +++ b/r_tutorials/w05a-R_tutorial.rmd @@ -4,9 +4,6 @@ subtitle: "Statistics and statistical programming \nNorthwestern University \n author: "Aaron Shaw" date: "October 13, 3030" output: - pdf_document: - toc: yes - toc_depth: '3' html_document: toc: yes number_sections: true @@ -15,6 +12,9 @@ output: collapsed: false smooth_scroll: true theme: readable + pdf_document: + toc: yes + toc_depth: '3' --- ```{r setup, include=FALSE} @@ -23,9 +23,9 @@ knitr::opts_chunk$set(echo = TRUE, tidy='styler', message = FALSE) ``` # Getting started (more better plots) -This is a supplement to the Week 5 R tutorial focused on elaborating some examples of time series plots and more polished plots using [`ggplot2`](https://ggplot2.tidyverse.org). I'll work some data on state-level COVID-19 in the United States published by *The New York Times* (*NYT*). You can access the data an details about the sources, measurement, and different datasets available via [the *NYT* github repository](https://github.com/nytimes/covid-19-data). +This is a supplement to the Week 5 R tutorial focused on elaborating some examples of time series plots and more polished plots using [`ggplot2`](https://ggplot2.tidyverse.org). I'll work with some data on state-level COVID-19 in the United States published by *The New York Times* (*NYT*). You can access the data as well as details about the sources, measurement, and related available datasets via [the *NYT* github repository](https://github.com/nytimes/covid-19-data). -To start, I'll load up the `tidyverse` library and also attach the `lubridate` package to help handle dates and times. Then I'll import the "raw csv" from the web, and take a look at the dataset: +To start, I'll load up the `tidyverse` library and also attach the `lubridate` package, which can help to handle dates and times. Then I'll import the "raw csv" of my dataset from the web, and take a look at it: ```{r} library(tidyverse) @@ -38,29 +38,31 @@ d <- read_csv(data_url) d ``` -For the sake of my examples, I'm planning to work with the `date`, `state`, `cases`, and `deaths` variables. Notice that by using the `read_csv()` function to import the data, R already recognizes the `date` column as dates. It looks like I need to convert the state variable to a factor, however. After I do that I can get a quick sense of how much data I have for each state with a univariate table that just counts the number of observations (rows) for each value of `state`. +For the sake of my examples, I'm planning to work with the `date`, `state`, `cases`, and `deaths` variables. Notice that by using the `read_csv()` function to import the data, R already recognizes the `date` column as dates. Also notice that the column names for cases and deaths don't reflect the fact that both variables are *cumulative* counts. Also also, notice that it looks like I need to convert the state variable to a factor. I'll start there and then get a quick sense of how much data I have for each state with a univariate table. ```{r} d$state <- factor(d$state) table(d$state) ``` - +Two things to point out here: (1) not all of our "states" are technically states (e.g., Puerto Rico, District of Columbia, Virgin Islands, Northern Mariana Islands, Guam). I prefer to think of this as the *NYT* data scientist team quietly reminding us that the United States maintains a number of colonial properties without formal political representation! The second thing (2) is that not all states have the same number of observations/rows. You can probably figure out exactly why this might be the case from the documentation of the data sources and or from thinking more carefully about the context (e.g., some states had cases much earlier in 2020 than others). Anyhow, just some things to be aware of as we move forward with our analysis. # Plotting a univariate time series -I recommend using [`geom_path()`](https://ggplot2.tidyverse.org/reference/geom_path.html) to create univariate time series plots. Specifically, I'll call `geom_line()`, which is a specialized version of `geom_path()` that connects observations in order according to the values of variable that is mapped to the x-axis. By convention, a univariate time series maps dates to the x-axis, so this will just plot a line connecting the dots over time. +A univariate time series is just a fancy term for a plot of a single variable for which you have repeated observations collected over time. I recommend using [`geom_path()`](https://ggplot2.tidyverse.org/reference/geom_path.html) (that's a hyperlink to the documentation) to create univariate time series plots. Specifically, I'll call `geom_line()`, which is a specialized (masked) version of `geom_path()` that connects observations in order according to the values of variable that is mapped to the x-axis. By convention, a univariate time series maps dates to the x-axis, so this will just plot a line connecting the values of my y-values over time. -For my first example, I want to build up a plot of weekly case counts in Illinois. I can start off by just plotting the cumulative cases for all of the states and work my way towards the specific plot I want from there: +For a univariate example, let's build a plot of weekly case counts in Illinois. + +I can start by just plotting the cumulative cases for all of the states and work towards the specific plot we want from there: ```{r} ggplot(data=d, aes(date, cases)) + geom_line() ``` -Notice that ggplot handles the `date` variable quite well by default! It recognizes the units of time and generates axis labels in terms of months. Also notice that ggplot handles the axis labels for the `cases` variable...less well. I don't know about you, but my brain doesn't parse scientific notation quickly/easily. +Notice that ggplot handles the `date` variable quite well by default! It recognizes the units of time and generates axis labels in terms of months. Also notice that ggplot handles the axis labels for the `cases` variable...less well. I don't know about you, but my brain doesn't parse scientific notation quickly/easily. Finally, the fact that this figure incorporates all the state-level observations as cumulative counts means that there is just a huge clutter of points/lines in this figure. It's impossible to really figure out what's going on, much less learn anything other than the cumulative number of cases within states appears to have increased over time (thanks for nothing, ggplot). -# Tidying timeseries data for better plots +## Tidying some timeseries data -Okay, let's get to work cleaning all this up. At this point, my next steps are to (1) restrict the data to the Illinois cases; (2) reorganize the *cumulative* daily case counts into weekly counts; and (3) plot it again with better axis labels and a nice title. +Okay, let's get to work cleaning this up. At this point, my next steps are to (1) restrict the data to the Illinois cases; (2) reorganize the *cumulative* daily case counts into weekly counts; and (3) plot it again with better axis labels and a nice title. I can restrict the data to Illinois in a few ways. Since I'm using ggplot, I'll work with Tidyverse "pipes" (`%>%`) and "verbs" (in this case, `filter`): ```{r} @@ -69,37 +71,45 @@ d %>% ggplot(aes(date, cases)) + geom_line() ``` -That's already much less cluttered. Inserting a call to the Tidyverse `mutate`, `group_by`, and `summarize` verbs can help me generate the weekly counts I'm looking for. Here's the code to produce a new object. I'll walk through it below: +That's already much less cluttered and much clearer. It also looks plausibly accurate (it's always good to sanity check your data visualizations as you go—weird anomalies in a graph are usually a good indicator of something weird happening in the underlying code and/or data. + +Now onwards to converting my cumulative case counts into weekly case counts. When I wrote this tutorial, the first way I thought to do this involved making calls to the Tidyverse `mutate`, `group_by`, and `summarize` verbs. After a little trial and error, I got it to work with the following code (which I'll walk through in detail below): ```{r} il_weekly_cases <- d %>% filter(state == "Illinois") %>% - mutate(diff_cases = c(cases[1], diff(cases, lag = 1)), - weekdate = cut(date, "week")) %>% + mutate( + diff_cases = c(cases[1], diff(cases, lag = 1)), + weekdate = cut(date, "week")) %>% group_by(weekdate) %>% summarize(new_cases = sum(diff_cases, na.rm = T),) il_weekly_cases ``` -There's quite a lot happening there. I'll go through it verb-by-verb. +There's quite a lot happening there so let's go through it verb-by-verb. -First, I use `mutate` to create a `diff_cases` variable that disaggregates the cumulative values of `cases` (read the documentation for `diff` to learn more about this one). Differenced values alone wouldn't produce the same number of items (try running `length(1:10)` and compare that with `length(diff(1:10, 1))` to see what I mean), so I stores the first value of my `cases` variable and then append the differenced values after that. Within the same call to mutate I also create a new variable `weekdate` that collapses the dates into weeks (see the documentation for `cut.Date`) and stores the resulting strings as factors (e.g., a factor where the levels correspond to a series of Mondays: "2020-01-20", "2020-01-27"...). Hopefully, so far so good? +First, I `filter` my cases to restrict the set to Illinois data. Then I use `mutate` to create a `diff_cases` variable that disaggregates the cumulative values of `cases` (read the documentation for `diff` to learn more about this one). Differenced values alone wouldn't produce the correct number of items (try running `length(1:10)` and compare that with `length(diff(1:10, 1))` to see what I mean), so I store the first value of my `cases` variable and then append the differenced values after that (Note that this assumes and takes advantage of the fact that the data is sorted by date. I could add a call to `arrange(-desc())` before doing my mutation to ensure the correct ordering, but won't bother with that for now). Within the same call to mutate I also create a new variable `weekdate` that collapses the dates into weeks (see the documentation for `cut.Date`) and stores the resulting strings as factors (e.g., a factor where the levels correspond to a series of Mondays: "2020-01-20", "2020-01-27"...). Hopefully, so far so good? -Next, I use `group_by` to aggregate everything by my `weekdate` factor values. +Next, I use `group_by` to aggregate everything by my `weekdate` factor values. This is essentially creating conditional groupings of the data that I can then summarize in my next command. -Finally I use `summarize` to reshape my data and collapse everything into weekly counts of new cases (notice that I use `sum` inside the `summarize` call to add up the case counts within the grouping variable). -Okay, let's see about plotting this now: +Finally I use `summarize` to reshape my data and collapse everything into weekly counts of new cases (notice that I use `sum` inside the `summarize` call to add up the case counts within the grouping variable). The result is a brand new two-column tibble consisting of weekdates and weekly counts of new cases. Excellent! -Hmm. looks like I have a problem with my dates. Let's troubleshoot this: +Okay, let's see about plotting this now: +```{r} +il_weekly_cases %>% + ggplot(aes(weekdate, new_cases)) + geom_line() +``` +Hmm. looks like I have a problem here. My first guess is that there's something funny going on with my `weekdate` variable because it looks very different on the x-axis. Let's troubleshoot: ```{r} class(il_weekly_cases$weekdate) ``` -Whoops. It looks like I need to convert that `weekdate` variable into an object of class "date" so that it will work with ggplot. There are a number of ways I could do this, but I'll just make a new variable by first converting `weekdate` to a character vector and then converting that into a date using `as.Date` (and remember that it is sometimes easier to read these "nested" commands from the inside-out). + +Whoops. Indeed, I need to convert that `weekdate` variable back into an object of class "date" so that it will work with ggplot. There are a number of ways I could do this, but I'll just make a new variable by first coercing `weekdate` to a character vector and then coercing that into a date using `as.Date` (and remember that it is sometimes easier to read these "nested" commands from the inside-out). ```{r} il_weekly_cases$date = as.Date(as.character((il_weekly_cases$weekdate))) il_weekly_cases ``` -That ought to work now: +That ought to work for plotting now: ```{r} plot1 <- il_weekly_cases %>% ggplot(aes(date, new_cases)) + geom_line() @@ -107,57 +117,65 @@ plot1 <- il_weekly_cases %>% plot1 ``` -Much better! Notice that the final week of the data appears to fall off a cliff. That's just an artifact of the way that the *NYT* has published the data for part of the most recent week. Once it updates, the case count probably won't drop like that (yikes). Anyhow, onwards to cleaning things up and adding a title. +Much better! Notice that the final week of the data appears to fall off a cliff. That's just an artifact of the way that the *NYT* has published the data for part of the most recent week. Once it updates, the case count probably won't tumble like that (yikes). -# Working on ggplot axis labels, titles, and scales -As I mentioned briefly in class `ggplot2` treats labels, titles, and scales as "layers" within it's "grammar of graphics" (and yes, I'm rolling my eyes as I type those scare-quotes). For the purposes of our example here I'm going to use `scale_date` to work with the x-axis, `scale_continuous` to work with the y-axis, and `labs` to clean up the title and axis labels. +## Working on ggplot axis labels, titles, and scales +Now we can style the plot. As I mentioned briefly in class `ggplot2` treats labels, titles, and scales as "layers" within it's "grammar of graphics" (that sound you hear is me rolling my eyes as I type those scare-quotes). For the purposes of our example here I'm going to use `scale_date` to work with the x-axis, `scale_continuous` to work with the y-axis, and `labs` to clean up the title and axis labels. Each of those have documentation and should appear on the `ggplot2` cheatsheet available via RStudio/Tidyverse. -For starters, let's see whether there might be any way I want to improve the axis labels. The ggplot defaults for my `date` variable are pretty good already, but maybe I want to incorporate a label/break for each month as well as a more granular grid in the background that shows the weeks? Here's what all of that looks like: +To start, let's see whether there might be any way I want to improve the x-axis labels. The ggplot defaults for my `date` variable are pretty good already, but maybe I want to incorporate a label ("break") for each month as well as a more granular grid in the background ("minor_breaks") that shows the weeks? Also, I like the date labels along the axis as abbreviations of the month names, so I'll keep that with a call to `date_labels`. Here's what all of that looks like: ```{r} plot2 <- plot1 + scale_x_date(date_labels = "%b", date_breaks= "1 month", date_minor_breaks = "1 week") plot2 ``` -The ggplot documentation for [`scale_date`](https://ggplot2.tidyverse.org/reference/scale_date.html) can give you some other examples and ideas. Also, notice how I appended the `scale_date` layer to my existing plot and stored it as a new object? This can make it easier to work iteratively without losing any of my earlier layers along the way. +The ggplot documentation for [`scale_date`](https://ggplot2.tidyverse.org/reference/scale_date.html) can give you some other examples and ideas. Also, notice how I appended the `scale_date` layer to my existing plot and stored it as a new object? This can make it easier to work iteratively on a single plot, adding new layers as I go without losing existing material along the way. -Now I can fix up the y-axis labels a bit using a call to the `labels` argument after I load the `scales` package. +Now I can fix up the y-axis labels a bit using a call to the `labels` argument after I load the `scales` package (why doesn't ggplot support this kind of labeling itself? I have no clue). ```{r} library(scales) plot3 <- plot2 + scale_y_continuous(label=comma) plot3 ``` -Nearly done. All that's left is a title and better axis names. I'll do that with yet another layer. +Nearly done. All that's left is a title and better axis names. I'll do that with yet another layer call to `labs`. The arguments here are pretty intuitive. ```{r} plot4 <- plot3 + labs(x="Week (in 2020)", y="New cases", title="COVID-19 cases in Illinois") plot4 ``` -Last, but not least, I mentioned in our class session that ggplot also has "themes" that can be useful for styling plots. One I have used for publications is the "light" theme. Here's how to apply that: +Last, but not least, I mentioned in our class session that ggplot also has "themes" that can be useful for styling plots. One I have used for publications is the "light" theme. Here I apply that theme as...yet another layer: ```{r} plot4 + theme_light() ``` That's looking much better than when we started! If you wanted to export it as a standalone file (e.g., .png, .pdf, or whatever), I recommend looking at the documentation for the `ggsave()` function, which is available via ggplot2. Base R also has a `save()` function that you can work with, although it can be a bit more complicated to get comfortable with. -# Long versus wide data (and why long data is often helpful) +# Multivariate and multidimensional time series plots + +Okay, that's a lovely univariate time series plot. Now let's make this more sophisticated and interesting by incorporating more data, more dimensions, and more variables. In order to do that, I want to start with a little detour into data structures. Try to stay with me—this turns out to be super important for working more efficiently with tools like ggplot as well as learning to manage more complex statistical analysis strategies (that we won't really cover in the course, but so be it). -So what if you wanted to plot a multivariate time series (e.g., the same plot for more than one state and/or for more than one measure)? As always, you have a number of options, but the most effective way to achieve this with ggplot involves learning to work with "long" format data. +## Long versus wide data (and why long data is often helpful) -Thus far, we have worked mostly with "wide" format data where (nearly) every row corresponds to a single unit/observation and every column corresponds to a variable (for which we usually have no more than one value attributed to any unit/observation). Wide format data is great for many things, but it turns out that learning to work with long format data can be super helpful for a number of purposes. Producing richer, multidimensional ggplot visualizations is one of them. +So now you want to plot a multivariate time series (e.g., the same plot for more than one state and/or for more than one measure). As always, you have a number of options, but the most effective way to achieve this with ggplot involves learning to work with "longer" data. + +Thus far, we have worked mostly with "wide" format data where (nearly) every row corresponds to a single unit/observation and every column corresponds to a distinct variable (for which we usually have no more than one value attributed to any unit/observation). This often results in wider format data that is great for many things. However, it turns out that longer format data can be super helpful for a number of purposes. Producing richer, multidimensional ggplot visualizations is one of them. Consider the format of my tidied dataframe that I used for plotting: ```{r} il_weekly_cases ``` -This dataframe is in a "wide" format. Each row is a week and each column is a variable unique to that week. +This dataframe is in a pretty "long" format. Each row is a week and each column is a variable unique to that week (okay, I could consolidate my `weekdate` and `date` columns into just one, but that's not really the point here. The idea is that there's minimal redundant information in the rows and in the columns). -Our original dataframe was a bit "longer": +Our original dataframe was also pretty "long": ```{r} d ``` -We see multiple observations per state (I think I would say the units or rows correspond to "state-dates" or something like that). It's not completely "long" however, because we also have multiple columns corresponding to the two variables of interest: `cases` and `deaths`. The point I want to make is that there are a number of ways we can make this data "longer." For the purposes of producing a multi-state plot like the one above, the most important of these is going to involve dropping the step where I filtered by `state=="Illinois"` and replacing by a `group_by` step before I create my `weekdate` variable. I'm also going to go ahead and drop the `date` and `fips` variables because they're just getting in my way at this point. I'll start there +Here we have multiple observations per state (I think I would say the units or rows correspond to "state-dates" or something like that). It's not as "long" as possible, though, because we also have multiple columns corresponding to the two variables of interest: `cases` and `deaths`. + +For the purposes of producing a multi-state and multivariate set of plots, the most important thing I want to do is consolidate my dataset into a format where I have the following columns: `date` (collapsed into weeks), `state`, `variable` (which will either have a value of `new cases` or `new deaths`), and a column for `value` that will hold the corresponding state-week count for the variable in each row. If that doesn't make sense, don't worry, we'll get there soon enough. + +Doing this involves a different approach to tidying up my data. I'll start by dropping the step where I filtered by `state=="Illinois"` and replacing it with a `group_by` step before I create my `weekdate` variable. I'm also going to go ahead and drop the `date` and `fips` variables because they're just getting in my way. ```{r} weekly <- d %>% group_by(state) %>% @@ -166,8 +184,8 @@ weekly <- d %>% ) %>% select(state, cases, deaths, weekdate) weekly ``` -I'm getting somewhere with this, I promise. One of the principles of "tidy" data is to make it so that every variable has a column, every observation has a row, and every value has a cell. Right now, I've got multiple observations for each state-week spread across multiple rows. Remember that my `cases` and `deaths` variables are actually cumulative counts, so I really only need to store the maximum value for each state-week in order to calculate the new cases per state-week. Let's see what to do about that: +Now I've got multiple observations for each state-week spread across multiple rows (because my rows were structured around a more granular measure of time). My next move is to collapse these into a single observation for each state-week. Remember that my `cases` and `deaths` variables are still cumulative counts, so as I do this aggregation by week I will only need to store the maximum value for each state-week in order to calculate the number of new cases per state-week. ```{r} tidy_weekly <- weekly %>% group_by(state, weekdate) %>% @@ -177,9 +195,15 @@ tidy_weekly <- weekly %>% ) ``` +Notice that the call to `group_by` groups by multiple variables. The order here matters! If I reversed it to read `group_by(weekdate, state)` the results would be very different. With the correct ordering, I have things bundled up into state-week sub-groups and then I move on to calculate the maximum value of cumulative cases within each bundle. + +Next, I can fix up my `weekdate` variable again so that it is a Date object. ```{r} tidy_weekly$weekdate <- as.Date(as.character(tidy_weekly$weekdate)) +``` +This will allow me to do some sorting within my state-week bundles to ensure things are in the proper order before I convert my weekly cumulative case count into weekly new case counts. +```{r} tidy_weekly <- tidy_weekly %>% group_by(state) %>% arrange(-desc(weekdate)) %>% @@ -190,7 +214,10 @@ tidy_weekly <- tidy_weekly %>% tidy_weekly ``` -This is headed in the right direction. For some purposes, though, it's still not quite "long" enough For starters, I can drop the cumulative cases and deaths columns. The other thing I can do is "pivot" the data to organize the `new_cases` and `new_deaths` measures a little differently. To manage this, I'll use the `pivot_longer()` function (part of the `tidyr` package from the tidyverse). I will also go ahead and coerce my `weekdate` into a Date object again: + +We're much closer to our goal now! + +I can go ahead and drop the cumulative cases and deaths columns with a call to `select` in my next step. Then the big next (and nearly final) step is to "pivot" the data to organize the `new_cases` and `new_deaths` measures in the way I described above. To manage this, I'll use the `pivot_longer()` function (part of the `tidyr` package from the tidyverse): ```{r} long_weekly <- tidy_weekly %>% select(state, weekdate, new_cases, new_deaths) %>% @@ -202,9 +229,10 @@ long_weekly <- tidy_weekly %>% long_weekly ``` -Can you see what that did? I now have two rows of data for every state-week. One that contains a value for `new_cases` and one that contains a value for `new_deaths`. Both of those variables have been "pivoted" into a single `variable` column. -Before we move forward I'm going to clean up the values of `variable`. +Can you see what that did? I now have two rows of data for every state-week. One row contains a value for `new_cases` and one contains a value for `new_deaths`. Both of those variables have been "pivoted" into a single `variable` column and their corresponding values recorded in another new column. Note that this makes our dataframe a little longer even though it does not technically reduce the "width" of this particular dataset (because we've taken two columns and pivoted them to create...two different columns). However, consider that we could accommodate as many additional numerical variables and values as we might like in this manner and you can start to see how this pivoting step could result in much longer data (the length becomes a function of the number of units in your dataset and the variables you include in your pivoting step). + +Before we move forward I'm also going to clean up the values of `variable`. This turns out to be helpful later on when we're plotting, but makes more sense to implement here before I start creating any plot layers. ```{r} long_weekly <- long_weekly %>% mutate( @@ -212,7 +240,8 @@ long_weekly <- long_weekly %>% ) ``` -Okay, prepared with my `tidy_weekly` and my `long_weekly` tibbles, I'm now ready to generate some more interesting multidimensional plots. Let's start with the same sort of time series of new cases we made for Illinois before so we can see how to replicate that with this new data structure: + +Okay, prepared with my `long_weekly` tibble, I'm now ready to generate some more interesting and multidimensional plots. Let's start with the same univariate time series of new cases we made for Illinois before so we can see how to replicate that figure with this new data structure: ```{r} long_weekly %>% filter( state == "Illinois" & variable == "new cases" @@ -220,15 +249,15 @@ long_weekly %>% filter( ``` -Now we can easily plot Illinois cases against deaths from the same tibble: +With our "longer" data format, we can plot Illinois cases against deaths from the same tibble by incorporating a `color=variable` argument : ```{r} long_weekly %>% filter(state == "Illinois") %>% ggplot(aes(weekdate, value, color=variable)) + geom_line() ``` -That plot isn't so great because the death counts are dwarfed by the case counts. Thank goodness! +Unfortunately, that plot isn't so great because the death counts are dwarfed by the case counts (thank goodness!). -Now let's compare Illinois case counts against some its neighbors in the upper midwest: +Now let's compare Illinois case counts against some the neighboring states in the upper midwest: ```{r} upper_midwest = c("Illinois", "Michigan", "Wisconsin", "Iowa", "Minnesota") @@ -237,9 +266,11 @@ long_weekly %>% ggplot(aes(weekdate, value, color=state)) + geom_line() ``` -Now that's getting a bit more interesting. +Notice that I use the `%in%` operator to filter for the values of the `state` vector that are "in" the `upper_midwest` vector (see `help(%in%)` for more). + +Also notice that we now have ourselves a multivariate time series! -What about finding some way to also incorporate the death counts? Well, ggplot has another layer option called "facets" that can help produce multiple plots and present them alongside each other (or in a grid). Here's an example that creates a faceted "grid" (really just a side-by-side comparison) of case counts and deaths for the same five states. +So now how about finding some way to also incorporate those death counts? If I just add them to this same plot we'll run into the same issue we did with the Illinois data because the death counts look tiny plotted on the same scale as the case counts. A good solution in such a situation is to create a second plot for weekly deaths that we can display together with this weekly cases plot that uses a differently scaled y-axis. The ggplot way to do this involves another type of layer called "facets." Here's an example that creates a faceted "grid" (noy much of a grid since there are only two variables or categories we're using to do the faceting) of weekly case counts and deaths for the same five states. ```{r} midwest_plot <- long_weekly %>% filter(state %in% upper_midwest) %>% @@ -248,9 +279,10 @@ midwest_plot <- long_weekly %>% midwest_plot ``` -Now we can clean up some of the other elements we worked on with the original plot (axes, title, etc.). I'll bake that into a single chunk below. +Nice! Now we can clean up some of the other elements we worked on with the original plot (axes, title, etc.). I'll bake that into a single chunk below. ```{r} midwest_plot + scale_x_date(date_labels = "%b", date_breaks= "1 month", date_minor_breaks = "1 week") + scale_y_continuous(label=comma) + labs(x="Week (in 2020)", y="", title="COVID-19 cases in the Upper Midwest") + theme_light() ``` +That's it! Mission accomplished. We've got ourselves a nice concise visualization of weekly COVID-19 cases and deaths across five upper midwest states over nearly 8 months of the pandemic. \ No newline at end of file