PC 1: Access and describe a dataset provided in an R library

PC1.1

## This first command can install the pacakge. I've commented it out here because it only needs to run once. 
## You can also install packages from the 'Packages' tab of the Rstudio interface.
##
## install.packages("openintro")  

library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
data("county")

PC1.2

class(county)
## [1] "tbl_df"     "tbl"        "data.frame"

Notice that the results of that last call to class() produced something you might not have expected: three different values! This is because the OpenIntro authors have provided this dataset as something called a “tibble” (that’s the tbl_df and tbl bits) which is basically a souped-up Tidyverse analogue to a Base-R dataframe. You can do some nice things with Tidyverse tibbles. For example, you can invoke them directly and not worry that R is going to print out a massive amonut of data at your console. The output also includes information about the dimensions and the types/classes of the columns/variables in the Tibble. Here’s what it looks like:

county
## # A tibble: 3,142 x 15
##    name  state pop2000 pop2010 pop2017 pop_change poverty homeownership
##    <chr> <fct>   <dbl>   <dbl>   <int>      <dbl>   <dbl>         <dbl>
##  1 Auta… Alab…   43671   54571   55504       1.48    13.7          77.5
##  2 Bald… Alab…  140415  182265  212628       9.19    11.8          76.7
##  3 Barb… Alab…   29038   27457   25270      -6.22    27.2          68  
##  4 Bibb… Alab…   20826   22915   22668       0.73    15.2          82.9
##  5 Blou… Alab…   51024   57322   58013       0.68    15.6          82  
##  6 Bull… Alab…   11714   10914   10309      -2.28    28.5          76.9
##  7 Butl… Alab…   21399   20947   19825      -2.69    24.4          69  
##  8 Calh… Alab…  112249  118572  114728      -1.51    18.6          70.7
##  9 Cham… Alab…   36583   34215   33713      -1.2     18.8          71.4
## 10 Cher… Alab…   23988   25989   25857      -0.6     16.1          77.5
## # … with 3,132 more rows, and 7 more variables: multi_unit <dbl>,
## #   unemployment_rate <dbl>, metro <fct>, median_edu <fct>,
## #   per_capita_income <dbl>, median_hh_income <int>, smoking_ban <fct>

Notice that the Tibble refers to some numeric variables as “” (“doubles”) and others as “<int”> (“integers”). The latter maps to the colloquial idea of an integer. A “double” is a programming-language speak for a variable that takes non-integer numeric values.

The results of this call are actually sufficient to answer PC1.3 and PC1.4 below.

PC1.3

If you didn’t know or didn’t realize that calling a Tibble directly answered this, here are some other ways to find the dimensions of a dataset:

dim(county)
## [1] 3142   15
nrow(county)
## [1] 3142
ncol(county)
## [1] 15

PC1.4

Again, some additional tools/approaches you might use to answer this if the Tibble method isn’t available/known to you. Note that you can find out the class for any one variable easily with the class() command. Iterating this over the names of all the variables in a dataframe is feasible, but tedious and inefficient, so I provide a more concise method with lapply() below:

names(county)
##  [1] "name"              "state"             "pop2000"          
##  [4] "pop2010"           "pop2017"           "pop_change"       
##  [7] "poverty"           "homeownership"     "multi_unit"       
## [10] "unemployment_rate" "metro"             "median_edu"       
## [13] "per_capita_income" "median_hh_income"  "smoking_ban"
## just an example here:
class(county$poverty)
## [1] "numeric"
## lapply() is useful for doing this over all variables in a dataframe/tibble
lapply(county, class)
## $name
## [1] "character"
## 
## $state
## [1] "factor"
## 
## $pop2000
## [1] "numeric"
## 
## $pop2010
## [1] "numeric"
## 
## $pop2017
## [1] "integer"
## 
## $pop_change
## [1] "numeric"
## 
## $poverty
## [1] "numeric"
## 
## $homeownership
## [1] "numeric"
## 
## $multi_unit
## [1] "numeric"
## 
## $unemployment_rate
## [1] "numeric"
## 
## $metro
## [1] "factor"
## 
## $median_edu
## [1] "factor"
## 
## $per_capita_income
## [1] "numeric"
## 
## $median_hh_income
## [1] "integer"
## 
## $smoking_ban
## [1] "factor"

PC1.5

For my example, I’ll work with the poverty variable:

length(county$poverty)
## [1] 3142
min(county$poverty, na.rm=TRUE) ## That na.rm=TRUE part is crucial!
## [1] 2.4
max(county$poverty, na.rm=TRUE)
## [1] 52
mean(county$poverty, na.rm=TRUE) ## So many significant digits... 
## [1] 15.96885
sd(county$poverty, na.rm=TRUE)
## [1] 6.515682
## And here's a built-in command that covers many of these:
summary(county$poverty)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##    2.40   11.30   15.20   15.97   19.40   52.00       2

PC1.6

hist(county$poverty)