## This first command can install the pacakge. I've commented it out here because it only needs to run once.
## You can also install packages from the 'Packages' tab of the Rstudio interface.
##
## install.packages("openintro")
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
data("county")
class(county)
## [1] "tbl_df" "tbl" "data.frame"
Notice that the results of that last call to class()
produced something you might not have expected: three different values! This is because the OpenIntro authors have provided this dataset as something called a “tibble” (that’s the tbl_df
and tbl
bits) which is basically a souped-up Tidyverse analogue to a Base-R dataframe. You can do some nice things with Tidyverse tibbles. For example, you can invoke them directly and not worry that R is going to print out a massive amonut of data at your console. The output also includes information about the dimensions and the types/classes of the columns/variables in the Tibble. Here’s what it looks like:
county
## # A tibble: 3,142 x 15
## name state pop2000 pop2010 pop2017 pop_change poverty homeownership
## <chr> <fct> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 Auta… Alab… 43671 54571 55504 1.48 13.7 77.5
## 2 Bald… Alab… 140415 182265 212628 9.19 11.8 76.7
## 3 Barb… Alab… 29038 27457 25270 -6.22 27.2 68
## 4 Bibb… Alab… 20826 22915 22668 0.73 15.2 82.9
## 5 Blou… Alab… 51024 57322 58013 0.68 15.6 82
## 6 Bull… Alab… 11714 10914 10309 -2.28 28.5 76.9
## 7 Butl… Alab… 21399 20947 19825 -2.69 24.4 69
## 8 Calh… Alab… 112249 118572 114728 -1.51 18.6 70.7
## 9 Cham… Alab… 36583 34215 33713 -1.2 18.8 71.4
## 10 Cher… Alab… 23988 25989 25857 -0.6 16.1 77.5
## # … with 3,132 more rows, and 7 more variables: multi_unit <dbl>,
## # unemployment_rate <dbl>, metro <fct>, median_edu <fct>,
## # per_capita_income <dbl>, median_hh_income <int>, smoking_ban <fct>
Notice that the Tibble refers to some numeric variables as “
The results of this call are actually sufficient to answer PC1.3 and PC1.4 below.
If you didn’t know or didn’t realize that calling a Tibble directly answered this, here are some other ways to find the dimensions of a dataset:
dim(county)
## [1] 3142 15
nrow(county)
## [1] 3142
ncol(county)
## [1] 15
Again, some additional tools/approaches you might use to answer this if the Tibble method isn’t available/known to you. Note that you can find out the class for any one variable easily with the class()
command. Iterating this over the names of all the variables in a dataframe is feasible, but tedious and inefficient, so I provide a more concise method with lapply()
below:
names(county)
## [1] "name" "state" "pop2000"
## [4] "pop2010" "pop2017" "pop_change"
## [7] "poverty" "homeownership" "multi_unit"
## [10] "unemployment_rate" "metro" "median_edu"
## [13] "per_capita_income" "median_hh_income" "smoking_ban"
## just an example here:
class(county$poverty)
## [1] "numeric"
## lapply() is useful for doing this over all variables in a dataframe/tibble
lapply(county, class)
## $name
## [1] "character"
##
## $state
## [1] "factor"
##
## $pop2000
## [1] "numeric"
##
## $pop2010
## [1] "numeric"
##
## $pop2017
## [1] "integer"
##
## $pop_change
## [1] "numeric"
##
## $poverty
## [1] "numeric"
##
## $homeownership
## [1] "numeric"
##
## $multi_unit
## [1] "numeric"
##
## $unemployment_rate
## [1] "numeric"
##
## $metro
## [1] "factor"
##
## $median_edu
## [1] "factor"
##
## $per_capita_income
## [1] "numeric"
##
## $median_hh_income
## [1] "integer"
##
## $smoking_ban
## [1] "factor"
For my example, I’ll work with the poverty
variable:
length(county$poverty)
## [1] 3142
min(county$poverty, na.rm=TRUE) ## That na.rm=TRUE part is crucial!
## [1] 2.4
max(county$poverty, na.rm=TRUE)
## [1] 52
mean(county$poverty, na.rm=TRUE) ## So many significant digits...
## [1] 15.96885
sd(county$poverty, na.rm=TRUE)
## [1] 6.515682
## And here's a built-in command that covers many of these:
summary(county$poverty)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 2.40 11.30 15.20 15.97 19.40 52.00 2
hist(county$poverty)