X-Git-Url: https://code.communitydata.science/social-media-chapter.git/blobdiff_plain/f452c9cf299c06a9aa14466aedec0731604b4a05..9e0cdeefb742c2c6284195a22e1b7412d43dcbf7:/README.md?ds=sidebyside diff --git a/README.md b/README.md index a554433..6bb09f5 100644 --- a/README.md +++ b/README.md @@ -3,9 +3,11 @@ title: Software and data for "A Computational Analysis of Social Media Scholarsh output: html_document --- + + > **Authors:** [Jeremy Foote](http://jeremydfoote.com/), [Aaron Shaw](http://aaronshaw.org/), [Benjamin Mako Hill](https://mako.cc/academic/)
> **Archival copies of code and data:**
-> **License:** see [COPYING file](COPYING): code is released under [GNU GPLv3](https://www.gnu.org/licenses/gpl-3.0.en.html) or any later version for code; paper is released as [CC BY-NC-SA](https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode). +> **License:** see [COPYING file](COPYING): code is released under [GNU GPLv3](https://www.gnu.org/licenses/gpl-3.0.en.html) or any later version; chapter is released as [CC BY-NC-SA](https://creativecommons.org/licenses/by-nc-sa/4.0/legalcode).
@@ -16,17 +18,16 @@ Dramatic increases in large-scale data generated through social media, combined To that end, we wrote a book chapter in the [Sage Handbook of Social Media](https://us.sagepub.com/en-us/nam/the-sage-handbook-of-social-media/book245739) in which we obtain a large-scale dataset of metadata about social media research papers which we analyze using a few commonly-used computational methods. This document is designed to tell you exactly how we did that and to walk you through how to reproduce our results and our paper by running the code we wrote. -This document is meant to be read alongside our chapter. You can find the chapter here: +This document is meant to be read alongside our chapter. The rest of this document describes how to download our code and data and how to reproduce the analyses in the chapter on your own computer. You can find and cite the chapter at the following:
> Foote, Jeremy D., Aaron Shaw, and Benjamin Mako Hill. 2017. “A Computational Analysis of Social Media Scholarship.” In The SAGE Handbook of Social Media, edited by Jean Burgess, Alice Marwick, and Thomas Poell, 111–34. London, UK: SAGE. [[Official Link](https://uk.sagepub.com/en-gb/eur/the-sage-handbook-of-social-media/book245739)] [[Preprint PDF](http://mako.cc/academic/foote_shaw_hill-computational_analysis_of_social_media.pdf)] -The rest of this document describes how to download our code and data and how to reproduce the analyses in the paper on your own computer. ### Requirements and Expectations We will be as explicit as possible in this document, and try to make it accessible to less-technical readers. However, we do make a few assumptions: -* You have access to and basic familiarity with [a POSIX command line interface](https://en.wikipedia.org/wiki/POSIX). The instructions here are written for and tested using [Debian](https://www.debian.org/) and [Ubuntu](https://www.ubuntu.com/) GNU/Linux. That said, these instructions should work without modification on most Linux systems. Although MacOS users may need to tweak a few things, they should work there, too. Microsoft Windows users will likely need to tweak more things. This is particularly true for the last step—building the paper itself. If you can get a simple example like [this one](https://github.com/yihui/knitr-examples/blob/master/005-latex.Rtex) working, then there's a decent chance you can get the paper to build. +* You have access to and basic familiarity with [a POSIX command line interface](https://en.wikipedia.org/wiki/POSIX). The instructions here are written for and tested using [Debian](https://www.debian.org/) and [Ubuntu](https://www.ubuntu.com/) GNU/Linux. That said, these instructions should work without modification on most Linux systems. Although MacOS users may need to [tweak a few things](MacInstallNotes), they should work there, too. Microsoft Windows users will likely need to tweak more things. This is particularly true for the last step—building the paper itself. If you can get a simple example like [this one](https://github.com/yihui/knitr-examples/blob/master/005-latex.Rtex) working, then there's a decent chance you can get the chapter to build. * You have [Python 3.x](https://www.python.org/downloads/) installed. For many users, you will already have it installed. Debian and Ubuntu users can install it with `apt install python3`. Others can download it from [the Python download page](https://www.python.org/downloads/) * You have [GNU R 3.x](https://www.r-project.org/) installed. Debian and Ubuntu users can install it with `apt install r-base`. Others can install it from [the R homepage](https://www.r-project.org/). In our testing we used versions GNU R versions 3.3.2 and 3.4.1. * To conduct the bibliometric network analysis, you'll need the [igraph library](http://igraph.org/). To install it on Debian or Ubuntu you can run `apt install libigraph0v5`. @@ -67,38 +68,38 @@ We will be as explicit as possible in this document, and try to make it accessib We have ensured that every piece of software used in this analysis is [free/open source software](https://www.gnu.org/philosophy/free-sw.en.html) which means it is both available at no cost and, like our analysis code, is transparent and inspectable. -### 1. Setting up Your Environment +### 1. Set up Your Environment -You'll want to start by creating a new directory. Download and extract following file from the [the Harvard Dataverse repository for the paper](https://dx.doi.org/10.7910/DVN/W31PH5): +You'll want to start by creating a new directory. Then, download and extract the following file from the [the Harvard Dataverse repository for the paper](https://dx.doi.org/10.7910/DVN/W31PH5): * [`code_and_paper.tar.gz`](https://dataverse.harvard.edu/file.xhtml?fileId=3107305) (this file is publicly accessible) If this is successful, you should have two subdirectories: `code` and `paper`. -### 2. Getting the Scopus Data +### 2. Get the Scopus Data -The data for this project came from the [Scopus API](https://dev.elsevier.com/). An API ([Application Programming Interface](https://en.wikipedia.org/wiki/Application_programming_interface)) is basically a way for computers to talk to each other directly. In our case, we asked the Scopus API to give us metadata about a set of research papers. In order to replicate the paper, you have two options for getting this metadata. +The data for this project came from the [Scopus API](https://dev.elsevier.com/). An API ([Application Programming Interface](https://en.wikipedia.org/wiki/Application_programming_interface)) is basically a way for computers to talk to each other directly. In our case, we asked the Scopus API to give us metadata about a set of research papers. In order to replicate the analysis from the chapter, you have two options for getting this metadata. -#### 2.1. Option 1: Downloading Collected Data from Harvard Dataverse +#### 2.1. Option 1: Download the Data We Collected from Harvard Dataverse As part of writing this paper, we did the work of downloading the metadata from Scopus and it is available in the [`raw_data.tar.gz`](https://dataverse.harvard.edu/file.xhtml?fileId=3106986) file on the Dataverse page. This dataset includes metadata about all of the papers that include the term "*social media*" in their title, abstract, or keywords, as well as all of the papers which cite these papers, as of February 2017. -Scopus won't let us make that dataset publicly accessible on the web so you'll need to request access to it through [the Harvard Dataverse](https://dataverse.harvard.edu/) and we'll ask that you only use the data for the purpose of reproducing this paper. +Scopus won't let us make that dataset publicly accessible on the web, but we can grant you access to it through [the Harvard Dataverse](https://dataverse.harvard.edu/). Just request it through Dataverse. We ask that you only use the data for the purpose of reproducing this chapter. Once you have downloaded the data, unpack it in the same directory that you unpacked the `code_and_data.tar.gz` file. Now you should have a third subdirectory: `raw_data`. -#### 2.2. Option 2: Getting the Data Yourself +#### 2.2. Option 2: Get the Data from Scopus Yourself If you want to download the data yourself instead of using the `raw_data.tar.gz` file we have prepared, follow the instructions in this section. There are several reasons you might want to do this. For example, you might want to retrieve a new version of the dataset with details of more recent papers. Or you might want to do a similar analysis with a different search term. In order to do this: -1. You must belong to an institution that has access to [Scopus](https://www.scopus.com/). +1. You must belong to an institution that has access to [Scopus](https://www.scopus.com/) and/or buy your own (absurdly expensive) subscription. 2. You will need to be patient. Scopus has a weekly limit on requests to their API, and it may take multiple weeks to download all of the results. You will also need to get an API key from . In the `code/data_collection` directory, edit the `scopus_api.py` file so that it has your key. The file should look something like `key = 'XXXXXXXXXXXXXXXXX'` where your key replaces the Xs. -The Scopus API is a little odd, in that you are authenticated based on your key, but your permissions change based on your IP ([authentication documentation](https://dev.elsevier.com/tecdoc_api_authentication.html)). I noticed that I had to be logged into our university VPN (even if I was on campus) in order to have all of the permissions needed to carry this out. +The Scopus API is a little odd, in that you are authenticated based on your key, but your permissions change based on your IP ([authentication documentation](https://dev.elsevier.com/tecdoc_api_authentication.html)). Jeremy noticed that he had to be logged into our Northwestern University VPN (even if he was on campus) in order to have all of the permissions needed to carry this out. Getting the data is a three-step process: @@ -117,11 +118,11 @@ Getting the data is a three-step process: python3 code/data_collection/02_get_cited_by.py -i raw_data/search_results.json -o raw_data/cited_by.json -### 3. Cleaning the Data +### 3. Clean the Data Whichever option you used above, you should now have a `raw_data` subdirectory which contains the files `search_results.json`, `abstracts_and_citations.json`, and `cited_by.json`. -These raw data files are "raw" in the sense that they contain lots of data we are looking for as well as lots of things we won't use in this analysis. They are also in JSON format which is not the best format for bringing data into most of the tools we'll be using for our analysis. As a result, we will clean them up to make them easier to work with in the tools that we'll be using. +These raw data files are "raw" in the sense that they contain lots of data we are looking for as well as lots of things we didn't use in this analysis. They are also in JSON format which is not the best format for most of the tools we'll use for our analysis. As a result, we will clean them up to make them easier to work with. All of the following commands should be run from the directory where you downloaded the material. @@ -146,7 +147,7 @@ All of the following commands should be run from the directory where you downloa python3 code/data_processing/03_make_paper_aff_table.py -i raw_data/abstracts_and_citations.json -o processed_data/paper_aff_table.tsv python3 code/data_processing/04_make_paper_subject_table.py -i raw_data/abstracts_and_citations.json -o processed_data/paper_subject_table.tsv -Once we have all the data cleaned and prepared, we are ready to proceed to our analysis. Our general workflow for analysis will be to run our analysis code and then save the output to an RData file in the `paper/data/` subdirectory. We will then import these RData files into the paper to create figures and tables. +Once we have all the data cleaned and prepared, we are ready to proceed to our analysis. Our general workflow for analysis will be to run our analysis code and then save the output to an RData file in the `paper/data/` subdirectory. We will then import these RData files into the chapter to create figures and tables. Doing this will require two final steps: @@ -154,7 +155,7 @@ Doing this will require two final steps: mkdir paper/data -7. Before we get to the analysis, though, we'll save some portions of the processed datasets to the `paper/data` subdirectory by running the following command, which will help us access the processed data directly from our paper in order to report descriptive statistics: +7. Before we get to the analysis, though, we'll save some portions of the processed datasets to the `paper/data` subdirectory by running the following command, which will help us access the processed data directly from our chapter in order to report descriptive statistics: Rscript code/data_processing/05_save_descriptives.R @@ -164,7 +165,7 @@ Doing this will require two final steps: The code used for our bibliometric analysis is contained within the `code/bibliometrics/` subdirectory. -We've included two copies of our Python code for our bibliometric analysis in the files `00_citation_network_analysis.py` and `00_citation_network_analysis.ipynb`. We will describe using the former in this section. If you have [Juypter](https://jupyter.org/) installed you can open the file in a a notebook format used by many scientists by running `jupyter-notebook citation_network_analysis.ipynb`. If you want to try Jupyter, Debian and Ubuntu users can install it with `apt install jupyter-notebook` and other users can download it [here](https://jupyter.org/install.html). +We've included two copies of our Python code for our bibliometric analysis in the files `00_citation_network_analysis.py` and `00_citation_network_analysis.ipynb`. We will describe using the former in this section. If you have [Jupyter](https://jupyter.org/) installed you can open the file in a a notebook format used by many scientists by running `jupyter-notebook citation_network_analysis.ipynb`. If you want to try Jupyter, Debian and Ubuntu users can install it with `apt install jupyter-notebook` and other users can download it [here](https://jupyter.org/install.html). Our bibliometric analysis code does require one additional piece of software called [Infomap](http://www.mapequation.org/) which we use to identify clusters in our citation network. There are some [instructions online](https://github.com/mapequation/infomap) but you can download and install it with the following commands run from the `code/bibliometrics` subdirectory: @@ -181,56 +182,56 @@ This will save the output to `paper/data/network_data.RData`. This data is used #### 4.2 Network Visualizations -If you want to create our two network diagrams, you'll need one additional piece of software: [Gephi, the Open Graph Viz Platform](https://gephi.org/). Figure 2 in the paper is a "hairball" network graph of the citation network in our dataset. Like all the software used here, it is free/open source software and available for [download](https://gephi.org/users/download/). +If you want to create our two network diagrams, you'll need one additional piece of software: [Gephi, the Open Graph Viz Platform](https://gephi.org/). We used it to generate Figure 2 in the chapter, a "hairball" network graph of the citation network in our dataset. Like all the software used here, it is free/open source software and available for [download](https://gephi.org/users/download/). Gephi is a graphical and interactive tool so, unlike the rest of our analysis, it will take some clicking around to reproduce our graph exactly. -We created the the big network "hairball" in Figure 2 by starting in the following way: +We created Figure 2 in the following way: -1. Opening the file `code/bibliometrics/g_sm.graphml` -2. On the top part of the left sidebar, clicking on: *Appearance*, *Node*, the color palette icon, *Partition*, selecting "cluster" from the drop-down, and then *▶Apply*. -3. In the *Layout* tab, Selecting "Fruchterman Reingold" and then clicking *Run*. (This will take a really long time!) +1. Open the file `code/bibliometrics/g_sm.graphml` +2. On the top part of the left sidebar, click on: *Appearance*, *Node*, the color palette icon, *Partition*, select "cluster" from the drop-down, and then *▶Apply*. +3. In the *Layout* tab, Select "Fruchterman Reingold" and then click *Run*. (This will take a really long time!) -To create the small network cluster in Figure 3 we began in the following way: +To create the small network cluster in Figure 3: -1. Opening the file `code/bibliometrics/cluster.graphml` -2. In the *Layout* tab, selecting "Fruchterman Reingold" and then clicking *Run*. -3. Clicking on: *Appearance*, *Node*, the color palette icon, *Partition*, selecting "name" from the drop-down, and then clicking *▶Apply*. -4. Clicking on: *Appearance*, *Node*, the size icon concentric circles, *Ranking*, selecting "papers" from the drop-down, fiddling with the min and max sizes, and then clicking *▶Apply*. +1. Open the file `code/bibliometrics/cluster.graphml` +2. In the *Layout* tab, select "Fruchterman Reingold" and then click *Run*. +3. Click on: *Appearance*, *Node*, the color palette icon, *Partition*, select "name" from the drop-down, and then click *▶Apply*. +4. Click on: *Appearance*, *Node*, the size icon concentric circles, *Ranking*, select "papers" from the drop-down, fiddle with the min and max sizes, and then click *▶Apply*. Getting things just right will take some fiddling! We've included our Gephi files (`code/bibliometrics/clusters.gephi` and `code/bibliometrics/g_sm.gephi`) which are finished or mostly finished versions that you can open up and use if you'd like. ### 5. Topic Modeling Analysis -The topic in the `code/topic_modeling` directory applies [latent Dirichlet allocation](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation) topic modeling to the social media abstracts. +The `code/topic_modeling` directory applies [latent Dirichlet allocation](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation)(LDA) topic modeling to the social media abstracts. LDA takes in a set of documents and produces a set of topics and a distribution of topics for each document. The first file takes in the abstract file, and creates two outputs: The abstracts together with their topic distribution and a set of topics and the top words associated with each. Our topic modeling analysis includes the following steps: -1. Running the following Python script that extracts our topics: +1. Run the following Python script that extracts our topics: python3 code/topic_modeling/00_topics_extraction.py > Note that this will take some time —5-10 minutes on a decent laptop. -2. Running a second file which (a) makes a couple of tables of the top words for each topic and (b) generates some summary statistics for how the topics change over time. These statistics are used, e.g., to create Figure 4. You can run this with: +2. Run a second file which (a) makes a couple of tables of the top words for each topic and (b) generates some summary statistics for how the topics change over time. These statistics are used, e.g., to create Figure 4. You can run this with: python3 code/topic_modeling/01_make_paper_files.py -Topic modeling is a stochastic process, and you may notice differences—potentially large differences—between the results in our paper and the results when you run it. If the order of topics has changed (or if you think other labels would be appropriate for some topics), then you can adjust the topic names by editing the `topic_names` list in the `01_make_papers.py` file. +Topic modeling is a stochastic process, and you may notice differences—potentially large differences—between the results in our chapter and the results when you run it. If the order of topics has changed (or if you think other labels would be appropriate for some topics), then you can adjust the topic names by editing the `topic_names` list in the `01_make_papers.py` file. ### 6. Prediction Analysis For the prediction analysis, we use features of the papers to predict whether or not a paper gets cited. -These commands require a computer with a large amount of memory (i.e., RAM). We had trouble running this step on our laptops which did not have enough. If you don't have access to such a computer, then you can change the `n_features` variable in the `00_ngram_extraction.py` file from `100000` to something like `3000`. This will change how many terms are included in the prediction analysis, but shouldn't make an important difference in the results. +These commands require a computer with a large amount of memory (i.e., RAM). We had trouble running these steps on our laptops which did not have enough memory. If you don't have access to such a computer, then you can change the `n_features` variable in the `00_ngram_extraction.py` file from `100000` to something like `3000`. This will change how many terms are included in the prediction analysis, but shouldn't make an important difference in the results. We ran the following steps: -1. Because one of the features we use is the text of the abstracts, we start by getting uni-, bi-, and tri-grams from the abstracts. This is done with the `00_ngram_extraction` python file: +1. Because one of the features we use is the text of the abstracts, we start by getting uni-, bi-, and tri-grams from the abstracts. This is done with the `00_ngram_extraction.py` python file: python3 code/prediction/00_ngram_extraction.py @@ -256,12 +257,11 @@ We ran the following steps: Each of these final two steps will take a long time to run. -### 7. Building the Paper - +### 7. Build the Chapter -Now we should have all of the data that we need sitting in the `paper/data` directory. We use a really neat package called [knitr](https://yihui.name/knitr/) to load and manipulate data directly in the document. Nearly all of the tables and figures in the document are created using code that is in the document, and knitr is the magic behind the scenes that makes that possible. +Now we should have all of the data that we need sitting in the `paper/data` directory. We use a package called [knitr](https://yihui.name/knitr/) to load and manipulate data directly in the chapter. With the exception of the Gephi graphs, all of the tables and figures in the chapter are created using code that is in the same file where we write the text of the chapter. The knitr package is the magic behind the scenes that makes that possible. -To build the paper, we will be using a [Makefile](https://www.gnu.org/software/make/manual/make.html) which requires a set of utilities to run. This is the part that might get a little tricky for Windows users. +To build the chapter, we will be using a [Makefile](https://www.gnu.org/software/make/manual/make.html) which requires a set of utilities to run. This is the part that might get a little tricky for Windows users. You can install the packages you'll need from Debian or Ubuntu with the following command: @@ -271,19 +271,18 @@ After this, you should change to the `paper` directory and simply run the comman This will produce a quickly scrolling output to standard out, and if everything has worked, then in the end it will produce a bunch of files in the `paper` directory, one of which will be the final PDF file! -### Errors, Improvements, and Updates +An alternative approach that does not involve installing software is to upload the entire paper subdirectory (including the paper/data subdirectory) as a paper repository to the service [ShareLaTeX](https://www.sharelatex.com/). In order to make it work, you'll also need to rename the file ending with `.Rnw` to `.Rtex`. We spent some of the time writing our paper using ShareLaTeX so this should work. -Although we have tried to make this document as clear as possible and although we have tested it carefully ourselves, there are many ways it might fail. You might have a different software environment or the software dependencies we have used might have [bit rot](https://en.wikipedia.org/wiki/Software_rot) over time in a way that leads to things breaking. +### Help us Find Errors, Improvements, and Updates + +Although we have tried to make this document as clear as possible and although we have tested it ourselves, there are many ways it might fail. You might have a different software environment or the software dependencies we have used might have [bit rot](https://en.wikipedia.org/wiki/Software_rot) over time in a way that leads to things breaking. You are welcome to get in touch with us if you have questions and we have provided our webpages and emails here for that purpose: * [Jeremy Foote](http://jeremydfoote.com/) <> -* [Aaron Shaw](http://aaronshaw.org/) <> +* [Aaron Shaw](http://aaronshaw.org/) <> * [Benjamin Mako Hill](https://mako.cc/academic/) <> -If you can fix issues you run into, find ways to clarify our instructions, or make fixes to our code, please tell us! - -In addition to the [archival version in the Dataverse](https://dx.doi.org/10.7910/DVN/W31PH5), we have hosted our code in a [git revision control management](https://git-scm.com/) repository here: -Even the page you are reading is included in our repository. If you notice any typos or errors, please fix it and send us your fix! +If you can fix issues you run into, find ways to clarify our instructions, or make fixes to our code, please tell us! We'd love to add helpful improvements to these materials. -Please feel to follow the [instructions we have posted](https://code.communitydata.cc/) for sending updated versions of our code or documentation so that others trying to replicate and learn from our work can benefit from your work and improvements! +In addition to the [archival version in the Dataverse](https://dx.doi.org/10.7910/DVN/W31PH5), we have hosted our code in a [git revision control management](https://git-scm.com/) repository here: Even the page you are reading is included in our repository. If you notice any typos or errors, please send us your fix! To do so, you can follow the [instructions we have posted](https://code.communitydata.cc/) for sending updated versions of our code or documentation. Your work and improvements can help others trying to replicate and learn from this project.