README.md

   1 Copyright (C)  2018  Nathan TeBlunthuis.
   2 Permission is granted to copy, distribute and/or modify this document
   3 under the terms of the GNU Free Documentation License, Version 1.3
   4 or any later version published by the Free Software Foundation;
   5 with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.
   6 A copy of the license is included in the file entitled "fdl-1.3.md".
   7
   8 # Replication data for "Revisiting 'The Rise and Decline' in a Population of Peer Production Projects" #
   9
  10 ## Overview ##
  11
  12 This archive contains code and data for reproducing the analysis for
  13 "Replication Data for Revisiting 'The Rise and Decline' in a
  14 Population of Peer Production Projects". Depending on what you hope to
  15 do with the data you probabbly do not want to download all of the
  16 files. Depending on your computation resources you may not be able to
  17 run all stages of the analysis.
  18
  19 The code for all stages of the analysis, including typesetting the
  20 manuscript and running the analysis, is in code.tar.
  21
  22 If you only want to run the final analysis or to play with datasets
  23 used in the analysis of the paper, you want intermediate_data.7z or
  24 the uncompressed tab and csv files.
  25
  26 The data files are created in a three stage process. The first stage
  27 uses the program "wikiq" to create tsv files that have edit data for
  28 each wiki. The second stage generates all.edits.RDS file which
  29 contains edit metadata from mediawiki xml dumps. This file is
  30 expensive to generate and at 1.5GB is pretty big.  The third stage
  31 builds smaller intermediate files that contain the analytical
  32 variables from these tsv files. The fourth stage uses the intermediate
  33 files to generate smaller RDS files that contain the results. Finally,
  34 knitr and latex typeset the manuscript.
  35
  36 A stage will only run if the outputs from the previous stages do not
  37 exist. So if the intermediate files exist they will not be
  38 regenerated. Only the final analysis will run. The exception is that
  39 stage 4, fitting models and generating plots, always runs.
  40
  41 If you only want to replicate from the second stage onward, you want
  42 wikiq_tsvs.7z. If you want to replicate everything, you want
  43 wikia_mediawiki_xml_dumps.7z.001 and wikia_mediawiki_xml_dumps.7z.002.
  44
  45 These instructions work backwards from building the manuscript using
  46 knitr, loading the datasets, running the analysis, to building the
  47 intermediate datasets.
  48
  49 ## Building the manuscript using knitr ##
  50 This requires working latex, latexmk, and knitr
  51 installations. Depending on your operating system you might install
  52 these packages in different ways. On Debian Linux you can run `apt
  53 install r-cran-knitr latexmk texlive-latex-extra`. Alternatively, you
  54 can upload the necessary files to a project on Sharelatex.com or
  55 Overleaf.com.
  56
  57 1. Download `code.tar`. This has everything you need to typeset the manuscript.
  58 2. Unpack the tar archive. On a unix system this can be done by running `tar xf code.tar`.
  59 3. Navigate to code/paper_source.
  60 4. Install R dependencies. In R. run `install.packages(c("data.table","scales","ggplot2","lubridate","texreg"))`
  61 5. On a unix system you should be able to run `make` to build the
  62    manuscript `generalizable_wiki.pdf`. Otherwise you should try
  63    uploading all of the files (including the tables, figure, and knitr
  64    folders) to a new project on ShareLatex.com.
  65
  66 ## Loading intermediate datasets ##
  67 The intermediate datasets are found in the `intermediate_data.7z`
  68 archive. They can be extracted on a unix system using the command `7z
  69 x intermediate_data.7z`. The files are 95MB uncompressed. These are
  70 RDS (R data set) files and can be loaded in R using the `readRDS`. For
  71 example `newcomer.ds <- readRDS("newcomers.RDS")`.  If you wish to
  72 work with these datasets using a tool other than R, you might prefer
  73 to work with the .tab files.
  74
  75 ## Running the analysis ##
  76
  77 Fitting the models may not work on machines with less than 32GB of
  78 RAM. If you have trouble, you may find the functions in
  79 lib-01-sample-datasets.R useful to create stratified samples of data
  80 for fitting models. See line 89 of 02_model_newcomer_survival.R for an
  81 example.
  82
  83 1. Download `code.tar` and `intermediate_data.7z` to your working
  84    folder and extract both archives. On a unix system this can be done
  85    with the command `tar xf code.tar && 7z x intermediate_data.7z`.
  86 2. Install R
  87    dependencies. `install.packages(c("data.table","ggplot2","urltools","texreg","optimx","lme4","bootstrap","scales","effects","lubridate","devtools","roxygen2"))`.
  88 3. On a unix system you can simply run `regen.all.sh` to fit the
  89    models, build the plots and create the RDS files.
  90
  91 ## Generating datasets ##
  92
  93 ### Building the intermediate files ###
  94 The intermediate files are generated from all.edits.RDS. This process requires about 20GB of memory.
  95
  96 1. Download `all.edits.RDS`, `userroles_data.7z`,`selected.wikis.csv`,
  97    and `code.tar`. Unpack `code.tar` and `userroles_data.7z`. On a
  98    unix system this can be done using `tar xf code.tar && 7z x
  99    userroles_data.7z`.
 100 2. Install R dependencies. In R run
 101    `install.packages(c("data.table","ggplot2","urltools","texreg","optimx","lme4","bootstrap","scales","effects","lubridate","devtools","roxygen2"))`.
 102 3. Run `01_build_datasets.R`.
 103
 104 ### Building all.edits.RDS ###
 105
 106 The intermediate RDS files used in the analysis are created from
 107 `all.edits.RDS`. To replicate building all.edits.RDS, you only need to
 108 run 01_build_datasets.R when the intermediate RDS files and
 109 `all.edits.RDS` files do not exist in the working
 110 directory. `all.edits.RDS` is generated from the tsv files generated
 111 by wikiq. This may take several hours. By default building the dataset
 112 will use all available CPU cores. If you want to change this, modify
 113 line 26 of `lib-01-build_newcomer_table.R`.
 114
 115 1. Download selected.wikis.csv, userroles_data.7z, wikiq_tsvs.7z, and
 116    code.tar. Unpack the files. On a unix system this can be done by
 117    running `7z x userroles_data.7z && 7z x wikiq_tsvs.7z && tar xf
 118    code.tar`.
 119 2. Run `01_build_datasets.R` to generate all.edits.RDS and the intermediate files.
 120
 121
 122 ### Running Wikiq to generate tsv files ###
 123 If you want to regenerate the datasets all the way from the xml dumps
 124 and data from the Wikia api you will have to run the python script
 125 `wikiq`. This is a fairly computationally intensive process. It may
 126 over a day unless you can run the computations in parallel.
 127
 128 1. Download `code.tar`, `wikia_mediawiki_xml_dumps.7z.001`,
 129    `wikia_mediawiki_xml_dumps.7z.002`, and
 130    `userroles_data.7z`. Extract the archives. On a Unix system this
 131    can be done by running `tar xf code.tar && 7z x
 132    wikia_mediawiki_xml_dumps.7z.001 && 7z x userroles_data.7z`.
 133 2. Have python3 and python3-pip installed. Using pip3 install `argparse`. `pip3 install argparse`.
 134 3. Edit `runwikiq.sh` to set N_THREADS.
 135 4. Run `runwikiq.sh` to generate the tsv files.
 136
 137 ### Obtaining Bot and Admin data from the Wikia API ###
 138 For the purposes of supporting an audit of our research project, this
 139 repository includes the code that we used to obtain Bot and Admin data
 140 from the Wikia API. Unfortunantly, since we ran the script, the API
 141 has changed and this code does not work.
 142
 143 Our research group maintains a tool for scraping the Wikia API
 144 available at https://code.communitydata.cc/wikia_userroles_scraper. This can
 145 be used to download user roles for the wikis in this dataset. Follow
 146 the instructions found in that package.
 147