changes to a bunch of the wikipedia view code
authorBenjamin Mako Hill <mako@atdot.cc>
Wed, 1 Apr 2020 14:15:12 +0000 (07:15 -0700)
committerBenjamin Mako Hill <mako@atdot.cc>
Wed, 1 Apr 2020 14:15:12 +0000 (07:15 -0700)
commit38fdd07b39f63de88dd985787eb2ac3a5866670c
tree617b27129d26a00dfe7dd03c558a53b953da5404
parent72bf7bcd3787ffbda4ec2c47204896483e8069c9
changes to a bunch of the wikipedia view code

- Renamed the articles.txt to something more specific

Changes to both scripts:

- Updated filenames to match the new standard
- Reworked the logging code so that it can write to stderr by
  default. Because we can only call logging.basicConfig() once, this
  eneded up being a bigger changes.
- Caused scripts to output git commits and export to track which code
  produced which dataset.
- Caused programs to take files instead of directories as
  output (allows us to run programs more than once a day).

Changes to the wikipedia_views/scripts/fetch_daily_views.py:

- Change output that it outputs a sequence of JSON dictionaries (one
  per line) as per the standard we agreed to and which is what
  Twitter, Github, and other dumps do. Previous behavior was to create
  output a single JSON list object.
- A number of other small changes and tweaks throughout.
wikipedia_views/resources/enwp_wikiproject_covid19_articles.txt [moved from wikipedia_views/resources/articles.txt with 86% similarity]
wikipedia_views/scripts/fetch_daily_views.py
wikipedia_views/scripts/wikiproject_scraper.py

Community Data Science Collective || Want to submit a patch?