]>
code.communitydata.science - cdsc_reddit.git/log
summary |
shortlog | log |
commit |
commitdiff |
tree
first ⋅ prev ⋅ next
Nate E TeBlunthuis [Mon, 6 Jul 2020 06:32:00 +0000 (23:32 -0700)]
Fix whitespace at top of file.
Nate E TeBlunthuis [Mon, 6 Jul 2020 06:27:18 +0000 (23:27 -0700)]
Secondary sort for the by_author dataset should be CreatedAt.
Nate E TeBlunthuis [Mon, 6 Jul 2020 06:24:40 +0000 (23:24 -0700)]
Create a second dataset sorted by author.
Nate E TeBlunthuis [Mon, 6 Jul 2020 06:20:17 +0000 (23:20 -0700)]
Create parquet datasets of reddit submissions from pushshift.
Nate E TeBlunthuis [Fri, 3 Jul 2020 21:00:36 +0000 (14:00 -0700)]
Rename spark script to reflect that it is for comments.
Nate E TeBlunthuis [Fri, 3 Jul 2020 20:55:25 +0000 (13:55 -0700)]
update .gitignore
Nate E TeBlunthuis [Fri, 3 Jul 2020 20:54:55 +0000 (13:54 -0700)]
bugfix in retrieving old data and rename file.
Nate E TeBlunthuis [Fri, 3 Jul 2020 20:35:46 +0000 (13:35 -0700)]
Script for checking shas for submissions.
Nate E TeBlunthuis [Fri, 3 Jul 2020 18:38:43 +0000 (11:38 -0700)]
Bugfix: use timestamp types
Also change the canonical file path.
Nate E TeBlunthuis [Fri, 3 Jul 2020 17:41:13 +0000 (10:41 -0700)]
update the reddit comment dumps
Nate E TeBlunthuis [Fri, 3 Jul 2020 17:40:43 +0000 (10:40 -0700)]
Don't clobber old dumps so that we can just download the new ones.
Nate E TeBlunthuis [Fri, 3 Jul 2020 00:40:17 +0000 (17:40 -0700)]
script for getting submissions dumps from pushshift.
Nate E TeBlunthuis [Thu, 2 Jul 2020 21:06:36 +0000 (14:06 -0700)]
Extract variables from pushshift comment to parquet
A spark script
Community Data Science Collective || Want to submit a patch?