code.communitydata.science - cdsc

]> code.communitydata.science - cdsc_reddit.git/log

projects / cdsc_reddit.git / log

commit | commitdiff | tree

Nate E TeBlunthuis [Tue, 7 Jul 2020 18:45:43 +0000 (11:45 -0700)]

Build comments dataset similarly to submissions and improve partitioning scheme

commit | commitdiff | tree

Nate E TeBlunthuis [Tue, 7 Jul 2020 07:58:26 +0000 (00:58 -0700)]

update .gitignore

commit | commitdiff | tree

Nate E TeBlunthuis [Tue, 7 Jul 2020 07:57:05 +0000 (00:57 -0700)]

Script for example of streaming pyarrow.

commit | commitdiff | tree

Nate E TeBlunthuis [Tue, 7 Jul 2020 07:51:40 +0000 (00:51 -0700)]

Script to demonstrate reading parquet.

commit | commitdiff | tree

Nate E TeBlunthuis [Tue, 7 Jul 2020 06:31:52 +0000 (23:31 -0700)]

Check the shas when we download dumps

commit | commitdiff | tree

Nate E TeBlunthuis [Tue, 7 Jul 2020 06:27:18 +0000 (23:27 -0700)]

Script to run both parts of submissions_2_parquet.sh

commit | commitdiff | tree

Nate E TeBlunthuis [Tue, 7 Jul 2020 05:30:04 +0000 (22:30 -0700)]

Cache before sorting so we don't extract twice.

commit | commitdiff | tree

Nate E TeBlunthuis [Tue, 7 Jul 2020 05:26:29 +0000 (22:26 -0700)]

Move the spark part of submissions_2_parquet to a separate script.

commit | commitdiff | tree

Nate E TeBlunthuis [Mon, 6 Jul 2020 06:32:00 +0000 (23:32 -0700)]

Fix whitespace at top of file.

commit | commitdiff | tree

Nate E TeBlunthuis [Mon, 6 Jul 2020 06:27:18 +0000 (23:27 -0700)]

Secondary sort for the by_author dataset should be CreatedAt.

commit | commitdiff | tree

Nate E TeBlunthuis [Mon, 6 Jul 2020 06:24:40 +0000 (23:24 -0700)]

Create a second dataset sorted by author.

commit | commitdiff | tree

Nate E TeBlunthuis [Mon, 6 Jul 2020 06:20:17 +0000 (23:20 -0700)]

Create parquet datasets of reddit submissions from pushshift.

commit | commitdiff | tree

Nate E TeBlunthuis [Fri, 3 Jul 2020 21:00:36 +0000 (14:00 -0700)]

Rename spark script to reflect that it is for comments.

commit | commitdiff | tree

Nate E TeBlunthuis [Fri, 3 Jul 2020 20:55:25 +0000 (13:55 -0700)]

update .gitignore

commit | commitdiff | tree

Nate E TeBlunthuis [Fri, 3 Jul 2020 20:54:55 +0000 (13:54 -0700)]

bugfix in retrieving old data and rename file.

commit | commitdiff | tree

Nate E TeBlunthuis [Fri, 3 Jul 2020 20:35:46 +0000 (13:35 -0700)]

Script for checking shas for submissions.

commit | commitdiff | tree

Nate E TeBlunthuis [Fri, 3 Jul 2020 18:38:43 +0000 (11:38 -0700)]

Bugfix: use timestamp types

Also change the canonical file path.

commit | commitdiff | tree

Nate E TeBlunthuis [Fri, 3 Jul 2020 17:41:13 +0000 (10:41 -0700)]

update the reddit comment dumps

commit | commitdiff | tree

Nate E TeBlunthuis [Fri, 3 Jul 2020 17:40:43 +0000 (10:40 -0700)]

Don't clobber old dumps so that we can just download the new ones.

commit | commitdiff | tree

Nate E TeBlunthuis [Fri, 3 Jul 2020 00:40:17 +0000 (17:40 -0700)]

script for getting submissions dumps from pushshift.

commit | commitdiff | tree

Nate E TeBlunthuis [Thu, 2 Jul 2020 21:06:36 +0000 (14:06 -0700)]

Extract variables from pushshift comment to parquet

A spark script

Building parquet tables from pushshift reddit dumps.

RSS Atom

Community Data Science Collective || Want to submit a patch?