]> code.communitydata.science - cdsc_reddit.git/log
cdsc_reddit.git
4 years agoCompute IDF for terms and authors.
Nate E TeBlunthuis [Sun, 23 Aug 2020 18:57:55 +0000 (11:57 -0700)]
Compute IDF for terms and authors.

4 years agoUpdate submissions to parse using the backfill queue.
Nate E TeBlunthuis [Wed, 12 Aug 2020 05:37:36 +0000 (22:37 -0700)]
Update submissions to parse using the backfill queue.

4 years agobugfix in checking submission shas
Nate E TeBlunthuis [Tue, 11 Aug 2020 21:21:54 +0000 (14:21 -0700)]
bugfix in checking submission shas

4 years agoUse multiword expressions in tf.
Nate E TeBlunthuis [Mon, 10 Aug 2020 23:57:46 +0000 (16:57 -0700)]
Use multiword expressions in tf.

4 years agoFinish generating multiword expressions.
Nate E TeBlunthuis [Mon, 10 Aug 2020 05:42:23 +0000 (22:42 -0700)]
Finish generating multiword expressions.

4 years agoBugfix
Nate E TeBlunthuis [Sun, 9 Aug 2020 09:34:42 +0000 (02:34 -0700)]
Bugfix

4 years agoUse groupby - joins instead of windows
Nate E TeBlunthuis [Sun, 9 Aug 2020 07:21:50 +0000 (00:21 -0700)]
Use groupby - joins instead of windows

4 years agorenamte tf_comments part 2.
Nate E TeBlunthuis [Tue, 4 Aug 2020 20:39:49 +0000 (13:39 -0700)]
renamte tf_comments part 2.

4 years agorename tf_reddit_comments.py step1.
Nate E TeBlunthuis [Tue, 4 Aug 2020 20:39:20 +0000 (13:39 -0700)]
rename tf_reddit_comments.py step1.

4 years agoImprove tokenization following data. Generate author counts.
Nate E TeBlunthuis [Tue, 4 Aug 2020 20:24:37 +0000 (13:24 -0700)]
Improve tokenization following data. Generate author counts.

4 years agoimprove tokenizer.
Nate E TeBlunthuis [Tue, 4 Aug 2020 05:55:10 +0000 (22:55 -0700)]
improve tokenizer.

4 years agoTF reddit comments.
Nate E TeBlunthuis [Tue, 4 Aug 2020 05:43:57 +0000 (22:43 -0700)]
TF reddit comments.

4 years agocode to sort tf
Nate E TeBlunthuis [Tue, 4 Aug 2020 00:56:36 +0000 (17:56 -0700)]
code to sort tf

4 years agoremove is_submitter field from submissions which doesn't exist.
Nate E TeBlunthuis [Fri, 10 Jul 2020 00:12:14 +0000 (17:12 -0700)]
remove is_submitter field from submissions which doesn't exist.

4 years agoBugfixes in scripts.
Nate E TeBlunthuis [Wed, 8 Jul 2020 06:29:36 +0000 (23:29 -0700)]
Bugfixes in scripts.

4 years agoclean up comments in streaming example.
Nate E TeBlunthuis [Tue, 7 Jul 2020 19:28:57 +0000 (12:28 -0700)]
clean up comments in streaming example.

4 years agoupdate .gitignore
Nate E TeBlunthuis [Tue, 7 Jul 2020 19:28:44 +0000 (12:28 -0700)]
update .gitignore

4 years agoupdate examples with working streaming
Nate E TeBlunthuis [Tue, 7 Jul 2020 18:47:17 +0000 (11:47 -0700)]
update examples with working streaming

4 years agoBuild comments dataset similarly to submissions and improve partitioning scheme
Nate E TeBlunthuis [Tue, 7 Jul 2020 18:45:43 +0000 (11:45 -0700)]
Build comments dataset similarly to submissions and improve partitioning scheme

4 years agoupdate .gitignore
Nate E TeBlunthuis [Tue, 7 Jul 2020 07:58:26 +0000 (00:58 -0700)]
update .gitignore

4 years agoScript for example of streaming pyarrow.
Nate E TeBlunthuis [Tue, 7 Jul 2020 07:57:05 +0000 (00:57 -0700)]
Script for example of streaming pyarrow.

4 years agoScript to demonstrate reading parquet.
Nate E TeBlunthuis [Tue, 7 Jul 2020 07:51:40 +0000 (00:51 -0700)]
Script to demonstrate reading parquet.

4 years agoCheck the shas when we download dumps
Nate E TeBlunthuis [Tue, 7 Jul 2020 06:31:52 +0000 (23:31 -0700)]
Check the shas when we download dumps

4 years agoScript to run both parts of submissions_2_parquet.sh
Nate E TeBlunthuis [Tue, 7 Jul 2020 06:27:18 +0000 (23:27 -0700)]
Script to run both parts of submissions_2_parquet.sh

4 years agoCache before sorting so we don't extract twice.
Nate E TeBlunthuis [Tue, 7 Jul 2020 05:30:04 +0000 (22:30 -0700)]
Cache before sorting so we don't extract twice.

4 years agoMove the spark part of submissions_2_parquet to a separate script.
Nate E TeBlunthuis [Tue, 7 Jul 2020 05:26:29 +0000 (22:26 -0700)]
Move the spark part of submissions_2_parquet to a separate script.

4 years agoFix whitespace at top of file.
Nate E TeBlunthuis [Mon, 6 Jul 2020 06:32:00 +0000 (23:32 -0700)]
Fix whitespace at top of file.

4 years agoSecondary sort for the by_author dataset should be CreatedAt.
Nate E TeBlunthuis [Mon, 6 Jul 2020 06:27:18 +0000 (23:27 -0700)]
Secondary sort for the by_author dataset should be CreatedAt.

4 years agoCreate a second dataset sorted by author.
Nate E TeBlunthuis [Mon, 6 Jul 2020 06:24:40 +0000 (23:24 -0700)]
Create a second dataset sorted by author.

4 years agoCreate parquet datasets of reddit submissions from pushshift.
Nate E TeBlunthuis [Mon, 6 Jul 2020 06:20:17 +0000 (23:20 -0700)]
Create parquet datasets of reddit submissions from pushshift.

4 years agoRename spark script to reflect that it is for comments.
Nate E TeBlunthuis [Fri, 3 Jul 2020 21:00:36 +0000 (14:00 -0700)]
Rename spark script to reflect that it is for comments.

4 years agoupdate .gitignore
Nate E TeBlunthuis [Fri, 3 Jul 2020 20:55:25 +0000 (13:55 -0700)]
update .gitignore

4 years agobugfix in retrieving old data and rename file.
Nate E TeBlunthuis [Fri, 3 Jul 2020 20:54:55 +0000 (13:54 -0700)]
bugfix in retrieving old data and rename file.

4 years agoScript for checking shas for submissions.
Nate E TeBlunthuis [Fri, 3 Jul 2020 20:35:46 +0000 (13:35 -0700)]
Script for checking shas for submissions.

4 years agoBugfix: use timestamp types
Nate E TeBlunthuis [Fri, 3 Jul 2020 18:38:43 +0000 (11:38 -0700)]
Bugfix: use timestamp types

Also change the canonical file path.

4 years agoupdate the reddit comment dumps
Nate E TeBlunthuis [Fri, 3 Jul 2020 17:41:13 +0000 (10:41 -0700)]
update the reddit comment dumps

4 years agoDon't clobber old dumps so that we can just download the new ones.
Nate E TeBlunthuis [Fri, 3 Jul 2020 17:40:43 +0000 (10:40 -0700)]
Don't clobber old dumps so that we can just download the new ones.

4 years agoscript for getting submissions dumps from pushshift.
Nate E TeBlunthuis [Fri, 3 Jul 2020 00:40:17 +0000 (17:40 -0700)]
script for getting submissions dumps from pushshift.

4 years agoExtract variables from pushshift comment to parquet
Nate E TeBlunthuis [Thu, 2 Jul 2020 21:06:36 +0000 (14:06 -0700)]
Extract variables from pushshift comment to parquet

A spark script

Community Data Science Collective || Want to submit a patch?