code.communitydata.science - cdsc

]> code.communitydata.science - cdsc_reddit.git/log

projects / cdsc_reddit.git / log

commit | commitdiff | tree

Nate E TeBlunthuis [Sun, 13 Dec 2020 04:42:47 +0000 (20:42 -0800)]

Some improvements to run affinity clustering on larger dataset and
compute density.

commit | commitdiff | tree

Nate E TeBlunthuis [Wed, 9 Dec 2020 01:32:20 +0000 (17:32 -0800)]

Refactor and reorganze.

commit | commitdiff | tree

Nate E TeBlunthuis [Wed, 2 Dec 2020 06:54:48 +0000 (22:54 -0800)]

Add code for running tf-idf at the weekly level.

commit | commitdiff | tree

Nathan TeBlunthuis [Wed, 18 Nov 2020 00:46:49 +0000 (16:46 -0800)]

refactor visualization code.

commit | commitdiff | tree

Nathan TeBlunthuis [Wed, 18 Nov 2020 00:33:14 +0000 (16:33 -0800)]

Merge remote-tracking branch 'refs/remotes/origin/master' into master

commit | commitdiff | tree

Nathan TeBlunthuis [Wed, 18 Nov 2020 00:33:13 +0000 (16:33 -0800)]

git-annex in nathante@nate-x1:~/cdsc_reddit

commit | commitdiff | tree

Nate E TeBlunthuis [Wed, 18 Nov 2020 00:31:48 +0000 (16:31 -0800)]

git-annex in nathante@mox2.hyak.local:/gscratch/comdata/users/nathante/cdsc-reddit

commit | commitdiff | tree

Nate E TeBlunthuis [Tue, 17 Nov 2020 23:59:20 +0000 (15:59 -0800)]

Update code for clustering + tsne.

commit | commitdiff | tree

Nate E TeBlunthuis [Tue, 17 Nov 2020 20:52:48 +0000 (12:52 -0800)]

Update code for building simlarity matrices.

commit | commitdiff | tree

Nate E TeBlunthuis [Thu, 12 Nov 2020 19:47:53 +0000 (11:47 -0800)]

bugfix in completing tfidf similarity matrices.

commit | commitdiff | tree

Nate E TeBlunthuis [Thu, 12 Nov 2020 00:58:39 +0000 (16:58 -0800)]

increase learning rate.

commit | commitdiff | tree

Nate E TeBlunthuis [Thu, 12 Nov 2020 00:55:39 +0000 (16:55 -0800)]

increase iterations and perplectity and early_exaggeration

commit | commitdiff | tree

Nate E TeBlunthuis [Thu, 12 Nov 2020 00:48:41 +0000 (16:48 -0800)]

increase learning rate

commit | commitdiff | tree

Nate E TeBlunthuis [Thu, 12 Nov 2020 00:43:41 +0000 (16:43 -0800)]

Fix bug in tsne.

commit | commitdiff | tree

Nate E TeBlunthuis [Thu, 12 Nov 2020 00:39:44 +0000 (16:39 -0800)]

git-annex in nathante@mox2.hyak.local:/gscratch/comdata/users/nathante/cdsc-reddit

commit | commitdiff | tree

Nate E TeBlunthuis [Thu, 12 Nov 2020 00:38:22 +0000 (16:38 -0800)]

split fitting and plotting tsne.

commit | commitdiff | tree

Nathan TeBlunthuis [Thu, 12 Nov 2020 00:05:36 +0000 (16:05 -0800)]

Add file to plot related subreddits using tsne.

commit | commitdiff | tree

Nate E TeBlunthuis [Tue, 10 Nov 2020 21:38:11 +0000 (13:38 -0800)]

Bugfix (typo)

commit | commitdiff | tree

Nate E TeBlunthuis [Tue, 10 Nov 2020 21:18:57 +0000 (13:18 -0800)]

Reuse code for term and author cosine similarity.

commit | commitdiff | tree

Nate E TeBlunthuis [Tue, 10 Nov 2020 21:18:19 +0000 (13:18 -0800)]

Refactor tfidf code to for code resuse.

commit | commitdiff | tree

Nate E TeBlunthuis [Tue, 10 Nov 2020 21:16:55 +0000 (13:16 -0800)]

rename 'idf' files to 'tfidf'

commit | commitdiff | tree

Nate E TeBlunthuis [Tue, 10 Nov 2020 21:12:11 +0000 (13:12 -0800)]

Improvements to idf code

commit | commitdiff | tree

Nate E TeBlunthuis [Mon, 2 Nov 2020 18:40:12 +0000 (10:40 -0800)]

Merge branch 'master' of code:cdsc_reddit

commit | commitdiff | tree

Nate E TeBlunthuis [Mon, 2 Nov 2020 18:40:02 +0000 (10:40 -0800)]

add term_cosine_similarity.py

commit | commitdiff | tree

Nathan TeBlunthuis [Mon, 2 Nov 2020 17:48:10 +0000 (09:48 -0800)]

Add Cosine similarities to README.md

commit | commitdiff | tree

Nathan TeBlunthuis [Mon, 2 Nov 2020 16:42:13 +0000 (08:42 -0800)]

Update Readme.

commit | commitdiff | tree

Nathan TeBlunthuis [Mon, 2 Nov 2020 05:50:44 +0000 (21:50 -0800)]

Merge branch 'master' of code:cdsc_reddit into master

commit | commitdiff | tree

Nathan TeBlunthuis [Mon, 2 Nov 2020 05:50:27 +0000 (21:50 -0800)]

Create README.md

commit | commitdiff | tree

Nate E TeBlunthuis [Sat, 3 Oct 2020 23:42:22 +0000 (16:42 -0700)]

Update reddit comments data with daily dumps.

commit | commitdiff | tree

Nate E TeBlunthuis [Sun, 23 Aug 2020 18:57:55 +0000 (11:57 -0700)]

Compute IDF for terms and authors.

commit | commitdiff | tree

Nate E TeBlunthuis [Wed, 12 Aug 2020 05:37:36 +0000 (22:37 -0700)]

Update submissions to parse using the backfill queue.

commit | commitdiff | tree

Nate E TeBlunthuis [Tue, 11 Aug 2020 21:21:54 +0000 (14:21 -0700)]

bugfix in checking submission shas

commit | commitdiff | tree

Nate E TeBlunthuis [Mon, 10 Aug 2020 23:57:46 +0000 (16:57 -0700)]

Use multiword expressions in tf.

commit | commitdiff | tree

Nate E TeBlunthuis [Mon, 10 Aug 2020 05:42:23 +0000 (22:42 -0700)]

Finish generating multiword expressions.

commit | commitdiff | tree

Nate E TeBlunthuis [Sun, 9 Aug 2020 09:34:42 +0000 (02:34 -0700)]

Bugfix

commit | commitdiff | tree

Nate E TeBlunthuis [Sun, 9 Aug 2020 07:21:50 +0000 (00:21 -0700)]

Use groupby - joins instead of windows

commit | commitdiff | tree

Nate E TeBlunthuis [Tue, 4 Aug 2020 20:39:49 +0000 (13:39 -0700)]

renamte tf_comments part 2.

commit | commitdiff | tree

Nate E TeBlunthuis [Tue, 4 Aug 2020 20:39:20 +0000 (13:39 -0700)]

rename tf_reddit_comments.py step1.

commit | commitdiff | tree

Nate E TeBlunthuis [Tue, 4 Aug 2020 20:24:37 +0000 (13:24 -0700)]

Improve tokenization following data. Generate author counts.

commit | commitdiff | tree

Nate E TeBlunthuis [Tue, 4 Aug 2020 05:55:10 +0000 (22:55 -0700)]

improve tokenizer.

commit | commitdiff | tree

Nate E TeBlunthuis [Tue, 4 Aug 2020 05:43:57 +0000 (22:43 -0700)]

TF reddit comments.

commit | commitdiff | tree

Nate E TeBlunthuis [Tue, 4 Aug 2020 00:56:36 +0000 (17:56 -0700)]

code to sort tf

commit | commitdiff | tree

Nate E TeBlunthuis [Fri, 10 Jul 2020 00:12:14 +0000 (17:12 -0700)]

remove is_submitter field from submissions which doesn't exist.

commit | commitdiff | tree

Nate E TeBlunthuis [Wed, 8 Jul 2020 06:29:36 +0000 (23:29 -0700)]

Bugfixes in scripts.

commit | commitdiff | tree

Nate E TeBlunthuis [Tue, 7 Jul 2020 19:28:57 +0000 (12:28 -0700)]

clean up comments in streaming example.

commit | commitdiff | tree

Nate E TeBlunthuis [Tue, 7 Jul 2020 19:28:44 +0000 (12:28 -0700)]

update .gitignore

commit | commitdiff | tree

Nate E TeBlunthuis [Tue, 7 Jul 2020 18:47:17 +0000 (11:47 -0700)]

update examples with working streaming

commit | commitdiff | tree

Nate E TeBlunthuis [Tue, 7 Jul 2020 18:45:43 +0000 (11:45 -0700)]

Build comments dataset similarly to submissions and improve partitioning scheme

commit | commitdiff | tree

Nate E TeBlunthuis [Tue, 7 Jul 2020 07:58:26 +0000 (00:58 -0700)]

update .gitignore

commit | commitdiff | tree

Nate E TeBlunthuis [Tue, 7 Jul 2020 07:57:05 +0000 (00:57 -0700)]

Script for example of streaming pyarrow.

commit | commitdiff | tree

Nate E TeBlunthuis [Tue, 7 Jul 2020 07:51:40 +0000 (00:51 -0700)]

Script to demonstrate reading parquet.

commit | commitdiff | tree

Nate E TeBlunthuis [Tue, 7 Jul 2020 06:31:52 +0000 (23:31 -0700)]

Check the shas when we download dumps

commit | commitdiff | tree

Nate E TeBlunthuis [Tue, 7 Jul 2020 06:27:18 +0000 (23:27 -0700)]

Script to run both parts of submissions_2_parquet.sh

commit | commitdiff | tree

Nate E TeBlunthuis [Tue, 7 Jul 2020 05:30:04 +0000 (22:30 -0700)]

Cache before sorting so we don't extract twice.

commit | commitdiff | tree

Nate E TeBlunthuis [Tue, 7 Jul 2020 05:26:29 +0000 (22:26 -0700)]

Move the spark part of submissions_2_parquet to a separate script.

commit | commitdiff | tree

Nate E TeBlunthuis [Mon, 6 Jul 2020 06:32:00 +0000 (23:32 -0700)]

Fix whitespace at top of file.

commit | commitdiff | tree

Nate E TeBlunthuis [Mon, 6 Jul 2020 06:27:18 +0000 (23:27 -0700)]

Secondary sort for the by_author dataset should be CreatedAt.

commit | commitdiff | tree

Nate E TeBlunthuis [Mon, 6 Jul 2020 06:24:40 +0000 (23:24 -0700)]

Create a second dataset sorted by author.

commit | commitdiff | tree

Nate E TeBlunthuis [Mon, 6 Jul 2020 06:20:17 +0000 (23:20 -0700)]

Create parquet datasets of reddit submissions from pushshift.

commit | commitdiff | tree

Nate E TeBlunthuis [Fri, 3 Jul 2020 21:00:36 +0000 (14:00 -0700)]

Rename spark script to reflect that it is for comments.

commit | commitdiff | tree

Nate E TeBlunthuis [Fri, 3 Jul 2020 20:55:25 +0000 (13:55 -0700)]

update .gitignore

commit | commitdiff | tree

Nate E TeBlunthuis [Fri, 3 Jul 2020 20:54:55 +0000 (13:54 -0700)]

bugfix in retrieving old data and rename file.

commit | commitdiff | tree

Nate E TeBlunthuis [Fri, 3 Jul 2020 20:35:46 +0000 (13:35 -0700)]

Script for checking shas for submissions.

commit | commitdiff | tree

Nate E TeBlunthuis [Fri, 3 Jul 2020 18:38:43 +0000 (11:38 -0700)]

Bugfix: use timestamp types

Also change the canonical file path.

commit | commitdiff | tree

Nate E TeBlunthuis [Fri, 3 Jul 2020 17:41:13 +0000 (10:41 -0700)]

update the reddit comment dumps

commit | commitdiff | tree

Nate E TeBlunthuis [Fri, 3 Jul 2020 17:40:43 +0000 (10:40 -0700)]

Don't clobber old dumps so that we can just download the new ones.

commit | commitdiff | tree

Nate E TeBlunthuis [Fri, 3 Jul 2020 00:40:17 +0000 (17:40 -0700)]

script for getting submissions dumps from pushshift.

commit | commitdiff | tree

Nate E TeBlunthuis [Thu, 2 Jul 2020 21:06:36 +0000 (14:06 -0700)]

Extract variables from pushshift comment to parquet

A spark script

Building parquet tables from pushshift reddit dumps.

RSS Atom

Community Data Science Collective || Want to submit a patch?