]> code.communitydata.science - cdsc_reddit.git/log
cdsc_reddit.git
2 years agorefactor similarities to use submodule. factor_out_similarities
Nathan TeBlunthuis [Wed, 19 Jan 2022 23:05:49 +0000 (15:05 -0800)]
refactor similarities to use submodule.

2 years agoupdate pushshift dumps.
Nathan TeBlunthuis [Sat, 11 Dec 2021 05:23:32 +0000 (21:23 -0800)]
update pushshift dumps.

2 years agolsi support for weekly similarities
Nathan TeBlunthuis [Thu, 12 Aug 2021 05:48:33 +0000 (22:48 -0700)]
lsi support for weekly similarities

2 years agoMerge branch 'master' of code:cdsc_reddit into excise_reindex
Nathan TeBlunthuis [Tue, 3 Aug 2021 22:13:39 +0000 (15:13 -0700)]
Merge branch 'master' of code:cdsc_reddit into excise_reindex

2 years agoMerge branch 'excise_reindex' of code:cdsc_reddit into excise_reindex
Nathan TeBlunthuis [Tue, 3 Aug 2021 22:13:21 +0000 (15:13 -0700)]
Merge branch 'excise_reindex' of code:cdsc_reddit into excise_reindex

2 years agoUpdates to similarities code for smap project.
Nathan TeBlunthuis [Tue, 3 Aug 2021 22:06:48 +0000 (15:06 -0700)]
Updates to similarities code for smap project.

2 years agoMerge branch 'master' of code:cdsc_reddit
Nathan TeBlunthuis [Tue, 3 Aug 2021 22:03:40 +0000 (15:03 -0700)]
Merge branch 'master' of code:cdsc_reddit

2 years agoMerge branch 'master' of code:cdsc_reddit into excise_reindex
Nathan TeBlunthuis [Tue, 3 Aug 2021 22:02:08 +0000 (15:02 -0700)]
Merge branch 'master' of code:cdsc_reddit into excise_reindex

2 years agoupdate clustering scripts
Nate E TeBlunthuis [Tue, 3 Aug 2021 21:55:02 +0000 (14:55 -0700)]
update clustering scripts

2 years agoMerge branch 'master' of code:cdsc_reddit
Nate E TeBlunthuis [Wed, 28 Jul 2021 22:32:21 +0000 (15:32 -0700)]
Merge branch 'master' of code:cdsc_reddit

2 years agono longer do we need to get daily dumps
Nate E TeBlunthuis [Wed, 28 Jul 2021 22:32:04 +0000 (15:32 -0700)]
no longer do we need to get daily dumps

2 years agoscript for picking the best clustering given constraints
Nate E TeBlunthuis [Sat, 15 May 2021 02:10:36 +0000 (19:10 -0700)]
script for picking the best clustering given constraints

2 years agoMerge branch 'excise_reindex' of code:cdsc_reddit into excise_reindex
Nate E TeBlunthuis [Fri, 14 May 2021 05:28:31 +0000 (22:28 -0700)]
Merge branch 'excise_reindex' of code:cdsc_reddit into excise_reindex

2 years agosupport isolates in visualization
Nate E TeBlunthuis [Fri, 14 May 2021 05:26:58 +0000 (22:26 -0700)]
support isolates in visualization

2 years agobug fix in affinity clustering
Nate E TeBlunthuis [Fri, 14 May 2021 05:26:03 +0000 (22:26 -0700)]
bug fix in affinity clustering

2 years agoMerge remote-tracking branch 'origin/excise_reindex' into temp
Nate E TeBlunthuis [Tue, 11 May 2021 01:32:03 +0000 (18:32 -0700)]
Merge remote-tracking branch 'origin/excise_reindex' into temp

2 years agoadd script for pulling cluster timeseries
Nate E TeBlunthuis [Tue, 11 May 2021 01:24:22 +0000 (18:24 -0700)]
add script for pulling cluster timeseries

2 years agoRefactor to make a decent api.
Nate E TeBlunthuis [Mon, 10 May 2021 20:46:49 +0000 (13:46 -0700)]
Refactor to make a decent api.

2 years agorefactor clustring in object oriented style
Nate E TeBlunthuis [Sat, 8 May 2021 05:33:26 +0000 (22:33 -0700)]
refactor clustring in object oriented style

2 years agorefactor clustering.py into method-specific files.
Nate E TeBlunthuis [Mon, 3 May 2021 18:28:48 +0000 (11:28 -0700)]
refactor clustering.py into method-specific files.

2 years agoRemove 'exclude phrases' parameter.
Nate E TeBlunthuis [Mon, 3 May 2021 17:37:09 +0000 (10:37 -0700)]
Remove 'exclude phrases' parameter.

2 years agoUse Latent semantic indexing and hdbscan
Nate E TeBlunthuis [Mon, 3 May 2021 06:39:55 +0000 (23:39 -0700)]
Use Latent semantic indexing and hdbscan

3 years agoreindex tfidf in memory instead of using spark
Nate E TeBlunthuis [Fri, 30 Apr 2021 19:48:19 +0000 (12:48 -0700)]
reindex tfidf in memory instead of using spark

3 years agobugfix
Nate E TeBlunthuis [Tue, 27 Apr 2021 05:31:05 +0000 (22:31 -0700)]
bugfix

3 years agoMerge branch 'charliepatch' of code:cdsc_reddit into charliepatch
Nate E TeBlunthuis [Mon, 26 Apr 2021 20:22:29 +0000 (13:22 -0700)]
Merge branch 'charliepatch' of code:cdsc_reddit into charliepatch

3 years agosupport passing in list of tfidf vectors.
Nate E TeBlunthuis [Mon, 26 Apr 2021 18:16:28 +0000 (11:16 -0700)]
support passing in list of tfidf vectors.

Also lowercases included subreddits.

3 years agosupport passing in list of tfidf vectors.
Nate E TeBlunthuis [Mon, 26 Apr 2021 18:16:28 +0000 (11:16 -0700)]
support passing in list of tfidf vectors.

Also lowercases included subreddits.

3 years agoMerge branch 'master' of code:cdsc_reddit
Nate E TeBlunthuis [Thu, 22 Apr 2021 17:46:26 +0000 (10:46 -0700)]
Merge branch 'master' of code:cdsc_reddit

3 years agoversion of weekly_cosine_similarities.py from klone
Nate E TeBlunthuis [Thu, 22 Apr 2021 17:38:10 +0000 (10:38 -0700)]
version of weekly_cosine_similarities.py from klone

3 years agobugfix in weekly similarities
Nate E TeBlunthuis [Thu, 22 Apr 2021 17:37:04 +0000 (10:37 -0700)]
bugfix in weekly similarities

3 years agobugfixes in clustering selection.
Nate E TeBlunthuis [Wed, 21 Apr 2021 23:56:25 +0000 (16:56 -0700)]
bugfixes in clustering selection.

3 years agocalculate some user-level attributes to detect bots
Nate E TeBlunthuis [Tue, 20 Apr 2021 18:34:36 +0000 (11:34 -0700)]
calculate some user-level attributes to detect bots

3 years agogrid sweep selection for clustering hyperparameters
Nate E TeBlunthuis [Tue, 20 Apr 2021 18:33:54 +0000 (11:33 -0700)]
grid sweep selection for clustering hyperparameters

3 years agoMerge branch 'master' of code:cdsc_reddit
Nate E TeBlunthuis [Tue, 6 Apr 2021 06:21:35 +0000 (23:21 -0700)]
Merge branch 'master' of code:cdsc_reddit

3 years agoChanges for cosine similarities on klone.
Nate E TeBlunthuis [Tue, 6 Apr 2021 06:21:06 +0000 (23:21 -0700)]
Changes for cosine similarities on klone.

3 years agoexport timeseries functions
Nate E TeBlunthuis [Thu, 25 Mar 2021 00:18:30 +0000 (17:18 -0700)]
export timeseries functions

3 years agoadd code for pulling activity time series from parquet.
Nate E TeBlunthuis [Wed, 24 Mar 2021 23:08:57 +0000 (16:08 -0700)]
add code for pulling activity time series from parquet.

3 years agoadd included_subreddits parameter to cosine similarities.
Nate E TeBlunthuis [Tue, 23 Feb 2021 02:38:34 +0000 (18:38 -0800)]
add included_subreddits parameter to cosine similarities.

3 years agoChanges from hyak.
Nate E TeBlunthuis [Tue, 23 Feb 2021 00:03:48 +0000 (16:03 -0800)]
Changes from hyak.

3 years agofix bug in viz.
Nate E TeBlunthuis [Thu, 28 Jan 2021 04:26:15 +0000 (20:26 -0800)]
fix bug in viz.

3 years agoadd visualization for 10000 subreddits based on author-tf similarities.
Nate E TeBlunthuis [Thu, 28 Jan 2021 04:22:24 +0000 (20:22 -0800)]
add visualization for 10000 subreddits based on author-tf similarities.

3 years agoMerge branch 'master' of code:cdsc_reddit
Nate E TeBlunthuis [Thu, 28 Jan 2021 04:09:23 +0000 (20:09 -0800)]
Merge branch 'master' of code:cdsc_reddit

3 years agoadd cluster selection to visualization
Nathan TeBlunthuis [Thu, 28 Jan 2021 04:08:07 +0000 (20:08 -0800)]
add cluster selection to visualization

3 years agoremove nsfw subs from topN
Nate E TeBlunthuis [Tue, 29 Dec 2020 05:11:44 +0000 (21:11 -0800)]
remove nsfw subs from topN

3 years agoUpdating to support wang-style user overlaps.
Nate E TeBlunthuis [Fri, 25 Dec 2020 06:38:04 +0000 (22:38 -0800)]
Updating to support wang-style user overlaps.

3 years agoSome improvements to run affinity clustering on larger dataset and
Nate E TeBlunthuis [Sun, 13 Dec 2020 04:42:47 +0000 (20:42 -0800)]
Some improvements to run affinity clustering on larger dataset and
compute density.

3 years agoRefactor and reorganze.
Nate E TeBlunthuis [Wed, 9 Dec 2020 01:32:20 +0000 (17:32 -0800)]
Refactor and reorganze.

3 years agoAdd code for running tf-idf at the weekly level.
Nate E TeBlunthuis [Wed, 2 Dec 2020 06:54:48 +0000 (22:54 -0800)]
Add code for running tf-idf at the weekly level.

3 years agorefactor visualization code.
Nathan TeBlunthuis [Wed, 18 Nov 2020 00:46:49 +0000 (16:46 -0800)]
refactor visualization code.

3 years agoMerge remote-tracking branch 'refs/remotes/origin/master' into master synced/master
Nathan TeBlunthuis [Wed, 18 Nov 2020 00:33:14 +0000 (16:33 -0800)]
Merge remote-tracking branch 'refs/remotes/origin/master' into master

3 years agogit-annex in nathante@nate-x1:~/cdsc_reddit
Nathan TeBlunthuis [Wed, 18 Nov 2020 00:33:13 +0000 (16:33 -0800)]
git-annex in nathante@nate-x1:~/cdsc_reddit

3 years agogit-annex in nathante@mox2.hyak.local:/gscratch/comdata/users/nathante/cdsc-reddit
Nate E TeBlunthuis [Wed, 18 Nov 2020 00:31:48 +0000 (16:31 -0800)]
git-annex in nathante@mox2.hyak.local:/gscratch/comdata/users/nathante/cdsc-reddit

3 years agoUpdate code for clustering + tsne.
Nate E TeBlunthuis [Tue, 17 Nov 2020 23:59:20 +0000 (15:59 -0800)]
Update code for clustering + tsne.

3 years agoUpdate code for building simlarity matrices.
Nate E TeBlunthuis [Tue, 17 Nov 2020 20:52:48 +0000 (12:52 -0800)]
Update code for building simlarity matrices.

3 years agobugfix in completing tfidf similarity matrices.
Nate E TeBlunthuis [Thu, 12 Nov 2020 19:47:53 +0000 (11:47 -0800)]
bugfix in completing tfidf similarity matrices.

3 years agoincrease learning rate.
Nate E TeBlunthuis [Thu, 12 Nov 2020 00:58:39 +0000 (16:58 -0800)]
increase learning rate.

3 years agoincrease iterations and perplectity and early_exaggeration
Nate E TeBlunthuis [Thu, 12 Nov 2020 00:55:39 +0000 (16:55 -0800)]
increase iterations and perplectity and early_exaggeration

3 years agoincrease learning rate
Nate E TeBlunthuis [Thu, 12 Nov 2020 00:48:41 +0000 (16:48 -0800)]
increase learning rate

3 years agoFix bug in tsne.
Nate E TeBlunthuis [Thu, 12 Nov 2020 00:43:41 +0000 (16:43 -0800)]
Fix bug in tsne.

3 years agogit-annex in nathante@mox2.hyak.local:/gscratch/comdata/users/nathante/cdsc-reddit
Nate E TeBlunthuis [Thu, 12 Nov 2020 00:39:44 +0000 (16:39 -0800)]
git-annex in nathante@mox2.hyak.local:/gscratch/comdata/users/nathante/cdsc-reddit

3 years agosplit fitting and plotting tsne.
Nate E TeBlunthuis [Thu, 12 Nov 2020 00:38:22 +0000 (16:38 -0800)]
split fitting and plotting tsne.

3 years agoAdd file to plot related subreddits using tsne.
Nathan TeBlunthuis [Thu, 12 Nov 2020 00:05:36 +0000 (16:05 -0800)]
Add file to plot related subreddits using tsne.

3 years agoBugfix (typo)
Nate E TeBlunthuis [Tue, 10 Nov 2020 21:38:11 +0000 (13:38 -0800)]
Bugfix (typo)

3 years agoReuse code for term and author cosine similarity.
Nate E TeBlunthuis [Tue, 10 Nov 2020 21:18:57 +0000 (13:18 -0800)]
Reuse code for term and author cosine similarity.

3 years agoRefactor tfidf code to for code resuse.
Nate E TeBlunthuis [Tue, 10 Nov 2020 21:18:19 +0000 (13:18 -0800)]
Refactor tfidf code to for code resuse.

3 years agorename 'idf' files to 'tfidf'
Nate E TeBlunthuis [Tue, 10 Nov 2020 21:16:55 +0000 (13:16 -0800)]
rename 'idf' files to 'tfidf'

3 years agoImprovements to idf code
Nate E TeBlunthuis [Tue, 10 Nov 2020 21:12:11 +0000 (13:12 -0800)]
Improvements to idf code

3 years agoMerge branch 'master' of code:cdsc_reddit
Nate E TeBlunthuis [Mon, 2 Nov 2020 18:40:12 +0000 (10:40 -0800)]
Merge branch 'master' of code:cdsc_reddit

3 years agoadd term_cosine_similarity.py
Nate E TeBlunthuis [Mon, 2 Nov 2020 18:40:02 +0000 (10:40 -0800)]
add term_cosine_similarity.py

3 years agoAdd Cosine similarities to README.md
Nathan TeBlunthuis [Mon, 2 Nov 2020 17:48:10 +0000 (09:48 -0800)]
Add Cosine similarities to README.md

3 years agoUpdate Readme.
Nathan TeBlunthuis [Mon, 2 Nov 2020 16:42:13 +0000 (08:42 -0800)]
Update Readme.

3 years agoMerge branch 'master' of code:cdsc_reddit into master
Nathan TeBlunthuis [Mon, 2 Nov 2020 05:50:44 +0000 (21:50 -0800)]
Merge branch 'master' of code:cdsc_reddit into master

3 years agoCreate README.md
Nathan TeBlunthuis [Mon, 2 Nov 2020 05:50:27 +0000 (21:50 -0800)]
Create README.md

3 years agoUpdate reddit comments data with daily dumps.
Nate E TeBlunthuis [Sat, 3 Oct 2020 23:42:22 +0000 (16:42 -0700)]
Update reddit comments data with daily dumps.

3 years agoCompute IDF for terms and authors.
Nate E TeBlunthuis [Sun, 23 Aug 2020 18:57:55 +0000 (11:57 -0700)]
Compute IDF for terms and authors.

3 years agoUpdate submissions to parse using the backfill queue.
Nate E TeBlunthuis [Wed, 12 Aug 2020 05:37:36 +0000 (22:37 -0700)]
Update submissions to parse using the backfill queue.

3 years agobugfix in checking submission shas
Nate E TeBlunthuis [Tue, 11 Aug 2020 21:21:54 +0000 (14:21 -0700)]
bugfix in checking submission shas

3 years agoUse multiword expressions in tf.
Nate E TeBlunthuis [Mon, 10 Aug 2020 23:57:46 +0000 (16:57 -0700)]
Use multiword expressions in tf.

3 years agoFinish generating multiword expressions.
Nate E TeBlunthuis [Mon, 10 Aug 2020 05:42:23 +0000 (22:42 -0700)]
Finish generating multiword expressions.

3 years agoBugfix
Nate E TeBlunthuis [Sun, 9 Aug 2020 09:34:42 +0000 (02:34 -0700)]
Bugfix

3 years agoUse groupby - joins instead of windows
Nate E TeBlunthuis [Sun, 9 Aug 2020 07:21:50 +0000 (00:21 -0700)]
Use groupby - joins instead of windows

3 years agorenamte tf_comments part 2.
Nate E TeBlunthuis [Tue, 4 Aug 2020 20:39:49 +0000 (13:39 -0700)]
renamte tf_comments part 2.

3 years agorename tf_reddit_comments.py step1.
Nate E TeBlunthuis [Tue, 4 Aug 2020 20:39:20 +0000 (13:39 -0700)]
rename tf_reddit_comments.py step1.

3 years agoImprove tokenization following data. Generate author counts.
Nate E TeBlunthuis [Tue, 4 Aug 2020 20:24:37 +0000 (13:24 -0700)]
Improve tokenization following data. Generate author counts.

3 years agoimprove tokenizer.
Nate E TeBlunthuis [Tue, 4 Aug 2020 05:55:10 +0000 (22:55 -0700)]
improve tokenizer.

3 years agoTF reddit comments.
Nate E TeBlunthuis [Tue, 4 Aug 2020 05:43:57 +0000 (22:43 -0700)]
TF reddit comments.

3 years agocode to sort tf
Nate E TeBlunthuis [Tue, 4 Aug 2020 00:56:36 +0000 (17:56 -0700)]
code to sort tf

3 years agoremove is_submitter field from submissions which doesn't exist.
Nate E TeBlunthuis [Fri, 10 Jul 2020 00:12:14 +0000 (17:12 -0700)]
remove is_submitter field from submissions which doesn't exist.

3 years agoBugfixes in scripts.
Nate E TeBlunthuis [Wed, 8 Jul 2020 06:29:36 +0000 (23:29 -0700)]
Bugfixes in scripts.

3 years agoclean up comments in streaming example.
Nate E TeBlunthuis [Tue, 7 Jul 2020 19:28:57 +0000 (12:28 -0700)]
clean up comments in streaming example.

3 years agoupdate .gitignore
Nate E TeBlunthuis [Tue, 7 Jul 2020 19:28:44 +0000 (12:28 -0700)]
update .gitignore

3 years agoupdate examples with working streaming
Nate E TeBlunthuis [Tue, 7 Jul 2020 18:47:17 +0000 (11:47 -0700)]
update examples with working streaming

3 years agoBuild comments dataset similarly to submissions and improve partitioning scheme
Nate E TeBlunthuis [Tue, 7 Jul 2020 18:45:43 +0000 (11:45 -0700)]
Build comments dataset similarly to submissions and improve partitioning scheme

3 years agoupdate .gitignore
Nate E TeBlunthuis [Tue, 7 Jul 2020 07:58:26 +0000 (00:58 -0700)]
update .gitignore

3 years agoScript for example of streaming pyarrow.
Nate E TeBlunthuis [Tue, 7 Jul 2020 07:57:05 +0000 (00:57 -0700)]
Script for example of streaming pyarrow.

3 years agoScript to demonstrate reading parquet.
Nate E TeBlunthuis [Tue, 7 Jul 2020 07:51:40 +0000 (00:51 -0700)]
Script to demonstrate reading parquet.

3 years agoCheck the shas when we download dumps
Nate E TeBlunthuis [Tue, 7 Jul 2020 06:31:52 +0000 (23:31 -0700)]
Check the shas when we download dumps

3 years agoScript to run both parts of submissions_2_parquet.sh
Nate E TeBlunthuis [Tue, 7 Jul 2020 06:27:18 +0000 (23:27 -0700)]
Script to run both parts of submissions_2_parquet.sh

3 years agoCache before sorting so we don't extract twice.
Nate E TeBlunthuis [Tue, 7 Jul 2020 05:30:04 +0000 (22:30 -0700)]
Cache before sorting so we don't extract twice.

3 years agoMove the spark part of submissions_2_parquet to a separate script.
Nate E TeBlunthuis [Tue, 7 Jul 2020 05:26:29 +0000 (22:26 -0700)]
Move the spark part of submissions_2_parquet to a separate script.

Community Data Science Collective || Want to submit a patch?