projects
/
cdsc_reddit.git
/ shortlog
commit
grep
author
committer
pickaxe
?
search:
re
summary
| shortlog |
log
|
commit
|
commitdiff
|
tree
first ⋅ prev ⋅ next
cdsc_reddit.git
2020-12-02
Nate E TeBlunthuis
Add code for running tf-idf at the weekly level.
commit
|
commitdiff
|
tree
|
snapshot
2020-11-18
Nathan TeBlunthuis
refactor visualization code.
commit
|
commitdiff
|
tree
|
snapshot
2020-11-18
Nathan TeBlunthuis
Merge remote-tracking branch 'refs/remotes/origin/maste...
synced/master
commit
|
commitdiff
|
tree
|
snapshot
2020-11-18
Nathan TeBlunthuis
git-annex in nathante@nate-x1:~/cdsc_reddit
commit
|
commitdiff
|
tree
|
snapshot
2020-11-18
Nate E TeBlunthuis
git-annex in nathante@mox2.hyak.local:/gscratch/comdata...
commit
|
commitdiff
|
tree
|
snapshot
2020-11-17
Nate E TeBlunthuis
Update code for clustering + tsne.
commit
|
commitdiff
|
tree
|
snapshot
2020-11-17
Nate E TeBlunthuis
Update code for building simlarity matrices.
commit
|
commitdiff
|
tree
|
snapshot
2020-11-12
Nate E TeBlunthuis
bugfix in completing tfidf similarity matrices.
commit
|
commitdiff
|
tree
|
snapshot
2020-11-12
Nate E TeBlunthuis
increase learning rate.
commit
|
commitdiff
|
tree
|
snapshot
2020-11-12
Nate E TeBlunthuis
increase iterations and perplectity and early_exaggeration
commit
|
commitdiff
|
tree
|
snapshot
2020-11-12
Nate E TeBlunthuis
increase learning rate
commit
|
commitdiff
|
tree
|
snapshot
2020-11-12
Nate E TeBlunthuis
Fix bug in tsne.
commit
|
commitdiff
|
tree
|
snapshot
2020-11-12
Nate E TeBlunthuis
git-annex in nathante@mox2.hyak.local:/gscratch/comdata...
commit
|
commitdiff
|
tree
|
snapshot
2020-11-12
Nate E TeBlunthuis
split fitting and plotting tsne.
commit
|
commitdiff
|
tree
|
snapshot
2020-11-12
Nathan TeBlunthuis
Add file to plot related subreddits using tsne.
commit
|
commitdiff
|
tree
|
snapshot
2020-11-10
Nate E TeBlunthuis
Bugfix (typo)
commit
|
commitdiff
|
tree
|
snapshot
2020-11-10
Nate E TeBlunthuis
Reuse code for term and author cosine similarity.
commit
|
commitdiff
|
tree
|
snapshot
2020-11-10
Nate E TeBlunthuis
Refactor tfidf code to for code resuse.
commit
|
commitdiff
|
tree
|
snapshot
2020-11-10
Nate E TeBlunthuis
rename 'idf' files to 'tfidf'
commit
|
commitdiff
|
tree
|
snapshot
2020-11-10
Nate E TeBlunthuis
Improvements to idf code
commit
|
commitdiff
|
tree
|
snapshot
2020-11-02
Nate E TeBlunthuis
Merge branch 'master' of code:cdsc_reddit
commit
|
commitdiff
|
tree
|
snapshot
2020-11-02
Nate E TeBlunthuis
add term_cosine_similarity.py
commit
|
commitdiff
|
tree
|
snapshot
2020-11-02
Nathan TeBlunthuis
Add Cosine similarities to README.md
commit
|
commitdiff
|
tree
|
snapshot
2020-11-02
Nathan TeBlunthuis
Update Readme.
commit
|
commitdiff
|
tree
|
snapshot
2020-11-02
Nathan TeBlunthuis
Merge branch 'master' of code:cdsc_reddit into master
commit
|
commitdiff
|
tree
|
snapshot
2020-11-02
Nathan TeBlunthuis
Create README.md
commit
|
commitdiff
|
tree
|
snapshot
2020-10-03
Nate E TeBlunthuis
Update reddit comments data with daily dumps.
commit
|
commitdiff
|
tree
|
snapshot
2020-08-23
Nate E TeBlunthuis
Compute IDF for terms and authors.
commit
|
commitdiff
|
tree
|
snapshot
2020-08-12
Nate E TeBlunthuis
Update submissions to parse using the backfill queue.
commit
|
commitdiff
|
tree
|
snapshot
2020-08-11
Nate E TeBlunthuis
bugfix in checking submission shas
commit
|
commitdiff
|
tree
|
snapshot
2020-08-10
Nate E TeBlunthuis
Use multiword expressions in tf.
commit
|
commitdiff
|
tree
|
snapshot
2020-08-10
Nate E TeBlunthuis
Finish generating multiword expressions.
commit
|
commitdiff
|
tree
|
snapshot
2020-08-09
Nate E TeBlunthuis
Bugfix
commit
|
commitdiff
|
tree
|
snapshot
2020-08-09
Nate E TeBlunthuis
Use groupby - joins instead of windows
commit
|
commitdiff
|
tree
|
snapshot
2020-08-04
Nate E TeBlunthuis
renamte tf_comments part 2.
commit
|
commitdiff
|
tree
|
snapshot
2020-08-04
Nate E TeBlunthuis
rename tf_reddit_comments.py step1.
commit
|
commitdiff
|
tree
|
snapshot
2020-08-04
Nate E TeBlunthuis
Improve tokenization following data. Generate author...
commit
|
commitdiff
|
tree
|
snapshot
2020-08-04
Nate E TeBlunthuis
improve tokenizer.
commit
|
commitdiff
|
tree
|
snapshot
2020-08-04
Nate E TeBlunthuis
TF reddit comments.
commit
|
commitdiff
|
tree
|
snapshot
2020-08-04
Nate E TeBlunthuis
code to sort tf
commit
|
commitdiff
|
tree
|
snapshot
2020-07-10
Nate E TeBlunthuis
remove is_submitter field from submissions which doesn...
commit
|
commitdiff
|
tree
|
snapshot
2020-07-08
Nate E TeBlunthuis
Bugfixes in scripts.
commit
|
commitdiff
|
tree
|
snapshot
2020-07-07
Nate E TeBlunthuis
clean up comments in streaming example.
commit
|
commitdiff
|
tree
|
snapshot
2020-07-07
Nate E TeBlunthuis
update .gitignore
commit
|
commitdiff
|
tree
|
snapshot
2020-07-07
Nate E TeBlunthuis
update examples with working streaming
commit
|
commitdiff
|
tree
|
snapshot
2020-07-07
Nate E TeBlunthuis
Build comments dataset similarly to submissions and...
commit
|
commitdiff
|
tree
|
snapshot
2020-07-07
Nate E TeBlunthuis
update .gitignore
commit
|
commitdiff
|
tree
|
snapshot
2020-07-07
Nate E TeBlunthuis
Script for example of streaming pyarrow.
commit
|
commitdiff
|
tree
|
snapshot
2020-07-07
Nate E TeBlunthuis
Script to demonstrate reading parquet.
commit
|
commitdiff
|
tree
|
snapshot
2020-07-07
Nate E TeBlunthuis
Check the shas when we download dumps
commit
|
commitdiff
|
tree
|
snapshot
2020-07-07
Nate E TeBlunthuis
Script to run both parts of submissions_2_parquet.sh
commit
|
commitdiff
|
tree
|
snapshot
2020-07-07
Nate E TeBlunthuis
Cache before sorting so we don't extract twice.
commit
|
commitdiff
|
tree
|
snapshot
2020-07-07
Nate E TeBlunthuis
Move the spark part of submissions_2_parquet to a separ...
commit
|
commitdiff
|
tree
|
snapshot
2020-07-06
Nate E TeBlunthuis
Fix whitespace at top of file.
commit
|
commitdiff
|
tree
|
snapshot
2020-07-06
Nate E TeBlunthuis
Secondary sort for the by_author dataset should be...
commit
|
commitdiff
|
tree
|
snapshot
2020-07-06
Nate E TeBlunthuis
Create a second dataset sorted by author.
commit
|
commitdiff
|
tree
|
snapshot
2020-07-06
Nate E TeBlunthuis
Create parquet datasets of reddit submissions from...
commit
|
commitdiff
|
tree
|
snapshot
2020-07-03
Nate E TeBlunthuis
Rename spark script to reflect that it is for comments.
commit
|
commitdiff
|
tree
|
snapshot
2020-07-03
Nate E TeBlunthuis
update .gitignore
commit
|
commitdiff
|
tree
|
snapshot
2020-07-03
Nate E TeBlunthuis
bugfix in retrieving old data and rename file.
commit
|
commitdiff
|
tree
|
snapshot
2020-07-03
Nate E TeBlunthuis
Script for checking shas for submissions.
commit
|
commitdiff
|
tree
|
snapshot
2020-07-03
Nate E TeBlunthuis
Bugfix: use timestamp types
commit
|
commitdiff
|
tree
|
snapshot
2020-07-03
Nate E TeBlunthuis
update the reddit comment dumps
commit
|
commitdiff
|
tree
|
snapshot
2020-07-03
Nate E TeBlunthuis
Don't clobber old dumps so that we can just download...
commit
|
commitdiff
|
tree
|
snapshot
2020-07-03
Nate E TeBlunthuis
script for getting submissions dumps from pushshift.
commit
|
commitdiff
|
tree
|
snapshot
2020-07-02
Nate E TeBlunthuis
Extract variables from pushshift comment to parquet
commit
|
commitdiff
|
tree
|
snapshot
Community Data Science Collective
||
Want to submit a patch?