projects
/
cdsc_reddit.git
/ shortlog
commit
grep
author
committer
pickaxe
?
search:
re
summary
| shortlog |
log
|
commit
|
commitdiff
|
tree
first ⋅ prev ⋅ next
cdsc_reddit.git
2020-07-07
Nate E TeBlunthuis
update examples with working streaming
commit
|
commitdiff
|
tree
|
snapshot
2020-07-07
Nate E TeBlunthuis
Build comments dataset similarly to submissions and...
commit
|
commitdiff
|
tree
|
snapshot
2020-07-07
Nate E TeBlunthuis
update .gitignore
commit
|
commitdiff
|
tree
|
snapshot
2020-07-07
Nate E TeBlunthuis
Script for example of streaming pyarrow.
commit
|
commitdiff
|
tree
|
snapshot
2020-07-07
Nate E TeBlunthuis
Script to demonstrate reading parquet.
commit
|
commitdiff
|
tree
|
snapshot
2020-07-07
Nate E TeBlunthuis
Check the shas when we download dumps
commit
|
commitdiff
|
tree
|
snapshot
2020-07-07
Nate E TeBlunthuis
Script to run both parts of submissions_2_parquet.sh
commit
|
commitdiff
|
tree
|
snapshot
2020-07-07
Nate E TeBlunthuis
Cache before sorting so we don't extract twice.
commit
|
commitdiff
|
tree
|
snapshot
2020-07-07
Nate E TeBlunthuis
Move the spark part of submissions_2_parquet to a separ...
commit
|
commitdiff
|
tree
|
snapshot
2020-07-06
Nate E TeBlunthuis
Fix whitespace at top of file.
commit
|
commitdiff
|
tree
|
snapshot
2020-07-06
Nate E TeBlunthuis
Secondary sort for the by_author dataset should be...
commit
|
commitdiff
|
tree
|
snapshot
2020-07-06
Nate E TeBlunthuis
Create a second dataset sorted by author.
commit
|
commitdiff
|
tree
|
snapshot
2020-07-06
Nate E TeBlunthuis
Create parquet datasets of reddit submissions from...
commit
|
commitdiff
|
tree
|
snapshot
2020-07-03
Nate E TeBlunthuis
Rename spark script to reflect that it is for comments.
commit
|
commitdiff
|
tree
|
snapshot
2020-07-03
Nate E TeBlunthuis
update .gitignore
commit
|
commitdiff
|
tree
|
snapshot
2020-07-03
Nate E TeBlunthuis
bugfix in retrieving old data and rename file.
commit
|
commitdiff
|
tree
|
snapshot
2020-07-03
Nate E TeBlunthuis
Script for checking shas for submissions.
commit
|
commitdiff
|
tree
|
snapshot
2020-07-03
Nate E TeBlunthuis
Bugfix: use timestamp types
commit
|
commitdiff
|
tree
|
snapshot
2020-07-03
Nate E TeBlunthuis
update the reddit comment dumps
commit
|
commitdiff
|
tree
|
snapshot
2020-07-03
Nate E TeBlunthuis
Don't clobber old dumps so that we can just download...
commit
|
commitdiff
|
tree
|
snapshot
2020-07-03
Nate E TeBlunthuis
script for getting submissions dumps from pushshift.
commit
|
commitdiff
|
tree
|
snapshot
2020-07-02
Nate E TeBlunthuis
Extract variables from pushshift comment to parquet
commit
|
commitdiff
|
tree
|
snapshot
Community Data Science Collective
||
Want to submit a patch?