code.communitydata.science - cdsc

]> code.communitydata.science - cdsc_reddit.git/shortlog

2020-07-07	Nate E TeBlunthuis	update examples with working streaming	commit \| commitdiff \| tree \| snapshot
2020-07-07	Nate E TeBlunthuis	Build comments dataset similarly to submissions and...	commit \| commitdiff \| tree \| snapshot
2020-07-07	Nate E TeBlunthuis	update .gitignore	commit \| commitdiff \| tree \| snapshot
2020-07-07	Nate E TeBlunthuis	Script for example of streaming pyarrow.	commit \| commitdiff \| tree \| snapshot
2020-07-07	Nate E TeBlunthuis	Script to demonstrate reading parquet.	commit \| commitdiff \| tree \| snapshot
2020-07-07	Nate E TeBlunthuis	Check the shas when we download dumps	commit \| commitdiff \| tree \| snapshot
2020-07-07	Nate E TeBlunthuis	Script to run both parts of submissions_2_parquet.sh	commit \| commitdiff \| tree \| snapshot
2020-07-07	Nate E TeBlunthuis	Cache before sorting so we don't extract twice.	commit \| commitdiff \| tree \| snapshot
2020-07-07	Nate E TeBlunthuis	Move the spark part of submissions_2_parquet to a separ...	commit \| commitdiff \| tree \| snapshot
2020-07-06	Nate E TeBlunthuis	Fix whitespace at top of file.	commit \| commitdiff \| tree \| snapshot
2020-07-06	Nate E TeBlunthuis	Secondary sort for the by_author dataset should be...	commit \| commitdiff \| tree \| snapshot
2020-07-06	Nate E TeBlunthuis	Create a second dataset sorted by author.	commit \| commitdiff \| tree \| snapshot
2020-07-06	Nate E TeBlunthuis	Create parquet datasets of reddit submissions from...	commit \| commitdiff \| tree \| snapshot
2020-07-03	Nate E TeBlunthuis	Rename spark script to reflect that it is for comments.	commit \| commitdiff \| tree \| snapshot
2020-07-03	Nate E TeBlunthuis	update .gitignore	commit \| commitdiff \| tree \| snapshot
2020-07-03	Nate E TeBlunthuis	bugfix in retrieving old data and rename file.	commit \| commitdiff \| tree \| snapshot
2020-07-03	Nate E TeBlunthuis	Script for checking shas for submissions.	commit \| commitdiff \| tree \| snapshot
2020-07-03	Nate E TeBlunthuis	Bugfix: use timestamp types	commit \| commitdiff \| tree \| snapshot
2020-07-03	Nate E TeBlunthuis	update the reddit comment dumps	commit \| commitdiff \| tree \| snapshot
2020-07-03	Nate E TeBlunthuis	Don't clobber old dumps so that we can just download...	commit \| commitdiff \| tree \| snapshot
2020-07-03	Nate E TeBlunthuis	script for getting submissions dumps from pushshift.	commit \| commitdiff \| tree \| snapshot
2020-07-02	Nate E TeBlunthuis	Extract variables from pushshift comment to parquet	commit \| commitdiff \| tree \| snapshot

Building parquet tables from pushshift reddit dumps.

RSS Atom

Community Data Science Collective || Want to submit a patch?