]> code.communitydata.science - cdsc_reddit.git/shortlog
cdsc_reddit.git
2020-11-02 Nate E TeBlunthuisMerge branch 'master' of code:cdsc_reddit
2020-11-02 Nate E TeBlunthuisadd term_cosine_similarity.py
2020-11-02 Nathan TeBlunthuisAdd Cosine similarities to README.md
2020-11-02 Nathan TeBlunthuisUpdate Readme.
2020-11-02 Nathan TeBlunthuisMerge branch 'master' of code:cdsc_reddit into master
2020-11-02 Nathan TeBlunthuisCreate README.md
2020-10-03 Nate E TeBlunthuisUpdate reddit comments data with daily dumps.
2020-08-23 Nate E TeBlunthuisCompute IDF for terms and authors.
2020-08-12 Nate E TeBlunthuisUpdate submissions to parse using the backfill queue.
2020-08-11 Nate E TeBlunthuisbugfix in checking submission shas
2020-08-10 Nate E TeBlunthuisUse multiword expressions in tf.
2020-08-10 Nate E TeBlunthuisFinish generating multiword expressions.
2020-08-09 Nate E TeBlunthuisBugfix
2020-08-09 Nate E TeBlunthuisUse groupby - joins instead of windows
2020-08-04 Nate E TeBlunthuisrenamte tf_comments part 2.
2020-08-04 Nate E TeBlunthuisrename tf_reddit_comments.py step1.
2020-08-04 Nate E TeBlunthuisImprove tokenization following data. Generate author...
2020-08-04 Nate E TeBlunthuisimprove tokenizer.
2020-08-04 Nate E TeBlunthuisTF reddit comments.
2020-08-04 Nate E TeBlunthuiscode to sort tf
2020-07-10 Nate E TeBlunthuisremove is_submitter field from submissions which doesn...
2020-07-08 Nate E TeBlunthuisBugfixes in scripts.
2020-07-07 Nate E TeBlunthuisclean up comments in streaming example.
2020-07-07 Nate E TeBlunthuisupdate .gitignore
2020-07-07 Nate E TeBlunthuisupdate examples with working streaming
2020-07-07 Nate E TeBlunthuisBuild comments dataset similarly to submissions and...
2020-07-07 Nate E TeBlunthuisupdate .gitignore
2020-07-07 Nate E TeBlunthuisScript for example of streaming pyarrow.
2020-07-07 Nate E TeBlunthuisScript to demonstrate reading parquet.
2020-07-07 Nate E TeBlunthuisCheck the shas when we download dumps
2020-07-07 Nate E TeBlunthuisScript to run both parts of submissions_2_parquet.sh
2020-07-07 Nate E TeBlunthuisCache before sorting so we don't extract twice.
2020-07-07 Nate E TeBlunthuisMove the spark part of submissions_2_parquet to a separ...
2020-07-06 Nate E TeBlunthuisFix whitespace at top of file.
2020-07-06 Nate E TeBlunthuisSecondary sort for the by_author dataset should be...
2020-07-06 Nate E TeBlunthuisCreate a second dataset sorted by author.
2020-07-06 Nate E TeBlunthuisCreate parquet datasets of reddit submissions from...
2020-07-03 Nate E TeBlunthuisRename spark script to reflect that it is for comments.
2020-07-03 Nate E TeBlunthuisupdate .gitignore
2020-07-03 Nate E TeBlunthuisbugfix in retrieving old data and rename file.
2020-07-03 Nate E TeBlunthuisScript for checking shas for submissions.
2020-07-03 Nate E TeBlunthuisBugfix: use timestamp types
2020-07-03 Nate E TeBlunthuisupdate the reddit comment dumps
2020-07-03 Nate E TeBlunthuisDon't clobber old dumps so that we can just download...
2020-07-03 Nate E TeBlunthuisscript for getting submissions dumps from pushshift.
2020-07-02 Nate E TeBlunthuisExtract variables from pushshift comment to parquet

Community Data Science Collective || Want to submit a patch?