]>
code.communitydata.science - mediawiki_dump_tools.git/log
summary |
shortlog | log |
commit |
commitdiff |
tree
first ⋅ prev ⋅ next
Nathan TeBlunthuis [Wed, 3 May 2023 17:23:30 +0000 (10:23 -0700)]
code review.
Benjamin Mako Hill [Sat, 29 Apr 2023 18:52:13 +0000 (11:52 -0700)]
fix because pandas testing API has changed
Benjamin Mako Hill [Sat, 29 Apr 2023 18:44:48 +0000 (11:44 -0700)]
rename variables to be more consistent
I changed regex_match_revision to be regex_revision_match so that it matches
the way that the other revisions are named so that they are all of the form:
regex_<thing_being_matched_again>_<variable>
I made the same change for comments.
Benjamin Mako Hill [Sat, 29 Apr 2023 18:40:03 +0000 (11:40 -0700)]
added counting functionality to regex code
The regex code has historically returned the actual matched patterns and the
named capture groups within regexes. When trying to count common and/or large
patterns, this leads to very large outputs.
I've added two new functions -RPc and -CPc that will cause wikiq to return
counts of each pattern (0 when there are no matches). The options apply to all
comment or revision patterns. I considered interfaces to make it possible to do
some but others but concluded this would be too complicated an interface.
This code should be checked before it's merged.
Benjamin Mako Hill [Fri, 28 Apr 2023 21:40:18 +0000 (14:40 -0700)]
updated README file
- added information on Python dependencies
- wrapped lines in a previous paragraph (no changes)
Benjamin Mako Hill [Fri, 28 Apr 2023 21:30:42 +0000 (14:30 -0700)]
make sure that content is defined before testing for search patterns
This appears to have been causing a bug with comments/text that were deleted.
Kaylea fixed and I adapated the code.
Benjamin Mako Hill [Fri, 28 Apr 2023 21:21:21 +0000 (14:21 -0700)]
added a line to fix persistence with deleted revs
kaylea realized that we need to initialize the old_rev_data dictionary or it
fails when the first revision to a page is deleted. This patch is from kaylea
and modified by mako.
Nathan TeBlunthuis [Mon, 11 Nov 2019 19:28:48 +0000 (11:28 -0800)]
remove commented code
Nathan TeBlunthuis [Sat, 9 Nov 2019 21:07:46 +0000 (13:07 -0800)]
refactor regex matching in a tidier object oriented style
Nathan TeBlunthuis [Sat, 9 Nov 2019 20:19:55 +0000 (12:19 -0800)]
validate tests and add asserts and baselines for regex tests.
sohyeonhwang [Thu, 7 Nov 2019 20:06:15 +0000 (14:06 -0600)]
added regex scanner v2's dump unit test file regextest.xml.bz2
sohyeonhwang [Thu, 7 Nov 2019 19:28:17 +0000 (13:28 -0600)]
merging pull containing revert-radius with 2nd version of regex scanner w/ unit tests
groceryheist [Mon, 7 Oct 2019 22:02:30 +0000 (15:02 -0700)]
add unit tests for configuring revert_radius
groceryheist [Mon, 7 Oct 2019 20:57:49 +0000 (13:57 -0700)]
make revert radius configurable
groceryheist [Sun, 6 Oct 2019 01:17:03 +0000 (18:17 -0700)]
Merge branch 'master' into regex_scanner
groceryheist [Sat, 5 Oct 2019 23:36:07 +0000 (16:36 -0700)]
update baseline outputs
groceryheist [Sat, 5 Oct 2019 23:13:11 +0000 (16:13 -0700)]
bugfix, remove old legacy persistence flag
sohyeonhwang [Sat, 5 Oct 2019 20:36:58 +0000 (15:36 -0500)]
changes for regex scanner addition
groceryheist [Sun, 22 Sep 2019 22:54:17 +0000 (15:54 -0700)]
edont compute persistence by default
groceryheist [Sun, 22 Sep 2019 22:11:59 +0000 (15:11 -0700)]
elaborate docstring for persistence
groceryheist [Mon, 3 Sep 2018 18:30:12 +0000 (11:30 -0700)]
improve help for namespace-include
groceryheist [Mon, 3 Sep 2018 18:21:49 +0000 (11:21 -0700)]
sub assertEquals assertEqual
Nate E TeBlunthuis [Fri, 24 Aug 2018 01:25:08 +0000 (18:25 -0700)]
add namespace filter parameter
groceryheist [Fri, 24 Aug 2018 01:52:54 +0000 (18:52 -0700)]
Merge branch 'advanced_persistence' of code.communitydata.cc:mediawiki_dump_tools into advanced_persistence
groceryheist [Fri, 24 Aug 2018 01:27:09 +0000 (18:27 -0700)]
Add parameter for selecting specific namespaces.
groceryheist [Fri, 24 Aug 2018 01:27:09 +0000 (18:27 -0700)]
Merge branch 'advanced_persistence' of code.communitydata.cc:mediawiki_dump_tools into advanced_persistence
Nate E TeBlunthuis [Fri, 24 Aug 2018 01:23:36 +0000 (18:23 -0700)]
Merge branch 'advanced_persistence' of code.communitydata.cc:/mediawiki_dump_tools into advanced_persistence
Nate E TeBlunthuis [Fri, 24 Aug 2018 01:25:08 +0000 (18:25 -0700)]
add namespace filter parameter
Nate E TeBlunthuis [Fri, 24 Aug 2018 01:23:36 +0000 (18:23 -0700)]
Merge branch 'advanced_persistence' of code.communitydata.cc:/mediawiki_dump_tools into advanced_persistence
Nate E TeBlunthuis [Fri, 24 Aug 2018 01:02:56 +0000 (18:02 -0700)]
add namespace filter parameter
Nate E TeBlunthuis [Fri, 24 Aug 2018 01:02:56 +0000 (18:02 -0700)]
add namespace filter parameter
groceryheist [Mon, 20 Aug 2018 23:08:16 +0000 (16:08 -0700)]
add support for persistence with segment matching
groceryheist [Tue, 10 Jul 2018 05:11:17 +0000 (22:11 -0700)]
Prefix page titles with namespace names.
groceryheist [Thu, 5 Jul 2018 08:16:00 +0000 (01:16 -0700)]
migrate to mwxml. This completes the migration away from python-mediawiki-utilities. Except for preserving legacy persistence behavior, we can safely use the nice updates from the mediawiki-utils project.
groceryheist [Thu, 5 Jul 2018 02:06:07 +0000 (19:06 -0700)]
migrate to mwpersistence. this fixes many issues. We preserve legacy persistence behavior using the --persistence-legacy.
groceryheist [Wed, 4 Jul 2018 22:29:48 +0000 (15:29 -0700)]
migrate reverts to python-mwreverts
groceryheist [Wed, 4 Jul 2018 22:20:52 +0000 (15:20 -0700)]
add note to readme about dependency on compression software
groceryheist [Wed, 4 Jul 2018 22:08:30 +0000 (15:08 -0700)]
add tests for wikipedia, malformed xml, bzip2, correct bz2 bug in wikiq.
groceryheist [Wed, 4 Jul 2018 06:43:47 +0000 (23:43 -0700)]
create baseline tests for xml dump processing
Benjamin Mako Hill [Thu, 17 May 2018 21:37:20 +0000 (14:37 -0700)]
a number of small updates and fixes
- fix regex for filename/filetype matches
- unload all files not just ones with end with xml in 7z archives
- fix bug that broke stdout
- minor cosmetic fixes
- updated mediawiki-utilities submodule to latest version
groceryheist [Thu, 7 Dec 2017 23:10:56 +0000 (15:10 -0800)]
support 7z archives with multiple files. add urlencode paraeter
Benjamin Mako Hill [Tue, 7 Feb 2017 02:25:17 +0000 (18:25 -0800)]
fix code to work with bzip files
Benjamin Mako Hill [Thu, 23 Jul 2015 19:16:31 +0000 (12:16 -0700)]
added list of compressed dump files to .gitignore
Benjamin Mako Hill [Thu, 23 Jul 2015 19:12:20 +0000 (12:12 -0700)]
added support to parse namespaces from title
This is necessary for wikis (e.g., Wikia XML dumps) that do not include
namespace metadata as tags within each <page>.
Benjamin Mako Hill [Thu, 23 Jul 2015 02:55:08 +0000 (19:55 -0700)]
added README file to document the submodule
Benjamin Mako Hill [Thu, 23 Jul 2015 02:44:52 +0000 (19:44 -0700)]
created new repository for wikiq with Mediawiki-Utilities as a submodule
Community Data Science Collective || Want to submit a patch?