mediawiki_dump_tools.git
6 months agoremove commented code master
Nathan TeBlunthuis [Mon, 11 Nov 2019 19:28:48 +0000 (11:28 -0800)]
remove commented code

6 months agorefactor regex matching in a tidier object oriented style
Nathan TeBlunthuis [Sat, 9 Nov 2019 21:07:46 +0000 (13:07 -0800)]
refactor regex matching in a tidier object oriented style

6 months agovalidate tests and add asserts and baselines for regex tests.
Nathan TeBlunthuis [Sat, 9 Nov 2019 20:19:55 +0000 (12:19 -0800)]
validate tests and add asserts and baselines for regex tests.

6 months agoadded regex scanner v2's dump unit test file regextest.xml.bz2
sohyeonhwang [Thu, 7 Nov 2019 20:06:15 +0000 (14:06 -0600)]
added regex scanner v2's dump unit test file regextest.xml.bz2

6 months agomerging pull containing revert-radius with 2nd version of regex scanner w/ unit tests
sohyeonhwang [Thu, 7 Nov 2019 19:28:17 +0000 (13:28 -0600)]
merging pull containing revert-radius with 2nd version of regex scanner w/ unit tests

7 months agoadd unit tests for configuring revert_radius
groceryheist [Mon, 7 Oct 2019 22:02:30 +0000 (15:02 -0700)]
add unit tests for configuring revert_radius

7 months agomake revert radius configurable
groceryheist [Mon, 7 Oct 2019 20:57:49 +0000 (13:57 -0700)]
make revert radius configurable

7 months agoMerge branch 'master' into regex_scanner
groceryheist [Sun, 6 Oct 2019 01:17:03 +0000 (18:17 -0700)]
Merge branch 'master' into regex_scanner

7 months agoupdate baseline outputs
groceryheist [Sat, 5 Oct 2019 23:36:07 +0000 (16:36 -0700)]
update baseline outputs

7 months agobugfix, remove old legacy persistence flag
groceryheist [Sat, 5 Oct 2019 23:13:11 +0000 (16:13 -0700)]
bugfix, remove old legacy persistence flag

7 months agochanges for regex scanner addition
sohyeonhwang [Sat, 5 Oct 2019 20:36:58 +0000 (15:36 -0500)]
changes for regex scanner addition

8 months agoedont compute persistence by default
groceryheist [Sun, 22 Sep 2019 22:54:17 +0000 (15:54 -0700)]
edont compute persistence by default

8 months agoelaborate docstring for persistence
groceryheist [Sun, 22 Sep 2019 22:11:59 +0000 (15:11 -0700)]
elaborate docstring for persistence

21 months agoimprove help for namespace-include
groceryheist [Mon, 3 Sep 2018 18:30:12 +0000 (11:30 -0700)]
improve help for namespace-include

21 months agosub assertEquals assertEqual advanced_persistence
groceryheist [Mon, 3 Sep 2018 18:21:49 +0000 (11:21 -0700)]
sub assertEquals assertEqual

21 months agoadd namespace filter parameter
Nate E TeBlunthuis [Fri, 24 Aug 2018 01:25:08 +0000 (18:25 -0700)]
add namespace filter parameter

21 months agoMerge branch 'advanced_persistence' of code.communitydata.cc:mediawiki_dump_tools...
groceryheist [Fri, 24 Aug 2018 01:52:54 +0000 (18:52 -0700)]
Merge branch 'advanced_persistence' of code.communitydata.cc:mediawiki_dump_tools into advanced_persistence

21 months agoAdd parameter for selecting specific namespaces.
groceryheist [Fri, 24 Aug 2018 01:27:09 +0000 (18:27 -0700)]
Add parameter for selecting specific namespaces.

21 months agoMerge branch 'advanced_persistence' of code.communitydata.cc:mediawiki_dump_tools...
groceryheist [Fri, 24 Aug 2018 01:27:09 +0000 (18:27 -0700)]
Merge branch 'advanced_persistence' of code.communitydata.cc:mediawiki_dump_tools into advanced_persistence

21 months agoMerge branch 'advanced_persistence' of code.communitydata.cc:/mediawiki_dump_tools...
Nate E TeBlunthuis [Fri, 24 Aug 2018 01:23:36 +0000 (18:23 -0700)]
Merge branch 'advanced_persistence' of code.communitydata.cc:/mediawiki_dump_tools into advanced_persistence

21 months agoadd namespace filter parameter
Nate E TeBlunthuis [Fri, 24 Aug 2018 01:25:08 +0000 (18:25 -0700)]
add namespace filter parameter

21 months agoMerge branch 'advanced_persistence' of code.communitydata.cc:/mediawiki_dump_tools...
Nate E TeBlunthuis [Fri, 24 Aug 2018 01:23:36 +0000 (18:23 -0700)]
Merge branch 'advanced_persistence' of code.communitydata.cc:/mediawiki_dump_tools into advanced_persistence

21 months agoadd namespace filter parameter
Nate E TeBlunthuis [Fri, 24 Aug 2018 01:02:56 +0000 (18:02 -0700)]
add namespace filter parameter

21 months agoadd namespace filter parameter
Nate E TeBlunthuis [Fri, 24 Aug 2018 01:02:56 +0000 (18:02 -0700)]
add namespace filter parameter

21 months agoadd support for persistence with segment matching
groceryheist [Mon, 20 Aug 2018 23:08:16 +0000 (16:08 -0700)]
add support for persistence with segment matching

22 months agoPrefix page titles with namespace names. mediawiki-utils-migration
groceryheist [Tue, 10 Jul 2018 05:11:17 +0000 (22:11 -0700)]
Prefix page titles with namespace names.

23 months agomigrate to mwxml. This completes the migration away from python-mediawiki-utilities...
groceryheist [Thu, 5 Jul 2018 08:16:00 +0000 (01:16 -0700)]
migrate to mwxml. This completes the migration away from python-mediawiki-utilities. Except for preserving legacy persistence behavior, we can safely use the nice updates from the mediawiki-utils project.

23 months agomigrate to mwpersistence. this fixes many issues. We preserve legacy persistence...
groceryheist [Thu, 5 Jul 2018 02:06:07 +0000 (19:06 -0700)]
migrate to mwpersistence. this fixes many issues. We preserve legacy persistence behavior using the --persistence-legacy.

23 months agomigrate reverts to python-mwreverts
groceryheist [Wed, 4 Jul 2018 22:29:48 +0000 (15:29 -0700)]
migrate reverts to python-mwreverts

23 months agoadd note to readme about dependency on compression software
groceryheist [Wed, 4 Jul 2018 22:20:52 +0000 (15:20 -0700)]
add note to readme about dependency on compression software

23 months agoadd tests for wikipedia, malformed xml, bzip2, correct bz2 bug in wikiq.
groceryheist [Wed, 4 Jul 2018 22:08:30 +0000 (15:08 -0700)]
add tests for wikipedia, malformed xml, bzip2, correct bz2 bug in wikiq.

23 months agocreate baseline tests for xml dump processing
groceryheist [Wed, 4 Jul 2018 06:43:47 +0000 (23:43 -0700)]
create baseline tests for xml dump processing

2 years agoa number of small updates and fixes
Benjamin Mako Hill [Thu, 17 May 2018 21:37:20 +0000 (14:37 -0700)]
a number of small updates and fixes

- fix regex for filename/filetype matches
- unload all files not just ones with end with xml in 7z archives
- fix bug that broke stdout
- minor cosmetic fixes
- updated mediawiki-utilities submodule to latest version

2 years agosupport 7z archives with multiple files. add urlencode paraeter
groceryheist [Thu, 7 Dec 2017 23:10:56 +0000 (15:10 -0800)]
support 7z archives with multiple files. add urlencode paraeter

3 years agofix code to work with bzip files
Benjamin Mako Hill [Tue, 7 Feb 2017 02:25:17 +0000 (18:25 -0800)]
fix code to work with bzip files

4 years agoadded list of compressed dump files to .gitignore
Benjamin Mako Hill [Thu, 23 Jul 2015 19:16:31 +0000 (12:16 -0700)]
added list of compressed dump files to .gitignore

4 years agoadded support to parse namespaces from title
Benjamin Mako Hill [Thu, 23 Jul 2015 19:12:20 +0000 (12:12 -0700)]
added support to parse namespaces from title

This is necessary for wikis (e.g., Wikia XML dumps) that do not include
namespace metadata as tags within each <page>.

4 years agoadded README file to document the submodule
Benjamin Mako Hill [Thu, 23 Jul 2015 02:55:08 +0000 (19:55 -0700)]
added README file to document the submodule

4 years agocreated new repository for wikiq with Mediawiki-Utilities as a submodule
Benjamin Mako Hill [Thu, 23 Jul 2015 02:44:52 +0000 (19:44 -0700)]
created new repository for wikiq with Mediawiki-Utilities as a submodule

Community Data Science Collective || Want to submit a patch?