]> code.communitydata.science - mediawiki_dump_tools.git/log
mediawiki_dump_tools.git
12 months agocode review. mako_changes-20230429
Nathan TeBlunthuis [Wed, 3 May 2023 17:23:30 +0000 (10:23 -0700)]
code review.

12 months agofix because pandas testing API has changed
Benjamin Mako Hill [Sat, 29 Apr 2023 18:52:13 +0000 (11:52 -0700)]
fix because pandas testing API has changed

12 months agorename variables to be more consistent
Benjamin Mako Hill [Sat, 29 Apr 2023 18:44:48 +0000 (11:44 -0700)]
rename variables to be more consistent

I changed regex_match_revision to be regex_revision_match so that it matches
the way that the other revisions are named so that they are all of the form:
regex_<thing_being_matched_again>_<variable>

I made the same change for comments.

12 months agoadded counting functionality to regex code
Benjamin Mako Hill [Sat, 29 Apr 2023 18:40:03 +0000 (11:40 -0700)]
added counting functionality to regex code

The regex code has historically returned the actual matched patterns and the
named capture groups within regexes.  When trying to count common and/or large
patterns, this leads to very large outputs.

I've added two new functions -RPc and -CPc that will cause wikiq to return
counts of each pattern (0 when there are no matches). The options apply to all
comment or revision patterns. I considered interfaces to make it possible to do
some but others but concluded this would be too complicated an interface.

This code should be checked before it's merged.

12 months agoupdated README file
Benjamin Mako Hill [Fri, 28 Apr 2023 21:40:18 +0000 (14:40 -0700)]
updated README file

- added information on Python dependencies
- wrapped lines in a previous paragraph (no changes)

12 months agomake sure that content is defined before testing for search patterns
Benjamin Mako Hill [Fri, 28 Apr 2023 21:30:42 +0000 (14:30 -0700)]
make sure that content is defined before testing for search patterns

This appears to have been causing a bug with comments/text that were deleted.
Kaylea fixed and I adapated the code.

12 months agoadded a line to fix persistence with deleted revs
Benjamin Mako Hill [Fri, 28 Apr 2023 21:21:21 +0000 (14:21 -0700)]
added a line to fix persistence with deleted revs

kaylea realized that we need to initialize the old_rev_data dictionary or it
fails when the first revision to a page is deleted. This patch is from kaylea
and modified by mako.

4 years agoremove commented code master
Nathan TeBlunthuis [Mon, 11 Nov 2019 19:28:48 +0000 (11:28 -0800)]
remove commented code

4 years agorefactor regex matching in a tidier object oriented style
Nathan TeBlunthuis [Sat, 9 Nov 2019 21:07:46 +0000 (13:07 -0800)]
refactor regex matching in a tidier object oriented style

4 years agovalidate tests and add asserts and baselines for regex tests.
Nathan TeBlunthuis [Sat, 9 Nov 2019 20:19:55 +0000 (12:19 -0800)]
validate tests and add asserts and baselines for regex tests.

4 years agoadded regex scanner v2's dump unit test file regextest.xml.bz2
sohyeonhwang [Thu, 7 Nov 2019 20:06:15 +0000 (14:06 -0600)]
added regex scanner v2's dump unit test file regextest.xml.bz2

4 years agomerging pull containing revert-radius with 2nd version of regex scanner w/ unit tests
sohyeonhwang [Thu, 7 Nov 2019 19:28:17 +0000 (13:28 -0600)]
merging pull containing revert-radius with 2nd version of regex scanner w/ unit tests

4 years agoadd unit tests for configuring revert_radius
groceryheist [Mon, 7 Oct 2019 22:02:30 +0000 (15:02 -0700)]
add unit tests for configuring revert_radius

4 years agomake revert radius configurable
groceryheist [Mon, 7 Oct 2019 20:57:49 +0000 (13:57 -0700)]
make revert radius configurable

4 years agoMerge branch 'master' into regex_scanner
groceryheist [Sun, 6 Oct 2019 01:17:03 +0000 (18:17 -0700)]
Merge branch 'master' into regex_scanner

4 years agoupdate baseline outputs
groceryheist [Sat, 5 Oct 2019 23:36:07 +0000 (16:36 -0700)]
update baseline outputs

4 years agobugfix, remove old legacy persistence flag
groceryheist [Sat, 5 Oct 2019 23:13:11 +0000 (16:13 -0700)]
bugfix, remove old legacy persistence flag

4 years agochanges for regex scanner addition
sohyeonhwang [Sat, 5 Oct 2019 20:36:58 +0000 (15:36 -0500)]
changes for regex scanner addition

4 years agoedont compute persistence by default
groceryheist [Sun, 22 Sep 2019 22:54:17 +0000 (15:54 -0700)]
edont compute persistence by default

4 years agoelaborate docstring for persistence
groceryheist [Sun, 22 Sep 2019 22:11:59 +0000 (15:11 -0700)]
elaborate docstring for persistence

5 years agoimprove help for namespace-include
groceryheist [Mon, 3 Sep 2018 18:30:12 +0000 (11:30 -0700)]
improve help for namespace-include

5 years agosub assertEquals assertEqual advanced_persistence
groceryheist [Mon, 3 Sep 2018 18:21:49 +0000 (11:21 -0700)]
sub assertEquals assertEqual

5 years agoadd namespace filter parameter
Nate E TeBlunthuis [Fri, 24 Aug 2018 01:25:08 +0000 (18:25 -0700)]
add namespace filter parameter

5 years agoMerge branch 'advanced_persistence' of code.communitydata.cc:mediawiki_dump_tools...
groceryheist [Fri, 24 Aug 2018 01:52:54 +0000 (18:52 -0700)]
Merge branch 'advanced_persistence' of code.communitydata.cc:mediawiki_dump_tools into advanced_persistence

5 years agoAdd parameter for selecting specific namespaces.
groceryheist [Fri, 24 Aug 2018 01:27:09 +0000 (18:27 -0700)]
Add parameter for selecting specific namespaces.

5 years agoMerge branch 'advanced_persistence' of code.communitydata.cc:mediawiki_dump_tools...
groceryheist [Fri, 24 Aug 2018 01:27:09 +0000 (18:27 -0700)]
Merge branch 'advanced_persistence' of code.communitydata.cc:mediawiki_dump_tools into advanced_persistence

5 years agoMerge branch 'advanced_persistence' of code.communitydata.cc:/mediawiki_dump_tools...
Nate E TeBlunthuis [Fri, 24 Aug 2018 01:23:36 +0000 (18:23 -0700)]
Merge branch 'advanced_persistence' of code.communitydata.cc:/mediawiki_dump_tools into advanced_persistence

5 years agoadd namespace filter parameter
Nate E TeBlunthuis [Fri, 24 Aug 2018 01:25:08 +0000 (18:25 -0700)]
add namespace filter parameter

5 years agoMerge branch 'advanced_persistence' of code.communitydata.cc:/mediawiki_dump_tools...
Nate E TeBlunthuis [Fri, 24 Aug 2018 01:23:36 +0000 (18:23 -0700)]
Merge branch 'advanced_persistence' of code.communitydata.cc:/mediawiki_dump_tools into advanced_persistence

5 years agoadd namespace filter parameter
Nate E TeBlunthuis [Fri, 24 Aug 2018 01:02:56 +0000 (18:02 -0700)]
add namespace filter parameter

5 years agoadd namespace filter parameter
Nate E TeBlunthuis [Fri, 24 Aug 2018 01:02:56 +0000 (18:02 -0700)]
add namespace filter parameter

5 years agoadd support for persistence with segment matching
groceryheist [Mon, 20 Aug 2018 23:08:16 +0000 (16:08 -0700)]
add support for persistence with segment matching

5 years agoPrefix page titles with namespace names. mediawiki-utils-migration
groceryheist [Tue, 10 Jul 2018 05:11:17 +0000 (22:11 -0700)]
Prefix page titles with namespace names.

5 years agomigrate to mwxml. This completes the migration away from python-mediawiki-utilities...
groceryheist [Thu, 5 Jul 2018 08:16:00 +0000 (01:16 -0700)]
migrate to mwxml. This completes the migration away from python-mediawiki-utilities. Except for preserving legacy persistence behavior, we can safely use the nice updates from the mediawiki-utils project.

5 years agomigrate to mwpersistence. this fixes many issues. We preserve legacy persistence...
groceryheist [Thu, 5 Jul 2018 02:06:07 +0000 (19:06 -0700)]
migrate to mwpersistence. this fixes many issues. We preserve legacy persistence behavior using the --persistence-legacy.

5 years agomigrate reverts to python-mwreverts
groceryheist [Wed, 4 Jul 2018 22:29:48 +0000 (15:29 -0700)]
migrate reverts to python-mwreverts

5 years agoadd note to readme about dependency on compression software
groceryheist [Wed, 4 Jul 2018 22:20:52 +0000 (15:20 -0700)]
add note to readme about dependency on compression software

5 years agoadd tests for wikipedia, malformed xml, bzip2, correct bz2 bug in wikiq.
groceryheist [Wed, 4 Jul 2018 22:08:30 +0000 (15:08 -0700)]
add tests for wikipedia, malformed xml, bzip2, correct bz2 bug in wikiq.

5 years agocreate baseline tests for xml dump processing
groceryheist [Wed, 4 Jul 2018 06:43:47 +0000 (23:43 -0700)]
create baseline tests for xml dump processing

5 years agoa number of small updates and fixes
Benjamin Mako Hill [Thu, 17 May 2018 21:37:20 +0000 (14:37 -0700)]
a number of small updates and fixes

- fix regex for filename/filetype matches
- unload all files not just ones with end with xml in 7z archives
- fix bug that broke stdout
- minor cosmetic fixes
- updated mediawiki-utilities submodule to latest version

6 years agosupport 7z archives with multiple files. add urlencode paraeter
groceryheist [Thu, 7 Dec 2017 23:10:56 +0000 (15:10 -0800)]
support 7z archives with multiple files. add urlencode paraeter

7 years agofix code to work with bzip files
Benjamin Mako Hill [Tue, 7 Feb 2017 02:25:17 +0000 (18:25 -0800)]
fix code to work with bzip files

8 years agoadded list of compressed dump files to .gitignore
Benjamin Mako Hill [Thu, 23 Jul 2015 19:16:31 +0000 (12:16 -0700)]
added list of compressed dump files to .gitignore

8 years agoadded support to parse namespaces from title
Benjamin Mako Hill [Thu, 23 Jul 2015 19:12:20 +0000 (12:12 -0700)]
added support to parse namespaces from title

This is necessary for wikis (e.g., Wikia XML dumps) that do not include
namespace metadata as tags within each <page>.

8 years agoadded README file to document the submodule
Benjamin Mako Hill [Thu, 23 Jul 2015 02:55:08 +0000 (19:55 -0700)]
added README file to document the submodule

8 years agocreated new repository for wikiq with Mediawiki-Utilities as a submodule
Benjamin Mako Hill [Thu, 23 Jul 2015 02:44:52 +0000 (19:44 -0700)]
created new repository for wikiq with Mediawiki-Utilities as a submodule

Community Data Science Collective || Want to submit a patch?