Issue2636
Created on 2008-04-15 11:57 by timehorse, last changed 2010-03-16 21:37 by vbr.
| Files | ||||
|---|---|---|---|---|
| File name | Uploaded | Description | Edit | Remove |
| issue2636-patches.tar.bz2 | timehorse, 2008-06-17 17:43 | Contains ALL potential patches, including ones for Atomic Grouping, Shared Constants, Smart Caching and so on... | ||
| issue2636-02.patch | timehorse, 2008-06-17 19:07 | |||
| issue2636-01+09-02+17_backport.diff | mrabarnett, 2008-09-30 00:45 | |||
| issue2636+01+09-02+17+18+19+20+21+24+26_speedup.diff | mrabarnett, 2008-09-30 23:42 | |||
| issue2636-features.diff | mrabarnett, 2009-02-03 23:07 | |||
| issue2636-features-2.diff | mrabarnett, 2009-02-08 00:39 | |||
| issue2636-features-3.diff | mrabarnett, 2009-02-24 19:28 | |||
| issue2636-features-4.diff | mrabarnett, 2009-02-26 01:22 | |||
| issue2636-features-5.diff | mrabarnett, 2009-03-01 01:42 | |||
| issue2636-features-6.diff | mrabarnett, 2009-03-07 02:47 | |||
| issue2636-patch-1.diff | mrabarnett, 2009-03-29 00:43 | |||
| issue2636-patch-2.diff | mrabarnett, 2009-04-16 14:58 | |||
| issue2636-20090726.zip | mrabarnett, 2009-07-26 19:11 | |||
| issue2636-20090727.zip | mrabarnett, 2009-07-27 16:13 | |||
| issue2636-20090729.zip | mrabarnett, 2009-07-29 11:10 | |||
| issue2636-20090804.zip | mrabarnett, 2009-08-04 01:30 | |||
| issue2636-20090810.zip | mrabarnett, 2009-08-10 14:18 | |||
| issue2636-20090810#2.zip | mrabarnett, 2009-08-10 15:04 | |||
| issue2636-20090810#3.zip | mrabarnett, 2009-08-10 22:42 | |||
| issue2636-20090815.zip | mrabarnett, 2009-08-15 16:12 | |||
| issue2636-20100116.zip | mrabarnett, 2010-01-16 03:00 | |||
| issue2636-20100204.zip | mrabarnett, 2010-02-04 02:34 | |||
| issue2636-20100210.zip | mrabarnett, 2010-02-10 02:20 | |||
| issue2636-20100211.zip | mrabarnett, 2010-02-11 02:16 | |||
| issue2636-20100217.zip | mrabarnett, 2010-02-17 04:09 | |||
| issue2636-20100218.zip | mrabarnett, 2010-02-18 03:03 | |||
| issue2636-20100219.zip | mrabarnett, 2010-02-19 01:31 | |||
| Features-backslashes.patch | moreati, 2010-02-21 14:46 | |||
| issue2636-20100222.zip | mrabarnett, 2010-02-22 23:28 | |||
| issue2636-20100223.zip | mrabarnett, 2010-02-23 00:39 | |||
| issue2636-20100224.zip | mrabarnett, 2010-02-24 20:25 | |||
| issue2636-20100225.zip | mrabarnett, 2010-02-25 00:12 | |||
| issue2636-20100226.zip | mrabarnett, 2010-02-26 03:20 | |||
| issue2636-20100304.zip | mrabarnett, 2010-03-04 00:41 | |||
| issue2636-20100305.zip | mrabarnett, 2010-03-05 03:27 | |||
| regex_test-20100316 | moreati, 2010-03-16 15:56 | Python 2.6.5 re test run against regex-20100305 | ||
| Messages (145) | |||
|---|---|---|---|
| msg65513 - (view) | Author: Jeffrey C. Jacobs (timehorse) | Date: 2008-04-15 11:57 | |
I am working on adding features to the current Regexp implementation, which is now set to 2.2.2. These features are to bring the Regexp code closer in line with Perl 5.10 as well as add a few python-specific niceties and potential speed-ups and clean-ups. I will be posting regular patch updates to this thread when major milestones have been reach with a description of the feature(s) added. Currently, the list of proposed changes are (in no particular order): 1) Fix <a href="http://bugs.python.org/issue433030">issue 433030</a> by adding support for Atomic Grouping and Possessive Qualifiers 2) Make named matches direct attributes of the match object; i.e. instead of m.group('foo'), one will be able to write simply m.foo. 3) (maybe) make Match objects subscriptable, such that m[n] is equivalent to m.group(n) and allow slicing. 4) Implement Perl-style back-references including relative back-references. 5) Add a well-formed, python-specific comment modifier, e.g. (?P#...); the difference between (?P#...) and Perl/Python's (?#...) is that the former will allow nested parentheses as well as parenthetical escaping, so that patterns of the form '(?P# Evaluate (the following) expression, 3\) using some other technique)'. The (?P#...) will interpret this entire expression as a comment, where as with (?#...) only, everything following ' expression...' would be considered part of the match. (?P#...) will necessarily be slower than (?#...) and so only should be used if richer commenting style is required but the verbose mode is not desired. 6) Add official support for fast, non-repeating capture groups with the Template option. Template is unofficially supported and disables all repeat operators (*, + and ?). This would mainly consist of documenting its behavior. 7) Modify the re compiled expression cache to better handle the thrashing condition. Currently, when regular expressions are compiled, the result is cached so that if the same expression is compiled again, it is retrieved from the cache and no extra work has to be done. This cache supports up to 100 entries. Once the 100th entry is reached, the cache is cleared and a new compile must occur. The danger, all be it rare, is that one may compile the 100th expression only to find that one recompiles it and has to do the same work all over again when it may have been done 3 expressions ago. By modifying this logic slightly, it is possible to establish an arbitrary counter that gives a time stamp to each compiled entry and instead of clearing the entire cache when it reaches capacity, only eliminate the oldest half of the cache, keeping the half that is more recent. This should limit the possibility of thrashing to cases where a very large number of Regular Expressions are continually recompiled. In addition to this, I will update the limit to 256 entries, meaning that the 128 most recent are kept. 8) Emacs/Perl style character classes, e.g. [:alphanum:]. For instance, :alphanum: would not include the '_' in the character class. 9) C-Engine speed-ups. I commenting and cleaning up the _sre.c Regexp engine to make it flow more linearly, rather than with all the current gotos and replace the switch-case statements with lookup tables, which in tests have shown to be faster. This will also include adding many more comments to the C code in order to make it easier for future developers to follow. These changes are subject to testing and some modifications may not be included in the final release if they are shown to be slower than the existing code. Also, a number of Macros are being eliminated where appropriate. 10) Export any (not already) shared value between the Python Code and the C code, e.g. the default Maximum Repeat count (65536); this will allow those constants to be changed in 1 central place. 11) Various other Perl 5.10 conformance modifications, TBD. More items may come and suggestions are welcome. ----- Currently, I have code which implements 5) and 7), have done some work on 10) and am almost 9). When 9) is complete, I will work on 1), some of which, such as parsing, is already done, then probably 8) and 4) because they should not require too much work -- 4) is parser-only AFAICT. Then, I will attempt 2) and 3), though those will require changes at the C-Code level. Then I will investigate what additional elements of 11) I can easily implement. Finally, I will write documentation for all of these features, including 6). In a few days, I will provide a patch with my interim results and will update the patches with regular updates when Milestones are reached. |
|||
| msg65593 - (view) | Author: Jeffrey C. Jacobs (timehorse) | Date: 2008-04-17 22:06 | |
I am very sorry to report (at least for me) that as of this moment, item 9), although not yet complete, is stable and able to pass all the existing python regexp tests. Because these tests are timed, I am using the timings from the first suite of tests to perform a benchmark of performance between old and new code. Based on discussion with Andrew Kuchling, I have decided for the sake of simplicity, the "timing" of each version is to be calculated by the absolute minimum time to execute observed because it is believed this execution would have had the most continuous CPU cycles and thus most closely represents the true execution time. It is this current conclusion that greatly saddens me, not that the effort has not been valuable in understanding the current engine. Indeed, I understand the current engine now well enough that I could proceed with the other modifications as-is rather than implementing them with the new engine. Mind you, I will likely not bring over the copious comments that the new engine received when I translated it to a form without C_Macros and gotos, as that would require too much effort IMHO. Anyway, all that being said, and keeping in mind that I am not 100% satisfied with the new engine and may still be able to wring some timing out of it -- not that I will spend much more time on this -- here is where we currently stand: Old Engine: 6.574s New Engine: 7.239s This makes the old Engine 665ms faster over the entire first test_re.py suite, or 9% faster than the New Engine. |
|||
| msg65613 - (view) | Author: Jeffrey C. Jacobs (timehorse) | Date: 2008-04-18 13:38 | |
Here are the modification so far for item 9) in _sre.c plus some small modifications to sre_constants.h which are only to get _sre.c to compile; normally sre_constants.h is generated by sre_constants.py, so this is not the final version of that file. I also would have intended to make SRE_CHARSET and SRE_COUNT use lookup tables, as well as maybe others, but not likely any other lookup tables. I also want to remove alloc_pos out of the self object and make it a parameter to the ALLOC parameter and probably get rid of the op_code attribute since it is only used in 1 place to save one subtract in a very rare case. But I want to resolve the 10% problem first, so would appreciate it if people could look at the REMOVE_SRE_MATCH_MACROS section of code and compare it to the non-REMOVE_SRE_MATCH_MACROS version of SRE_MATCH and see if you can suggest anything to make the former (new code) faster to get me that elusive 10%. |
|||
| msg65614 - (view) | Author: Jeffrey C. Jacobs (timehorse) | Date: 2008-04-18 14:23 | |
Here is a patch to implement item 7) |
|||
| msg65617 - (view) | Author: Jeffrey C. Jacobs (timehorse) | Date: 2008-04-18 14:50 | |
This simple patch adds (?P#...)-style comment support. |
|||
| msg65725 - (view) | Author: Jim Jewett (jimjjewett) | Date: 2008-04-24 14:23 | |
> These features are to bring the Regexp code closer in line with Perl 5.10
Why 5.1 instead of 5.8 or at least 5.6? Is it just a scope-creep issue?
> as well as add a few python-specific
because this also adds to the scope.
> 2) Make named matches direct attributes
> of the match object; i.e. instead of m.group('foo'),
> one will be able to write simply m.foo.
> 3) (maybe) make Match objects subscriptable, such
> that m[n] is equivalent to m.group(n) and allow slicing.
(2) and (3) would both be nice, but I'm not sure it makes sense to do
*both* instead of picking one.
> 5) Add a well-formed, python-specific comment modifier,
> e.g. (?P#...);
[handles parens in comments without turning on verbose, but is slower]
Why? It adds another incompatibility, so it has to be very useful or
clear. What exactly is the advantage over just turning on verbose?
> 9) C-Engine speed-ups. ...
> a number of Macros are being eliminated where appropriate.
Be careful on those, particular on str/unicode and different compile options.
|
|||
| msg65726 - (view) | Author: Amaury Forgeot d'Arc (amaury.forgeotdarc) * | Date: 2008-04-24 14:31 | |
> > These features are to bring the Regexp code closer in line > > with Perl 5.10 > > Why 5.1 instead of 5.8 or at least 5.6? Is it just a scope-creep issue? 5.10.0 comes after 5.8 and is the latest version (2007/12/18)! Yes it is confusing. |
|||
| msg65727 - (view) | Author: Jeffrey C. Jacobs (timehorse) | Date: 2008-04-24 16:06 | |
Thanks Jim for your thoughts! Armaury has already explained about Perl 5.10.0. I suppose it's like Macintosh version numbering, since Mac Tiger went from version 10.4.9 to 10.4.10 and 10.4.11 a few years ago. Maybe we should call Python 2.6 Python 2.06 just in case. But 2.6 is the known last in the 2 series so it's not a problem for us! :) >> as well as add a few python-specific > > because this also adds to the scope. At this point the only python-specific changes I am proposing would be items 2, 3 (discussed below), 5 (discussed below), 6 and 7. 6 is only a documentation change, the code is already implemented. 7 is just a better behavior. I think it is RARE one compiles more than 100 unique regular expressions, but you never know as projects tend to grow over time, and in the old code the 101st would be recompiled even if it was just compiled 2 minutes ago. The patch is available so I leave it to the community to judge for themselves whether it is worth it, but as you can see, it's not a very large change. >> 2) Make named matches direct attributes >> of the match object; i.e. instead of m.group('foo'), >> one will be able to write simply m.foo. > >> 3) (maybe) make Match objects subscriptable, such >> that m[n] is equivalent to m.group(n) and allow slicing. > > (2) and (3) would both be nice, but I'm not sure it makes sense to do > *both* instead of picking one. Well, I think named matches are better than numbered ones, so I'd definitely go with 2. The problem with 2, though, is that it still leaves the rather typographically intense m.group(n), since I cannot write m.3. However, since capture groups are always numbered sequentially, it models a list very nicely. So I think for indexing by group number, the subscripting operator makes sense. I was not originally suggesting m['foo'] be supported, but I can see how that may come out of 3. But there is a restriction on python named matches that they have to be valid python and that strikes me as 2 more than 3 because 3 would not require such a restriction but 2 would. So at least I want 2, but it seems IMHO m[1] is better than m.group(1) and not in the least hard or a confusing way of retrieving the given group. Mind you, the Match object is a C-struct with python binding and I'm not exactly sure how to add either feature to it, but I'm sure the C-API manual will help with that. >> 5) Add a well-formed, python-specific comment modifier, >> e.g. (?P#...); > > [handles parens in comments without turning on verbose, but is slower] > > Why? It adds another incompatibility, so it has to be very useful or > clear. What exactly is the advantage over just turning on verbose? Well, Larry Wall and Guido agreed long ago that we, the python community, own all expressions of the form (?P...) and although I'd be my preference to make (?#...) more in conformance with understanding parenthesis nesting, changing the logic behind THAT would make python non-standard. So as far as any conflicting design, we needn't worry. As for speed, the this all occurs in the parser and does not effect the compiler or engine. It occurs only after a (?P has been read and then only as the last check before failure, so it should not be much slower except when the expression is invalid. The actual execution time to find the closing brace of (?P#...) is a bit slower than that for (?#...) but not by much. Verbose is generally a good idea for anything more than a trivial Regular Expression. However, it can have overhead if not included as the first flag: an expression is always checked for verbose post-compilation and if it is encountered, the expression is compiled a second time, which is somewhat wasteful. But the reason I like the (?P#...) over (?#...) is because I think people would more tend to assume: r'He(?# 2 (TWO) ls)llo' should match "Hello" but it doesn't. That expression only matches "He ls)llo", so I created the (?P#...) to make the comment match type more intuitive: r'He(?P# 2 (TWO) ls)llo' matches "Hello". >> 9) C-Engine speed-ups. ... >> a number of Macros are being eliminated where appropriate. > > Be careful on those, particular on str/unicode and different > compile options. Will do; thanks for the advice! I have only observed the UNICODE flag controlling whether certain code is used (besides the ones I've added) and have tried to stay true to that when I encounter it. Mind you, unless I can get my extra 10% it's unlikely I'd actually go with item 9 here, even if it is easier to read IMHO. However, I want to run the new engine proposal through gprof to see if I can track down some bottlenecks. At some point, I hope to get my current changes on Launchpad if I can get that working. If I do, I'll give a link to how people can check out my working code here as well. |
|||
| msg65734 - (view) | Author: Jim Jewett (jimjjewett) | Date: 2008-04-24 18:09 | |
Python 2.6 isn't the last, but Guido has said that there won't be a 2.10.
> Match object is a C-struct with python binding
> and I'm not exactly sure how to add either feature to it
I may be misunderstanding -- isn't this just a matter of writing the
function and setting it in the tp_as_sequence and tp_as_mapping slots?
> Larry Wall and Guido agreed long ago that we, the python
> community, own all expressions of the form (?P...)
Cool -- that reference should probably be added to the docs. For someone
trying to learn or translate regular expressions, it helps to know that (?P
...) is explicitly a python extension (even if Perl adopts it later).
Definately put the example in the doc.
r'He(?# 2 (TWO) ls)llo' should match "Hello" but it doesn't. Maybe
even without the change, as doco on the current situation.
Does VERBOSE really have to be the first flag, or does it just have to be on
the whole pattern instead of an internal switch?
I'm not sure I fully understand what you said about template. Is this a
special undocumented switch, or just an internal optimization mode that
should be triggered whenever the repeat operators don't happen to occur?
|
|||
| msg65838 - (view) | Author: Antoine Pitrou (pitrou) | Date: 2008-04-26 10:08 | |
I don't know anything about regexp implementation, but if you replace a switch-case with a function lookup table, it isn't surprising that the new version ends up slower. A local jump is always faster than a function call, because of the setup overhead and stack manipulation the latter involves. So you might try to do the cleanup while keeping the switch-case structure, if possible. |
|||
| msg65841 - (view) | Author: Jeffrey C. Jacobs (timehorse) | Date: 2008-04-26 11:51 | |
Thank you and Merci Antoine! That is a good point. It is clearly specific to the compiler whether a switch-case will be turned into a series of conditional branches or simply creating an internal jump table with lookup. And it is true that most compilers, if I understand correctly, use the jump-table approach for any switch-case over 2 or 3 entries when the cases are tightly grouped and near 0. That is probably why the original code worked so fast. I'll see if I can combine the best of both approaches. Thanks again! |
|||
| msg66033 - (view) | Author: Jeffrey C. Jacobs (timehorse) | Date: 2008-05-01 14:15 | |
I am making my changes in a Bazaar branch hosted on Launchpad. It took me quite a while to get things set up more-or-less logically but there they are and I'm currently trying to re-apply my local changes up to today into the various branches I have. Each of the 11 issues I outlined originally has its own branch, with a root branch from which all these branches are derived to serve as a place for a) merging in python 2.6 alpha concurrent development (merges) and to apply any additional re changes that don't fall into any of the other categories, of which I have so far found only 2 small ones. Anyway, if anyone is interested in monitoring my progress, it is available at: https://code.launchpad.net/~timehorse/ I will still post major milestones here, but one can monitory day-to-day progress on Launchpad. Also on launchpad you will find more detail on the plans for each of the 11 modifications, for the curious. Thanks again for all the advice! |
|||
| msg67309 - (view) | Author: Jeffrey C. Jacobs (timehorse) | Date: 2008-05-24 21:38 | |
I am finally making progress again, after a month of changing my patches from my local svn repository to bazaar hosted on launchpad.net, as stated in my last update. I also have more or less finished the probably easiest item, #5, so I have a full patch for that available now. First, though, I want to update my "No matter what" patch, which is to say these are the changes I want to make if any changes are made to the Regexp code. |
|||
| msg67447 - (view) | Author: Mark Summerfield (mark) | Date: 2008-05-28 13:38 | |
AFAIK if you have a regex with named capture groups there is no direct
way to relate them to the capture group numbers.
You could do (untested; Python 3 syntax):
d = {v: k for k, v in match.groupdict()}
for i in range(match.lastindex):
print(i, match.group(i), d[match.group(i)])
One possible solution would be a grouptuples() function that returned a
tuple of 3-tuples (index, name, captured_text) with the name being None
for unnamed groups.
Anyway, good luck with all your improvements, I will be especially glad
if you manage to do (2) and (8) (and maybe (3)).
|
|||
| msg67448 - (view) | Author: Jeffrey C. Jacobs (timehorse) | Date: 2008-05-28 13:57 | |
Mark scribbled: > One possible solution would be a grouptuples() function that returned > a tuple of 3-tuples (index, name, captured_text) with the name being > None for unnamed groups. Hmm. Well, that's not a bad idea at all IMHO and would, AFAICT probably be easier to do than (2) but I would still do (2) but will try to add that to one of the existing items or spawn another item for it since it is kind of a distinct feature. My preference right now is to finish off the test cases for (7) because it is already coded, then finish the work on (1) as that was the original reason for modification then on to (2) then (3) as they are related and then I don't mind tackling (8) because I think that one shouldn't be too hard. Interestingly, the existing engine code (sre_parse.py) has a place-holder, commented out, for character classes but it was never properly implemented. And I will warn that with Unicode, I THINK all the character classes exist as unicode functions or can be implemented as multiple unicode functions, but I'm not 100% sure so if I run into that problem, some character classes may initially be left out while I work on another item. Anyway, thanks for the input, Mark! |
|||
| msg68336 - (view) | Author: Jeffrey C. Jacobs (timehorse) | Date: 2008-06-17 17:43 | |
Well, it's time for another update on my progress... Some good news first: Atomic Grouping is now completed, tested and documented, and as stated above, is classified as issue2636-01 and related patches. Secondly, with caveats listed below, Named Match Group Attributes on a match object (item 2) is also more or less complete at issue2636-02 -- it only lacks documentation. Now, I want to also update my list of items. We left off at 11: Other Perl-specific modifications. Since that time, I have spawned a number of other branches, the first of which (issue2636-12) I am happy to announce is also complete! 12) Implement the changes to the documentation of re as per Jim J. Jewett suggestion from 2008-04-24 14:09. Again, this has been done. 13) Implement a grouptuples(...) method as per Mark Summerfield's suggest on 2008-05-28 09:38. grouptuples would take the same filtering parameters as the other group* functions, and would return a list of 3- tuples (unless only 1 group was requested). It should default to all match groups (1..n, not group 0, the matching string). 14) As per PEP-3131 and the move to Python 3.0, python will begin to allow full UNICODE-compliant identifier names. Correspondingly, it would be the responsibility of this item to allow UNICODE names for match groups. This would allow retrieval of UNICODE names via the group* functions or when combined with Item 3, the getitem handler (m[u'...']) (03+14) and the attribute name itself (e.g. getattr(m, u'...')) when combined with item 2 (02+14). 15) Change the Pattern_Type, Match_Type and Scanner_Type (experimental) to become richer Python Types. Specifically, add __doc__ strings to each of these types' methods and members. 16) Implement various FIXMEs. 16-1) Implement the FIXME such that if m is a MatchObject, del m.string will disassociate the original matched string from the match object; string would be the only member that would allow modification or deletion and you will not be able to modify the m.string value, only delete it. ----- Finally, I want to say a couple notes about Item 2: Firstly, as noted in Item 14, I wish to add support for UNICODE match group names, and the current version of the C-code would not allow that; it would only make sense to add UNICODE support if 14 is implemented, so adding support for UNICODE match object attributes would depend on both items 2 and 14. Thus, that would be implemented in issue2636-02+14. Secondly, there is a FIXME which I discussed in Item 16; I gave that problem it's own item and branch. Also, as stated in Item 15, I would like to add more robust help code to the Match object and bind __doc__ strings to the fixed attributes. Although this would not directly effect the Item 2 implementation, it would probably involve moving some code around in its vicinity. Finally, I would like suggestions on how to handle name collisions when match group names are provided as attributes. For instance, an expression like '(?P<pos>.*)' would match more or less any string and assign it to the name "pos". But "pos" is already an attribute of the Match object, and therefore pos cannot be exposed as a named match group attribute, since match.pos will return the usual meaning of pos for a match object, not the value of the capture group names "pos". I have 3 proposals as to how to handle this: a) Simply disallow the exposure of match group name attributes if the names collide with an existing member of the basic Match Object interface. b) Expose the reserved names through a special prefix notation, and for forward compatibility, expose all names via this prefix notation. In other words, if the prefix was 'k', match.kpos could be used to access pos; if it was '_', match._pos would be used. If Item 3 is implemented, it may be sufficient to allow access via match['pos'] as the canonical way of handling match group names using reserved words. c) Don't expose the names directly; only expose them through a prefixed name, e.g. match._pos or match.kpos. Personally, I like a because if Item 3 is implemented, it makes a fairly useful shorthand for retrieving keyword names when a keyword is used for a name. Also, we could put a deprecation warning in to inform users that eventually match groups names that are keywords in the Match Object will eventually be disallowed. However, I don't support restricting the match group names any more than they already are (they must be a valid python identifier only) so again I would go with a) and nothing more and that's what's implemented in issue2636-02.patch. ----- Now, rather than posting umteen patch files I am posting one bz2- compressed tar of ALL patch files for all threads, where each file is of the form: issue2636(-\d\d|+\d\d)*(-only)?.patch For instance, issue2636-01.patch is the p1 patch that is a difference between the current Python trunk and all that would need to be implemented to support Atomic Grouping / Possessive Qualifiers. Combined branches are combined with a PLUS ('+') and sub-branches concatenated with a DASH ('- '). Thus, "issue2636-01+09-01-01+10.patch" is a patch which combines the work from Item 1: Atomic Grouping / Possessive Qualifiers, the sub- sub branch of Item 9: Engine Cleanups and Item 10: Shared Constants. Item 9 has both a child and a grandchild. The Child (09-01) is my proposed engine redesign with the single loop; the grandchild (09-01-01) is the redesign with the triple loop. Finally the optional "-only" flag means that the diff is against the core SRE modifications branch and thus does not include the core branch changes. As noted above, Items 01, 02, 05, 07 and 12 should be considered more or less complete and ready for merging assuming I don't identify in my implementation of the other items that I neglected something in these. The rest, including the combined items, are all provided in the given tarball. |
|||
| msg68339 - (view) | Author: Jeffrey C. Jacobs (timehorse) | Date: 2008-06-17 19:07 | |
Sorry, as I stated in the last post, I generated the patches then realized that I was missing the documentation for Item 2, so I have updated the issue2636-02.patch file and am attaching that separately until the next release of the patch tarball. issue2636-02-only.patch should be ignored and I will only regenerate it with the correct documentation in the next tarball release so I can move on to either Character Classes or Relative Back-references. I wanna pause Item 3 for the moment because 2, 3, 13, 14, 15 and 16 all seem closely related and I need a break to allow my mind to wrap around the big picture before I try and tackle each one. |
|||
| msg68358 - (view) | Author: Mark Summerfield (mark) | Date: 2008-06-18 07:13 | |
[snip]
> 13) Implement a grouptuples(...) method as per Mark Summerfield's
> suggest on 2008-05-28 09:38. grouptuples would take the same filtering
> parameters as the other group* functions, and would return a list of 3-
> tuples (unless only 1 group was requested). It should default to all
> match groups (1..n, not group 0, the matching string).
:-)
[snip]
> Finally, I would like suggestions on how to handle name collisions when
> match group names are provided as attributes. For instance, an
> expression like '(?P<pos>.*)' would match more or less any string and
> assign it to the name "pos". But "pos" is already an attribute of the
> Match object, and therefore pos cannot be exposed as a named match group
> attribute, since match.pos will return the usual meaning of pos for a
> match object, not the value of the capture group names "pos".
>
> I have 3 proposals as to how to handle this:
>
> a) Simply disallow the exposure of match group name attributes if the
> names collide with an existing member of the basic Match Object
> interface.
I don't like the prefix ideas and now that you've spelt it out I don't
like the sometimes m.foo will work and sometimes it won't. So I prefer
m['foo'] to be the canonical way because that guarantees your code is
always consistent.
------------------------------------------------------------
BTW I wanted to do a simple regex to match a string that might or might
not be quoted, and that could contain quotes (but not those used to
delimit it). My first attempt was illegal:
(?P<quote>['"])?([^(?=quote)])+(?(quote)(?=quote))
It isn't hard to work round but it did highlight the fact that you can't
use captures inside character classes. I don't know if Perl allows this;
I guess if it doesn't then Python shouldn't either since GvR wants the
engine to be Perl compatible.
|
|||
| msg68399 - (view) | Author: Jeffrey C. Jacobs (timehorse) | Date: 2008-06-19 12:01 | |
Thanks for weighing in Mark! Actually, your point is valid and quite fair, though I would not assume that Item 3 would be included just because Item 2 isn't. I will do my best to develop both, but I do not make the final decision as to what python includes. That having been said, 3 seems very likely at this point so we may be okay, but let me give this one more try as I think I have a better solution to make Item 2 more palatable. Let's say we have 5 choices here: > a) Simply disallow the exposure of match group name attributes if the > names collide with an existing member of the basic Match Object > interface. > > b) Expose the reserved names through a special prefix notation, and > for forward compatibility, expose all names via this prefix notation. > In other words, if the prefix was 'k', match.kpos could be used to > access pos; if it was '_', match._pos would be used. If Item 3 is > implemented, it may be sufficient to allow access via match['pos'] as > the canonical way of handling match group names using reserved words. > > c) Don't expose the names directly; only expose them through a > prefixed name, e.g. match._pos or match.kpos. d) (As Mark suggested) we drop Item 2 completely. I have not invested much work in this so that would not bother me, but IMHO I actually prefer Item 2 to 3 so I would really like to see it preserved in some form. e) Add an option, re.MATCH_ATTRIBUTES, that is used as a Match Creation flag. When the re.MATCH_ATTRIBUTES or re.A flag is included in the compile, or (?a) is included in the pattern, it will do 2 things. First, it will raise an exception if either a) there exists an unnamed capture group or b) the capture group name is a reserved keyword. In addition to this, I would put in a hook to support a from __future__ so that any post 2.6 changes to the match object type can be smoothly integrated a version early to allow programmers to change when any future changes come. Secondly, I would *conditionally* allow arbitrary capture group name via the __getattr__ handler IFF that flag was present; otherwise you could not access Capture Groups by name via match.foo. I really like the idea of e) so I'm taking Item 2 out of the "ready for merge" category and going to put it in the queue for the modifications spelled out above. I'm not too worried about our flags differing from Perl too much as we did base our first 4 on Perl (x, s, m, i), but subsequently added Unicode and Locale, which Perl does not have, and never implemented o (since our caching semantic already pretty much gives every expression that), e (which is specific to Perl syntax AFAICT) and g (which can be simulated via re.split). So I propose we take A and implement it as I've specified and that is the current goal of Item 2. Once this is done and working, we can decide whether it should be included in the python trunk. How does that sound to you, Mark and anyone else who wishes to weigh in? |
|||
| msg68409 - (view) | Author: Mark Summerfield (mark) | Date: 2008-06-19 14:15 | |
[snip]
It seems to me that both using a special prefix or adding an option are
adding a lot of baggage and will increase the learning curve.
The nice thing about (3) (even without slicing) is that it seems a v.
natural extension. But (2) seems magical (i.e., Perl-like rather than
Pythonic) which I really don't like.
BTW I just noticed this:
'<_sre.SRE_Pattern object at 0x9ded020>'
>>> "{0!r}".format(rx)
'<_sre.SRE_Pattern object at 0x9ded020>'
>>> "{0!s}".format(rx)
'<_sre.SRE_Pattern object at 0x9ded020>'
>>> "{0!a}".format(rx)
That's fair enough, but maybe for !s the output should be rx.pattern?
|
|||
| msg73185 - (view) | Author: Antoine Pitrou (pitrou) | Date: 2008-09-13 13:40 | |
See also #3825. |
|||
| msg73295 - (view) | Author: Jeffrey C. Jacobs (timehorse) | Date: 2008-09-16 11:59 | |
Update 16 Sep 2008: Based on the work for issue #3825, I would like to simply update the item list as follows: 1) Atomic Grouping / Possessive Qualifiers (See also Issue #433030) [Complete] 2) Match group names as attributes (e.g. match.foo) [Complete save issues outlined above] 3) Match group indexing (e.g. match['foo'], match[3]) 4) Perl-style back-references (e.g. compile(r'(a)\g{-1}'), and possibly adding the r'\k' escape sequence for keywords. 5) Parenthesis-Aware Python Comment (e.g. r'(?P#...)') [Complete] 6) Expose support for Template expressions (expressions without repeat operators), adding test cases and documentation for existing code. 7) Larger compiled Regexp cache (256 vs. 100) and reduced thrashing risk. [Complete] 8) Character Classes (e.g. r'[:alphanum:]') 9) Proposed Engine redesigns and cleanups (core item only contains cleanups and comments to the current design but does not modify the design). 9-1) Single-loop Engine redesign that runs 8% slower than current. [Complete] 9-1-1) 3-loop Engine redesign that runs 10% slower than current. [Complete] 9-2) Matthew Bernett's Engine redesign as per issue #3825 10) Have all C-Python shared constants stored in 1 place (sre_constants.py) and generated by that into C constants (sre_constants.h). [Complete AFAICT] 11) Scan Perl 5.10.0 for other potential additions that could be implemented for Python. 12) Documentation suggestions by Jim J. Jewett [Complete] 13) Add grouptuples method to the Match object (i.e. match.grouptuples() returns (<index>, <name or None>, <value>) ) suitable for iteration. 14) UNICODE match group names, as per PEP-3131. 15) Add __doc__ strings and other Python niceties to the Pattern_Type, Match_Type and Scanner_Type (experimental). 16) Implement any remaining TODOs and FIXMEs in the Regexp modules. 16-1) Allow for the disassociation of a source string from a Match_Type, assuming this will still leave the object in a "reasonable" state. 17) Variable-length [Positive and Negative] Look-behind assertions, as described and implemented in Issue #3825. --- Now, we have a combination of Items 1, 9-2 and 17 available in issue #3825, so for now, refer to that issue for the 01+09-02+17 combined solution. Eventually, I hope to merge the work between this and that issue. I sadly admit I have made not progress on this since June because managing 30 some lines of development, some of which having complex diamond branching, e.g.: 01 is the child of Issue2636 09 is the child of Issue2636 10 is the child of Issue2636 09-01 is the child of 09 09-01-01 is the child of 09-01 01+09 is the child of 01 and 09 01+10 is the child of 01 and 10 09+10 is the child of 09 and 10 01+09-01 is the child of 01 and 09-01 01+09-01-01 is the child of 01 and 09-01-01 09-01+10 is the child of 09-01 and 10 09-01-01+10 is the child of 09-01-01 and 10 Which all seems rather simple until you wrap your head around: 01+09+10 is the child of 01, 09, 10, 01+09, 01+10 AND 09+10! Keep in mind the reason for all this complex numbering is because many issues cannot be implemented in a vacuum: If you want Atomic Grouping, that's 1 implementation, if you want Shared Constants, that's a different implementation. but if you want BOTH Atomic Grouping and Shared Constants, that is a wholly other implementation because each implementation affects the other. Thus, I end up with a plethora of branches and a nightmare when it comes to merging which is why I've been so slow in making progress. Bazaar seems to be very confused when it comes to a merge in 6 parts between, for example 01, 09, 10, 01+09, 01+10 and 09+10, as above. It gets confused when it sees the same changes applied in a previous merge applied again, instead of simply realizing that the change in one since last merge is EXACTLY the same change in the other since last merge so effectively there is nothing to do; instead, Bazaar gets confused and starts treating code that did NOT change since last merge as if it was changed and thus tries to role back the 01+09+10-specific changes rather than doing nothing and generates a conflict. Oh, that I could only have a version control system that understood the kind of complex branching that I require! Anyway, that's the state of things; this is me, signing out! |
|||
| msg73714 - (view) | Author: Matthew Barnett (mrabarnett) | Date: 2008-09-24 14:28 | |
Comparing item 2 and item 3, I think that item 3 is the Pythonic choice and item 2 is a bad idea. Item 4: back-references in the pattern are like \1 and (?P=name), not \g<1> or \g<name>, and in the replacement string are like \g<1> and \g<name>, not \1 (or (?P=name)). I'd like to suggest that back-references in the pattern also include \g<1> and \g<name> and \g<-1> for relative back-references. Interestingly, Perl names groups with (?<name>...) whereas Python uses (?P<name>...). A permissible alternative? |
|||
| msg73717 - (view) | Author: Jeffrey C. Jacobs (timehorse) | Date: 2008-09-24 15:09 | |
Thanks for weighing in Matthew! Yeah, I do get some flack for item 2 because originally item 3 wasn't supposed to cover named groups but on investigation it made sense that it should. I still prefer 2 over-all but the nice thing about them being separate items is that we can accept 2 or 3 or both or neither, and for the most part development for the first phase of 2 is complete though there is still IMHO the issue of UNICODE name groups (visa-vi item 14) and the name collision problem which I propose fixing with an Attribute / re.A flag. So, I think it may end up that we could support both 3 by default and 2 via a flag or maybe 3 and 2 both but with 2 as is, with name collisions hidden (i.e. if you have r'(?P<string>...)' as your capture group, typing m.string will still give you the original comparison string, as per the current python documentation) but have collision-checking via the Attribute flag so that with r'(?A)(?P<string>...)' would not compile because string is a reserved word. Your interpretation of 4 matches mine, though, and I would definitely suggest using Perl's \g<-n> notation for relative back-references, but further, I was thinking, if not part of 4, part of the catch-all item 11 to add support for Perl's (?<name>...) as a synonym for Python's (?P<name>...) and Perl's \k<name> for Python's (?P=name) notation. The evolution of Perl's name group is actually interesting. Years ago, Guido had a conversation with Larry Wall about using the (?P...) capture sequence for python-specific Regular Expression blocks. So Python went ahead and implemented named capture groups. Years later, the Perl folks thought named capture groups were a neat idea and adapted them in the (?<...>...) form because Python had restricted the (?P...) notation to themselves so they couldn't use our even if they wanted to. Now, though, with Perl adapting (?<...>...), I think it inevitable that Java and even C++ may see this as the defacto standard. So I 100% agree, we should consider supporting (?<name>...) in the parser. Oh, and as I suggested in Issue 3825, I have these new item proposals: Item 18: Add a re.REVERSE, re.R (?r) flag for reversing the direction of the String Evaluation against a given Regular Expression pattern. See issue 516762, as implemented in Issue 3825. Item 19: Make various in-line flags positionally dependant, for example (?i) makes the pattern before this case-sensitive but after it case-insensitive. See Issue 433024, as implemented in Issue 3825. Item 20: All the negation of in-line flags to cancel their effect in conditionally flagged expressions for example (?-i). See Issue 433027, as implemented in Issue 3825. Item 21: Allow for scoped flagged expressions, i.e. (?i:...), where the flag(s) is applied to the expression within the parenthesis. See Issue 433028, as implemented in Issue 3825. Item 22: Zero-width regular expression split: when splitting via a regular expression of Zero-length, this should return an expression equivalent to splitting at each character boundary, with a null string at the beginning and end representing the space before the first and after the last character. See issue 3262. Item 23: Character class ranges over case-insensitive matches, i.e. does "(?i)[9-A]" contain '_' , whose ord is greater than the ord of 'A' and less than the ord of 'a'. See issue 5311. And I shall create a bazaar repository for your current development line with the unfortunately unwieldy name of lp:~timehorse/python/issue2636-01+09-02+17+18+19+20+21 as that would, AFAICT, cover all the items you've fixed in your latest patch. Anyway, great work Matthew and I look forward to working with you on Regexp 2.7 as you do great work! |
|||
| msg73721 - (view) | Author: Matthew Barnett (mrabarnett) | Date: 2008-09-24 15:48 | |
Regarding item 22: there's also #1647489 ("zero-length match confuses re.finditer()"). This had me stumped for a while, but I might have a solution. I'll see whether it'll fix item 22 too. I wasn't planning on doing any more major changes on my branch, just tweaking and commenting and seeing whether I've missed any tricks in the speed stakes. Half the task is finding out what's achievable, and how! |
|||
| msg73730 - (view) | Author: Georg Brandl (georg.brandl) | Date: 2008-09-24 16:33 | |
Though I can't look at the code at this time, I just want to express how good it feels that you both are doing these great things for regular expressions in Python! Especially atomic grouping is something I've often wished for when writing lexers for Pygments... Keep up the good work! |
|||
| msg73752 - (view) | Author: Jeffrey C. Jacobs (timehorse) | Date: 2008-09-24 19:45 | |
Good catch on issue 1647489 Matthew; it looks like this is where that bug fix will end up going. But, I am unsure if the solution for this issue is going to be the same as for 3262. I think the solution here is to add an internal flag that will keep track of whether the current character had previously participated in a Zero-Width match and thus not allow any subsequent zero-width matches associated beyond the first, and at the same time not consuming any characters in a Zero-width match. Thus, I have allocated this fix as Item 24, but it may be later merged with 22 if the solutions turn out to be more or less the same, likely via a 22+24 thread. The main difference, though, as I see it, is that the change in 24 may be considered a bug where the general consensus of 22 is that it is more of a feature request and given Guido's acceptance of a flag-based approach, I suggest we allocate re.ZEROWIDTH, re.Z and (?z) flags to turn on the behaviour you and I expect, but still think that be best as a 2.7 / 3.1 solution. I would also like to add a from __futurue__ import ZeroWidthRegularExpressions or some such to make this the default behaviour so that by version 3.2 it may indeed be considered the default. Anyway, I've allocated all the new items in the launchpad repository so feel free to go to http://www.bazaar-vcs.org/ and install Bazaar for windows so you can download any of the individual item development threads and try them out for yourself. Also, please consider setting up a free launchpad account of your very own so that I can perhaps create a group that would allow us to better share development. Thanks again Matthew for all your greatly appreciated contributions! |
|||
| msg73766 - (view) | Author: Jeffrey C. Jacobs (timehorse) | Date: 2008-09-25 00:06 | |
I've moved all the development branches to the ~pythonregexp2.7 team so that we can work collaboratively. You just need to install Bazaar, join www.launchpad.net, upload your public SSH key and then request to be added to the pythonregexp2.7 team. At that point, you can check out any code via: bzr co lp:~pythonregexp2.7/python/issue2636-* This should make co-operative development easier. |
|||
| msg73779 - (view) | Author: Matthew Barnett (mrabarnett) | Date: 2008-09-25 11:56 | |
Just out of interest, is there any plan to include #1160 while we're at it? |
|||
| msg73780 - (view) | Author: Jeffrey C. Jacobs (timehorse) | Date: 2008-09-25 11:57 | |
I've enumerated the current list of Item Numbers at the official Launchpad page for this issue: https://launchpad.net/~pythonregexp2.7 There you will find links to each development branch associated with each item, where a broader description of each issue may be found. I will no longer enumerate the entire list here as it has grown too long to keep repeating; please consult that web page for the most up-to-date list of items we will try to tackle in the Python Regexp 2.7 update. Also, anyone wanting to join the development team who already has a Launchpad account can just go to the Python Regexp 2.7 web site above and request to join. You will need Bazaar to check out, pull or branch code from the repository, which is available at www.bazaar-vcs.org. |
|||
| msg73782 - (view) | Author: Jeffrey C. Jacobs (timehorse) | Date: 2008-09-25 12:23 | |
Good catch, Matthew, and if you spot any other outstanding Regular Expression issues feel free to mention them here. I'll give issue 1160 an item number of 25 and think all we need to do here is change SRE_CODE to be typedefed to an unsigned long and change the repeat count constants (which would be easier if we assume item 10: shared constants). |
|||
| msg73791 - (view) | Author: Matthew Barnett (mrabarnett) | Date: 2008-09-25 13:43 | |
For reference, these are all the regex-related issues that I've found (including this one!): id : activity : title #2636 : 25/09/08 : Regexp 2.7 (modifications to current re 2.2.2) #1160 : 25/09/08 : Medium size regexp crashes python #1647489 : 24/09/08 : zero-length match confuses re.finditer() #3511 : 24/09/08 : Incorrect charset range handling with ignore case flag? #3825 : 24/09/08 : Major reworking of Python 2.5.2 re module #433028 : 24/09/08 : SRE: (?flag:...) is not supported #433027 : 24/09/08 : SRE: (?-flag) is not supported. #433024 : 24/09/08 : SRE: (?flag) isn't properly scoped #3262 : 22/09/08 : re.split doesn't split with zero-width regex #3299 : 17/09/08 : invalid object destruction in re.finditer() #3665 : 24/08/08 : Support \u and \U escapes in regexes #3482 : 15/08/08 : re.split, re.sub and re.subn should support flags #1519638 : 11/07/08 : Unmatched Group issue - workaround #1662581 : 09/07/08 : the re module can perform poorly: O(2**n) versus O(n**2) #3255 : 02/07/08 : [proposal] alternative for re.sub #2650 : 28/06/08 : re.escape should not escape underscore #433030 : 17/06/08 : SRE: Atomic Grouping (?>...) is not supported #1721518 : 24/04/08 : Small case which hangs #1693050 : 24/04/08 : \w not helpful for non-Roman scripts #2537 : 24/04/08 : re.compile(r'((x|y+)*)*') should fail #1633953 : 23/02/08 : re.compile("(.*$){1,4}", re.MULTILINE) fails #1282 : 06/01/08 : re module needs to support bytes / memoryview well #814253 : 11/09/07 : Grouprefs in lookbehind assertions #214033 : 10/09/07 : re incompatibility in sre #1708652 : 01/05/07 : Exact matching #694374 : 28/06/03 : Recursive regular expressions #433029 : 14/06/01 : SRE: posix classes aren't supported |
|||
| msg73794 - (view) | Author: Jeffrey C. Jacobs (timehorse) | Date: 2008-09-25 14:17 | |
Hmmm. Well, some of those are already covered: #2636 : self #1160 : Item 25 #1647489 : Item 24 #3511 : Item 23 #3825 : Item 9-2 #433028 : Item 21 #433027 : Item 20 #433024 : Item 19 #3262 : Item 22 #3299 : TBD #3665 : TBD #3482 : TBD #1519638 : TBD #1662581 : TBD #3255 : TBD #2650 : TBD #433030 : Item 1 #1721518 : TBD #1693050 : TBD #2537 : TBD #1633953 : TBD #1282 : TBD #814253 : TBD (but I think you implemented this, didn't you Matthew?) #214033 : TBD #1708652 : TBD #694374 : TBD #433029 : Item 8 I'll have to get nosy and go over the rest of these to see if any of them have already been solved, like the duplicate test case issue from a while ago, but someone forgot to close them. I'm thinking specifically the '\u' escape sequence one. |
|||
| msg73798 - (view) | Author: Matthew Barnett (mrabarnett) | Date: 2008-09-25 15:57 | |
#814253 is part of the fix for variable-width lookbehind. BTW, I've just tried a second time to register with Launchpad, but still no reply. :-( |
|||
| msg73801 - (view) | Author: Jeffrey C. Jacobs (timehorse) | Date: 2008-09-25 16:32 | |
Yes, I see in you rc2+2 diff it was added into that. I will have to allocate a new number for that fix though, as technically it's a different feature than variable-length look-behind. For now I'm having a hard time merging your diffs in with my code base. Lots and lots of conflicts, alas. BTW, what UID did you try to register under at Launchpad? Maybe I can see if it's registered but just forgetting to send you e-mail. |
|||
| msg73803 - (view) | Author: Matthew Barnett (mrabarnett) | Date: 2008-09-25 17:01 | |
Tried bazaar@mrabarnett.plus.com twice, no reply. Succeeded with mrabarnett@freeuk.com. |
|||
| msg73805 - (view) | Author: Jeffrey C. Jacobs (timehorse) | Date: 2008-09-25 17:36 | |
Thanks Matthew. You are now part of the pythonregexp2.7 team. I want to handle integrating Branch 01+09-02+17 myself for now and the other branches will need to be renamed because I need to add Item 26: Capture Groups in Look-Behind expressions, which would mean the order of your patches are: 01+09-02+17: regex_2.6rc2.diff regex_2.6rc2+1.diff 01+09-02+17+26: regex_2.6rc2+2.diff 01+09-02+17+18+26: regex_2.6rc2+3.diff regex_2.6rc2+4.diff 01+09-02+17+18+19+20+21+26: regex_2.6rc2+5 regex_2.6rc2+6 It is my intention, therefore, to check a version of each of these patches in to their corresponding repository, sequentially, starting with 0, which is what I am working on now. I am worried about a straight copy to each thread though, as there are some basic cleanups provided through the core issue2636 patch, the item 1 patch and the item 9 patch. The best way to see what these changes are is to download http://bugs.python.org/file10645/issue2636-patches.tar.bz2 and look at the issue2636-01+09.patch file or, by typing the following into bazaar: bzr diff --old lp:~pythonregexp2.7/python/base --new lp:~pythonregexp2.7/python/issue2636+01+09 Which is more up-to-date than my June patches -- I really need to regenerate those! |
|||
| msg73827 - (view) | Author: Matthew Barnett (mrabarnett) | Date: 2008-09-25 23:59 | |
I've been completely unable to get Bazaar to work with Launchpad: authentication errors and bzrlib.errors.TooManyConcurrentRequests. |
|||
| msg73848 - (view) | Author: Jeffrey C. Jacobs (timehorse) | Date: 2008-09-26 13:11 | |
Matthew, Did you upload a public SSH key to your Launchpad account? You're on MS Windows, right? I can try and do an install on an MS Windows XP box or 2 I have lying around and see how that works, but we should try and solve this vexing thing I've noticed about Windows development, which is that Windows cannot understand Unix-style file permissions, and so when I check out Python on Windows and then check it back in, I've noticed that EVERY python and C file is "changed" by virtue of its permissions having changed. I would hope there's some way to tell Bazaar to ignore 'permissions' changes because I know our edits really have nothing to do with that. Anyway, I'll try a few things visa-vi Windows to see if I get a similar problem; there's also the https://answers.launchpad.net/bazaar forum where you can post your Bazaar issues and see if the community can help. Search previous questions or click the "Ask a question" button and type your subject. Launchpad's UI is even smart enough to scan your question title for similar ones so you may be able to find a solution right away that way. I use the Launchpad Answers section all the time and have found it usually is a great way of getting help. |
|||
| msg73853 - (view) | Author: Matthew Barnett (mrabarnett) | Date: 2008-09-26 15:16 | |
I have it working finally! |
|||
| msg73854 - (view) | Author: Jeffrey C. Jacobs (timehorse) | Date: 2008-09-26 15:43 | |
Great, Matthew!! Now, I'm still in the process of setting up branches related to your work; generally they should be created from a core and set of features implemented for example: To get from Version 2 to Version 3 of your Engine, I had to first check out lp:~pythonregexp2.7/python/issue2636-01+09-02+17 and then "push" it back onto launchpad as lp:~pythonregexp2.7/python/issue2636-01+09-02+17+26. This way the check-in logs become coherent. So, please hold off on checking your code in until I have your current patch-set checked in, which I should finish by today; I also need to rename some of the projects based on the fact that you also implemented item 26 in most of your patches. Actually, I keep a general To-Do list of what I am up to on the https://code.launchpad.net/~pythonregexp2.7/python/issue2636 whiteboard, which you can also edit, if you want to see what I'm up to. But I'll try to have that list complete by today, fingers crossed! In the mean time, would you mind seeing if you are getting the file permissions issue by doing a checkout or pull or branch and then calling "bzr stat" to see if this caused Bazaar to add your entire project for checkin because the permissions changed. Thanks and congratulations again! |
|||
| msg73855 - (view) | Author: Matthew Barnett (mrabarnett) | Date: 2008-09-26 16:00 | |
I did a search on the permissions problem: https://answers.launchpad.net/bzr/+question/34332. |
|||
| msg73861 - (view) | Author: Jeffrey C. Jacobs (timehorse) | Date: 2008-09-26 16:28 | |
Thanks, Matthew. My reading of that Answer is that you should be okay because you, I assume, installed the Windows-Native package rather than the cygwin that I first tested. I think the problem is specific to Cygwin as well as the circumstances described in the article. Still, it should be quite easy to verify if you just check out python and then do a stat, as this will show all files whose permissions have changed as well as general changes. Unfortunately, I am still working on setting up those branches, but once I finish documenting each of the branches, I should proceed more rapidly. |
|||
| msg73875 - (view) | Author: Jeffrey C. Jacobs (timehorse) | Date: 2008-09-26 18:04 | |
Phew! Okay, all you patches have been applied as I said in a previous message, and you should now be able to check out lp:~pythonregexp2.7/python/issue2636+01+09-02+17+18+19+20+21+24+26 where you can then apply your latest known patch (rc2+7) to add a fix for the findall / finditer bug. However, please review my changes to: a) lp:~pythonregexp2.7/python/issue2636-01+09-02+17 b) lp:~pythonregexp2.7/python/issue2636-01+09-02+17+26 c) lp:~pythonregexp2.7/python/issue2636-01+09-02+17+18+26 d) lp:~pythonregexp2.7/python/issue2636-01+09-02+17+18+19+20+21+26 To make sure my mergers are what your code snapshots should be. I did get one conflict with patch 5 IIRC where a reverse attribute was added to the SRE_STATE struct, and get a weird grouping error when running the tests for (a) and (b), which I think is a typo; a compile error regarding the afore mentioned missing reverse attribute from patch 3 or 4 in (c) and the SRE_FLAG_REVERSE seems to have been lost in (d) for some reason. Also, if you feel like tackling any other issues, whether they have numbers or not, and implementing them in your current development line, please let me know so I can get all the documentation and development branches set up. Thanks and good luck! |
|||
| msg73955 - (view) | Author: Matthew Barnett (mrabarnett) | Date: 2008-09-28 02:51 | |
I haven't yet found out how to turn on compression when getting the branches, so I've only looked at lp:~pythonregexp2.7/python/issue2636+01+09-02+17+18+19+20+21+24+26. I did see that the SRE_FLAG_REVERSE flag was missing. BTW, I ran re.findall(r"(?m)^(.*re\..+\\m)$", text) where text was 67MB of emails. Python v2.5.2 took 2.4secs and the new version 5.6secs. Ouch! I added 4 lines to _sre.c and tried again. 1.1secs. Nice! :-) |
|||
| msg74025 - (view) | Author: Jeffrey C. Jacobs (timehorse) | Date: 2008-09-29 11:47 | |
Good work, Matthew. Now, another bazaar hint, IMHO, is once of my favourite commands: switch. I generally develop all in one directory, rather than getting a new directory for each branch. Once does have to be VERY careful to type "bzr info" to make sure the branch you're editing is the one you think it is! but with "bzr switch", you do a differential branch switch that allows you to change your development branch quickly and painlessly. This assumes you did a "bzr checkout" and not a "bzr pull". If you did a pull, you can still turn this into a "checkout", where all VCS actions are mirrored on the server, by using the 'bind' command. Make sure you push your branch first. You don't need to worry about all this "bind"ing, "push"ing and "pull"ing if you choose checkout, but OTOH, if your connection is over-all very slow, you may still be better off with a "pull"ed branch rather than a "checkout"ed one. Anyway, good catch on those 4 lines and I'll see if I can get your earlier branches up to date. |
|||
| msg74026 - (view) | Author: Jeffrey C. Jacobs (timehorse) | Date: 2008-09-29 12:36 | |
Matthew, I've traced down the patch failures in my merges and now each of the 4 versions of code on Launchpad should compile, though the first 2 do not pass all the negative look-behind tests, though your later 2 do. Any chance you could back-port that fix to the lp:~pythonregexp2.7/python/issue2636-01+09-02+17 branch? If you can, I can propagate that fix to the higher levels pretty quickly. |
|||
| msg74058 - (view) | Author: Matthew Barnett (mrabarnett) | Date: 2008-09-30 00:45 | |
issue2636-01+09-02+17_backport.diff is the backport fix. Still unable to compress the download, so that's >200MB each time! |
|||
| msg74104 - (view) | Author: Matthew Barnett (mrabarnett) | Date: 2008-09-30 23:42 | |
The explanation of the zero-width bug is incorrect. What happens is this: The functions for finditer(), findall(), etc, perform searches and want the next one to continue from where the previous match ended. However, if the match was actually zero-width then that would've made it search from where the previous search _started_, and it would be stuck forever. Therefore, after a zero-width match the caller of the search consumes a character. Unfortunately, that can result a character being 'missed'. The bug in re.split() is also the result of an incorrect fix to this zero-width problem. I suggest that the regex code should include the fix for the zero-width split bug; we can have code to turn it off unless a re.ZEROWIDTH flag is present, if that's the decision. The patch issue2636+01+09-02+17+18+19+20+21+24+26_speedup.diff includes some speedups. |
|||
| msg74174 - (view) | Author: Matthew Barnett (mrabarnett) | Date: 2008-10-02 16:48 | |
I've found an interesting difference between Python and Perl regular
expressions:
In Python:
\Z matches at the end of the string
In Perl:
\Z matches at the end of the string or before a newline at the
end of the string
\z matches at the end of the string
|
|||
| msg74203 - (view) | Author: Matthew Barnett (mrabarnett) | Date: 2008-10-02 22:49 | |
Perl v5.10 offers the ability to have duplicate capture group numbers in
branches. For example:
(?|(a)|(b))
would number both of the capture groups as group 1.
Something to include?
|
|||
| msg74204 - (view) | Author: Matthew Barnett (mrabarnett) | Date: 2008-10-02 22:51 | |
I've extended the group referencing. It now has:
Forward group references
(\2two|(one))+
\g-type group references
(n is name or number)
\g<n> (Python re replacement string)
\g{n} (Perl)
\g'n' (Perl)
\g"n" (because ' and " are interchangeable)
\gn (n is single digit) (Perl)
(n is number)
\g<+n>
\g<-n>
\g{+n} (Perl)
\g{-n} (Perl)
\k-type group references
(n is group name)
\k<n> (Perl)
\k{n} (Perl)
\k'n' (Perl)
\k"n" (because ' and " are interchangeable)
|
|||
| msg74904 - (view) | Author: Matthew Barnett (mrabarnett) | Date: 2008-10-17 12:28 | |
Further to msg74203, I can see no reason why we can't allow duplicate capture group names if the groups are on different branches are are thus mutually exclusive. For example: (?P<name>a)|(?P<name>b) Apart from this I think that duplicate names should continue to raise an exception. |
|||
| msg80916 - (view) | Author: Alex Willmer (moreati) | Date: 2009-02-01 19:25 | |
I've been trying, and failing to understand the state of play with this bug. The most recent upload is issue2636+01+09-02+17+18+19+20+21+24+26_speedup.diff, but I can't seem to apply that to anything. Nearly every hunk fails when I try against 25-maint, 26-maint or trunk. How does one apply this? Do I need to apply mrabarnett's patches from bug 3825? |
|||
| msg81112 - (view) | Author: Matthew Barnett (mrabarnett) | Date: 2009-02-03 23:07 | |
issue2636-features.diff is based on Python 2.6. It includes: Named Unicode characters eg \N{LATIN CAPITAL LETTER A} Unicode character properties eg \p{Lu} (uppercase letter) and \P{Lu} (not uppercase letter) Other character properties not restricted to Unicode eg \p{Alnum} and \P{Alnum} Issue #3511 : Incorrect charset range handling with ignore case flag? Issue #3665 : Support \u and \U escapes in regexes Issue #1519638 Unmatched Group issue - workaround Issue #1693050 \w not helpful for non-Roman scripts The next 2 seemed a good idea at the time. :-) Octal escape \onnn Extended hex escape \x{n} |
|||
| msg81236 - (view) | Author: Robert Xiao (nneonneo) | Date: 2009-02-05 23:13 | |
I'm glad to see that the unmatched group issue is finally being addressed. Thanks! |
|||
| msg81238 - (view) | Author: Russ Cox (rsc) | Date: 2009-02-05 23:52 | |
> Named Unicode characters eg \N{LATIN CAPITAL LETTER A}
These descriptions are not as stable as, say, Unicode code
point values or language names. Are you sure it is a good idea
to depend on them not being adjusted in the future?
It's certainly nice and self-documenting, but it doesn't seem
better from a future-proofing point of view than \u0041.
Do other languages implement this?
Russ
|
|||
| msg81239 - (view) | Author: Matthew Barnett (mrabarnett) | Date: 2009-02-06 00:03 | |
Python 2.6 does (and probably Python 3.x, although I haven't checked):
>>> u"\N{LATIN CAPITAL LETTER A}"
u'A'
If it's good enough for Python's Unicode string literals then it's good
enough for Python's re module. :-)
|
|||
| msg81240 - (view) | Author: Robert Xiao (nneonneo) | Date: 2009-02-06 00:06 | |
In fact, it works for Python 2.4, 2.5, 2.6 and 3.0 from my rather
limited testing.
In Python 2.4:
>>> u"\N{LATIN CAPITAL LETTER A}"
u'A'
>>> u"\N{MUSICAL SYMBOL DOUBLE SHARP}"
u'\U0001d12a'
In Python 3.0:
>>> "\N{LATIN CAPITAL LETTER A}"
'A'
>>> ord("\N{MUSICAL SYMBOL DOUBLE SHARP}")
119082
|
|||
| msg81359 - (view) | Author: Matthew Barnett (mrabarnett) | Date: 2009-02-08 00:39 | |
issue2636-features-2.diff is based on Python 2.6. Bugfix. No new features. |
|||
| msg81473 - (view) | Author: Antoine Pitrou (pitrou) | Date: 2009-02-09 19:09 | |
Besides the fact that this is probably great work, I really wonder who will have enough time and skills to review such a huge patch... :-S In any case, some recommendations: - please provide patches against trunk; there is no way such big changes will get committed against 2.6, which is in maintenance mode - avoid, as far as possible, doing changes in style, whitespace or indentation; this will make the patch slightly smaller or cleaner - avoid C++-style comments (use /* ... */ instead) - don't hesitate to add extensive comments and documentation about what you've added Once you think your patch is ready, you may post it to http://codereview.appspot.com/, in the hope that it makes reviewing easier. |
|||
| msg81475 - (view) | Author: Antoine Pitrou (pitrou) | Date: 2009-02-09 19:17 | |
One thing I forgot: - please don't make lines longer than 80 characters :-) Once the code has settled down, it would also be interesting to know if performance has changed compared to the previous implementation. |
|||
| msg82673 - (view) | Author: Matthew Barnett (mrabarnett) | Date: 2009-02-24 19:28 | |
issue2636-features-3.diff is based on the 2.x trunk. Added comments. Restricted line lengths to no more than 80 characters Added common POSIX character classes like [[:alpha:]]. Added further checks to reduce unnecessary backtracking. I've decided to remove \onnn and \x{n} because they aren't supported elsewhere in the language. |
|||
| msg82739 - (view) | Author: Matthew Barnett (mrabarnett) | Date: 2009-02-26 01:22 | |
issue2636-features-4.diff includes: Bugfixes msg74203: duplicate capture group numbers msg74904: duplicate capture group names |
|||
| msg82950 - (view) | Author: Matthew Barnett (mrabarnett) | Date: 2009-03-01 01:42 | |
issue2636-features-5.diff includes: Bugfixes Added \G anchor (from Perl). \G is the anchor at the start of a search, so re.search(r'\G(\w)') is the same as re.match(r'(\w)'). re.findall normally performs a series of searches, each starting where the previous one finished, but if the pattern starts with \G then it's like a series of matches: >>> re.findall(r'\w', 'abc def') ['a', 'b', 'c', 'd', 'e', 'f'] >>> re.findall(r'\G\w', 'abc def') ['a', 'b', 'c'] Notice how it failed to match at the space, so no more results. |
|||
| msg83271 - (view) | Author: Matthew Barnett (mrabarnett) | Date: 2009-03-07 02:47 | |
issue2636-features-6.diff includes: Bugfixes Added group access via subscripting. >>> m = re.search("(\D*)(?<number>\d+)(\D*)", "abc123def") >>> len(m) 4 >>> m[0] 'abc123def' >>> m[1] 'abc' >>> m[2] '123' >>> m[3] 'def' >>> m[1 : 4] ('abc', '123', 'def') >>> m[ : ] ('abc123def', 'abc', '123', 'def') >>> m["number"] '123' |
|||
| msg83277 - (view) | Author: Martin v. Löwis (loewis) | Date: 2009-03-07 11:27 | |
I don't think it will be possible to accept these patches in the current form and way in which they are presented. I randomly picked issue2636-features-2.diff, and see that it contains lots of style and formatting changes, which is completely taboo for this kind of contribution. I propose to split up the patches into separate tracker issues, one issue per proposed new feature. No need to migrate all changes to new issues - start with the one single change that you think is already complete, and acceptance is likely without debate. Leave a note in this issue what change has been moved to what issue. For each such new issue, describe what precisely the patch is supposed to do. Make sure it is complete with respect to this specific change, and remove any code not contributing to the change. Also procedurally, it is not quite clear to me who is contributing these changes: Jeffrey C. Jacobs, or Matthew Barnett. We will need copyright forms from the original contributor. |
|||
| msg83390 - (view) | Author: Jeffrey C. Jacobs (timehorse) | Date: 2009-03-09 15:15 | |
Martin and Matthew, I've been far too busy in the new year to keep up with all your updates to this issue, but since Martin wanted some clarification on direction and copyright, Matthew and I are co-developers, but there is clear delineation between each of our work where the patches uploaded by Matthew (mrbarnett) were uploaded by him and totally a product of his work. The ones uploaded by me are more complicated, as I have always intended this to be a piecemeal project, not one patch fixes all, which is why I created the Bazaar repository hierarchy (https://launchpad.net/~pythonregexp2.7) with 36 or so branches of mostly independent development at various stages of completion. Here is where the copyrights get more complicated, but not much so. As I said, there are branches where multiple issues are combined (with the plus operator (+)). In general, I consider primary development the single- number branch and only create combined branches where I feel there may be a cross-dependency between one branch and the other. Working this way is VERY time consuming: one spends more time merging branches than actually developing. Matthew, on the other hand, has worked fairly linearly so his branches generally have long number trains to indicate all the issues solved in each. What's more, the last time I updated the repository was last summer so all of Matthew's latest patches have not been catalogued and documented. But, what is there that is more or less 100% copyright and thanks to Matthew's diligent work always contains his first contribution, the new RegExp engine, thread 09-02. So, any items which contain ...+09-02+... are pretty much Matthew's work and the rest are mine. All that said, I personally like having all this development in one place, but also like having the separate branch development model I've set up in Bazaar. If new issues are created from this one, I would thus hope they would still follow the outline specified on the Launchpad page. I prefer keeping everything in one issue though as IMHO it makes things easier to keep track of. As for the stuff I've worked on, I first should forewarn that there is a root patch at (https://code.launchpad.net/~pythonregexp2.7/python/issue2636) and as issue2636.patch in the tar.bz2 patch library I posted last June. This patch contains various code cleanups and most notably a realignment of the documentation to follow 72-column rule. I know Python's documentation is supposed to be 80-column, but some of the lines were going out even passed that and by making it 72 it allows for incremental expansion before having to reformat any lines. However, textually, the issue2636 version of re.rst is no different than the last version it's based off off, which I verified by generating Sphinx hierarchies for both versions. I therefore suggest this as the only change which is 'massive restructuring' as it does not effect the actual documentation, it just makes it more legible in reStructuredText form. This and other suggested changes in the root issue2636 thread are indented to be applied if at least 1 of the other issues is accepted, and as such is the root branch of every other branch. Understanding that even these small changes may not in fact be acceptable, I have always generated 2 sets of patches for each issue: one diff'ed against the python snapshot stored in base (https://code.launchpad.net/~pythonregexp2.7/python/base) and one that is diff'ed against the issue2636 root so if the changes in issue2636 root are none the less unacceptable, they can easily be disregarded. Now, with respect to work ready for analysis and merging prepared by me, I have 4 threads ready for analysis, with documentation updated and test cases written and passing: 1: Atomic Grouping / Possessive Qualifiers 5: Added a Python-specific RegExp comment group, (?P#...) which supports parenthetical nesting (see the issue for details) 7: Better caching algorithm for the RegExp compiler with more entries in the cache and reduced possibility of thrashing. 12: Clarify the python documentation for RegExp comments; this was only a change in re.rst. The branches 09-01 and 09-01-01 are engine redesigns that I used to better understand the current RegExp engine but neither is faster than the existing engine so they will probably be abandoned. 10 is also nearly complete and effects the implementation of 01 (whence 01+10) if accepted, but I have not done a final analysis to determine if any other variables can be consolidated to be defined only in one place. Thread 2 is in a near-complete form, but has been snagged by a decision as to what the interface to it should be -- see the discussion above and specifically http://bugs.python.org/msg68336 and http://bugs.python.org/msg68399. The stand-alone patch by me is the latest version and implements the version called (a) in those notes. I prefer to implement (e). I don't think I'd had a chance to do any significant work on any of the other threads and got really bogged down with changing thread 2 as described above, trying to maintain threads for Matthew and just performing all those merges in Bazaar! So that's the news from me, and nothing new to contribute at this time, but if you want separate, piecemeal solutions, feel free to crack opened http://bugs.python.org/file10645/issue2636-patches.tar.bz2 and grab them for at least items 1, 5, 7 and 12. |
|||
| msg83411 - (view) | Author: Martin v. Löwis (loewis) | Date: 2009-03-09 23:09 | |
> I've been far too busy in the new year to keep up with all your updates > to this issue, but since Martin wanted some clarification on direction > and copyright, Thanks for the clarification. So I think we should focus on Matthew's patches first, and come back to yours when you have time to contribute them. |
|||
| msg83427 - (view) | Author: Jeffrey C. Jacobs (timehorse) | Date: 2009-03-10 12:00 | |
Okay, as I said, Atomic Grouping, etc., off a recent 2.6 is already available and I can do any cleanups requested to those already mentioned, I just don't want to start any new items at the moment. As it is, we are still over a year from any of this seeing the light of day as it's not going to be merged until we start 2.7 / 3.1 alpha. Fortunately, I think Matthew here DOES have a lot of potential to have everything wrapped up by then, but I think to summarize everyone's concern, we really would like to be able to examine each change incrementally, rather than as a whole. So, for the purposes of this, I would recommend that you, Matthew, make a version of your new engine WITHOUT any Atomic Group, variable length look behind / ahead assertions, reverse string scanning, positional, negated or scoped inline flags, group key indexing or any other feature described in the various issues, and that we then evaluate purely on the merits of the engine itself whether it is worth moving to that engine, and having made that decision officially move all work to that design if warranted. Personally, I'd like to see that 'pure' engine for myself and maybe we can all develop an appropriate benchmark suite to test it fairly against the existing engine. I also think we should consider things like presentation (are all lines terminated by column 80), number of comments, and general readability. IMHO, the current code is conformant in the line length, but VERY deficient WRT comments and readability, the later of which it sacrifices for speed (as well as being retrofitted for iteration rather than recursion). I'm no fan of switch-case, but I found that by turning the various case statements into bite-sized functions and adding many, MANY comments, the code became MUCH more readable at the minor cost of speed. As I think speed trumps readability (though not blindly), I abandoned my work on the engines, but do feel that if we are going to keep the old engine, I should try and adapt my comments to the old framework to make the current code a bit easier to understand since the framework is more or less the same code as in the existing engine, just re-arranged. I think all of the things you've added to your engine, Matthew, can, with varying levels of difficulty be implemented in the existing Regexp Engine, though I'm not suggesting that we start that effort. Simply, let's evaluate fairly whether your engine is worth the switch over. Personally, I think the engine has some potential -- though not much better than current WRT readability -- but we've only heard anecdotal evidence of it's superior speed. Even if the engine isn't faster, developing speed benchmarks that fairly gage any potential new engine would be handy for the next person to have a great idea for a rewrite, so perhaps while you peruse the stripped down version of your engine, the rest of us can work on modifying regex_tests.py, test_re.py and re_tests.py in Lib/test specifically for the purpose of benchmarking. If we can focus on just these two issues ('pure' engine and fair benchmarks) I think I can devote some time to the later as I've dealt a lot with benchmarking (WRT the compiler-cache) and test cases and hope to be a bit more active here. |
|||
| msg83428 - (view) | Author: Antoine Pitrou (pitrou) | Date: 2009-03-10 12:08 | |
> Okay, as I said, Atomic Grouping, etc., off a recent 2.6 is already > available and I can do any cleanups requested to those already > mentioned, I just don't want to start any new items at the moment. As > it is, we are still over a year from any of this seeing the light of day > as it's not going to be merged until we start 2.7 / 3.1 alpha. 3.1 will actually be released, if all goes well, before July of this year. The first alpha was released a couple of days ago. The goal is to fix most deficiencies of the 3.0 release. See http://www.python.org/dev/peps/pep-0375/ for the planned release schedule. |
|||
| msg83429 - (view) | Author: Jeffrey C. Jacobs (timehorse) | Date: 2009-03-10 12:14 | |
Thanks, Antione! Then I think for the most part any changes to Regexp will have to wait for 3.2 / 2.7. |
|||
| msg83988 - (view) | Author: Matthew Barnett (mrabarnett) | Date: 2009-03-22 23:33 | |
An additional feature that could be borrowed, though in slightly
modified form, from Perl is case-changing controls in replacement
strings. Roughly the idea is to add these forms to the replacement string:
\g<1> provides capture group 1
\u\g<1> provides capture group 1 with the first character in uppercase
\U\g<1> provides capture group 1 with all the characters in uppercase
\l\g<1> provides capture group 1 with the first character in lowercase
\L\g<1> provides capture group 1 with all the characters in lowercase
In Perl titlecase is achieved by using both \u and \L, and the same
could be done in Python:
\u\L\g<1> provides capture group 1 with the first character in
uppercase after putting all the characters in all lowercase
although internally it would do proper titlecase.
I'm suggesting restricting the action to only the following group. Note
that this is actually syntactically unambiguous.
|
|||
| msg83989 - (view) | Author: Robert Xiao (nneonneo) | Date: 2009-03-23 00:08 | |
Frankly, I don't really like that idea; I think it muddles up the RE syntax to have such a group-modifying operator, and seems rather unpythonic: the existing way to do this -- use .upper(), .lower() or .title() to format the groups in a match object as necessary -- seems to be much more readable and reasonable in this sense. I think the proposed changes look good, but I agree that the focus should be on breaking up the megapatch into more digestible feature additions, starting from the barebones engine. Until that's done, I doubt *anyone* will want to review it, let alone merge it into the main Python distribution. So, I think we should hold off on any new features until this raft of changes can be properly broken up, reviewed and (hopefully) merged in. |
|||
| msg83993 - (view) | Author: Matthew Barnett (mrabarnett) | Date: 2009-03-23 01:42 | |
Ah, too Perlish! :-) Another feature request that I've decided not to consider any further is recursive regular expressions. There are other tools available for that kind of thing, and I don't want the re module to go the way of Perl 6's rules; such things belong elsewhere, IMHO. |
|||
| msg84350 - (view) | Author: Matthew Barnett (mrabarnett) | Date: 2009-03-29 00:43 | |
Patch issue2636-patch-1.diff contains a stripped down version of my regex engine and the other changes that are necessary to make it work. |
|||
| msg86004 - (view) | Author: Gregory P. Smith (gregory.p.smith) * | Date: 2009-04-15 23:13 | |
fyi - I can't compile issue2636-patch-1.diff when applied to trunk (2.7) using gcc 4.0.3. many errors. |
|||
| msg86032 - (view) | Author: Matthew Barnett (mrabarnett) | Date: 2009-04-16 14:58 | |
Try issue2636-patch-2.diff. |
|||
| msg89632 - (view) | Author: Akira Kitada (akitada) | Date: 2009-06-23 16:29 | |
Thanks for this great work! Does Regexp 2.7 include Unicode Scripts support? http://www.regular-expressions.info/unicode.html Perl and Ruby support it and it's pretty handy. |
|||
| msg89634 - (view) | Author: Matthew Barnett (mrabarnett) | Date: 2009-06-23 17:01 | |
It includes Unicode character properties, but not the Unicode script identification, because the Python Unicode database contains the former but not the latter. Although they could be added to the re module, IMHO their proper place is in the Unicode database, from which the re module could access them. |
|||
| msg89643 - (view) | Author: Walter Dörwald (doerwalter) * | Date: 2009-06-23 20:52 | |
http://bugs.python.org/6331 is a patch that adds unicode script info to the unicode database. |
|||
| msg90954 - (view) | Author: Matthew Barnett (mrabarnett) | Date: 2009-07-26 19:11 | |
issue2636-20090726.zip is a new implementation of the re engine. It replaces re.py, sre.py, sre_constants.py, sre_parse.py and sre_compile.py with a new re.py and replaces sre_constants.h, sre.h and _sre.c with _re.h and _re.c. The internal engine no longer interprets a form of bytecode but instead follows a linked set of nodes, and it can work breadth-wise as well as depth-first, which makes it perform much better when faced with one of those 'pathological' regexes. It supports scoped flags, variable-length lookbehind, Unicode properties, named characters, atomic groups, possessive quantifiers, and will handle zero-width splits correctly when the ZEROWIDTH flag is set. There are a few more things to add, like allowing indexing for capture groups, and further speed improvements might be possible (at worst it's roughly the same speed as the existing re module). I'll be adding some documentation about how it works and the slight differences in behaviour later. |
|||
| msg90961 - (view) | Author: Georg Brandl (georg.brandl) | Date: 2009-07-26 21:29 | |
Sounds like this is an awesome piece of work! Since the patch is obviously a very large piece and will be hard to review, may I suggest releasing the new engine as a standalone package and spreading the word, so that people can stress-test it? By the time 2.7 is ready to release, if it has had considerable exposure to the public, that will help acceptance greatly. The Unicode script identification might not be hard to add to unicodedata; maybe Martin can do that? |
|||
| msg90985 - (view) | Author: Matthew Barnett (mrabarnett) | Date: 2009-07-27 16:13 | |
issue2636-20090727.zip contains regex.py, _regex.h, _regex.c and also _regex.pyd (for Python 2.6 on Windows). For Windows machines just put regex.py and _regex.pyd into Python's Lib\site-packages folder. I've changed the name so that it won't hide the re module. |
|||
| msg90986 - (view) | Author: Gregory P. Smith (gregory.p.smith) * | Date: 2009-07-27 17:36 | |
Agreed, a standalone release combined with a public announcement about its availability is a must if we want to get any sort of wide spread testing. It'd be great if we had a fully characterized set of tests for the behavior of the existing engine... but we don't. So widespread testing is important. |
|||
| msg90989 - (view) | Author: A.M. Kuchling (akuchling) * | Date: 2009-07-27 17:53 | |
We have lengthy sets of tests in Lib/test/regex_tests.py and Lib/test/test_re.py. While widespread testing of a standalone module would certainly be good, I doubt that will exercise many corner cases and the more esoteric features. Most actual code probably uses relatively few regex pattern constructs. |
|||
| msg91028 - (view) | Author: Matthew Barnett (mrabarnett) | Date: 2009-07-29 00:56 | |
issue2636-20090729.zip contains regex.py, _regex.h, _regex.c which will work with Python 2.5 as well as Python 2.6, and also 2 builds of _regex.pyd (for Python 2.5 and Python 2.6 on Windows). This version supports accessing the capture groups by subscripting the match object, for example: >>> m = regex.match("(?<foo>.)(?<bar>.)", "abc") >>> len(m) 3 >>> m[0] 'ab' >>> m[1 : 3] ['a', 'b'] >>> m["foo"] 'a' |
|||
| msg91035 - (view) | Author: Matthew Barnett (mrabarnett) | Date: 2009-07-29 11:10 | |
Unfortunately I found a bug in regex.py, caused when I made it compatible with Python 2.5. :-( issue2636-20090729.zip is now corrected. |
|||
| msg91038 - (view) | Author: Ezio Melotti (ezio.melotti) | Date: 2009-07-29 13:01 | |
Apparently Perl has a quite comprehensive set of tests at http://perl5.git.perl.org/perl.git/blob/HEAD:/t/op/re_tests . If we want the engine to be Perl-compatible, it might be a good idea to reuse (part of) their tests (if their license allows it). |
|||
| msg91245 - (view) | Author: John Machin (sjmachin) | Date: 2009-08-03 22:36 | |
Problem is memory leak from repeated calls of e.g.
compiled_pattern.search(some_text). Task Manager performance panel shows
increasing memory usage with regex but not with re. It appears to be
cumulative i.e. changing to another pattern or text doesn't release memory.
Environment: Python 2.6.2, Windows XP SP3, latest (29 July) regex zip file.
Example:
8<-- regex_timer.py
import sys
import time
if sys.platform == 'win32':
timer = time.clock
else:
timer = time.time
module = __import__(sys.argv[1])
count = int(sys.argv[2])
pattern = sys.argv[3]
expected = sys.argv[4]
text = 80 * '~' + 'qwerty'
rx = module.compile(pattern)
t0 = timer()
for i in xrange(count):
assert rx.search(text).group(0) == expected
t1 = timer()
print "%d iterations in %.6f seconds" % (count, t1 - t0)
8<---
Here are the results of running this (plus observed difference between
peak memory usage and base memory usage):
dos-prompt>\python26\python regex_timer.py regex 1000000 "~" "~"
1000000 iterations in 3.811500 seconds [60 Mb]
dos-prompt>\python26\python regex_timer.py regex 2000000 "~" "~"
2000000 iterations in 7.581335 seconds [128 Mb]
dos-prompt>\python26\python regex_timer.py re 2000000 "~" "~"
2000000 iterations in 2.549738 seconds [3 Mb]
This happens on a variety of patterns: "w", "wert", "[a-z]+", "[a-z]+t",
...
|
|||
| msg91250 - (view) | Author: Matthew Barnett (mrabarnett) | Date: 2009-08-04 01:30 | |
issue2636-20090804.zip is a new version of the regex module. The memory leak has been fixed. |
|||
| msg91437 - (view) | Author: Vlastimil Brom (vbr) | Date: 2009-08-10 08:54 | |
First, many thanks for this contribution; it's great, that the re module gets updated in that comprehensive way! I'd like to report some issue with the current version (issue2636-20090804.zip). Using an empty string as the search pattern ends up consuming system resources and the function doesn't return anything nor raise an exception or crash (within several minutes I tried). The current re engine simply returns the empty matches on all character boundaries in this case. I use win XPh SP3, the behaviour is the same on python 2.5.4 and 2.6.2: It should be reproducible with the following simple code: >>> import re >>> import regex >>> re.findall("", "abcde") ['', '', '', '', '', ''] >>> regex.findall("", "abcde") _ regards vbr |
|||
| msg91439 - (view) | Author: John Machin (sjmachin) | Date: 2009-08-10 10:58 | |
Adding to vbr's report: [2.6.2, Win XP SP3] (1) bug mallocs memory
inside loop (2) also happens to regex.findall with patterns 'a{0,0}' and
'\B' (3) regex.sub('', 'x', 'abcde') has similar problem BUT 'a{0,0}'
and '\B' appear to work OK.
|
|||
| msg91448 - (view) | Author: Matthew Barnett (mrabarnett) | Date: 2009-08-10 14:18 | |
issue2636-20090810.zip should fix the empty-string bug. |
|||
| msg91450 - (view) | Author: Matthew Barnett (mrabarnett) | Date: 2009-08-10 15:04 | |
issue2636-20090810#2.zip has some further improvements and bugfixes. |
|||
| msg91460 - (view) | Author: Vlastimil Brom (vbr) | Date: 2009-08-10 19:27 | |
I'd like to confirm, that the above reported error is fixed in issue2636-20090810#2.zip While testing the new features a bit, I noticed some irregularity in handling the Unicode Character Properties; I tried randomly some of those mentioned at http://www.regular- expressions.info/unicode.html using the simple findall like above. It seems, that only the short abbreviated forms of the properties are supported, however, the long variants are handled in different ways. Namely, the properties names containing whitespace or other non-letter characters cause some probably unexpected exception: >>> regex.findall(ur"\p{Ll}", u"abcDEF") [u'a', u'b', u'c'] # works ok \p{LowercaseLetter} isn't supported, but seems to be handled, as it throws "error: undefined property name" at the end of the traceback. \p{Lowercase Letter} \p{Lowercase_Letter} \p{Lowercase-Letter} isn't probably expected, the traceback is: >>> regex.findall(ur"\p{Lowercase_Letter}", u"abcDEF") Traceback (most recent call last): File "<input>", line 1, in <module> File "C:\Python25\lib\regex.py", line 194, in findall return _compile(pattern, flags).findall(string) File "C:\Python25\lib\regex.py", line 386, in _compile parsed = _parse_pattern(source, info) File "C:\Python25\lib\regex.py", line 465, in _parse_pattern branches = [_parse_sequence(source, info)] File "C:\Python25\lib\regex.py", line 477, in _parse_sequence item = _parse_item(source, info) File "C:\Python25\lib\regex.py", line 485, in _parse_item element = _parse_element(source, info) File "C:\Python25\lib\regex.py", line 610, in _parse_element return _parse_escape(source, info, False) File "C:\Python25\lib\regex.py", line 844, in _parse_escape return _parse_property(source, ch == "p", here, in_set) File "C:\Python25\lib\regex.py", line 983, in _parse_property if info.local_flags & IGNORECASE and not in_set: NameError: global name 'info' is not defined >>> Of course, arbitrary strings other than properties names are handled identically. Python 2.6.2 version behaves the same like 2.5.4. vbr |
|||
| msg91462 - (view) | Author: Gregory P. Smith (gregory.p.smith) * | Date: 2009-08-10 22:02 | |
for each of these discrepancies that you're finding, please consider submitting them as patches that add a unittest to the existing test suite. otherwise their behavior guarantees will be lost regardless of if the suite in this issue is adopted. thanks! I'll happily commit any passing re module unittest additions. |
|||
| msg91463 - (view) | Author: Matthew Barnett (mrabarnett) | Date: 2009-08-10 22:42 | |
issue2636-20090810#3.zip adds more Unicode character properties such as "\p{Lowercase_Letter}", and also Unicode script ranges. In addition, the 'findall' method now accepts an 'overlapped' argument for finding overlapped matches. For example: >>> regex.findall(r"(..)", "abc") ['ab'] >>> regex.findall(r"(..)", "abc", overlapped=True) ['ab', 'bc'] |
|||
| msg91473 - (view) | Author: Vlastimil Brom (vbr) | Date: 2009-08-11 11:15 | |
Sorry for the dumb question, which may also suggest, that I'm unfortunately unable to contribute at this level (with zero knowledge of C and only "working" one for Python): Where can I find the sources for tests etc. and how they are eventually to be submitted? Is some other account needed besides the one for bugs.python.org? Anyway, the long character properties now work in the latest version issue2636-20090810#3.zip In the mentioned overview http://www.regular-expressions.info/unicode.html there is a statement for the property names: "You may omit the underscores or use hyphens or spaces instead." While I'm not sure, that it is a good thing to have that many variations, they should probably be handled in the same way. Now, the whitespace (and also non ascii characters) in the property name seem to confuse the parser: these pass silently (don't match anything) and don't throw an exception like "undefined property name". cf. >>> regex.findall(ur"\p{Dummy Property}", u"abcDEF") [] >>> regex.findall(ur"\p{DümmýPrópërtý}", u"abcDEF") [] >>> regex.findall(ur"\p{DummyProperty}", u"abcDEF") Traceback (most recent call last): File "<input>", line 1, in <module> File "regex.pyc", line 195, in findall File "regex.pyc", line 563, in _compile File "regex.pyc", line 642, in _parse_pattern File "regex.pyc", line 654, in _parse_sequence File "regex.pyc", line 662, in _parse_item File "regex.pyc", line 787, in _parse_element File "regex.pyc", line 1021, in _parse_escape File "regex.pyc", line 1159, in _parse_property error: undefined property name 'DummyProperty' >>> vbr |
|||
| msg91474 - (view) | Author: R. David Murray (r.david.murray) * | Date: 2009-08-11 12:59 | |
Take a look a the dev FAQ, linked from http://www.python.org/dev. The tests are in Lib/test in a distribution installed from source, but ideally you would be (anonymously) pulling the trunk from SVN (when it is back) and creating your patches with respect to that code as explained in the FAQ. You would be adding unit test code to Lib/test/test_re.py, though it looks like re_tests.py might be an interesting file to look at as well. As the dev docs say, anyone can contribute, and writing tests is a great way to start, so please don't feel like you aren't qualified to contribute, you are. If you have questions, come to #python-dev on freenode. |
|||
| msg91490 - (view) | Author: John Machin (sjmachin) | Date: 2009-08-12 03:00 | |
What is the expected timing comparison with re? Running the Aug10#3 version on Win XP SP3 with Python 2.6.3, I see regex typically running at only 20% to %50 of the speed of re in ASCII mode, with not-very-atypical tests (find all Python identifiers in a line, failing search for a Python identifier in an 80-byte text). Is the supplied _regex.pyd from some sort of debug or unoptimised build? Here are some results: dos-prompt>\python26\python -mtimeit -s"import re as x;r=x.compile(r'[A-Za-z_][A-Za-z0-9_]+');t=' def __init__(self, arg1, arg2):\n'" "r.findall(t)" 100000 loops, best of 3: 5.32 usec per loop dos-prompt>\python26\python -mtimeit -s"import regex as x;r=x.compile(r'[A-Za-z_][A-Za-z0-9_]+');t=' def __init__(self, arg1, arg2):\n'" "r.findall(t)" 100000 loops, best of 3: 12.2 usec per loop dos-prompt>\python26\python -mtimeit -s"import re as x;r=x.compile(r'[A-Za-z_][A-Za-z0-9_]+');t='1234567890'*8" "r.search(t)" 1000000 loops, best of 3: 1.61 usec per loop dos-prompt>\python26\python -mtimeit -s"import regex as x;r=x.compile(r'[A-Za-z_][A-Za-z0-9_]+');t='1234567890'*8" "r.search(t)" 100000 loops, best of 3: 7.62 usec per loop Here's the worst case that I've found so far: dos-prompt>\python26\python -mtimeit -s"import re as x;r=x.compile(r'z{80}');t='z'*79" "r.search(t)" 1000000 loops, best of 3: 1.19 usec per loop dos-prompt>\python26\python -mtimeit -s"import regex as x;r=x.compile(r'z{80}');t='z'*79" "r.search(t)" 1000 loops, best of 3: 334 usec per loop See Friedl: "length cognizance". Corresponding figures for match() are 1.11 and 8.5. |
|||
| msg91495 - (view) | Author: Jeffrey C. Jacobs (timehorse) | Date: 2009-08-12 12:04 | |
</lurk> Re: timings Thanks for the info, John. First of all, I really like those tests and could you please submit a patch or other document so that we could combine them into the python test suite. The python test suite, which can be run as part of 'make test' or IIRC there is a way to run JUST the 2 re test suites which I seem to have senior moment'd, includes a built-in timing output over some of the tests, though I don't recall which ones were being timed: standard cases or pathological (rare) ones. Either way, we should include some timings that are of a standard nature in the test suite to make Matthew's and any other developer's work easier. So, John, if you are not familiar with the test suite, I can look into adding the specific cases you've developed into the test suite so we can have a more representative timing of things. Remember, though, that when run as a single instance, at least in the existing engine, the re compiler caches recent compiles, so repeatedly compiling an expression flattens the overhead in a single run to a single compile and lookup, where as your tests recompile at each test (though I'm not sure what timeit is doing: if it invokes a new instance of python each time, it is recompiling each time, if it is reusing the instance, it is only compiling once). Having not looked at Matthew's regex code recently (nice name, BTW), I don't know if it also contains the compiled expression cache, in which case, adding it in might help timings. Originally, the cache worked by storing ~100 entries and cleared itself when full; I have a modification which increases this to 256 (IIRC) and only removes the 128 oldest to prevent thrashing at the boundary which I think is better if only for a particular pathological case. In any case, don't despair at these numbers, Matthew: you have a lot of time and potentially a lot of ways to make your engine faster by the time 1.7 alpha is coined. But also be forewarned, because, knowing what I know about the current re engine and what it is further capable of, I don't think your regex will be replacing re in 1.7 if it isn't at least as fast as the existing engine for some standard set of agreed upon tests, no matter how many features you can add. I have no doubt, with a little extra monkey grease, we could implement all new features in the existing engine. I don't want to have to reinvent the wheel, of course, and if Matthew's engine can pick up some speed everybody wins! So, keep up the good work Matthew, as it's greatly appreciated! Thanks all! Jeffrey. <lurk> |
|||
| msg91496 - (view) | Author: Antoine Pitrou (pitrou) | Date: 2009-08-12 12:16 | |
> Remember, though, that > when run as a single instance, at least in the existing engine, the re > compiler caches recent compiles, so repeatedly compiling an expression > flattens the overhead in a single run to a single compile and lookup, > where as your tests recompile at each test They don't. The pattern is compiled only once. Please take a look at http://docs.python.org/library/timeit.html#command-line-interface |
|||
| msg91497 - (view) | Author: Jeffrey C. Jacobs (timehorse) | Date: 2009-08-12 12:29 | |
Mea culpa et mes apologies, The '-s' option to John's expressions are indeed executed only once -- they are one-time setup lines. The final quoted expression is what's run multiple times. In other words, improving caching in regex will not help. >sigh< Merci, Antoine! Jeffrey. |
|||
| msg91500 - (view) | Author: Collin Winter (collinwinter) * | Date: 2009-08-12 18:01 | |
FYI, Unladen Swallow includes several regex benchmark suites: a port of V8's regex benchmarks (regex_v8); some of the regexes used when tuning the existing sre engine 7-8 years ago (regex_effbot); and a regex_compile benchmark that tests regex compilation time. See http://code.google.com/p/unladen-swallow/wiki/Benchmarks for more details, including how to check out and run the benchmark suite. You'll need to modify your experimental Python build to have "import re" import the proposed regex engine, rather than _sre. The benchmark command would look something like `./perf.py -r -b regex /control/python /experiment/python`, which will run all the regex benchmarks in rigorous mode. I'll be happy to answer any questions you have about our benchmarks. I'd be very interested to see how the proposed regex engine performs on these tests. |
|||
| msg91535 - (view) | Author: Alex Willmer (moreati) | Date: 2009-08-13 21:14 | |
I've made an installable package of Matthew Barnett's patch. It may get this to a wider audience. http://pypi.python.org/pypi/regex Next I'll look at incorporating Andrew Kuchling's suggestion of the re tests from CPython. |
|||
| msg91598 - (view) | Author: Mark Summerfield (mark) | Date: 2009-08-15 07:49 | |
Hi, I've noticed 3 differences between the re and regex engines. I don't know if they are intended or not, but thought it best to mention them. (I used the issue2636-20090810#3.zip version.) Python 2.6.2 (r262:71600, Apr 20 2009, 09:25:38) [GCC 4.3.2 20081105 (Red Hat 4.3.2-7)] on linux2 IDLE 2.6.2 >>> import re, regex >>> ############################################################ 1 of 3 >>> re1= re.compile(r""" (?!<\w)(?P<name>[-\w]+)= (?P<quote>(?P<single>')|(?P<double>"))? (?P<value>(?(single)[^']+?|(?(double)[^"]+?|\S+))) (?(quote)(?P=quote)) """, re.VERBOSE) >>> re2= regex.compile(r""" (?!<\w)(?P<name>[-\w]+)= (?P<quote>(?P<single>')|(?P<double>"))? (?P<value>(?(single)[^']+?|(?(double)[^"]+?|\S+))) (?(quote)(?P=quote)) """, re.VERBOSE) >>> text = "<table border='1'>" >>> re1.findall(text) [('border', "'", "'", '', '1')] >>> re2.findall(text) [] >>> text = "<table border=1>" >>> re1.findall(text) [('border', '', '', '', '1>')] >>> re2.findall(text) [] >>> ############################################################ 2 of 3 >>> re1 = re.compile(r"""^[ \t]* (?P<parenthesis>\()? [- ]? (?P<area>\d{3}) (?(parenthesis)\)) [- ]? (?P<local_a>\d{3}) [- ]? (?P<local_b>\d{4}) [ \t]*$ """, re.VERBOSE) >>> re2 = regex.compile(r"""^[ \t]* (?P<parenthesis>\()? [- ]? (?P<area>\d{3}) (?(parenthesis)\)) [- ]? (?P<local_a>\d{3}) [- ]? (?P<local_b>\d{4}) [ \t]*$ """, re.VERBOSE) >>> data = ("179-829-2116", "(187) 160 0880", "(286)-771-3878", "(291) 835-9634", "353-896-0505", "(555) 555 5555", "(555) 555-5555", "(555)-555-5555", "555 555 5555", "555 555-5555", "555-555-5555", "601 805 3142", "(675) 372 3135", "810 329 7071", "(820) 951 3885", "942 818-5280", "(983)8792282") >>> for d in data: ans1 = re1.findall(d) ans2 = re2.findall(d) print "re=%s rx=%s %d" % (ans1, ans2, ans1 == ans2) re=[('', '179', '829', '2116')] rx=[('', '179', '829', '2116')] 1 re=[('(', '187', '160', '0880')] rx=[] 0 re=[('(', '286', '771', '3878')] rx=[('(', '286', '771', '3878')] 1 re=[('(', '291', '835', '9634')] rx=[] 0 re=[('', '353', '896', '0505')] rx=[('', '353', '896', '0505')] 1 re=[('(', '555', '555', '5555')] rx=[] 0 re=[('(', '555', '555', '5555')] rx=[] 0 re=[('(', '555', '555', '5555')] rx=[('(', '555', '555', '5555')] 1 re=[('', '555', '555', '5555')] rx=[] 0 re=[('', '555', '555', '5555')] rx=[] 0 re=[('', '555', '555', '5555')] rx=[('', '555', '555', '5555')] 1 re=[('', '601', '805', '3142')] rx=[] 0 re=[('(', '675', '372', '3135')] rx=[] 0 re=[('', '810', '329', '7071')] rx=[] 0 re=[('(', '820', '951', '3885')] rx=[] 0 re=[('', '942', '818', '5280')] rx=[] 0 re=[('(', '983', '879', '2282')] rx=[('(', '983', '879', '2282')] 1 >>> ############################################################ 3 of 3 >>> re1 = re.compile(r""" <img\s+[^>]*?src=(?:(?P<quote>["'])(?P<qimage>[^\1>]+?) (?P=quote)|(?P<uimage>[^"' >]+))[^>]*?>""", re.VERBOSE) >>> re2 = regex.compile(r""" <img\s+[^>]*?src=(?:(?P<quote>["'])(?P<qimage>[^\1>]+?) (?P=quote)|(?P<uimage>[^"' >]+))[^>]*?>""", re.VERBOSE) >>> data = """<body> <img src='a.png'> <img alt='picture' src="b.png"> <img alt="picture" src="Big C.png" other="xyx"> <img src=icon.png alt=icon> <img src="I'm here!.jpg" alt="aren't I?">""" >>> data = data.split("\n") >>> data = [x.strip() for x in data] >>> for d in data: ans1 = re1.findall(d) ans2 = re2.findall(d) print "re=%s rx=%s %d" % (ans1, ans2, ans1 == ans2) re=[("'", 'a.png', '')] rx=[("'", 'a.png', '')] 1 re=[('"', 'b.png', '')] rx=[('"', 'b.png', '')] 1 re=[('"', 'Big C.png', '')] rx=[('"', 'Big C.png', '')] 1 re=[('', '', 'icon.png')] rx=[('', '', 'icon.png alt=icon')] 0 re=[('"', "I'm here!.jpg", '')] rx=[('"', "I'm here!.jpg", '')] 1 I'm sorry I haven't had the time to try to minimize the examples, but I hope that at least they will prove helpful. Number 3 looks like a problem with non-greedy matching; I don't know about the others. |
|||
| msg91607 - (view) | Author: John Machin (sjmachin) | Date: 2009-08-15 14:02 | |
Simplification of mark's first two problems:
Problem 1: looks like regex's negative look-head assertion is broken
>>> re.findall(r'(?!a)\w', 'abracadabra')
['b', 'r', 'c', 'd', 'b', 'r']
>>> regex.findall(r'(?!a)\w', 'abracadabra')
[]
Problem 2: in VERBOSE mode, regex appears to be ignoring spaces inside
character classes
>>> import re, regex
>>> pat = r'(\w)([- ]?)(\w{4})'
>>> for data in ['abbbb', 'a-bbbb', 'a bbbb']:
... print re.compile(pat).findall(data), regex.compile(pat).findall(data)
... print re.compile(pat, re.VERBOSE).findall(data),
regex.compile(pat,regex.
VERBOSE).findall(data)
...
[('a', '', 'bbbb')] [('a', '', 'bbbb')]
[('a', '', 'bbbb')] [('a', '', 'bbbb')]
[('a', '-', 'bbbb')] [('a', '-', 'bbbb')]
[('a', '-', 'bbbb')] [('a', '-', 'bbbb')]
[('a', ' ', 'bbbb')] [('a', ' ', 'bbbb')]
[('a', ' ', 'bbbb')] []
HTH,
John
|
|||
| msg91610 - (view) | Author: Matthew Barnett (mrabarnett) | Date: 2009-08-15 16:12 | |
issue2636-20090815.zip fixes the bugs found in msg91598 and msg91607. The regex engine currently lacks some of the optimisations that the re engine has, but I've concluded that even with them the extra work that the engine needs to do to make it easy to switch to breadth-wise matching when needed is slowing it down too much (if it's matching only depth-first then it can save only the changes to the 'context', but if it's matching breadth-wise then it needs to duplicate the entire 'context'). I'm therefore seeing whether I can have 2 engines internally, one optimised for depth-first and the other for breadth-wise, and switch from the former to the latter if matching is taking too long. |
|||
| msg91671 - (view) | Author: Alex Willmer (moreati) | Date: 2009-08-17 20:29 | |
Matthew's 20080915.zip attachment is now on PyPI. This one, having a more complete MANIFEST, will build for people other than me. |
|||
| msg91917 - (view) | Author: Vlastimil Brom (vbr) | Date: 2009-08-24 12:55 | |
I'd like to add some detail to the previous msg91473 The current behaviour of the character properties looks a bit surprising sometimes: >>> >>> regex.findall(ur"\p{UppercaseLetter}", u"QW\p{UppercaseLetter}as") [u'Q', u'W', u'U', u'L'] >>> regex.findall(ur"\p{Uppercase Letter}", u"QW\p{Uppercase Letter}as") [u'\\p{Uppercase Letter}'] >>> regex.findall(ur"\p{UppercaseÄÄÄLetter}", u"QW\p {UppercaseÄÄÄLetter}as") [u'\\p{Uppercase\xc4\xc4\xc4Letter}'] >>> regex.findall(ur"\p{UppercaseQQQLetter}", u"QW\p {UppercaseQQQLetter}as") Traceback (most recent call last): File "<pyshell#34>", line 1, in <module> regex.findall(ur"\p{UppercaseQQQLetter}", u"QW\p {UppercaseQQQLetter}as") ... File "C:\Python26\lib\regex.py", line 1178, in _parse_property raise error("undefined property name '%s'" % name) error: undefined property name 'UppercaseQQQLetter' >>> i.e. potential property names consisting only from the ascii-letters (+ _, -) are looked up and either used or an error is raised, other names (containing whitespace or non-ascii letters) aren't treated as a special expression, hence, they either match their literal value or simply don't match (without errors). Is this the intended behaviour? I am not sure whether it is maybe defined somewhere, or there are some de-facto standards for this... I guess, the space in the property names might be allowed (unless there are some implications for the parser...), otherwise the fallback handling of invalid property names as normal strings is probably the expected way. vbr |
|||
| msg97860 - (view) | Author: Matthew Barnett (mrabarnett) | Date: 2010-01-16 03:00 | |
issue2636-20100116.zip is a new version of the regex module. I've given up on the breadth-wise matching - it was too difficult finding a pattern structure that would work well for both depth-first and breadth-wise. It probably still needs some tweaks and tidying up, but I thought I might as well release something! |
|||
| msg98809 - (view) | Author: Matthew Barnett (mrabarnett) | Date: 2010-02-04 02:34 | |
issue2636-20100204.zip is a new version of the regex module. I've added splititer and added a build for Python 3.1. |
|||
| msg99072 - (view) | Author: Vlastimil Brom (vbr) | Date: 2010-02-08 23:45 | |
Hi, thanks for the update! Just for the unlikely case, it hasn't been noticed sofar, using python 2.6.4 or 2.5.4 with the regexp build issue2636-20100204.zip I am getting the following easy-to-fix error: Python 2.6.4 (r264:75708, Oct 26 2009, 08:23:19) [MSC v.1500 32 bit (Intel)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> import regex Traceback (most recent call last): File "<stdin>", line 1, in <module> File "C:\Python26\lib\regex.py", line 2003 print "Header file written at %s\n" % os.path.abspath(header_file.name)) ^ SyntaxError: invalid syntax After removing the extra closing paren in regex.py, line 2003, everything seems ok. vbr |
|||
| msg99132 - (view) | Author: Vlastimil Brom (vbr) | Date: 2010-02-09 17:38 | |
I'd like to add another issue I encountered with the latest version of regex - issue2636-20100204.zip It seems, that there is an error in handling some quantifiers in python 2.5 on Python 2.5.4 (r254:67916, Dec 23 2008, 15:10:54) [MSC v.1310 32 bit (Intel)] on win32 I get e.g.: >>> regex.findall(ur"q*", u"qqwe") Traceback (most recent call last): File "<pyshell#35>", line 1, in <module> regex.findall(ur"q*", u"qqwe") File "C:\Python25\lib\regex.py", line 213, in findall return _compile(pattern, flags).findall(string, overlapped=overlapped) File "C:\Python25\lib\regex.py", line 633, in _compile p = _regex.compile(pattern, info.global_flags | info.local_flags, code, info.group_index, index_group) RuntimeError: invalid RE code There is the same error for other possibly "infinite" quantifiers like "q+", "q{0,}" etc. with their non-greedy and possesive variants. On python 2.6 and 3.1 all these patterns works without errors. vbr |
|||
| msg99148 - (view) | Author: Matthew Barnett (mrabarnett) | Date: 2010-02-10 02:20 | |
issue2636-20100210.zip is a new version of the regex module. The reported bugs appear to be fixed now. |
|||
| msg99186 - (view) | Author: Vlastimil Brom (vbr) | Date: 2010-02-11 01:09 | |
Thanks for the quick update, I confirm the fix for both issues; just another finding (while testing the behaviour mentioned previously - msg91917) The property name normalisation seem to be much more robust now, I just encountered an encoding error using a rather artificial input (in python 2.5, 2.6): >>> regex.findall(ur"\p{UppercaseÄÄÄLetter}", u"QW\p{UppercaseÄÄÄLetter}as") Traceback (most recent call last): File "<pyshell#4>", line 1, in <module> regex.findall(ur"\p{UppercaseÄÄÄLetter}", u"QW\p{UppercaseÄÄÄLetter}as") File "C:\Python25\lib\regex.py", line 213, in findall return _compile(pattern, flags).findall(string, overlapped=overlapped) File "C:\Python25\lib\regex.py", line 599, in _compile parsed = _parse_pattern(source, info) File "C:\Python25\lib\regex.py", line 690, in _parse_pattern branches = [_parse_sequence(source, info)] File "C:\Python25\lib\regex.py", line 702, in _parse_sequence item = _parse_item(source, info) File "C:\Python25\lib\regex.py", line 710, in _parse_item element = _parse_element(source, info) File "C:\Python25\lib\regex.py", line 837, in _parse_element return _parse_escape(source, info, False) File "C:\Python25\lib\regex.py", line 1098, in _parse_escape return _parse_property(source, info, in_set, ch) File "C:\Python25\lib\regex.py", line 1240, in _parse_property raise error("undefined property name '%s'" % name) error: <unprintable error object> >>> Not sure, how this would be fixed (i.e. whether the error message should be changed to unicode, if applicable). Not surprisingly, in python 3.1, there is a correct message at the end: regex.error: undefined property name 'UppercaseÄÄÄLetter' vbr |
|||
| msg99190 - (view) | Author: Matthew Barnett (mrabarnett) | Date: 2010-02-11 02:16 | |
I've been aware for some time that exception messages in Python 2 can't be Unicode, but I wasn't sure which encoding to use, so I've decided to use that of sys.stdout. It appears to work OK in IDLE and at the Python prompt. issue2636-20100211.zip is the new version of the regex module. |
|||
| msg99462 - (view) | Author: Matthew Barnett (mrabarnett) | Date: 2010-02-17 04:09 | |
issue2636-20100217.zip is a new version of the regex module. It includes a fix for issue #7940. |
|||
| msg99470 - (view) | Author: Alex Willmer (moreati) | Date: 2010-02-17 13:01 | |
I've packaged this latest revision and uploaded to PyPI http://pypi.python.org/pypi/regex |
|||
| msg99479 - (view) | Author: Matthew Barnett (mrabarnett) | Date: 2010-02-17 19:35 | |
The main text at http://pypi.python.org/pypi/regex appears to have lost its backslashes, for example: The Unicode escapes uxxxx and Uxxxxxxxx are supported. instead of: The Unicode escapes \uxxxx and \Uxxxxxxxx are supported. |
|||
| msg99481 - (view) | Author: Vlastimil Brom (vbr) | Date: 2010-02-17 23:43 | |
I just tested the fix for unicode tracebacks and found some possibly weird results (not sure how/whether it should be fixed, as these inputs are indeed rather artificial...).
(win XPp SP3 Czech, Python 2.6.4)
Using the cmd console, the output is fine (for the characters it can accept and display)
>>> regex.findall(ur"\p{InBasicLatinĚ}", u"aé")
Traceback (most recent call last):
...
File "C:\Python26\lib\regex.py", line 1244, in _parse_property
raise error("undefined property name '%s'" % name)
regex.error: undefined property name 'InBasicLatinĚ'
>>>
(same result for other distorted "proprety names" containing e.g. ěščřžýáíéúůßäëiöüîô ...
However, in Idle the output differs depending on the characters present
>>> regex.findall(ur"\p{InBasicLatinÉ}", u"ab c")
yields the expected
...
File "C:\Python26\lib\regex.py", line 1244, in _parse_property
raise error("undefined property name '%s'" % name)
error: undefined property name 'InBasicLatinÉ'
but
>>> regex.findall(ur"\p{InBasicLatinĚ}", u"ab c")
Traceback (most recent call last):
...
File "C:\Python26\lib\regex.py", line 1244, in _parse_property
raise error("undefined property name '%s'" % name)
File "C:\Python26\lib\regex.py", line 167, in __init__
message = message.encode(sys.stdout.encoding)
File "C:\Python26\lib\encodings\cp1250.py", line 12, in encode
return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode character u'\xcc' in position 37: character maps to <undefined>
>>>
which might be surprising, as cp1250 should be able to encode "Ě", maybe there is some intermediate ascii step?
using the wxpython pyShell I get its specific encoding error:
regex.findall(ur"\p{InBasicLatinÉ}", u"ab c")
Traceback (most recent call last):
...
File "C:\Python26\lib\regex.py", line 1102, in _parse_escape
return _parse_property(source, info, in_set, ch)
File "C:\Python26\lib\regex.py", line 1244, in _parse_property
raise error("undefined property name '%s'" % name)
File "C:\Python26\lib\regex.py", line 167, in __init__
message = message.encode(sys.stdout.encoding)
AttributeError: PseudoFileOut instance has no attribute 'encoding'
(the same for \p{InBasicLatinĚ} etc.)
In python 3.1 in Idle, all of these exceptions are displayed correctly, also in other scripts or with special characters.
Maybe in python 2.x e.g. repr(...) of the unicode error messages could be used in order to avoid these problems, but I don't know, what the conventions are in these cases.
Another issue I found here (unrelated to tracebacks) are backslashes or punctuation (except the handled -_) in the property names, which just lead to failed mathces and no exceptions about unknown property names
regex.findall(u"\p{InBasic.Latin}", u"ab c")
[]
I was also surprised by the added pos/endpos parameters, as I used flags as a non-keyword third parameter for the re functions in my code (probably my fault ...)
re.findall(pattern, string, flags=0)
regex.findall(pattern, string, pos=None, endpos=None, flags=0, overlapped=False)
(is there a specific reason for this order, or could it be changed to maintain compatibility with the current re module?)
I hope, at least some of these remarks make some sense;
thanks for the continued work on this module!
vbr
|
|||
| msg99494 - (view) | Author: Matthew Barnett (mrabarnett) | Date: 2010-02-18 03:03 | |
issue2636-20100218.zip is a new version of the regex module. I've added '.' to the permitted characters when parsing the name of a property. The name itself is no longer reported in the error message. I've also corrected the positions of the 'pos' and 'endpos' arguments: regex.findall(pattern, string, flags=0, pos=None, endpos=None, overlapped=False) |
|||
| msg99548 - (view) | Author: Vlastimil Brom (vbr) | Date: 2010-02-19 00:29 | |
Thanks for fixing the argument positions; unfortunately, it seems, there might be some other problem, that makes my code work differently than the builtin re; it seems, in the character classes the ignorcase flag is ignored somehow: >>> regex.findall(r"[ab]", "aB", regex.I) ['a'] >>> re.findall(r"[ab]", "aB", re.I) ['a', 'B'] >>> (The same with the flag set in the pattern.) Outside of the character class the case seems to be handled normally, or am I missing something? vbr |
|||
| msg99552 - (view) | Author: Matthew Barnett (mrabarnett) | Date: 2010-02-19 01:31 | |
issue2636-20100219.zip is a new version of the regex module. The regex module should give the same results as the re module for backwards compatibility. The ignorecase bug is now fixed. This new version releases the GIL when matching on str and bytes (str and unicode in Python 2.x). |
|||
| msg99665 - (view) | Author: Alex Willmer (moreati) | Date: 2010-02-21 14:46 | |
On 17 February 2010 19:35, Matthew Barnett <report@bugs.python.org> wrote: > The main text at http://pypi.python.org/pypi/regex appears to have lost its backslashes, for example: > > The Unicode escapes uxxxx and Uxxxxxxxx are supported. > > instead of: > > The Unicode escapes \uxxxx and \Uxxxxxxxx are supported. Matthew, As you no doubt realised that text is read straight from the Features.txt file. PyPI interprets it as RestructuredText, which uses \ as an escape character in various cases. Do you intentionally write Features.txt as RestructuredText? If so here is a patch that escapes the \ characters as appropriate, otherwise I'll work out how to make PyPI read it as plain text. Regards, Alex -- Alex Willmer <alex@moreati.org.uk> http://moreati.org.uk/blog |
|||
| msg99668 - (view) | Author: Matthew Barnett (mrabarnett) | Date: 2010-02-21 16:21 | |
To me the extension .txt means plain text. Is there a specific extension for ReStructuredText, eg .rst? |
|||
| msg99835 - (view) | Author: Matthew Barnett (mrabarnett) | Date: 2010-02-22 21:24 | |
issue2636-20100222.zip is a new version of the regex module. This new version adds reverse searching. The 'features' now come in ReStructuredText (.rst) and HTML. |
|||
| msg99863 - (view) | Author: Vlastimil Brom (vbr) | Date: 2010-02-22 22:51 | |
Is the issue2636-20100222.zip archive supposed to be complete? I can't find not only the rst or html "features", but more importantly the py and pyd files for the particular versions. Anyway, I just skimmed through the regular-expressions.info documentation and found, that most features, which I missed in the builtin re version seems to be present in the regex module; a few possibly notable exceptions being some unicode features: http://www.regular-expressions.info/unicode.html support for unicode script properties might be needlessly complex (maybe unless http://bugs.python.org/issue6331 is implemented) On the other hand \X for matching any single grapheme might be useful, according to the mentioned page, the currently working equivalent would be \P{M}\p{M}* However, I am not sure about the compatibility concerns; it is possible, that the modifier characters as a part of graphemes might cause some discrepancies in the text indices etc. A feature, where i personally (currently) can't find a usecase is \G and continuing matches (but no doubt, there would be some some cases for this). regards vbr |
|||
| msg99872 - (view) | Author: Matthew Barnett (mrabarnett) | Date: 2010-02-22 23:28 | |
I don't know what happened there. I didn't notice that the zip file was way too small. Here's a replacement (still called issue2636-20100222.zip). Unicode script properties are already included, at least those whose definitions at http://www.regular-expressions.info/unicode.html I haven't notice \X before. I'll have a look at it. As for \G, .findall performs searches normally, but when using \G it effectively performs contiguous matches only, which can be useful when you need it! |
|||
| msg99888 - (view) | Author: Matthew Barnett (mrabarnett) | Date: 2010-02-23 00:39 | |
OK, you've convinced me, \X is supported. :-) issue2636-20100223.zip is a new version of the regex module. |
|||
| msg99890 - (view) | Author: Alex Willmer (moreati) | Date: 2010-02-23 00:47 | |
On 22 Feb 2010, at 21:24, Matthew Barnett <report@bugs.python.org> wrote: > issue2636-20100222.zip is a new version of the regex module. > > This new version adds reverse searching. > > The 'features' now come in ReStructuredText (.rst) and HTML Thank you matthew. My laptop is out of action, so it will be a few days before I can upload a new version to PyPI. If you would prefer to have control of the pypi package, or to share control please let mr know. Alex |
|||
| msg99892 - (view) | Author: Vlastimil Brom (vbr) | Date: 2010-02-23 01:31 | |
Wow, that's what can be called rapid development :-), thanks very much!
I did'n noticed before, that \G had been implemented already.
\X works fine for me, it also maintains the input string indices correctly.
We can use unicode character properties \p{Letter} and unicode bloks \p{inBasicLatin} properties;
the script properties like \p{Latin} or \p{IsLatin} return "undefined property name".
I guess, this would require the access to the respective information in unicodedata, where it isn't available now (there also seem to be much more scripts than those mentioned at regular-expressions.info
cf.
http://www.unicode.org/Public/UNIDATA/Scripts.txt
http://www.unicode.org/Public/UNIDATA/PropertyValueAliases.txt (under "# Script (sc)").
vbr
|
|||
| msg100066 - (view) | Author: Matthew Barnett (mrabarnett) | Date: 2010-02-24 20:25 | |
issue2636-20100224.zip is a new version of the regex module. It includes support for matching based on Unicode scripts as well as on Unicode blocks and properties. |
|||
| msg100076 - (view) | Author: Vlastimil Brom (vbr) | Date: 2010-02-24 23:14 | |
Thanks, its indeed a very nice addition to the library... Just a marginal remark; it seems, that in script-names also some non BMP characters are covered, however, in the unicode ranges thee only BMP. http://www.unicode.org/Public/UNIDATA/Blocks.txt Am I missing something more complex, as why 10000.. - ..10FFFF; ranges weren't included in _BLOCKS ? Maybe building these ranges is expensive, in contrast to rare uses of these properties? (Not that I am able to reliably test it on my "narrow" python build on windows, but currently, obviously, e.g. \p{InGothic} gives "undefined property name" whereas \p{Gothic} is accepted.) vbr |
|||
| msg100080 - (view) | Author: Matthew Barnett (mrabarnett) | Date: 2010-02-25 00:12 | |
It was more of an oversight. issue2636-20100225.zip now contains the full list of both blocks and scripts. |
|||
| msg100134 - (view) | Author: Matthew Barnett (mrabarnett) | Date: 2010-02-26 03:20 | |
issue2636-20100226.zip is a new version of the regex module. It now supports the branch reset (?|...|...), enabling the different branches of an alternation to reuse group numbers. |
|||
| msg100152 - (view) | Author: Alex Willmer (moreati) | Date: 2010-02-26 14:36 | |
On 26 February 2010 03:20, Matthew Barnett <report@bugs.python.org> wrote: > Added file: http://bugs.python.org/file16375/issue2636-20100226.zip This is now uploaded to PyPI http://pypi.python.org/pypi/regex/0.1.20100226 -- Alex Willmer <alex@moreati.org.uk> http://moreati.org.uk/blog |
|||
| msg100359 - (view) | Author: Vlastimil Brom (vbr) | Date: 2010-03-03 23:48 | |
I just noticed a cornercase with the newly introduced grapheme matcher \X, if this is used in the character set:
>>> regex.findall("\X", "abc")
['a', 'b', 'c']
>>> regex.findall("[\X]", "abc")
Traceback (most recent call last):
File "<input>", line 1, in <module>
File "regex.pyc", line 218, in findall
File "regex.pyc", line 1435, in _compile
File "regex.pyc", line 2351, in optimise
File "regex.pyc", line 2705, in optimise
File "regex.pyc", line 2798, in optimise
File "regex.pyc", line 2268, in __hash__
AttributeError: '_Sequence' object has no attribute '_key'
It obviously doesn't make much sense to use this universal literal in the character class (the same with "." in its metacharacter role) and also http://www.regular-expressions.info/refunicode.html doesn't mention this possibility; but the error message might probably be more descriptive, or the pattern might match "X" or "\" and "\X" (?)
I was originally thinking about the possibility to combine the positive and negative character classes, where e.g. \X would be a kind of base; I am not aware of any re engine supporting this, but I eventually found an unicode guidelines for regular expressions, which also covers this:
http://unicode.org/reports/tr18/#Subtraction_and_Intersection
It also surprises a bit, that these are all included in
Basic Unicode Support: Level 1; (even with arbitrary unions, intersections, differences ...) it suggests, that there is probably no implementation available (AFAIK) - even on this basic level, according to this guideline.
Among other features on this level, the section
http://unicode.org/reports/tr18/#Supplementary_Characters
seems useful, especially the handling of the characters beyond \uffff, also in the form of surrogate pairs as single characters.
This might be useful on the narrow python builds, but it is possible, that there would be be an incompatibility with the handling of these data in "narrow" python itself.
Just some suggestions or rather remarks, as you already implemented many advanced features and are also considering some different approaches ...:-)
vbr
|
|||
| msg100362 - (view) | Author: Matthew Barnett (mrabarnett) | Date: 2010-03-04 00:41 | |
\X shouldn't be allowed in a character class because it's equivalent to \P{M}\p{M}*. It's a bug, now fixed in issue2636-20100304.zip.
I'm not convinced about the set intersection and difference stuff. Isn't that overdoing it a little? :-)
|
|||
| msg100370 - (view) | Author: Vlastimil Brom (vbr) | Date: 2010-03-04 01:45 | |
Actually I had that impression too, but I was mainly surprised with these requirements being on the lowest level of the unicode support. Anyway, maybe the relevance of these guidelines for the real libraries is is lower, than I expected.
Probably the simpler cases are adequately handled with lookarounds, e.g. (?:\w(?<!\p{Greek}))+ and the complex examples like symmetric differences seem to be beyond the normal scope of re anyway.
Personally, I would find the surrogate handling more useful, but I see, that it isn't actually the job for the re library, given that the narrow build of python doesn't support indexing, slicing, len of these characters either...
vbr
|
|||
| msg100452 - (view) | Author: Matthew Barnett (mrabarnett) | Date: 2010-03-05 03:27 | |
issue2636-20100305.zip is a new version of the regex module. Just a few tweaks. |
|||
| msg101172 - (view) | Author: Alex Willmer (moreati) | Date: 2010-03-16 15:56 | |
I've adapted the Python 2.6.5 test_re.py as follows, from test.test_support import verbose, run_unittest -import re -from re import Scanner +import regex as re +from regex import Scanner and run it against regex-2010305. Three tests failed, and the report is attached. |
|||
| msg101181 - (view) | Author: Ezio Melotti (ezio.melotti) | Date: 2010-03-16 19:31 | |
Does regex.py have its own test suite (which also includes tests for all the problems reported in the last few messages)? If so, the new tests could be merged in re's test_re. This will simplify the testing of regex.py and will improve the test coverage of re.py, possibly finding new bugs. It will also be useful to check if the two libraries behave in the same way. |
|||
| msg101193 - (view) | Author: Vlastimil Brom (vbr) | Date: 2010-03-16 21:37 | |
I am not sure about the testsuite for this regex module, but it seems to me, that many of the problems reported here probably don't apply for the current builtin re, as they are connected with the new features of regex. After the suggestion in msg91462. I briefly checked the re testsuite and found it very comprehensive, given the featureset. Of course, most/all? re tests should apply for regex, but probably not vice versa. vbr |
|||
| History | |||
|---|---|---|---|
| Date | User | Action | Args |
| 2010-03-16 21:37:50 | vbr | set | messages: + msg101193 |
| 2010-03-16 19:31:24 | ezio.melotti | set | messages: + msg101181 |
| 2010-03-16 15:56:37 | moreati | set | files:
+ regex_test-20100316 messages: + msg101172 |
| 2010-03-05 03:27:29 | mrabarnett | set | files:
+ issue2636-20100305.zip messages: + msg100452 |
| 2010-03-04 01:45:27 | vbr | set | messages: + msg100370 |
| 2010-03-04 00:41:57 | mrabarnett | set | files:
+ issue2636-20100304.zip messages: + msg100362 |
| 2010-03-03 23:48:24 | vbr | set | messages: + msg100359 |
| 2010-02-26 14:36:17 | moreati | set | messages: + msg100152 |
| 2010-02-26 03:20:17 | mrabarnett | set | files:
+ issue2636-20100226.zip messages: + msg100134 |
| 2010-02-25 00:12:54 | mrabarnett | set | files:
+ issue2636-20100225.zip messages: + msg100080 |
| 2010-02-24 23:14:04 | vbr | set | messages: + msg100076 |
| 2010-02-24 20:25:00 | mrabarnett | set | files:
+ issue2636-20100224.zip messages: + msg100066 |
| 2010-02-23 01:31:05 | vbr | set | messages: + msg99892 |
| 2010-02-23 00:47:49 | moreati | set | messages: + msg99890 |
| 2010-02-23 00:39:04 | mrabarnett | set | files:
+ issue2636-20100223.zip messages: + msg99888 |
| 2010-02-22 23:28:30 | mrabarnett | set | files:
+ issue2636-20100222.zip messages: + msg99872 |
| 2010-02-22 23:10:55 | mrabarnett | set | files: - issue2636-20100222.zip |
| 2010-02-22 22:51:33 | vbr | set | messages: + msg99863 |
| 2010-02-22 21:24:31 | mrabarnett | set | files:
+ issue2636-20100222.zip messages: + msg99835 |
| 2010-02-21 16:21:20 | mrabarnett | set | messages: + msg99668 |
| 2010-02-21 14:46:40 | moreati | set | files:
+ Features-backslashes.patch messages: + msg99665 |
| 2010-02-19 01:31:23 | mrabarnett | set | files:
+ issue2636-20100219.zip messages: + msg99552 |
| 2010-02-19 00:29:46 | vbr | set | messages: + msg99548 |
| 2010-02-18 03:03:19 | mrabarnett | set | files:
+ issue2636-20100218.zip messages: + msg99494 |
| 2010-02-17 23:43:25 | vbr | set | messages: + msg99481 |
| 2010-02-17 19:35:45 | mrabarnett | set | messages: + msg99479 |
| 2010-02-17 13:01:55 | moreati | set | messages: + msg99470 |
| 2010-02-17 04:09:28 | mrabarnett | set | files:
+ issue2636-20100217.zip messages: + msg99462 |
| 2010-02-11 02:16:55 | mrabarnett | set | files:
+ issue2636-20100211.zip messages: + msg99190 |
| 2010-02-11 01:09:51 | vbr | set | messages: + msg99186 |
| 2010-02-10 02:20:06 | mrabarnett | set | files:
+ issue2636-20100210.zip messages: + msg99148 |
| 2010-02-09 17:38:03 | vbr | set | messages: + msg99132 |
| 2010-02-08 23:45:59 | vbr | set | messages: + msg99072 |
| 2010-02-04 02:34:44 | mrabarnett | set | files:
+ issue2636-20100204.zip messages: + msg98809 versions: + Python 3.1 |
| 2010-01-16 03:00:02 | mrabarnett | set | files:
+ issue2636-20100116.zip messages: + msg97860 |
| 2009-12-31 15:26:36 | ezio.melotti | set | priority: normal |
| 2009-08-24 12:55:50 | vbr | set | messages: + msg91917 |
| 2009-08-17 20:29:51 | moreati | set | messages: + msg91671 |
| 2009-08-15 16:12:29 | mrabarnett | set | files:
+ issue2636-20090815.zip messages: + msg91610 |
| 2009-08-15 14:02:20 | sjmachin | set | messages: + msg91607 |
| 2009-08-15 07:49:47 | mark | set | messages: + msg91598 |
| 2009-08-13 21:14:03 | moreati | set | messages: + msg91535 |
| 2009-08-12 18:01:38 | collinwinter | set | messages: + msg91500 |
| 2009-08-12 12:42:50 | doerwalter | set | nosy:
- doerwalter |
| 2009-08-12 12:29:12 | timehorse | set | messages: + msg91497 |
| 2009-08-12 12:16:21 | pitrou | set | messages: + msg91496 |
| 2009-08-12 12:04:09 | timehorse | set | messages: + msg91495 |
| 2009-08-12 03:00:20 | sjmachin | set | messages: + msg91490 |
| 2009-08-11 12:59:22 | r.david.murray | set | nosy:
+ r.david.murray messages: + msg91474 |
| 2009-08-11 11:15:30 | vbr | set | messages: + msg91473 |
| 2009-08-10 22:42:18 | mrabarnett | set | files:
+ issue2636-20090810#3.zip messages: + msg91463 |
| 2009-08-10 22:02:00 | gregory.p.smith | set | messages: + msg91462 |
| 2009-08-10 19:27:46 | vbr | set | messages: + msg91460 |
| 2009-08-10 15:04:49 | mrabarnett | set | files:
+ issue2636-20090810#2.zip messages: + msg91450 |
| 2009-08-10 14:18:57 | mrabarnett | set | files:
+ issue2636-20090810.zip messages: + msg91448 |
| 2009-08-10 10:58:09 | sjmachin | set | messages: + msg91439 |
| 2009-08-10 08:54:54 | vbr | set | nosy:
+ vbr messages: + msg91437 |
| 2009-08-04 01:30:19 | mrabarnett | set | files:
+ issue2636-20090804.zip messages: + msg91250 |
| 2009-08-03 22:36:34 | sjmachin | set | nosy:
+ sjmachin messages: + msg91245 |
| 2009-07-29 13:01:31 | ezio.melotti | set | messages: + msg91038 |
| 2009-07-29 11:10:25 | mrabarnett | set | files:
+ issue2636-20090729.zip messages: + msg91035 |
| 2009-07-29 11:09:49 | mrabarnett | set | files: - issue2636-20090729.zip |
| 2009-07-29 00:56:31 | mrabarnett | set | files:
+ issue2636-20090729.zip messages: + msg91028 |
| 2009-07-27 17:53:10 | akuchling | set | messages: + msg90989 |
| 2009-07-27 17:36:54 | gregory.p.smith | set | messages: + msg90986 |
| 2009-07-27 16:13:03 | mrabarnett | set | files:
+ issue2636-20090727.zip messages: + msg90985 |
| 2009-07-26 21:29:23 | georg.brandl | set | messages: + msg90961 |
| 2009-07-26 19:11:52 | mrabarnett | set | files:
+ issue2636-20090726.zip messages: + msg90954 |
| 2009-06-23 20:52:48 | doerwalter | set | nosy:
+ doerwalter messages: + msg89643 |
| 2009-06-23 17:01:34 | mrabarnett | set | messages: + msg89634 |
| 2009-06-23 16:29:08 | akitada | set | nosy:
+ akitada messages: + msg89632 |
| 2009-05-20 01:31:06 | rhettinger | unlink | issue5337 dependencies |
| 2009-04-16 14:58:26 | mrabarnett | set | files:
+ issue2636-patch-2.diff messages: + msg86032 |
| 2009-04-15 23:13:41 | gregory.p.smith | set | messages: + msg86004 |
| 2009-04-15 22:59:42 | gregory.p.smith | set | nosy:
+ gregory.p.smith |
| 2009-03-31 21:11:02 | georg.brandl | link | issue5337 dependencies |
| 2009-03-29 00:44:33 | mrabarnett | set | files:
+ issue2636-patch-1.diff messages: + msg84350 |
| 2009-03-23 01:42:54 | mrabarnett | set | messages: + msg83993 |
| 2009-03-23 00:08:38 | nneonneo | set | messages: + msg83989 |
| 2009-03-22 23:33:29 | mrabarnett | set | messages: + msg83988 |
| 2009-03-10 12:14:22 | timehorse | set | messages: + msg83429 |
| 2009-03-10 12:08:04 | pitrou | set | messages: + msg83428 |
| 2009-03-10 12:00:47 | timehorse | set | messages: + msg83427 |
| 2009-03-09 23:09:54 | loewis | set | messages: + msg83411 |
| 2009-03-09 15:15:54 | timehorse | set | messages: + msg83390 |
| 2009-03-07 14:19:11 | jaylogan | set | nosy: + jaylogan |
| 2009-03-07 11:27:06 | loewis | set | nosy:
+ loewis messages: + msg83277 |
| 2009-03-07 02:48:16 | mrabarnett | set | files:
+ issue2636-features-6.diff messages: + msg83271 |
| 2009-03-01 01:42:47 | mrabarnett | set | files:
+ issue2636-features-5.diff messages: + msg82950 |
| 2009-02-26 01:23:14 | mrabarnett | set | files:
+ issue2636-features-4.diff messages: + msg82739 |
| 2009-02-26 00:42:48 | collinwinter | set | nosy: + collinwinter |
| 2009-02-24 19:29:15 | mrabarnett | set | files:
+ issue2636-features-3.diff messages: + msg82673 |
| 2009-02-09 19:17:44 | pitrou | set | messages: + msg81475 |
| 2009-02-09 19:09:55 | pitrou | set | messages: + msg81473 |
| 2009-02-08 08:44:52 | ezio.melotti | set | nosy: + ezio.melotti |
| 2009-02-08 00:39:45 | mrabarnett | set | files:
+ issue2636-features-2.diff messages: + msg81359 |
| 2009-02-06 00:06:03 | nneonneo | set | messages: + msg81240 |
| 2009-02-06 00:03:00 | mrabarnett | set | messages: + msg81239 |
| 2009-02-05 23:52:49 | rsc | set | messages: + msg81238 |
| 2009-02-05 23:13:07 | nneonneo | set | nosy:
+ nneonneo messages: + msg81236 |
| 2009-02-03 23:08:08 | mrabarnett | set | files:
+ issue2636-features.diff messages: + msg81112 |
| 2009-02-01 19:25:08 | moreati | set | messages: + msg80916 |
| 2008-10-18 22:54:49 | moreati | set | nosy: + moreati |
| 2008-10-17 12:28:06 | mrabarnett | set | messages: + msg74904 |
| 2008-10-02 22:51:06 | mrabarnett | set | messages: + msg74204 |
| 2008-10-02 22:49:59 | mrabarnett | set | messages: + msg74203 |
| 2008-10-02 16:48:15 | mrabarnett | set | messages: + msg74174 |
| 2008-09-30 23:42:31 | mrabarnett | set | files:
+ issue2636+01+09-02+17+18+19+20+21+24+26_speedup.diff messages: + msg74104 |
| 2008-09-30 00:45:09 | mrabarnett | set | files:
+ issue2636-01+09-02+17_backport.diff messages: + msg74058 |
| 2008-09-29 12:36:07 | timehorse | set | messages: + msg74026 |
| 2008-09-29 11:48:00 | timehorse | set | messages: + msg74025 |
| 2008-09-28 02:52:00 | mrabarnett | set | messages: + msg73955 |
| 2008-09-26 18:04:38 | timehorse | set | messages: + msg73875 |
| 2008-09-26 16:28:10 | timehorse | set | messages: + msg73861 |
| 2008-09-26 16:00:54 | mrabarnett | set | messages: + msg73855 |
| 2008-09-26 15:43:46 | timehorse | set | messages: + msg73854 |
| 2008-09-26 15:16:18 | mrabarnett | set | messages: + msg73853 |
| 2008-09-26 13:11:22 | timehorse | set | messages: + msg73848 |
| 2008-09-25 23:59:05 | mrabarnett | set | messages: + msg73827 |
| 2008-09-25 17:36:07 | timehorse | set | messages: + msg73805 |
| 2008-09-25 17:01:11 | mrabarnett | set | messages: + msg73803 |
| 2008-09-25 16:32:38 | timehorse | set | messages: + msg73801 |
| 2008-09-25 15:57:45 | mrabarnett | set | messages: + msg73798 |
| 2008-09-25 14:17:06 | timehorse | set | messages: + msg73794 |
| 2008-09-25 13:43:28 | mrabarnett | set | messages: + msg73791 |
| 2008-09-25 12:23:25 | timehorse | set | messages: + msg73782 |
| 2008-09-25 11:57:54 | timehorse | set | messages: + msg73780 |
| 2008-09-25 11:56:40 | mrabarnett | set | messages: + msg73779 |
| 2008-09-25 00:06:33 | timehorse | set | messages: + msg73766 |
| 2008-09-24 19:45:57 | timehorse | set | messages: + msg73752 |
| 2008-09-24 16:33:35 | georg.brandl | set | nosy:
+ georg.brandl messages: + msg73730 |
| 2008-09-24 15:48:49 | mrabarnett | set | messages: + msg73721 |
| 2008-09-24 15:09:28 | timehorse | set | messages: + msg73717 |
| 2008-09-24 14:28:03 | mrabarnett | set | nosy:
+ mrabarnett messages: + msg73714 |
| 2008-09-22 21:31:44 | georg.brandl | link | issue433031 superseder |
| 2008-09-16 11:59:48 | timehorse | set | title: Regexp 2.6 (modifications to current re 2.2.2) -> Regexp 2.7 (modifications to current re 2.2.2) messages: + msg73295 versions: + Python 2.7, - Python 2.6 |
| 2008-09-13 13:40:22 | pitrou | set | messages: + msg73185 |
| 2008-06-19 14:15:54 | mark | set | messages: + msg68409 |
| 2008-06-19 12:01:31 | timehorse | set | messages: + msg68399 |
| 2008-06-18 07:13:25 | mark | set | messages: + msg68358 |
| 2008-06-17 19:07:22 | timehorse | set | files:
+ issue2636-02.patch messages: + msg68339 |
| 2008-06-17 17:44:14 | timehorse | set | files: - issue2636-07-only.diff |
| 2008-06-17 17:44:10 | timehorse | set | files: - issue2636-07.diff |
| 2008-06-17 17:44:06 | timehorse | set | files: - issue2636-05.diff |
| 2008-06-17 17:44:03 | timehorse | set | files: - issue2636.diff |
| 2008-06-17 17:43:59 | timehorse | set | files: - issue2636-05-only.diff |
| 2008-06-17 17:43:54 | timehorse | set | files: - issue2636-09.patch |
| 2008-06-17 17:43:39 | timehorse | set | files:
+ issue2636-patches.tar.bz2 messages: + msg68336 |
| 2008-05-29 19:00:39 | timehorse | set | files: - issue2636-07.patch |
| 2008-05-29 19:00:25 | timehorse | set | files: + issue2636-07-only.diff |
| 2008-05-29 18:59:39 | timehorse | set | files: + issue2636-07.diff |
| 2008-05-29 18:58:37 | timehorse | set | files: - issue2636-05.diff |
| 2008-05-29 18:58:22 | timehorse | set | files: + issue2636-05.diff |
| 2008-05-29 18:57:34 | timehorse | set | files: - issue2636.diff |
| 2008-05-29 18:56:29 | timehorse | set | files: + issue2636.diff |
| 2008-05-28 13:57:25 | timehorse | set | messages: + msg67448 |
| 2008-05-28 13:38:46 | mark | set | nosy:
+ mark messages: + msg67447 |
| 2008-05-24 21:40:35 | timehorse | set | files: - issue2636-05.patch |
| 2008-05-24 21:40:24 | timehorse | set | files: + issue2636-05.diff |
| 2008-05-24 21:39:57 | timehorse | set | files: + issue2636-05-only.diff |
| 2008-05-24 21:39:09 | timehorse | set | files:
+ issue2636.diff messages: + msg67309 |
| 2008-05-01 14:16:21 | timehorse | set | messages: + msg66033 |
| 2008-04-26 11:51:14 | timehorse | set | messages: + msg65841 |
| 2008-04-26 10:08:05 | pitrou | set | nosy:
+ pitrou messages: + msg65838 |
| 2008-04-24 20:55:49 | rsc | set | nosy: + rsc |
| 2008-04-24 18:09:25 | jimjjewett | set | messages: + msg65734 |
| 2008-04-24 16:06:27 | timehorse | set | messages: + msg65727 |
| 2008-04-24 14:31:53 | amaury.forgeotdarc | set | nosy:
+ amaury.forgeotdarc messages: + msg65726 |
| 2008-04-24 14:23:35 | jimjjewett | set | nosy:
+ jimjjewett messages: + msg65725 |
| 2008-04-18 14:50:44 | timehorse | set | files:
+ issue2636-05.patch messages: + msg65617 |
| 2008-04-18 14:23:19 | timehorse | set | files:
+ issue2636-07.patch messages: + msg65614 |
| 2008-04-18 13:38:57 | timehorse | set | files:
+ issue2636-09.patch keywords: + patch messages: + msg65613 |
| 2008-04-17 22:07:00 | timehorse | set | messages: + msg65593 |
| 2008-04-15 13:22:10 | akuchling | set | components: + Regular Expressions, - Library (Lib) |
| 2008-04-15 12:49:43 | akuchling | set | nosy: + akuchling |
| 2008-04-15 11:57:51 | timehorse | create | |