Regex: How should backreferences contained in the capturing match they reference to work?
On 9/2/07, liorean <liorean at gmail.com> wrote:
var re=/(a\1)+|b/, a=[], key, aab=re.exec('aab') aaab=re.exec('aaab'); for(key in aab) a.push('aab["'+key+'"]: '+aab[key]); for(key in aaab) a.push('aaab["'+key+'"]: '+aaab[key]); a.join('\r\n');
From the spec it's pretty clear (to me anyhow) that SpiderMonkey and
Opera are correct here (and the ES4 reference implementation, whose RegEx engine was modelled directly on the spec, agrees). The reasoning is that (a) the undefined value in the captures array matches any input without advancing the input pointer [15.10.2.9 step 8 substep 3], (b) every element in the captures array starts out as undefined [15.10.2.2 step 2 substep 4], (c) every element in the captures array affected by a particular set of capturing parens is reset to undefined every time those parens are entered [15.10.2.5 case "RepeatMatcher" step 4], and (d) the captures array is only updated after the capturing parens are exited during matching [15.10.2.8 case "( Disjunction )" step 3 substep 1 subsubstep 5]. Therefore the pattern above is always the same as /(a)+|b/. Therefore the results are what they are.
It seems to me that, looking at the situation from the perspective of a user of regex, that there are two ways of tackling this problem that makes sense, and they are hinged on whether a captured submatch is considered to be undefined or the empty string by default.
In ES3 they are clearly undefined by default.
If the capture is considered to be undefined by default, then the ES3 solution makes most sense... except for one thing: a capturing submatch containing a backreference to the containing capture means a guaranteed failure to match in that path, and is pretty easy to detect at compile time.
The way I understand the spec it means a guaranteed success. Still pretty silly, to be sure.
Throwing a syntax error at compile time seems more appropriate since I hardly think developers want their regex to spend time in constructs that cannot possibly match. Alternatively, engines can simply throw out that entire path at compile time and just set all the contained captures to undefined - behaviour would still be identical to what ES3 specifies. I.e. it's a choice between loud or silent failure.
Or a vacuous and silent success, as in this case.
I'm not sure what problem you're trying to solve, but my hope is that once ES4 is widely implemented programmers will switch to named submatches to avoid the problem with numeric backreferences entirely. In the mean time, there's obviously room for a good implementation to warn about ineffectual submatches like these. Changing the semantics to an error is not very appealing, on the one hand I doubt it will break a lot of sites, on the other hand I doubt it will fix many either.
Sorry for delaying my response on this. Wanted to get some public opinions on it: uri:http://web-graphics.com/2007/09/05/a-quick-js-quizz-for-anybody-who-think-they-know-regex/
Sadly I only got responses from two developers, but they both seem to agree with me (and I was careful to not state my opinion there)...
On 9/2/07, liorean <liorean at gmail.com> wrote:
var re=/(a\1)+|b/, a=[], key, aab=re.exec('aab') aaab=re.exec('aaab'); for(key in aab) a.push('aab["'+key+'"]: '+aab[key]); for(key in aaab) a.push('aaab["'+key+'"]: '+aaab[key]); a.join('\r\n');
On 03/09/07, Lars T Hansen <lth at acm.org> wrote:
From the spec it's pretty clear (to me anyhow) that SpiderMonkey and Opera are correct here (and the ES4 reference implementation, whose RegEx engine was modelled directly on the spec, agrees). The reasoning is that (a) the undefined value in the captures array matches any input without advancing the input pointer [15.10.2.9 step 8 substep 3], (b) every element in the captures array starts out as undefined [15.10.2.2 step 2 substep 4], (c) every element in the captures array affected by a particular set of capturing parens is reset to undefined every time those parens are entered [15.10.2.5 case "RepeatMatcher" step 4], and (d) the captures array is only updated after the capturing parens are exited during matching [15.10.2.8 case "( Disjunction )" step 3 substep 1 subsubstep 5]. Therefore the pattern above is always the same as /(a)+|b/. Therefore the results are what they are.
Seems I've been looking in entirely the wrong places. Yes, that seems correct.
On 9/2/07, liorean <liorean at gmail.com> wrote:
It seems to me that, looking at the situation from the perspective of a user of regex, that there are two ways of tackling this problem that makes sense, and they are hinged on whether a captured submatch is considered to be undefined or the empty string by default.
On 03/09/07, Lars T Hansen <lth at acm.org> wrote:
In ES3 they are clearly undefined by default.
This was, as stated, looking from the perspective of a user or regex, not an implementer. The ES3 spec is actually the solution that makes least sense from the user perspective, if you ask me. I'd even call it a weird choice, given the alternatives. To throw out a submatch before trying to match it the next time instead of just replacing the submatch value after the next match is unintuitive behaviour.
On 9/2/07, liorean <liorean at gmail.com> wrote:
If the capture is considered to be undefined by default, then the ES3 solution makes most sense... except for one thing: a capturing submatch containing a backreference to the containing capture means a guaranteed failure to match in that path, and is pretty easy to detect at compile time.
On 03/09/07, Lars T Hansen <lth at acm.org> wrote:
The way I understand the spec it means a guaranteed success. Still pretty silly, to be sure.
This is what I was misunderstanding - that the behaviour for when the capture was undefined actually meant a match for the empty string.
On 9/2/07, liorean <liorean at gmail.com> wrote:
Throwing a syntax error at compile time seems more appropriate since I hardly think developers want their regex to spend time in constructs that cannot possibly match. Alternatively, engines can simply throw out that entire path at compile time and just set all the contained captures to undefined - behaviour would still be identical to what ES3 specifies. I.e. it's a choice between loud or silent failure.
On 03/09/07, Lars T Hansen <lth at acm.org> wrote:
Or a vacuous and silent success, as in this case.
Which makes less sense than either the JavaScriptCore or the JScript behaviour, IMO.
On 03/09/07, Lars T Hansen <lth at acm.org> wrote:
I'm not sure what problem you're trying to solve, but my hope is that once ES4 is widely implemented programmers will switch to named submatches to avoid the problem with numeric backreferences entirely.
The problem is the same whether it's named or not, I think. Consider:
/(?P<as>a(?P=as))+|b/.exec('aaab');
Is the behaviour of that any different from the behaviour of these:
/(a\1)+|b/.exec('aaab');
/(?P<as>a\1)+|b/.exec('aaab');
Since in this case, unless I've misunderstood the proposal, (?P=as) is just an alias for \1.
On 03/09/07, Lars T Hansen <lth at acm.org> wrote:
In the mean time, there's obviously room for a good implementation to warn about ineffectual submatches like these. Changing the semantics to an error is not very appealing, on the one hand I doubt it will break a lot of sites, on the other hand I doubt it will fix many either.
I don't believe there is just about any live code that actually uses backreferences within the capturing match they refer to. If there were, I don't think a behaviour as divergent from the other browsers as that of JavaScriptCore would have been able to fly under the radar.
What I was really asking for here is: Which behaviour should I implement. From the user perspective, I think it's best if a backreference to the containing submatch (independent of whether it's a named backreference or an indexed one) is a failure, like in JavaScriptCore. The reason is that it's simple to understand using the following reasoning based on preexpanding the backreference:
- The pattern (\1) matches zero characters plus an expansion of itself, which is zero characters. Thus that's equal to ((?:){Infinity}), which is still zero characters.
- However, (a\1) matches one character plus an expansion of itself (exactly which character is irrelevant, which should be obvious soon), which is one character plus an expansion of itself, which is two characters... ad inifinitum. Thus that is equal to (a{Infinity}) for which there cannot possibly exist a matching sting, since strings are finite.
In other ways, a backreference to the containing capture is logically just plain nonsense.
From an implementor perspective, it doesn't matter much whether this
is a silent or loud failure, I think. Making it silent would just mean we can ignore that whole code path anyway.
I agree making it a syntax error is probably not the best idea, since that could kill an entire script, even if it's never even used. Emitting a warning is probably a good idea, though.
Second best, from a user perspective, is to not reset the captures, like in JScript. This fits well into the following iterative reasoning about what the pattern does:
-
The pattern (\1)+ is equivalent to ((?:)) first repetition, a match for the first capture which is the empty string. The capturing submatch is the new value for the first capture.
-
The pattern (\1)+ is equivalent to (?:)((?:)) second repetition, a match for first repetition which is the empty string plus zero characters plus the first capture which is the empty string. The capturing submatch is the new value for the first capture.
-
The pattern (a\1)+ is equivalent to (a(?:)) first repetition, which matches 'a' plus the first capture which is the empty string. The capturing submatch is the new value for the first capture.
-
The pattern (a\1)+ is equivalent to (?:a)(a(?:a)) second repetition, which matches the first repetition which is 'a' plus 'a' plus the first capture which is 'a'. The capturing submatch is the new value for the first capture.
-
The pattern (a\1)+ is equivalent to (?:aaa)(a(?:aa)) third repetition, which matches the second repetition which is 'aaa' plus 'a' plus the first capture which is 'a'a. The capturing submatch is the new value for the first capture.
I'm sure you recognise the pattern:)
From an implementor perspective, I hardly think not having to reset
part of the capture array for each repetition is bad.
How does it work with capturing submatches within the submatch that is repeated?
/(?:(a)(b)?)+/.exec('aaba');
JavaScriptCore and JScript: ['aaba','a','b'] SpiderMonkey and futhark: ['aaba','a','']
/(?:(\2|a)(b)?)+/.exec('abba');
JavaScriptCore: ['abba','a','b'] JScript: ['ab','a','b'] SpiderMonkey and futhark: ['abba','a','']
ES3: If I understand 15.10.2.5 Term, fourth algorithm, point 4, the correct results would be ['ab','a','b'] since the \2 capture would be set to undefined before the second repetition. Thus futhark and SpiderMonkey fail to implement this part of ES3 Regex. (I'd be happy if it turned out I'm wrong here, however.)
I repeat: I think ES3 (and SpiderMonkey and futhark) has taken the wrong choice here. It doesn't make sense to throw away submatches after they have been captured, and it's not what regex users expect. And a reasonable handling of a backreference to a containing submatch is that it's nonsense since it logically can't match anything. But even if they're not deemed nonsense, they should refer to last repetition's capture, not be reset with each repetition.
So I guess what I'm asking you is if you think the ES3 algorithm is appropriate, and if a change to a more sensible interpretation could be done in ES4. Since we have three totally different interpretations in browser hosted engines, I bet there's precious little code out there that relies on the ES3 behaviour, or for that matter the particular implementation in any of the browser hosted engines.
It is true that named submatches are just aliases for numbered submatches. My argument was only that programmers are vastly less likely to nest a named backreference inside the capturing submatch of that name than they are when the backreference is a number, because it will be much more obvious what's going on when you can see both the name of the submatch and the name of the back reference and their respective positions.
My answer to you is obviously that you should implement the ES3 behavior (as should JScript and JavaScriptCore). The current behavior is well-defined; it's not a hardship for anyone; the incompatibilities among the engines are probably not a big deal (thus the incompatible engines can be changed so that they conform); thus there is no compelling reason to change the spec either.
/(?:(\2|a)(b)?)+/.exec('abba');
JavaScriptCore: ['abba','a','b'] JScript: ['ab','a','b'] SpiderMonkey and futhark: ['abba','a','']
ES3: If I understand 15.10.2.5 Term, fourth algorithm, point 4, the correct results would be ['ab','a','b'] since the \2 capture would be set to undefined before the second repetition. Thus futhark and SpiderMonkey fail to implement this part of ES3 Regex. (I'd be happy if it turned out I'm wrong here, however.)
FWIW, the ES4 RI agrees with Mozilla and Opera.
Trying to trace the matching above, we have: ...+ (?:...) (\2|a) succeeds without consuming input (b)? succeeds without consuming input so (?:...) succeds without consuming input so ...+ starts backtracking [step 1 of the continuation for RepeatMatcher] so (\2|a) succeeds, consuming "a" and (b)? succeeds, consuming "b" so (?:...) succeeds consuming "ab" with captures [a,b] and ...+ repeats: (?:...) (\2|a) succeeds without consuming input and (b)? succeeds, consuming "b" so (?:...) succeeds consuming "b" with captures [,b] and ...+ repeats: (?:...) (\2|a) succeeds without consuming input (b)? succeeds without consuming input so (?:...) succeds without consuming input so + starts backtracking so (\2|a) succeeds, consuming "a" and (b)? succeeds without consuming input so (?:...) succeeds consuming "a" with captures [a,] so ...+ succeeds, matching "abba" with final capture array [a,]
Sorry for the lack of rigor in the above, but I believe that's correct.
On 13/09/2007, Lars T Hansen <lth at acm.org> wrote:
My answer to you is obviously that you should implement the ES3 behavior (as should JScript and JavaScriptCore).
Of course I'll implement the ES3 behaviour if ES4 makes no change to it, I just think it would be better to change the behaviour to something that actually matches developer expectations instead.
The current behavior is well-defined; it's not a hardship for anyone; the incompatibilities among the engines are probably not a big deal (thus the incompatible engines can be changed so that they conform); thus there is no compelling reason to change the spec either.
That the spec doesn't match expectations and that there are behaviours that do make sense to replace it with, coupled with the fact there seems to be no obvious compatibility problem with changing it (otherwise JScript and JavaScriptCore surely would have been changed to match the ES3 behaviour), makes no compelling reason?
ES3: If I understand 15.10.2.5 Term, fourth algorithm, point 4, the correct results would be ['ab','a','b'] since the \2 capture would be set to undefined before the second repetition. Thus futhark and SpiderMonkey fail to implement this part of ES3 Regex. (I'd be happy if it turned out I'm wrong here, however.)
FWIW, the ES4 RI agrees with Mozilla and Opera.
Trying to trace the matching above, we have: ...+ (?:...) (\2|a) succeeds without consuming input (b)? succeeds without consuming input so (?:...) succeds without consuming input so ...+ starts backtracking [step 1 of the continuation for RepeatMatcher] so (\2|a) succeeds, consuming "a" and (b)? succeeds, consuming "b" so (?:...) succeeds consuming "ab" with captures [a,b] and ...+ repeats: (?:...) (\2|a) succeeds without consuming input and (b)? succeeds, consuming "b" so (?:...) succeeds consuming "b" with captures [,b] and ...+ repeats: (?:...) (\2|a) succeeds without consuming input (b)? succeeds without consuming input so (?:...) succeds without consuming input so + starts backtracking so (\2|a) succeeds, consuming "a" and (b)? succeeds without consuming input so (?:...) succeeds consuming "a" with captures [a,] so ...+ succeeds, matching "abba" with final capture array [a,]
Sorry for the lack of rigor in the above, but I believe that's correct.
Ah, the "undefined really means a match with the empty string" thing again. I should have known...
On 9/13/07, liorean <liorean at gmail.com> wrote:
On 13/09/2007, Lars T Hansen <lth at acm.org> wrote:
The current behavior is well-defined; it's not a hardship for anyone; the incompatibilities among the engines are probably not a big deal (thus the incompatible engines can be changed so that they conform); thus there is no compelling reason to change the spec either.
That the spec doesn't match expectations and that there are behaviours that do make sense to replace it with, coupled with the fact there seems to be no obvious compatibility problem with changing it (otherwise JScript and JavaScriptCore surely would have been changed to match the ES3 behaviour), makes no compelling reason?
I hope I'm not being overly flip when I say that it is the spec that circumscribes the set of expectations you are allowed to have. And the spec is entirely clear here, ie what it says is not in question, even if it's not always easy to find out what it says. A somewhat determined developer who needs to rely on the behavior can discover what behavior is expected (though he obviously can't trust the implementations to get it right).
It's your contention that developers will be surprised by the current behavior. What can I say? Edge cases will always surprise somebody. I don't feel surprised that captures are thrown away when a repeat matcher repeats, even if you do.
Warnings are pointless on the web, and the language has no warnings now. If we're going to do something about backreferences inside their own captures, it would have to be to outlaw them and require a syntax error.
I doubt we're inclined to make any incompatible changes to the matching algorithm, and (again) I don't really see the point.
(It is said a lawyer is a person who's not appaled by something if it can be shown to be the consequence of a law. I guess I'm a (language) lawyer on this one :-)
I am not looking to make trouble here, believe me, but I want to
point out two things that could help David's case:
- JS regexps were modeled by me (to lwall's horror ;-) on Perl
regexen. Here's what perl (5.8.8) does:
$ perl "aaab" =~ /(a\1)+|b/; print "$& ($1)\n"; b ()
It's no surprise JavaScriptCore agrees, since it is based on PCRE.
Tamarin is too -- do you know what it does?
- IE JScript does not agree with any of Perl/JavaScriptCore or ES3
and conformant implementations. That does not mean it should prevail
for this edge case. But it does suggest we could change to match
Perl, and match David's two-person (three counting him; perhaps four
if I count ;-) developer sample.
On 9/14/07, Brendan Eich <brendan at mozilla.org> wrote:
I am not looking to make trouble here, believe me, but I want to point out two things that could help David's case:
- JS regexps were modeled by me (to lwall's horror ;-) on Perl regexen. Here's what perl (5.8.8) does:
$ perl "aaab" =~ /(a\1)+|b/; print "$& ($1)\n"; b ()
It's no surprise JavaScriptCore agrees, since it is based on PCRE. Tamarin is too -- do you know what it does?
- IE JScript does not agree with any of Perl/JavaScriptCore or ES3 and conformant implementations. That does not mean it should prevail for this edge case. But it does suggest we could change to match Perl, and match David's two-person (three counting him; perhaps four if I count ;-) developer sample.
I oppose this on merely Perl compatibility grounds -- ES regexes have since evolved through other influences -- but sure, we'll give it a fair hearing. I propose that everyone involved wait until the draft library spec is released (soon, if all goes well) and that specific proposals to changes in regexp semantics are made in terms of that draft spec. (It should be somewhat easier to deal with than the ES3 spec would be.)
On 9/13/07, liorean <liorean at gmail.com> wrote:
That the spec doesn't match expectations and that there are behaviours that do make sense to replace it with, coupled with the fact there seems to be no obvious compatibility problem with changing it (otherwise JScript and JavaScriptCore surely would have been changed to match the ES3 behaviour), makes no compelling reason?
On 13/09/2007, Lars T Hansen <lth at acm.org> wrote:
I hope I'm not being overly flip when I say that it is the spec that circumscribes the set of expectations you are allowed to have. And the spec is entirely clear here, ie what it says is not in question, even if it's not always easy to find out what it says. A somewhat determined developer who needs to rely on the behavior can discover what behavior is expected (though he obviously can't trust the implementations to get it right).
Expectations are never formed from the spec, and that is especially true of ECMAScript since the spec is quite hard to read for laymen. Expectations are formed from a logical application of the perceived concepts underlying the language features. Regex are a special case here as well, since most regex articles and tutorials are not particularly language specific, they are typically only specific to the generalised flavour (the Perl flavour and POSIX flavour, respectively). In fact, for a long time the only JavaScript specific regex article that a fast search could find that was the one I wrote for eVolt in 2002.
It's your contention that developers will be surprised by the current behavior. What can I say? Edge cases will always surprise somebody. I don't feel surprised that captures are thrown away when a repeat matcher repeats, even if you do.
Maybe you don't. But almost all regex implementations that implement capturing submatches that are specifically implementing the ES3 algorithm agree with me.
Warnings are pointless on the web, and the language has no warnings now. If we're going to do something about backreferences inside their own captures, it would have to be to outlaw them and require a syntax error.
Warnings are pretty much useless today, yes. But you can always hope for better debugging and development tools.
I doubt we're inclined to make any incompatible changes to the matching algorithm, and (again) I don't really see the point.
That's why I stressed that JScript and JavaScriptCore, and according to Brendan above also Tamarin, fail to implement this according to ES3. It's not an incompatible change - it can't be if the two most common engines are incompatible with the spec. (I gather that'd be JScript and Tamarin.) That pretty much guarantees that there's no code relying on the spec behaviour.
On 14/09/2007, Lars T Hansen <lth at acm.org> wrote:
I oppose this on merely Perl compatibility grounds -- ES regexes have since evolved through other influences -- but sure, we'll give it a fair hearing.
I'm glad to hear that it'll get a fair hearing. Just throw some logs on the fire however, these all agree on backreferences to failed submatches failing to match anything-including-the-empty-string:
JGsoft, .NET, Java, Perl, PCRE, Python, Ruby, Tcl ARE, POSIX BRE
(according to uri:http://www.regular-expressions.info/refflavors.html,
I haven't actually tested in all those)
You can of course argue that "failed submatches fail to match" is not the same as "not-yet-captured submatches fail to match" - that would require some testing, and the only one that I have accessible personally right now is Perl.
On 14/09/2007, Garrett Smith <dhtmlkitchen at gmail.com> wrote:
You're post on ES4 list would make for an excellent tutorial on RegExps.
I've been asked to do RegExp tutorials before...
uri:http://www.evolt.org/article/Regular_Expressions_in_JavaScript/17/36435/
I think a lot of web devs, including myself, don't have such deep knowledge of RegExps. The differences in script engines show that it's more than web devs who misunderstand the spec. You're examples, including the ones on your blog, indicate that RegExps in js have some counter-intuitive behavior.
Yes. I've been meaning to flesh that out in another blog entry. Probably will have to wait until next week however.
I think it would be excellent material for an article on developer.mozilla.org. Something to consider, at least.
Ah. The problem of choosing where to place your articles... in fact, I'd probably prefer to place any future articles on dev.opera.com (I think Opera is seriously under appreciated by developers, so anything to help...). Or on eVolt, since I like that place.
I think I can sum up the change I think is appropriate by these things:
- undefined should be a failure to match instead of a match to the empty string
- captures should only be set to undefined in two cases - when the regex matching is started, and if inside a negative lookahead
On Sep 14, 2007, at 12:42 PM, liorean wrote:
On 14/09/2007, liorean <liorean at gmail.com> wrote:
Maybe you don't. But almost all regex implementations that implement capturing submatches that are specifically implementing the ES3 algorithm agree with me.
s/that are specifically/that are not specifically/
"Almost all" excludes IE JScript, right? Those implementations are
JavaScriptCore, based on PCRE, and what else?
Again, it's the Perl connection.
On 14/09/2007, Brendan Eich <brendan at mozilla.org> wrote:
"Almost all" excludes IE JScript, right?
Well, it doesn't agree with me on what is the right approach, but it does agree with me that the ES3 approach is wrong.
Those implementations are JavaScriptCore, based on PCRE, and what else?
The only implementations that I know of that specifically implement the ES3 algorithms are those in SpiderMonkey, linear_b and futhark (I have no idea about SEE, Rhino, and a lot of other fringe JS engines however). And the ES4 refimpl I guess.
I only said "almost all" because I haven't tested each and every one. It could be possible that the only ones behaving according to the ES3 algorithm at all are indeed those written to match the ES3 algorithms. It's also possible that there are those that do match the ES3 algorithms but are not written specifically for them - I just don't know of any such implementation.
JavaScriptCore uses PCRE, so of course it's not written specifically for the ES3 algorithms.
The JScript one is shared with at least VBScript, and I wouldn't be surprised if it was shared with VB, VBA and other COM environments as well. I don't think it's written specifically for the ES3 algorithms. It's peculiar, and I think it doesn't agree with anybody else on a few edge cases besides those I've brought up.
Again, it's the Perl connection.
Not only. Both Perl and those POSIX flavour regex that implement capturing submatches...
liorean wrote:
I think I can sum up the change I think is appropriate by these things:
- undefined should be a failure to match instead of a match to the empty string
- captures should only be set to undefined in two cases - when the regex matching is started, and if inside a negative lookahead
I am in complete agreement with liorean. The ES3 handling on these points is non-intuitive, non-compatible with other regex libraries, far less useful than the alternative (I could give countless examples of where the Perl-style handling has real-world practicality), and creates future compatibility issues if ECMAScript were to implement certain features from Perl or other Perl-derivative flavors such as capturing-group-based conditionals.
Lars T Hansen wrote:
I hope I'm not being overly flip when I say that it is the spec that circumscribes the set of expectations you are allowed to have. And the spec is entirely clear here, ie what it says is not in question, even if it's not always easy to find out what it says. A somewhat determined developer who needs to rely on the behavior can discover what behavior is expected (though he obviously can't trust the implementations to get it right).
I consider ES3's handling of capturing group participation and so forth (summed up by liorean) to be bugs in the spec (perhaps I'm being overly flip, but these are my biggest gripes with ES3 regexes). The fact that I consider them bugs which might be fixed in future versions means that I cannot rely on the behavior even though I understand what the spec prescribes.
StevenLevithan wrote:
liorean wrote:
I think I can sum up the change I think is appropriate by these things:
- undefined should be a failure to match instead of a match to the empty string
- captures should only be set to undefined in two cases - when the regex matching is started, and if inside a negative lookahead
I am in complete agreement with liorean. The ES3 handling on these points is non-intuitive, non-compatible with other regex libraries, far less useful than the alternative (I could give countless examples of where the Perl-style handling has real-world practicality), and creates future compatibility issues if ECMAScript were to implement certain features from Perl or other Perl-derivative flavors such as capturing-group-based conditionals.
To support my last comment with a couple of the "countless examples", I can point to my blog:
-
If backreferences to non-participating capturing groups resulted in failure rather than a match of the empty string, it would be possible to mimic conditionals as shown at blog.stevenlevithan.com/archives/mimic-conditionals, blog.stevenlevithan.com/archives/mimic-conditionals . That page presents some generalized patterns, but simpler cases can often be taken advantage of.
-
If captures were only set to undefined when the regex matching starts and after a negative lookahead which contains captures, it would be possible to do things like shown at blog.stevenlevithan.com/archives/multi-attr-capture, blog.stevenlevithan.com/archives/multi-attr-capture .
Those are particular things I've posted about in the past, but I run into these spec bugs ;-) on a regular basis. I have never run into a case where I wished for the ES3 handling of these issues.
Tried to sum up the issue a little more coherently and well articulated in a blog post: uri:http://web-graphics.com/2007/11/26/ecmascript-3-regular-expressions-a-specification-that-doesnt-make-sense/
Doesn't really say anything I haven't said somewhere in this thread already, though.
liorean wrote:
Tried to sum up the issue a little more coherently and well articulated in a blog post: uri:http://web-graphics.com/2007/11/26/ecmascript-3-regular-expressions-a-specification-that-doesnt-make-sense/
And I've posted a follow-up blog.stevenlevithan.com/archives/es3-regexes-broken here (which in truth is more of a long rant).
BTW, here's another variation on the use of Perl-style group participation handling that I posted on a regex forum yesterday: regexadvice.com/forums/permalink/25764/36926/ShowThread.aspx#36926 Reg Expression for matching [multiple required attributes] (that regex also uses an atomic group, but that's not imperative in this case and can be emulated even in ES3 by using capturing groups in lookahead, if necessary). The Perl-style handling is intuitive and ripe for creative exploitation. The ES3-style handling is essentially useless. But I'll stop beating this horse now...
StevenLevithan wrote:
And I've posted a follow-up blog.stevenlevithan.com/archives/es3-regexes-broken here (which in truth is more of a long rant).
At the risk of spamming the mailing list, I'll note that I've just toned down a lot of the previous, undue negativity towards the draft ES4 regex spec in that post. I'm not really a hateful ingrate. :)
Hello,
I've been playing around with implementing ES3/ES4 regex, (more based on observation of what the browser hosted engines seem to do than by the spec, since I'm trying to build a non-backtracking engine) and noted this case where the browsers don't agree:
JavaScriptCore: aab["0"]: b aab["index"]: 2 aab["input"]: aab aaab["0"]: b aaab["index"]: 3 aaab["input"]: aaab JScript: aab["input"]: aab aab["index"]: 0 aab["lastIndex"]: 1 aab["0"]: a aab["1"]: a aaab["input"]: aaab aaab["index"]: 0 aaab["lastIndex"]: 3 aaab["0"]: aaa aaab["1"]: aa SpiderMonkey: aab["0"]: aa aab["1"]: a aab["index"]: 0 aab["input"]: aab aaab["0"]: aaa aaab["1"]: a aaab["index"]: 0 aaab["input"]: aaab futhark: aab["0"]: aa aab["1"]: a aab["index"]: 0 aab["input"]: aab aaab["0"]: aaa aaab["1"]: a aaab["index"]: 0 aaab["input"]: aaab linear_b: aab["0"]: aa aab["1"]: a aab["index"]: 0 aab["input"]: aab aaab["0"]: aaa aaab["1"]: a aaab["index"]: 0 aaab["input"]: aaab ES3 spec: From my reading of 15.10.2.9 AtomEscape, this matches what JavaScriptCore does except for that ES3 defines the captures and sets them to undefined, whereas JavaScriptCore doesn't seem to define them at all - They are missing from the results rather than included with a value of undefined.
As you can see, three different ways to interpret the handling: JavaScriptCore: Full match on 'aaab': 'b', First capture: not defined
backreference fails to match anything, which means it goes to the alternate path.
submatch is on 'a'+''='a', which is the new value of the first capture. Second repetition, first capture is 'a', thus the submatch is on 'a'+'a'='aa', which is the new value for the first capture. This is consistent with what happens as you add more 'a's to the string being matched.
submatch is on 'a'+''='a', which is the new value of the first capture. Second repetition, first capture is reset to '', thus the submatch is on 'a'+''='a', which is the new value for the first capture. Third repetition, first capture is reset to '', thus the submatch is on 'a'+''='a', which is the new value for the first capture.
It seems to me that, looking at the situation from the perspective of a user of regex, that there are two ways of tackling this problem that makes sense, and they are hinged on whether a captured submatch is considered to be undefined or the empty string by default.
If the capture is considered to be undefined by default, then the ES3 solution makes most sense... except for one thing: a capturing submatch containing a backreference to the containing capture means a guaranteed failure to match in that path, and is pretty easy to detect at compile time. Throwing a syntax error at compile time seems more appropriate since I hardly think developers want their regex to spend time in constructs that cannot possibly match. Alternatively, engines can simply throw out that entire path at compile time and just set all the contained captures to undefined - behaviour would still be identical to what ES3 specifies. I.e. it's a choice between loud or silent failure.
If the capture is considered to be the empty string by default, then the JScript solution makes most sense. There's little logic in resetting the capture to the empty string once the submatch has been captured.