Regexp capturing groups.
2008/9/4 Markus Jarderot <marjar-4 at student.ltu.se>:
When I first noticed this in Firefox I thought it was a bug. After some investigation it turns out that the problem was in the specification. What I am talking about is that ES discards the capturing groups on repetition. I don't know of any Regexp engine which is not based on the ECMA-262 standard that behaves like this.
Neither do I, which I pointed out a few times in the prior discussions on the topic.
I don't know if any web application depends on this behavior, but I wouldn't write any code that did.
Well, I'm sure you will find some code out there if you were to spider for it. However, changing the behaviour would:
- Make the behaviour match that of regex in other languages, and realign the ECMAScript regex with the Perl regex they are based on.
- Make the behaviour more intuitive even for developers without prior experience of regex in other languages or who read regex tutorials that are not language specific.
- Increase the overall usefulness of ECMAScript regex by increasing the problem space they can provide solutions within.
- Probably break a very small amount of live code.
This problem, and that of back-references to non-participating groups, have been discussed on this list before, but nothing seems to have come out of it. esdiscuss/2007-September/thread.html#4513, esdiscuss/2007-September/thread.html#4574
I also made my argument in uri:http://web-graphics.com/2007/11/26/ecmascript-3-regular-expressions-a-specification-that-doesnt-make-sense/
and Steven Levithan had already made an argument like it in uri:http://blog.stevenlevithan.com/archives/npcg-javascript.
I've not seen anything come out of it, which I blame myself for because I raised the issue but didn't enter a ticket on it in trac. Of course, the reason I didn't enter it was that Steven Levithan says in this blog entry url:http://blog.stevenlevithan.com/archives/es3-regexes-broken he
had planned to submit it but wanted to see where his other regex tickets went first. So it ended up just a mailing list discussion and not much else.
On Sep 4, 2008, at 11:10 AM, liorean wrote:
I also made my argument in <uri:web-graphics.com/2007/11/26/ecmascript-3-regular- expressions-a-specification-that-doesnt-make-sense/> and Steven Levithan had already made an argument like it in uri:http://blog.stevenlevithan.com/archives/npcg-javascript.
I've not seen anything come out of it, which I blame myself for because I raised the issue but didn't enter a ticket on it in trac. Of course, the reason I didn't enter it was that Steven Levithan says in this blog entry url:http://blog.stevenlevithan.com/archives/es3-regexes-broken he had planned to submit it but wanted to see where his other regex tickets went first. So it ended up just a mailing list discussion and not much else.
Steven did file
bugs.ecmascript.org/ticket/376
I'll champion fixing this, somehow, in Harmony. We should get
Waldemar's opinion on it.
Brendan Eich wrote:
Steven did file
bugs.ecmascript.org/ticket/376
I'll champion fixing this, somehow, in Harmony. We should get
Waldemar's opinion on it./be
.#376 seems to only concern the issue of back-references to non-participating capture-groups. What I described was back-references to capture-groups within a repetition. As the algorithm in the ECMA-262v3 is currently written, any captures from the last iteration should be discarded. .#376 mentions it, but also says that this deserves a separate ticket. I have not found any ticket on this specific issue.
I have recently found that also Google Chrome keeps the captures between iterations.
URI: javascript:alert(/(?:(a)|(b))*/.exec("ababa"))
Firefox 3.0.1: "ababa,a," <-- no "b"
Internet Explorer 7.0.5730.11 and Google Chrome 0.2.149.29: "ababa,a,b" <-- "b" in the end
-- Markus Jarderot
Good point -- could you please file a separate ticket and cite it
here in reply? Thanks,
Brendan Eich wrote:
Good point -- could you please file a separate ticket and cite it
here in reply? Thanks,
I filed a ticket to fix this bug in Chrome.
Waldemar
On Sep 22, 2008, at 3:32 PM, Waldemar Horwat wrote:
Brendan Eich wrote:
Good point -- could you please file a separate ticket and cite it here in reply? Thanks,
I filed a ticket to fix this bug in Chrome.
The ticket I was asking for, which Markus is going to file now that
he has access to the bugs.ecmascript.org trac, is against
ECMA-262.
Are you asking for Chrome (V8, I mean) to deviate from ES3 in order
to find out what breaks? If so, great, but how about coordination
among Mozilla, Chrome, and WebKit (SFX)?
Brendan Eich wrote:
On Sep 22, 2008, at 3:32 PM, Waldemar Horwat wrote:
Brendan Eich wrote:
Good point -- could you please file a separate ticket and cite it here in reply? Thanks,
I filed a ticket to fix this bug in Chrome.
The ticket I was asking for, which Markus is going to file now that he has access to the bugs.ecmascript.org trac, is against ECMA-262.
Are you asking for Chrome (V8, I mean) to deviate from ES3 in order to find out what breaks? If so, great, but how about coordination among Mozilla, Chrome, and WebKit (SFX)?
/be
No; it's the opposite. I'm asking Chrome to conform to ES3.
Waldemar
On Sep 22, 2008, at 4:17 PM, Waldemar Horwat wrote:
Brendan Eich wrote:
On Sep 22, 2008, at 3:32 PM, Waldemar Horwat wrote:
Brendan Eich wrote:
Good point -- could you please file a separate ticket and cite it here in reply? Thanks,
I filed a ticket to fix this bug in Chrome.
The ticket I was asking for, which Markus is going to file now
that he has access to the bugs.ecmascript.org trac, is against
ECMA-262.Are you asking for Chrome (V8, I mean) to deviate from ES3 in
order to find out what breaks? If so, great, but how about coordination among Mozilla, Chrome, and WebKit (SFX)?/be
No; it's the opposite. I'm asking Chrome to conform to ES3.
Could you cite the chromium issue link? I'd like to learn what, if
any, web compatibility knowledge was gained in in this case, or could
still be gained.
When I first noticed this in Firefox I thought it was a bug. After some investigation it turns out that the problem was in the specification. What I am talking about is that ES discards the capturing groups on repetition. I don't know of any Regexp engine which is not based on the ECMA-262 standard that behaves like this.
(Using JavaScript as implemented in Mozilla Firefox 3.0.1)
A simple example: /(?:(a)|(b))*/.exec("ababa") -> ["ababa", "a", ""]
It recognizes each letter in turn, but when it is time to match the next one it discards the result of the last repetition.
A little more practical example, URL query key/value matching: var match = //thread.php(?:[&?]key1=([^&#])|[&?]key2=([^&#])|[&?][^&#])/.exec(url); var value1 = match[1]; var value2 = match[2]; This would on most other Regexp engines store the value after key1 in group 1, and the value after key2 in group 2, independent on the order in the input-string. But on ECMA-262 based engines, only the last matching value would be kept. The same technique could be applied to attributes in HTML-tags.
To get this to work with ECMA-262 based engines, you could first pick out the query-string with one Regexp, and then look for each key in turn. var query = //thread.php(?[^#])/.exec(url)[1]; var value1 = /[&?]key1=([^&#])/.exec(query)[1]; var value2 = /[&?]key2=([^&#]*)/.exec(query)[1];
I don't know if any web application depends on this behavior, but I wouldn't write any code that did.
This problem, and that of back-references to non-participating groups, have been discussed on this list before, but nothing seems to have come out of it. esdiscuss/2007-September/thread.html#4513, esdiscuss/2007-September/thread.html#4574