Regex: How should backreferences contained in the capturing match they reference to work?

# liorean (18 years ago)

Hello,

I've been playing around with implementing ES3/ES4 regex, (more based on observation of what the browser hosted engines seem to do than by the spec, since I'm trying to build a non-backtracking engine) and noted this case where the browsers don't agree:

var
    re=/(a\1)+|b/,
    a=[],
    key,
    aab=re.exec('aab')
    aaab=re.exec('aaab');
for(key in aab)
     a.push('aab["'+key+'"]: '+aab[key]);
for(key in aaab)
     a.push('aaab["'+key+'"]: '+aaab[key]);
a.join('\r\n');

JavaScriptCore: aab["0"]: b aab["index"]: 2 aab["input"]: aab aaab["0"]: b aaab["index"]: 3 aaab["input"]: aaab JScript: aab["input"]: aab aab["index"]: 0 aab["lastIndex"]: 1 aab["0"]: a aab["1"]: a aaab["input"]: aaab aaab["index"]: 0 aaab["lastIndex"]: 3 aaab["0"]: aaa aaab["1"]: aa SpiderMonkey: aab["0"]: aa aab["1"]: a aab["index"]: 0 aab["input"]: aab aaab["0"]: aaa aaab["1"]: a aaab["index"]: 0 aaab["input"]: aaab futhark: aab["0"]: aa aab["1"]: a aab["index"]: 0 aab["input"]: aab aaab["0"]: aaa aaab["1"]: a aaab["index"]: 0 aaab["input"]: aaab linear_b: aab["0"]: aa aab["1"]: a aab["index"]: 0 aab["input"]: aab aaab["0"]: aaa aaab["1"]: a aaab["index"]: 0 aaab["input"]: aaab ES3 spec: From my reading of 15.10.2.9 AtomEscape, this matches what JavaScriptCore does except for that ES3 defines the captures and sets them to undefined, whereas JavaScriptCore doesn't seem to define them at all - They are missing from the results rather than included with a value of undefined.

As you can see, three different ways to interpret the handling: JavaScriptCore: Full match on 'aaab': 'b', First capture: not defined

    My guesson how this parsing works:
        First repetition, capture is not-yet-defined so the

backreference fails to match anything, which means it goes to the alternate path.

JScript:
    Full match on 'aaab': 'aaa',
    First capture: 'aa'

    My guess on how this parsing works:
        First repetition, the first capture is '', thus the

submatch is on 'a'+''='a', which is the new value of the first capture. Second repetition, first capture is 'a', thus the submatch is on 'a'+'a'='aa', which is the new value for the first capture. This is consistent with what happens as you add more 'a's to the string being matched.

SpiderMonkey, futhark, linear_b:
    Full match on 'aaab': 'aaa',
    First capture: 'a'

    My guess on how this parsing works:
        First repetition, the first capture is set to '', thus the

submatch is on 'a'+''='a', which is the new value of the first capture. Second repetition, first capture is reset to '', thus the submatch is on 'a'+''='a', which is the new value for the first capture. Third repetition, first capture is reset to '', thus the submatch is on 'a'+''='a', which is the new value for the first capture.

It seems to me that, looking at the situation from the perspective of a user of regex, that there are two ways of tackling this problem that makes sense, and they are hinged on whether a captured submatch is considered to be undefined or the empty string by default.

If the capture is considered to be undefined by default, then the ES3 solution makes most sense... except for one thing: a capturing submatch containing a backreference to the containing capture means a guaranteed failure to match in that path, and is pretty easy to detect at compile time. Throwing a syntax error at compile time seems more appropriate since I hardly think developers want their regex to spend time in constructs that cannot possibly match. Alternatively, engines can simply throw out that entire path at compile time and just set all the contained captures to undefined - behaviour would still be identical to what ES3 specifies. I.e. it's a choice between loud or silent failure.

If the capture is considered to be the empty string by default, then the JScript solution makes most sense. There's little logic in resetting the capture to the empty string once the submatch has been captured.

Hello,

I've been playing around with implementing ES3/ES4 regex, (more based
on observation of what the browser hosted engines seem to do than by
the spec, since I'm trying to build a non-backtracking engine) and
noted this case where the browsers don't agree:

    var
        re=/(a\1)+|b/,
        a=[],
        key,
        aab=re.exec('aab')
        aaab=re.exec('aaab');
    for(key in aab)
         a.push('aab["'+key+'"]: '+aab[key]);
    for(key in aaab)
         a.push('aaab["'+key+'"]: '+aaab[key]);
    a.join('\r\n');

JavaScriptCore:
    aab["0"]: b
    aab["index"]: 2
    aab["input"]: aab
    aaab["0"]: b
    aaab["index"]: 3
    aaab["input"]: aaab
JScript:
    aab["input"]: aab
    aab["index"]: 0
    aab["lastIndex"]: 1
    aab["0"]: a
    aab["1"]: a
    aaab["input"]: aaab
    aaab["index"]: 0
    aaab["lastIndex"]: 3
    aaab["0"]: aaa
    aaab["1"]: aa
SpiderMonkey:
    aab["0"]: aa
    aab["1"]: a
    aab["index"]: 0
    aab["input"]: aab
    aaab["0"]: aaa
    aaab["1"]: a
    aaab["index"]: 0
    aaab["input"]: aaab
futhark:
    aab["0"]: aa
    aab["1"]: a
    aab["index"]: 0
    aab["input"]: aab
    aaab["0"]: aaa
    aaab["1"]: a
    aaab["index"]: 0
    aaab["input"]: aaab
linear_b:
    aab["0"]: aa
    aab["1"]: a
    aab["index"]: 0
    aab["input"]: aab
    aaab["0"]: aaa
    aaab["1"]: a
    aaab["index"]: 0
    aaab["input"]: aaab
ES3 spec:
    From my reading of 15.10.2.9 AtomEscape, this matches what
JavaScriptCore does except for that ES3 defines the captures and sets
them to undefined, whereas JavaScriptCore doesn't seem to define them
at all - They are missing from the results rather than included with a
value of undefined.

As you can see, three different ways to interpret the handling:
    JavaScriptCore:
        Full match on 'aaab': 'b',
        First capture: not defined

        My guesson how this parsing works:
            First repetition, capture is not-yet-defined so the
backreference fails to match anything, which means it goes to the
alternate path.

    JScript:
        Full match on 'aaab': 'aaa',
        First capture: 'aa'

        My guess on how this parsing works:
            First repetition, the first capture is '', thus the
submatch is on 'a'+''='a', which is the new value of the first
capture.
            Second repetition, first capture is 'a', thus the submatch
is on 'a'+'a'='aa', which is the new value for the first capture.
            This is consistent with what happens as you add more 'a's
to the string being matched.

    SpiderMonkey, futhark, linear_b:
        Full match on 'aaab': 'aaa',
        First capture: 'a'

        My guess on how this parsing works:
            First repetition, the first capture is set to '', thus the
submatch is on 'a'+''='a', which is the new value of the first
capture.
            Second repetition, first capture is reset to '', thus the
submatch is on 'a'+''='a', which is the new value for the first
capture.
            Third repetition, first capture is reset to '', thus the
submatch is on 'a'+''='a', which is the new value for the first
capture.



It seems to me that, looking at the situation from the perspective of
a user of regex, that there are two ways of tackling this problem that
makes sense, and they are hinged on whether a captured submatch is
considered to be undefined or the empty string by default.


If the capture is considered to be undefined by default, then the ES3
solution makes most sense... except for one thing: a capturing
submatch containing a backreference to the containing capture means a
guaranteed failure to match in that path, and is pretty easy to detect
at compile time. Throwing a syntax error at compile time seems more
appropriate since I hardly think developers want their regex to spend
time in constructs that cannot possibly match. Alternatively, engines
can simply throw out that entire path at compile time and just set all
the contained captures to undefined - behaviour would still be
identical to what ES3 specifies. I.e. it's a choice between loud or
silent failure.

If the capture is considered to be the empty string by default, then
the JScript solution makes most sense. There's little logic in
resetting the capture to the empty string once the submatch has been
captured.
-- 
David "liorean" Andersson

# Lars T Hansen (18 years ago)

On 9/2/07, liorean <liorean at gmail.com> wrote:

var
    re=/(a\1)+|b/,
    a=[],
    key,
    aab=re.exec('aab')
    aaab=re.exec('aaab');
for(key in aab)
     a.push('aab["'+key+'"]: '+aab[key]);
for(key in aaab)
     a.push('aaab["'+key+'"]: '+aaab[key]);
a.join('\r\n');

From the spec it's pretty clear (to me anyhow) that SpiderMonkey and

Opera are correct here (and the ES4 reference implementation, whose RegEx engine was modelled directly on the spec, agrees). The reasoning is that (a) the undefined value in the captures array matches any input without advancing the input pointer [15.10.2.9 step 8 substep 3], (b) every element in the captures array starts out as undefined [15.10.2.2 step 2 substep 4], (c) every element in the captures array affected by a particular set of capturing parens is reset to undefined every time those parens are entered [15.10.2.5 case "RepeatMatcher" step 4], and (d) the captures array is only updated after the capturing parens are exited during matching [15.10.2.8 case "( Disjunction )" step 3 substep 1 subsubstep 5]. Therefore the pattern above is always the same as /(a)+|b/. Therefore the results are what they are.

It seems to me that, looking at the situation from the perspective of a user of regex, that there are two ways of tackling this problem that makes sense, and they are hinged on whether a captured submatch is considered to be undefined or the empty string by default.

In ES3 they are clearly undefined by default.

If the capture is considered to be undefined by default, then the ES3 solution makes most sense... except for one thing: a capturing submatch containing a backreference to the containing capture means a guaranteed failure to match in that path, and is pretty easy to detect at compile time.

The way I understand the spec it means a guaranteed success. Still pretty silly, to be sure.

Throwing a syntax error at compile time seems more appropriate since I hardly think developers want their regex to spend time in constructs that cannot possibly match. Alternatively, engines can simply throw out that entire path at compile time and just set all the contained captures to undefined - behaviour would still be identical to what ES3 specifies. I.e. it's a choice between loud or silent failure.

Or a vacuous and silent success, as in this case.

I'm not sure what problem you're trying to solve, but my hope is that once ES4 is widely implemented programmers will switch to named submatches to avoid the problem with numeric backreferences entirely. In the mean time, there's obviously room for a good implementation to warn about ineffectual submatches like these. Changing the semantics to an error is not very appealing, on the one hand I doubt it will break a lot of sites, on the other hand I doubt it will fix many either.

On 9/2/07, liorean <liorean at gmail.com> wrote:
>
>     var
>         re=/(a\1)+|b/,
>         a=[],
>         key,
>         aab=re.exec('aab')
>         aaab=re.exec('aaab');
>     for(key in aab)
>          a.push('aab["'+key+'"]: '+aab[key]);
>     for(key in aaab)
>          a.push('aaab["'+key+'"]: '+aaab[key]);
>     a.join('\r\n');

>From the spec it's pretty clear (to me anyhow) that SpiderMonkey and
Opera are correct here (and the ES4 reference implementation, whose
RegEx engine was modelled directly on the spec, agrees).  The
reasoning is that (a) the **undefined** value in the captures array
matches any input without advancing the input pointer [15.10.2.9 step
8 substep 3], (b) every element in the captures array starts out as
**undefined** [15.10.2.2 step 2 substep 4], (c) every element in the
captures array affected by a particular set of capturing parens is
reset to **undefined** every time those parens are entered [15.10.2.5
case "RepeatMatcher" step 4], and (d) the captures array is only
updated after the capturing parens are exited during matching
[15.10.2.8 case "( Disjunction )" step 3 substep 1 subsubstep 5].
Therefore the pattern above is always the same as /(a)+|b/.  Therefore
the results are what they are.

> It seems to me that, looking at the situation from the perspective of
> a user of regex, that there are two ways of tackling this problem that
> makes sense, and they are hinged on whether a captured submatch is
> considered to be undefined or the empty string by default.

In ES3 they are clearly **undefined** by default.

> If the capture is considered to be undefined by default, then the ES3
> solution makes most sense... except for one thing: a capturing
> submatch containing a backreference to the containing capture means a
> guaranteed failure to match in that path, and is pretty easy to detect
> at compile time.

The way I understand the spec it means a guaranteed success.  Still
pretty silly, to be sure.

> Throwing a syntax error at compile time seems more
> appropriate since I hardly think developers want their regex to spend
> time in constructs that cannot possibly match. Alternatively, engines
> can simply throw out that entire path at compile time and just set all
> the contained captures to undefined - behaviour would still be
> identical to what ES3 specifies. I.e. it's a choice between loud or
> silent failure.

Or a vacuous and silent success, as in this case.

I'm not sure what problem you're trying to solve, but my hope is that
once ES4 is widely implemented programmers will switch to named
submatches to avoid the problem with numeric backreferences entirely.
In the mean time, there's obviously room for a good implementation to
warn about ineffectual submatches like these.  Changing the semantics
to an error is not very appealing, on the one hand I doubt it will
break a lot of sites, on the other hand I doubt it will fix many
either.

--lars

# liorean (18 years ago)

Sorry for delaying my response on this. Wanted to get some public opinions on it: uri:http://web-graphics.com/2007/09/05/a-quick-js-quizz-for-anybody-who-think-they-know-regex/

Sadly I only got responses from two developers, but they both seem to agree with me (and I was careful to not state my opinion there)...

On 9/2/07, liorean <liorean at gmail.com> wrote:

var
    re=/(a\1)+|b/,
    a=[],
    key,
    aab=re.exec('aab')
    aaab=re.exec('aaab');
for(key in aab)
     a.push('aab["'+key+'"]: '+aab[key]);
for(key in aaab)
     a.push('aaab["'+key+'"]: '+aaab[key]);
a.join('\r\n');

On 03/09/07, Lars T Hansen <lth at acm.org> wrote:

From the spec it's pretty clear (to me anyhow) that SpiderMonkey and Opera are correct here (and the ES4 reference implementation, whose RegEx engine was modelled directly on the spec, agrees). The reasoning is that (a) the undefined value in the captures array matches any input without advancing the input pointer [15.10.2.9 step 8 substep 3], (b) every element in the captures array starts out as undefined [15.10.2.2 step 2 substep 4], (c) every element in the captures array affected by a particular set of capturing parens is reset to undefined every time those parens are entered [15.10.2.5 case "RepeatMatcher" step 4], and (d) the captures array is only updated after the capturing parens are exited during matching [15.10.2.8 case "( Disjunction )" step 3 substep 1 subsubstep 5]. Therefore the pattern above is always the same as /(a)+|b/. Therefore the results are what they are.

Seems I've been looking in entirely the wrong places. Yes, that seems correct.

On 9/2/07, liorean <liorean at gmail.com> wrote:

It seems to me that, looking at the situation from the perspective of a user of regex, that there are two ways of tackling this problem that makes sense, and they are hinged on whether a captured submatch is considered to be undefined or the empty string by default.

On 03/09/07, Lars T Hansen <lth at acm.org> wrote:

In ES3 they are clearly undefined by default.

This was, as stated, looking from the perspective of a user or regex, not an implementer. The ES3 spec is actually the solution that makes least sense from the user perspective, if you ask me. I'd even call it a weird choice, given the alternatives. To throw out a submatch before trying to match it the next time instead of just replacing the submatch value after the next match is unintuitive behaviour.

On 9/2/07, liorean <liorean at gmail.com> wrote:

If the capture is considered to be undefined by default, then the ES3 solution makes most sense... except for one thing: a capturing submatch containing a backreference to the containing capture means a guaranteed failure to match in that path, and is pretty easy to detect at compile time.

On 03/09/07, Lars T Hansen <lth at acm.org> wrote:

The way I understand the spec it means a guaranteed success. Still pretty silly, to be sure.

This is what I was misunderstanding - that the behaviour for when the capture was undefined actually meant a match for the empty string.

On 9/2/07, liorean <liorean at gmail.com> wrote:

Throwing a syntax error at compile time seems more appropriate since I hardly think developers want their regex to spend time in constructs that cannot possibly match. Alternatively, engines can simply throw out that entire path at compile time and just set all the contained captures to undefined - behaviour would still be identical to what ES3 specifies. I.e. it's a choice between loud or silent failure.

On 03/09/07, Lars T Hansen <lth at acm.org> wrote:

Or a vacuous and silent success, as in this case.

Which makes less sense than either the JavaScriptCore or the JScript behaviour, IMO.

On 03/09/07, Lars T Hansen <lth at acm.org> wrote:

I'm not sure what problem you're trying to solve, but my hope is that once ES4 is widely implemented programmers will switch to named submatches to avoid the problem with numeric backreferences entirely.

The problem is the same whether it's named or not, I think. Consider:

/(?P<as>a(?P=as))+|b/.exec('aaab');

Is the behaviour of that any different from the behaviour of these:

/(a\1)+|b/.exec('aaab');
/(?P<as>a\1)+|b/.exec('aaab');

Since in this case, unless I've misunderstood the proposal, (?P=as) is just an alias for \1.

On 03/09/07, Lars T Hansen <lth at acm.org> wrote:

In the mean time, there's obviously room for a good implementation to warn about ineffectual submatches like these. Changing the semantics to an error is not very appealing, on the one hand I doubt it will break a lot of sites, on the other hand I doubt it will fix many either.

I don't believe there is just about any live code that actually uses backreferences within the capturing match they refer to. If there were, I don't think a behaviour as divergent from the other browsers as that of JavaScriptCore would have been able to fly under the radar.

What I was really asking for here is: Which behaviour should I implement. From the user perspective, I think it's best if a backreference to the containing submatch (independent of whether it's a named backreference or an indexed one) is a failure, like in JavaScriptCore. The reason is that it's simple to understand using the following reasoning based on preexpanding the backreference:

The pattern (\1) matches zero characters plus an expansion of itself, which is zero characters. Thus that's equal to ((?:){Infinity}), which is still zero characters.
However, (a\1) matches one character plus an expansion of itself (exactly which character is irrelevant, which should be obvious soon), which is one character plus an expansion of itself, which is two characters... ad inifinitum. Thus that is equal to (a{Infinity}) for which there cannot possibly exist a matching sting, since strings are finite.

In other ways, a backreference to the containing capture is logically just plain nonsense.

From an implementor perspective, it doesn't matter much whether this

is a silent or loud failure, I think. Making it silent would just mean we can ignore that whole code path anyway.

I agree making it a syntax error is probably not the best idea, since that could kill an entire script, even if it's never even used. Emitting a warning is probably a good idea, though.

Second best, from a user perspective, is to not reset the captures, like in JScript. This fits well into the following iterative reasoning about what the pattern does:

The pattern (\1)+ is equivalent to ((?:)) first repetition, a match for the first capture which is the empty string. The capturing submatch is the new value for the first capture.
The pattern (\1)+ is equivalent to (?:)((?:)) second repetition, a match for first repetition which is the empty string plus zero characters plus the first capture which is the empty string. The capturing submatch is the new value for the first capture.
The pattern (a\1)+ is equivalent to (a(?:)) first repetition, which matches 'a' plus the first capture which is the empty string. The capturing submatch is the new value for the first capture.
The pattern (a\1)+ is equivalent to (?:a)(a(?:a)) second repetition, which matches the first repetition which is 'a' plus 'a' plus the first capture which is 'a'. The capturing submatch is the new value for the first capture.
The pattern (a\1)+ is equivalent to (?:aaa)(a(?:aa)) third repetition, which matches the second repetition which is 'aaa' plus 'a' plus the first capture which is 'a'a. The capturing submatch is the new value for the first capture.

I'm sure you recognise the pattern:)

From an implementor perspective, I hardly think not having to reset

part of the capture array for each repetition is bad.

How does it work with capturing submatches within the submatch that is repeated?

/(?:(a)(b)?)+/.exec('aaba');

JavaScriptCore and JScript: ['aaba','a','b'] SpiderMonkey and futhark: ['aaba','a','']

/(?:(\2|a)(b)?)+/.exec('abba');

JavaScriptCore: ['abba','a','b'] JScript: ['ab','a','b'] SpiderMonkey and futhark: ['abba','a','']

ES3: If I understand 15.10.2.5 Term, fourth algorithm, point 4, the correct results would be ['ab','a','b'] since the \2 capture would be set to undefined before the second repetition. Thus futhark and SpiderMonkey fail to implement this part of ES3 Regex. (I'd be happy if it turned out I'm wrong here, however.)

I repeat: I think ES3 (and SpiderMonkey and futhark) has taken the wrong choice here. It doesn't make sense to throw away submatches after they have been captured, and it's not what regex users expect. And a reasonable handling of a backreference to a containing submatch is that it's nonsense since it logically can't match anything. But even if they're not deemed nonsense, they should refer to last repetition's capture, not be reset with each repetition.

So I guess what I'm asking you is if you think the ES3 algorithm is appropriate, and if a change to a more sensible interpretation could be done in ES4. Since we have three totally different interpretations in browser hosted engines, I bet there's precious little code out there that relies on the ES3 behaviour, or for that matter the particular implementation in any of the browser hosted engines.

Sorry for delaying my response on this. Wanted to get some public
opinions on it:
<uri:http://web-graphics.com/2007/09/05/a-quick-js-quizz-for-anybody-who-think-they-know-regex/>

Sadly I only got responses from two developers, but they both seem to
agree with me (and I was careful to not state my opinion there)...

> On 9/2/07, liorean <liorean at gmail.com> wrote:
> >     var
> >         re=/(a\1)+|b/,
> >         a=[],
> >         key,
> >         aab=re.exec('aab')
> >         aaab=re.exec('aaab');
> >     for(key in aab)
> >          a.push('aab["'+key+'"]: '+aab[key]);
> >     for(key in aaab)
> >          a.push('aaab["'+key+'"]: '+aaab[key]);
> >     a.join('\r\n');

On 03/09/07, Lars T Hansen <lth at acm.org> wrote:
> From the spec it's pretty clear (to me anyhow) that SpiderMonkey and
> Opera are correct here (and the ES4 reference implementation, whose
> RegEx engine was modelled directly on the spec, agrees).  The
> reasoning is that (a) the **undefined** value in the captures array
> matches any input without advancing the input pointer [15.10.2.9 step
> 8 substep 3], (b) every element in the captures array starts out as
> **undefined** [15.10.2.2 step 2 substep 4], (c) every element in the
> captures array affected by a particular set of capturing parens is
> reset to **undefined** every time those parens are entered [15.10.2.5
> case "RepeatMatcher" step 4], and (d) the captures array is only
> updated after the capturing parens are exited during matching
> [15.10.2.8 case "( Disjunction )" step 3 substep 1 subsubstep 5].
> Therefore the pattern above is always the same as /(a)+|b/.  Therefore
> the results are what they are.

Seems I've been looking in entirely the wrong places. Yes, that seems correct.

> On 9/2/07, liorean <liorean at gmail.com> wrote:
> > It seems to me that, looking at the situation from the perspective of
> > a user of regex, that there are two ways of tackling this problem that
> > makes sense, and they are hinged on whether a captured submatch is
> > considered to be undefined or the empty string by default.

On 03/09/07, Lars T Hansen <lth at acm.org> wrote:
> In ES3 they are clearly **undefined** by default.

This was, as stated, looking from the perspective of a *user* or
regex, not an implementer. The ES3 spec is actually the solution that
makes least sense from the user perspective, if you ask me. I'd even
call it a weird choice, given the alternatives. To throw out a
submatch before trying to match it the next time instead of just
replacing the submatch value after the next match is unintuitive
behaviour.

> On 9/2/07, liorean <liorean at gmail.com> wrote:
> > If the capture is considered to be undefined by default, then the ES3
> > solution makes most sense... except for one thing: a capturing
> > submatch containing a backreference to the containing capture means a
> > guaranteed failure to match in that path, and is pretty easy to detect
> > at compile time.

On 03/09/07, Lars T Hansen <lth at acm.org> wrote:
> The way I understand the spec it means a guaranteed success.  Still
> pretty silly, to be sure.

This is what I was misunderstanding - that the behaviour for when the
capture was undefined actually meant a match for the empty string.

> On 9/2/07, liorean <liorean at gmail.com> wrote:
> > Throwing a syntax error at compile time seems more
> > appropriate since I hardly think developers want their regex to spend
> > time in constructs that cannot possibly match. Alternatively, engines
> > can simply throw out that entire path at compile time and just set all
> > the contained captures to undefined - behaviour would still be
> > identical to what ES3 specifies. I.e. it's a choice between loud or
> > silent failure.

On 03/09/07, Lars T Hansen <lth at acm.org> wrote:
> Or a vacuous and silent success, as in this case.

Which makes less sense than either the JavaScriptCore or the JScript
behaviour, IMO.

On 03/09/07, Lars T Hansen <lth at acm.org> wrote:
> I'm not sure what problem you're trying to solve, but my hope is that
> once ES4 is widely implemented programmers will switch to named
> submatches to avoid the problem with numeric backreferences entirely.

The problem is the same whether it's named or not, I think. Consider:

    /(?P<as>a(?P=as))+|b/.exec('aaab');

Is the behaviour of that any different from the behaviour of these:

    /(a\1)+|b/.exec('aaab');
    /(?P<as>a\1)+|b/.exec('aaab');

Since in this case, unless I've misunderstood the proposal, (?P=as) is
just an alias for \1.

On 03/09/07, Lars T Hansen <lth at acm.org> wrote:
> In the mean time, there's obviously room for a good implementation to
> warn about ineffectual submatches like these.  Changing the semantics
> to an error is not very appealing, on the one hand I doubt it will
> break a lot of sites, on the other hand I doubt it will fix many
> either.

I don't believe there is just about any live code that actually uses
backreferences within the capturing match they refer to. If there
were, I don't think a behaviour as divergent from the other browsers
as that of JavaScriptCore would have been able to fly under the radar.

What I was really asking for here is: Which behaviour should I
implement. From the user perspective, I think it's best if a
backreference to the containing submatch (independent of whether it's
a named backreference or an indexed one) is a failure, like in
JavaScriptCore. The reason is that it's simple to understand using the
following reasoning based on preexpanding the backreference:

- The pattern (\1) matches zero characters plus an expansion of
itself, which is zero characters. Thus that's equal to
((?:){Infinity}), which is still zero characters.
- However, (a\1) matches one character plus an expansion of itself
(exactly which character is irrelevant, which should be obvious soon),
which is one character plus an expansion of itself, which is two
characters... ad inifinitum. Thus that is equal to (a{Infinity}) for
which there cannot possibly exist a matching sting, since strings are
finite.

In other ways, a backreference to the containing capture is logically
just plain nonsense.

>From an implementor perspective, it doesn't matter much whether this
is a silent or loud failure, I think. Making it silent would just mean
we can ignore that whole code path anyway.

I agree making it a syntax error is probably not the best idea, since
that could kill an entire script, even if it's never even used.
Emitting a warning is probably a good idea, though.

Second best, from a user perspective, is to not reset the captures,
like in JScript. This fits well into the following iterative reasoning
about what the pattern does:

- The pattern (\1)+ is equivalent to ((?:)) first repetition, a match
for the first capture which is the empty string. The capturing
submatch is the new value for the first capture.
- The pattern (\1)+ is equivalent to (?:)((?:)) second repetition, a
match for first repetition which is the empty string plus zero
characters plus the first capture which is the empty string. The
capturing submatch is the new value for the first capture.

- The pattern (a\1)+ is equivalent to (a(?:)) first repetition, which
matches 'a' plus the first capture which is the empty string. The
capturing submatch is the new value for the first capture.
- The pattern (a\1)+ is equivalent to (?:a)(a(?:a)) second repetition,
which matches the first repetition which is 'a' plus 'a' plus the
first capture which is 'a'. The capturing submatch is the new value
for the first capture.
- The pattern (a\1)+ is equivalent to (?:aaa)(a(?:aa)) third
repetition, which matches the second repetition which is 'aaa' plus
'a' plus the first capture which is 'a'a. The capturing submatch is
the new value for the first capture.

I'm sure you recognise the pattern:)

>From an implementor perspective, I hardly think not having to reset
part of the capture array for each repetition is bad.

How does it work with capturing submatches within the submatch that is repeated?

    /(?:(a)(b)?)+/.exec('aaba');

JavaScriptCore and JScript: ['aaba','a','b']
SpiderMonkey and futhark: ['aaba','a','']

    /(?:(\2|a)(b)?)+/.exec('abba');

JavaScriptCore: ['abba','a','b']
JScript: ['ab','a','b']
SpiderMonkey and futhark: ['abba','a','']

ES3: If I understand 15.10.2.5 Term, fourth algorithm, point 4, the
correct results would be  ['ab','a','b'] since the \2 capture would be
set to undefined before the second repetition. Thus futhark and
SpiderMonkey fail to implement this part of ES3 Regex. (I'd be happy
if it turned out I'm wrong here, however.)

I repeat: I think ES3 (and SpiderMonkey and futhark) has taken the
wrong choice here. It doesn't make sense to throw away submatches
after they have been captured, and it's not what regex users expect.
And a reasonable handling of a backreference to a containing submatch
is that it's nonsense since it logically can't match anything. But
even if they're not deemed nonsense, they should refer to last
repetition's capture, not be reset with each repetition.

So I guess what I'm asking you is if you think the ES3 algorithm is
appropriate, and if a change to a more sensible interpretation could
be done in ES4. Since we have three totally different interpretations
in browser hosted engines, I bet there's precious little code out
there  that relies on the ES3 behaviour, or for that matter the
particular implementation in any of the browser hosted engines.
-- 
David "liorean" Andersson

# Lars T Hansen (18 years ago)

It is true that named submatches are just aliases for numbered submatches. My argument was only that programmers are vastly less likely to nest a named backreference inside the capturing submatch of that name than they are when the backreference is a number, because it will be much more obvious what's going on when you can see both the name of the submatch and the name of the back reference and their respective positions.

My answer to you is obviously that you should implement the ES3 behavior (as should JScript and JavaScriptCore). The current behavior is well-defined; it's not a hardship for anyone; the incompatibilities among the engines are probably not a big deal (thus the incompatible engines can be changed so that they conform); thus there is no compelling reason to change the spec either.

/(?:(\2|a)(b)?)+/.exec('abba');
JavaScriptCore: ['abba','a','b'] JScript: ['ab','a','b'] SpiderMonkey and futhark: ['abba','a','']

ES3: If I understand 15.10.2.5 Term, fourth algorithm, point 4, the correct results would be ['ab','a','b'] since the \2 capture would be set to undefined before the second repetition. Thus futhark and SpiderMonkey fail to implement this part of ES3 Regex. (I'd be happy if it turned out I'm wrong here, however.)

FWIW, the ES4 RI agrees with Mozilla and Opera.

Trying to trace the matching above, we have: ...+ (?:...) (\2|a) succeeds without consuming input (b)? succeeds without consuming input so (?:...) succeds without consuming input so ...+ starts backtracking [step 1 of the continuation for RepeatMatcher] so (\2|a) succeeds, consuming "a" and (b)? succeeds, consuming "b" so (?:...) succeeds consuming "ab" with captures [a,b] and ...+ repeats: (?:...) (\2|a) succeeds without consuming input and (b)? succeeds, consuming "b" so (?:...) succeeds consuming "b" with captures [,b] and ...+ repeats: (?:...) (\2|a) succeeds without consuming input (b)? succeeds without consuming input so (?:...) succeds without consuming input so + starts backtracking so (\2|a) succeeds, consuming "a" and (b)? succeeds without consuming input so (?:...) succeeds consuming "a" with captures [a,] so ...+ succeeds, matching "abba" with final capture array [a,]

Sorry for the lack of rigor in the above, but I believe that's correct.

It is true that named submatches are just aliases for numbered
submatches.  My argument was only that programmers are vastly less
likely to nest a named backreference inside the capturing submatch of
that name than they are when the backreference is a number, because it
will be much more obvious what's going on when you can see both the
name of the submatch and the name of the back reference and their
respective positions.

My answer to you is obviously that you should implement the ES3
behavior (as should JScript and JavaScriptCore).  The current behavior
is well-defined; it's not a hardship for anyone; the incompatibilities
among the engines are probably not a big deal (thus the incompatible
engines can be changed so that they conform); thus there is no
compelling reason to change the spec either.

>     /(?:(\2|a)(b)?)+/.exec('abba');
>
> JavaScriptCore: ['abba','a','b']
> JScript: ['ab','a','b']
> SpiderMonkey and futhark: ['abba','a','']
>
> ES3: If I understand 15.10.2.5 Term, fourth algorithm, point 4, the
> correct results would be  ['ab','a','b'] since the \2 capture would be
> set to undefined before the second repetition. Thus futhark and
> SpiderMonkey fail to implement this part of ES3 Regex. (I'd be happy
> if it turned out I'm wrong here, however.)

FWIW, the ES4 RI agrees with Mozilla and Opera.

Trying to trace the matching above, we have:
  ...+
   (?:...)
    (\2|a) succeeds without consuming input
    (b)? succeeds without consuming input
   so (?:...) succeds without consuming input
  so ...+ starts backtracking [step 1 of the continuation for RepeatMatcher]
    so (\2|a) succeeds, consuming "a"
    and (b)? succeeds, consuming "b"
   so (?:...) succeeds consuming "ab" with captures [a,b]
 and ...+ repeats:
   (?:...)
    (\2|a) succeeds without consuming input
    and (b)? succeeds, consuming "b"
   so (?:...) succeeds consuming "b" with captures [,b]
 and ...+ repeats:
   (?:...)
    (\2|a) succeeds without consuming input
    (b)? succeeds without consuming input
   so (?:...) succeds without consuming input
  so + starts backtracking
    so (\2|a) succeeds, consuming "a"
    and (b)? succeeds without consuming input
   so (?:...) succeeds consuming "a" with captures [a,]
 so ...+ succeeds, matching "abba" with final capture array [a,]

Sorry for the lack of rigor in the above, but I believe that's correct.

--lars

On 9/13/07, liorean <liorean at gmail.com> wrote:
> Sorry for delaying my response on this. Wanted to get some public
> opinions on it:
> <uri:http://web-graphics.com/2007/09/05/a-quick-js-quizz-for-anybody-who-think-they-know-regex/>
>
> Sadly I only got responses from two developers, but they both seem to
> agree with me (and I was careful to not state my opinion there)...
>
>
> > On 9/2/07, liorean <liorean at gmail.com> wrote:
> > >     var
> > >         re=/(a\1)+|b/,
> > >         a=[],
> > >         key,
> > >         aab=re.exec('aab')
> > >         aaab=re.exec('aaab');
> > >     for(key in aab)
> > >          a.push('aab["'+key+'"]: '+aab[key]);
> > >     for(key in aaab)
> > >          a.push('aaab["'+key+'"]: '+aaab[key]);
> > >     a.join('\r\n');
>
> On 03/09/07, Lars T Hansen <lth at acm.org> wrote:
> > From the spec it's pretty clear (to me anyhow) that SpiderMonkey and
> > Opera are correct here (and the ES4 reference implementation, whose
> > RegEx engine was modelled directly on the spec, agrees).  The
> > reasoning is that (a) the **undefined** value in the captures array
> > matches any input without advancing the input pointer [15.10.2.9 step
> > 8 substep 3], (b) every element in the captures array starts out as
> > **undefined** [15.10.2.2 step 2 substep 4], (c) every element in the
> > captures array affected by a particular set of capturing parens is
> > reset to **undefined** every time those parens are entered [15.10.2.5
> > case "RepeatMatcher" step 4], and (d) the captures array is only
> > updated after the capturing parens are exited during matching
> > [15.10.2.8 case "( Disjunction )" step 3 substep 1 subsubstep 5].
> > Therefore the pattern above is always the same as /(a)+|b/.  Therefore
> > the results are what they are.
>
> Seems I've been looking in entirely the wrong places. Yes, that seems correct.
>
> > On 9/2/07, liorean <liorean at gmail.com> wrote:
> > > It seems to me that, looking at the situation from the perspective of
> > > a user of regex, that there are two ways of tackling this problem that
> > > makes sense, and they are hinged on whether a captured submatch is
> > > considered to be undefined or the empty string by default.
>
> On 03/09/07, Lars T Hansen <lth at acm.org> wrote:
> > In ES3 they are clearly **undefined** by default.
>
> This was, as stated, looking from the perspective of a *user* or
> regex, not an implementer. The ES3 spec is actually the solution that
> makes least sense from the user perspective, if you ask me. I'd even
> call it a weird choice, given the alternatives. To throw out a
> submatch before trying to match it the next time instead of just
> replacing the submatch value after the next match is unintuitive
> behaviour.
>
> > On 9/2/07, liorean <liorean at gmail.com> wrote:
> > > If the capture is considered to be undefined by default, then the ES3
> > > solution makes most sense... except for one thing: a capturing
> > > submatch containing a backreference to the containing capture means a
> > > guaranteed failure to match in that path, and is pretty easy to detect
> > > at compile time.
>
> On 03/09/07, Lars T Hansen <lth at acm.org> wrote:
> > The way I understand the spec it means a guaranteed success.  Still
> > pretty silly, to be sure.
>
> This is what I was misunderstanding - that the behaviour for when the
> capture was undefined actually meant a match for the empty string.
>
> > On 9/2/07, liorean <liorean at gmail.com> wrote:
> > > Throwing a syntax error at compile time seems more
> > > appropriate since I hardly think developers want their regex to spend
> > > time in constructs that cannot possibly match. Alternatively, engines
> > > can simply throw out that entire path at compile time and just set all
> > > the contained captures to undefined - behaviour would still be
> > > identical to what ES3 specifies. I.e. it's a choice between loud or
> > > silent failure.
>
> On 03/09/07, Lars T Hansen <lth at acm.org> wrote:
> > Or a vacuous and silent success, as in this case.
>
> Which makes less sense than either the JavaScriptCore or the JScript
> behaviour, IMO.
>
> On 03/09/07, Lars T Hansen <lth at acm.org> wrote:
> > I'm not sure what problem you're trying to solve, but my hope is that
> > once ES4 is widely implemented programmers will switch to named
> > submatches to avoid the problem with numeric backreferences entirely.
>
> The problem is the same whether it's named or not, I think. Consider:
>
>     /(?P<as>a(?P=as))+|b/.exec('aaab');
>
> Is the behaviour of that any different from the behaviour of these:
>
>     /(a\1)+|b/.exec('aaab');
>     /(?P<as>a\1)+|b/.exec('aaab');
>
> Since in this case, unless I've misunderstood the proposal, (?P=as) is
> just an alias for \1.
>
> On 03/09/07, Lars T Hansen <lth at acm.org> wrote:
> > In the mean time, there's obviously room for a good implementation to
> > warn about ineffectual submatches like these.  Changing the semantics
> > to an error is not very appealing, on the one hand I doubt it will
> > break a lot of sites, on the other hand I doubt it will fix many
> > either.
>
> I don't believe there is just about any live code that actually uses
> backreferences within the capturing match they refer to. If there
> were, I don't think a behaviour as divergent from the other browsers
> as that of JavaScriptCore would have been able to fly under the radar.
>
>
>
> What I was really asking for here is: Which behaviour should I
> implement. From the user perspective, I think it's best if a
> backreference to the containing submatch (independent of whether it's
> a named backreference or an indexed one) is a failure, like in
> JavaScriptCore. The reason is that it's simple to understand using the
> following reasoning based on preexpanding the backreference:
>
> - The pattern (\1) matches zero characters plus an expansion of
> itself, which is zero characters. Thus that's equal to
> ((?:){Infinity}), which is still zero characters.
> - However, (a\1) matches one character plus an expansion of itself
> (exactly which character is irrelevant, which should be obvious soon),
> which is one character plus an expansion of itself, which is two
> characters... ad inifinitum. Thus that is equal to (a{Infinity}) for
> which there cannot possibly exist a matching sting, since strings are
> finite.
>
> In other ways, a backreference to the containing capture is logically
> just plain nonsense.
>
> >From an implementor perspective, it doesn't matter much whether this
> is a silent or loud failure, I think. Making it silent would just mean
> we can ignore that whole code path anyway.
>
> I agree making it a syntax error is probably not the best idea, since
> that could kill an entire script, even if it's never even used.
> Emitting a warning is probably a good idea, though.
>
>
>
> Second best, from a user perspective, is to not reset the captures,
> like in JScript. This fits well into the following iterative reasoning
> about what the pattern does:
>
> - The pattern (\1)+ is equivalent to ((?:)) first repetition, a match
> for the first capture which is the empty string. The capturing
> submatch is the new value for the first capture.
> - The pattern (\1)+ is equivalent to (?:)((?:)) second repetition, a
> match for first repetition which is the empty string plus zero
> characters plus the first capture which is the empty string. The
> capturing submatch is the new value for the first capture.
>
> - The pattern (a\1)+ is equivalent to (a(?:)) first repetition, which
> matches 'a' plus the first capture which is the empty string. The
> capturing submatch is the new value for the first capture.
> - The pattern (a\1)+ is equivalent to (?:a)(a(?:a)) second repetition,
> which matches the first repetition which is 'a' plus 'a' plus the
> first capture which is 'a'. The capturing submatch is the new value
> for the first capture.
> - The pattern (a\1)+ is equivalent to (?:aaa)(a(?:aa)) third
> repetition, which matches the second repetition which is 'aaa' plus
> 'a' plus the first capture which is 'a'a. The capturing submatch is
> the new value for the first capture.
>
> I'm sure you recognise the pattern:)
>
>
>
> >From an implementor perspective, I hardly think not having to reset
> part of the capture array for each repetition is bad.
>
>
>
>
> How does it work with capturing submatches within the submatch that is repeated?
>
>     /(?:(a)(b)?)+/.exec('aaba');
>
> JavaScriptCore and JScript: ['aaba','a','b']
> SpiderMonkey and futhark: ['aaba','a','']
>
>
>     /(?:(\2|a)(b)?)+/.exec('abba');
>
> JavaScriptCore: ['abba','a','b']
> JScript: ['ab','a','b']
> SpiderMonkey and futhark: ['abba','a','']
>
> ES3: If I understand 15.10.2.5 Term, fourth algorithm, point 4, the
> correct results would be  ['ab','a','b'] since the \2 capture would be
> set to undefined before the second repetition. Thus futhark and
> SpiderMonkey fail to implement this part of ES3 Regex. (I'd be happy
> if it turned out I'm wrong here, however.)
>
>
>
>
> I repeat: I think ES3 (and SpiderMonkey and futhark) has taken the
> wrong choice here. It doesn't make sense to throw away submatches
> after they have been captured, and it's not what regex users expect.
> And a reasonable handling of a backreference to a containing submatch
> is that it's nonsense since it logically can't match anything. But
> even if they're not deemed nonsense, they should refer to last
> repetition's capture, not be reset with each repetition.
>
>
>
>
>
> So I guess what I'm asking you is if you think the ES3 algorithm is
> appropriate, and if a change to a more sensible interpretation could
> be done in ES4. Since we have three totally different interpretations
> in browser hosted engines, I bet there's precious little code out
> there  that relies on the ES3 behaviour, or for that matter the
> particular implementation in any of the browser hosted engines.
> --
> David "liorean" Andersson
> _______________________________________________
> Es4-discuss mailing list
> Es4-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es4-discuss
>

# liorean (18 years ago)

On 13/09/2007, Lars T Hansen <lth at acm.org> wrote:

My answer to you is obviously that you should implement the ES3 behavior (as should JScript and JavaScriptCore).

Of course I'll implement the ES3 behaviour if ES4 makes no change to it, I just think it would be better to change the behaviour to something that actually matches developer expectations instead.

The current behavior is well-defined; it's not a hardship for anyone; the incompatibilities among the engines are probably not a big deal (thus the incompatible engines can be changed so that they conform); thus there is no compelling reason to change the spec either.

That the spec doesn't match expectations and that there are behaviours that do make sense to replace it with, coupled with the fact there seems to be no obvious compatibility problem with changing it (otherwise JScript and JavaScriptCore surely would have been changed to match the ES3 behaviour), makes no compelling reason?

ES3: If I understand 15.10.2.5 Term, fourth algorithm, point 4, the correct results would be ['ab','a','b'] since the \2 capture would be set to undefined before the second repetition. Thus futhark and SpiderMonkey fail to implement this part of ES3 Regex. (I'd be happy if it turned out I'm wrong here, however.)

FWIW, the ES4 RI agrees with Mozilla and Opera.

Trying to trace the matching above, we have: ...+ (?:...) (\2|a) succeeds without consuming input (b)? succeeds without consuming input so (?:...) succeds without consuming input so ...+ starts backtracking [step 1 of the continuation for RepeatMatcher] so (\2|a) succeeds, consuming "a" and (b)? succeeds, consuming "b" so (?:...) succeeds consuming "ab" with captures [a,b] and ...+ repeats: (?:...) (\2|a) succeeds without consuming input and (b)? succeeds, consuming "b" so (?:...) succeeds consuming "b" with captures [,b] and ...+ repeats: (?:...) (\2|a) succeeds without consuming input (b)? succeeds without consuming input so (?:...) succeds without consuming input so + starts backtracking so (\2|a) succeeds, consuming "a" and (b)? succeeds without consuming input so (?:...) succeeds consuming "a" with captures [a,] so ...+ succeeds, matching "abba" with final capture array [a,]

Sorry for the lack of rigor in the above, but I believe that's correct.

Ah, the "undefined really means a match with the empty string" thing again. I should have known...

On 13/09/2007, Lars T Hansen <lth at acm.org> wrote:
> My answer to you is obviously that you should implement the ES3
> behavior (as should JScript and JavaScriptCore).

Of course I'll implement the ES3 behaviour if ES4 makes no change to
it, I just think it would be better to change the behaviour to
something that actually matches developer expectations instead.

> The current behavior
> is well-defined; it's not a hardship for anyone; the incompatibilities
> among the engines are probably not a big deal (thus the incompatible
> engines can be changed so that they conform); thus there is no
> compelling reason to change the spec either.

That the spec doesn't match expectations and that there are behaviours
that do make sense to replace it with, coupled with the fact there
seems to be no obvious compatibility problem with changing it
(otherwise JScript and JavaScriptCore surely would have been changed
to match the ES3 behaviour), makes no compelling reason?

 > > ES3: If I understand 15.10.2.5 Term, fourth algorithm, point 4, the
> > correct results would be  ['ab','a','b'] since the \2 capture would be
> > set to undefined before the second repetition. Thus futhark and
> > SpiderMonkey fail to implement this part of ES3 Regex. (I'd be happy
> > if it turned out I'm wrong here, however.)
>
> FWIW, the ES4 RI agrees with Mozilla and Opera.
>
> Trying to trace the matching above, we have:
>   ...+
>    (?:...)
>     (\2|a) succeeds without consuming input
>     (b)? succeeds without consuming input
>    so (?:...) succeds without consuming input
>   so ...+ starts backtracking [step 1 of the continuation for RepeatMatcher]
>     so (\2|a) succeeds, consuming "a"
>     and (b)? succeeds, consuming "b"
>    so (?:...) succeeds consuming "ab" with captures [a,b]
>  and ...+ repeats:
>    (?:...)
>     (\2|a) succeeds without consuming input
>     and (b)? succeeds, consuming "b"
>    so (?:...) succeeds consuming "b" with captures [,b]
>  and ...+ repeats:
>    (?:...)
>     (\2|a) succeeds without consuming input
>     (b)? succeeds without consuming input
>    so (?:...) succeds without consuming input
>   so + starts backtracking
>     so (\2|a) succeeds, consuming "a"
>     and (b)? succeeds without consuming input
>    so (?:...) succeeds consuming "a" with captures [a,]
>  so ...+ succeeds, matching "abba" with final capture array [a,]
>
> Sorry for the lack of rigor in the above, but I believe that's correct.

Ah, the "undefined really means a match with the empty string" thing
again. I should have known...
-- 
David "liorean" Andersson

# Lars T Hansen (18 years ago)

On 9/13/07, liorean <liorean at gmail.com> wrote:

On 13/09/2007, Lars T Hansen <lth at acm.org> wrote:

The current behavior is well-defined; it's not a hardship for anyone; the incompatibilities among the engines are probably not a big deal (thus the incompatible engines can be changed so that they conform); thus there is no compelling reason to change the spec either.

That the spec doesn't match expectations and that there are behaviours that do make sense to replace it with, coupled with the fact there seems to be no obvious compatibility problem with changing it (otherwise JScript and JavaScriptCore surely would have been changed to match the ES3 behaviour), makes no compelling reason?

I hope I'm not being overly flip when I say that it is the spec that circumscribes the set of expectations you are allowed to have. And the spec is entirely clear here, ie what it says is not in question, even if it's not always easy to find out what it says. A somewhat determined developer who needs to rely on the behavior can discover what behavior is expected (though he obviously can't trust the implementations to get it right).

It's your contention that developers will be surprised by the current behavior. What can I say? Edge cases will always surprise somebody. I don't feel surprised that captures are thrown away when a repeat matcher repeats, even if you do.

Warnings are pointless on the web, and the language has no warnings now. If we're going to do something about backreferences inside their own captures, it would have to be to outlaw them and require a syntax error.

I doubt we're inclined to make any incompatible changes to the matching algorithm, and (again) I don't really see the point.

(It is said a lawyer is a person who's not appaled by something if it can be shown to be the consequence of a law. I guess I'm a (language) lawyer on this one :-)

On 9/13/07, liorean <liorean at gmail.com> wrote:
> On 13/09/2007, Lars T Hansen <lth at acm.org> wrote:
> > The current behavior
> > is well-defined; it's not a hardship for anyone; the incompatibilities
> > among the engines are probably not a big deal (thus the incompatible
> > engines can be changed so that they conform); thus there is no
> > compelling reason to change the spec either.
>
> That the spec doesn't match expectations and that there are behaviours
> that do make sense to replace it with, coupled with the fact there
> seems to be no obvious compatibility problem with changing it
> (otherwise JScript and JavaScriptCore surely would have been changed
> to match the ES3 behaviour), makes no compelling reason?

I hope I'm not being overly flip when I say that it is the spec that
circumscribes the set of expectations you are allowed to have.  And
the spec is entirely clear here, ie what it says is not in question,
even if it's not always easy to find out what it says.  A somewhat
determined developer who needs to rely on the behavior can discover
what behavior is expected (though he obviously can't trust the
implementations to get it right).

It's your contention that developers will be surprised by the current
behavior.  What can I say? Edge cases will always surprise somebody.
I don't feel surprised that captures are thrown away when a repeat
matcher repeats, even if you do.

Warnings are pointless on the web, and the language has no warnings
now.  If we're going to do something about backreferences inside their
own captures, it would have to be to outlaw them and require a syntax
error.

I doubt we're inclined to make any incompatible changes to the
matching algorithm, and (again) I don't really see the point.

(It is said a lawyer is a person who's not appaled by something if it
can be shown to be the consequence of a law.  I guess I'm a (language)
lawyer on this one :-)

--lars

# Brendan Eich (18 years ago)

I am not looking to make trouble here, believe me, but I want to
point out two things that could help David's case:

JS regexps were modeled by me (to lwall's horror ;-) on Perl
regexen. Here's what perl (5.8.8) does:

$ perl "aaab" =~ /(a\1)+|b/; print "$& ($1)\n"; b ()

It's no surprise JavaScriptCore agrees, since it is based on PCRE.
Tamarin is too -- do you know what it does?

IE JScript does not agree with any of Perl/JavaScriptCore or ES3
and conformant implementations. That does not mean it should prevail
for this edge case. But it does suggest we could change to match
Perl, and match David's two-person (three counting him; perhaps four
if I count ;-) developer sample.

I am not looking to make trouble here, believe me, but I want to  
point out two things that could help David's case:

1. JS regexps were modeled by me (to lwall's horror ;-) on Perl  
regexen. Here's what perl (5.8.8) does:

$ perl
"aaab" =~ /(a\1)+|b/;
print "$& ($1)\n";
b ()

It's no surprise JavaScriptCore agrees, since it is based on PCRE.  
Tamarin is too -- do you know what it does?

2. IE JScript does not agree with any of Perl/JavaScriptCore or ES3  
and conformant implementations. That does not mean it should prevail  
for this edge case. But it does suggest we could change to match  
Perl, and match David's two-person (three counting him; perhaps four  
if I count ;-) developer sample.

/be

# Lars T Hansen (18 years ago)

On 9/14/07, Brendan Eich <brendan at mozilla.org> wrote:

I am not looking to make trouble here, believe me, but I want to point out two things that could help David's case:

JS regexps were modeled by me (to lwall's horror ;-) on Perl regexen. Here's what perl (5.8.8) does:

$ perl "aaab" =~ /(a\1)+|b/; print "$& ($1)\n"; b ()

It's no surprise JavaScriptCore agrees, since it is based on PCRE. Tamarin is too -- do you know what it does?

IE JScript does not agree with any of Perl/JavaScriptCore or ES3 and conformant implementations. That does not mean it should prevail for this edge case. But it does suggest we could change to match Perl, and match David's two-person (three counting him; perhaps four if I count ;-) developer sample.

I oppose this on merely Perl compatibility grounds -- ES regexes have since evolved through other influences -- but sure, we'll give it a fair hearing. I propose that everyone involved wait until the draft library spec is released (soon, if all goes well) and that specific proposals to changes in regexp semantics are made in terms of that draft spec. (It should be somewhat easier to deal with than the ES3 spec would be.)

On 9/14/07, Brendan Eich <brendan at mozilla.org> wrote:
> I am not looking to make trouble here, believe me, but I want to
> point out two things that could help David's case:
>
> 1. JS regexps were modeled by me (to lwall's horror ;-) on Perl
> regexen. Here's what perl (5.8.8) does:
>
> $ perl
> "aaab" =~ /(a\1)+|b/;
> print "$& ($1)\n";
> b ()
>
> It's no surprise JavaScriptCore agrees, since it is based on PCRE.
> Tamarin is too -- do you know what it does?
>
> 2. IE JScript does not agree with any of Perl/JavaScriptCore or ES3
> and conformant implementations. That does not mean it should prevail
> for this edge case. But it does suggest we could change to match
> Perl, and match David's two-person (three counting him; perhaps four
> if I count ;-) developer sample.

I oppose this on merely Perl compatibility grounds -- ES regexes have
since evolved through other influences -- but sure, we'll give it a
fair hearing.  I propose that everyone involved wait until the draft
library spec is released (soon, if all goes well) and that specific
proposals to changes in regexp semantics are made in terms of that
draft spec.  (It should be somewhat easier to deal with than the ES3
spec would be.)

--lars

# liorean (18 years ago)

On 9/13/07, liorean <liorean at gmail.com> wrote:

That the spec doesn't match expectations and that there are behaviours that do make sense to replace it with, coupled with the fact there seems to be no obvious compatibility problem with changing it (otherwise JScript and JavaScriptCore surely would have been changed to match the ES3 behaviour), makes no compelling reason?

On 13/09/2007, Lars T Hansen <lth at acm.org> wrote:

I hope I'm not being overly flip when I say that it is the spec that circumscribes the set of expectations you are allowed to have. And the spec is entirely clear here, ie what it says is not in question, even if it's not always easy to find out what it says. A somewhat determined developer who needs to rely on the behavior can discover what behavior is expected (though he obviously can't trust the implementations to get it right).

Expectations are never formed from the spec, and that is especially true of ECMAScript since the spec is quite hard to read for laymen. Expectations are formed from a logical application of the perceived concepts underlying the language features. Regex are a special case here as well, since most regex articles and tutorials are not particularly language specific, they are typically only specific to the generalised flavour (the Perl flavour and POSIX flavour, respectively). In fact, for a long time the only JavaScript specific regex article that a fast search could find that was the one I wrote for eVolt in 2002.

It's your contention that developers will be surprised by the current behavior. What can I say? Edge cases will always surprise somebody. I don't feel surprised that captures are thrown away when a repeat matcher repeats, even if you do.

Maybe you don't. But almost all regex implementations that implement capturing submatches that are specifically implementing the ES3 algorithm agree with me.

Warnings are pointless on the web, and the language has no warnings now. If we're going to do something about backreferences inside their own captures, it would have to be to outlaw them and require a syntax error.

Warnings are pretty much useless today, yes. But you can always hope for better debugging and development tools.

I doubt we're inclined to make any incompatible changes to the matching algorithm, and (again) I don't really see the point.

That's why I stressed that JScript and JavaScriptCore, and according to Brendan above also Tamarin, fail to implement this according to ES3. It's not an incompatible change - it can't be if the two most common engines are incompatible with the spec. (I gather that'd be JScript and Tamarin.) That pretty much guarantees that there's no code relying on the spec behaviour.

On 14/09/2007, Lars T Hansen <lth at acm.org> wrote:

I oppose this on merely Perl compatibility grounds -- ES regexes have since evolved through other influences -- but sure, we'll give it a fair hearing.

I'm glad to hear that it'll get a fair hearing. Just throw some logs on the fire however, these all agree on backreferences to failed submatches failing to match anything-including-the-empty-string:

JGsoft, .NET, Java, Perl, PCRE, Python, Ruby, Tcl ARE, POSIX BRE

(according to uri:http://www.regular-expressions.info/refflavors.html,

I haven't actually tested in all those)

You can of course argue that "failed submatches fail to match" is not the same as "not-yet-captured submatches fail to match" - that would require some testing, and the only one that I have accessible personally right now is Perl.

On 14/09/2007, Garrett Smith <dhtmlkitchen at gmail.com> wrote:

You're post on ES4 list would make for an excellent tutorial on RegExps.

I've been asked to do RegExp tutorials before...

uri:http://www.evolt.org/article/Regular_Expressions_in_JavaScript/17/36435/

I think a lot of web devs, including myself, don't have such deep knowledge of RegExps. The differences in script engines show that it's more than web devs who misunderstand the spec. You're examples, including the ones on your blog, indicate that RegExps in js have some counter-intuitive behavior.

Yes. I've been meaning to flesh that out in another blog entry. Probably will have to wait until next week however.

I think it would be excellent material for an article on developer.mozilla.org. Something to consider, at least.

Ah. The problem of choosing where to place your articles... in fact, I'd probably prefer to place any future articles on dev.opera.com (I think Opera is seriously under appreciated by developers, so anything to help...). Or on eVolt, since I like that place.

I think I can sum up the change I think is appropriate by these things:

undefined should be a failure to match instead of a match to the empty string
captures should only be set to undefined in two cases - when the regex matching is started, and if inside a negative lookahead

> On 9/13/07, liorean <liorean at gmail.com> wrote:
> > That the spec doesn't match expectations and that there are behaviours
> > that do make sense to replace it with, coupled with the fact there
> > seems to be no obvious compatibility problem with changing it
> > (otherwise JScript and JavaScriptCore surely would have been changed
> > to match the ES3 behaviour), makes no compelling reason?

On 13/09/2007, Lars T Hansen <lth at acm.org> wrote:
> I hope I'm not being overly flip when I say that it is the spec that
> circumscribes the set of expectations you are allowed to have.  And
> the spec is entirely clear here, ie what it says is not in question,
> even if it's not always easy to find out what it says.  A somewhat
> determined developer who needs to rely on the behavior can discover
> what behavior is expected (though he obviously can't trust the
> implementations to get it right).

Expectations are never formed from the spec, and that is especially
true of ECMAScript since the spec is quite hard to read for laymen.
Expectations are formed from a logical application of the perceived
concepts underlying the language features. Regex are a special case
here as well, since most regex articles and tutorials are not
particularly language specific, they are typically only specific to
the generalised flavour (the Perl flavour and POSIX flavour,
respectively). In fact, for a long time the only JavaScript specific
regex article that a fast search could find that was the one I wrote
for eVolt in 2002.

> It's your contention that developers will be surprised by the current
> behavior.  What can I say? Edge cases will always surprise somebody.
> I don't feel surprised that captures are thrown away when a repeat
> matcher repeats, even if you do.

Maybe you don't. But almost all regex implementations that implement
capturing submatches that are specifically implementing the ES3
algorithm agree with me.

> Warnings are pointless on the web, and the language has no warnings
> now.  If we're going to do something about backreferences inside their
> own captures, it would have to be to outlaw them and require a syntax
> error.

Warnings are pretty much useless today, yes. But you can always hope
for better debugging and development tools.

> I doubt we're inclined to make any incompatible changes to the
> matching algorithm, and (again) I don't really see the point.

That's why I stressed that JScript and JavaScriptCore, and according
to Brendan above also Tamarin, fail to implement this according to
ES3. It's not an incompatible change - it can't be if the two most
common engines are incompatible with the spec. (I gather that'd be
JScript and Tamarin.)  That pretty much guarantees that there's no
code relying on the spec behaviour.

On 14/09/2007, Lars T Hansen <lth at acm.org> wrote:
> I oppose this on merely Perl compatibility grounds -- ES regexes have
> since evolved through other influences -- but sure, we'll give it a
> fair hearing.

I'm glad to hear that it'll get a fair hearing.
Just throw some logs on the fire however, these all agree on
backreferences to failed submatches failing to match
anything-including-the-empty-string:

JGsoft, .NET, Java, Perl, PCRE, Python, Ruby, Tcl ARE, POSIX BRE

(according to <uri:http://www.regular-expressions.info/refflavors.html>,
I haven't actually tested in all those)

You can of course argue that "failed submatches fail to match" is not
the same as "not-yet-captured submatches fail to match" - that would
require some testing, and the only one that I have accessible
personally right now is Perl.

On 14/09/2007, Garrett Smith <dhtmlkitchen at gmail.com> wrote:
> You're post on ES4 list would make for an excellent tutorial on RegExps.

I've been asked to do RegExp tutorials before...

<uri:http://www.evolt.org/article/Regular_Expressions_in_JavaScript/17/36435/>

> I think a lot of web devs, including myself, don't have such deep
> knowledge of RegExps. The differences in script engines show that it's
> more than web devs who misunderstand the spec. You're examples,
> including the ones on your blog, indicate that RegExps in js have some
> counter-intuitive behavior.

Yes. I've been meaning to flesh that out in another blog entry.
Probably will have to wait until next week however.

> I think it would be excellent material for an article on
> developer.mozilla.org. Something to consider, at least.

Ah. The problem of choosing where to place your articles... in fact,
I'd probably prefer to place any future articles on dev.opera.com (I
think Opera is seriously under appreciated by developers, so anything
to help...). Or on eVolt, since I like that place.

I think I can sum up the change I think is appropriate by these things:
- undefined should be a failure to match instead of a match to the empty string
- captures should only be set to undefined in two cases - when the
regex matching is started, and if inside a negative lookahead
-- 
David "liorean" Andersson

# liorean (18 years ago)

On 14/09/2007, liorean <liorean at gmail.com> wrote:

Maybe you don't. But almost all regex implementations that implement capturing submatches that are specifically implementing the ES3 algorithm agree with me.

s/that are specifically/that are not specifically/

On 14/09/2007, liorean <liorean at gmail.com> wrote:
> Maybe you don't. But almost all regex implementations that implement
> capturing submatches that are specifically implementing the ES3
> algorithm agree with me.

s/that are specifically/that are not specifically/
-- 
David "liorean" Andersson

# Brendan Eich (18 years ago)

On Sep 14, 2007, at 12:42 PM, liorean wrote:

On 14/09/2007, liorean <liorean at gmail.com> wrote:

Maybe you don't. But almost all regex implementations that implement capturing submatches that are specifically implementing the ES3 algorithm agree with me.

s/that are specifically/that are not specifically/

"Almost all" excludes IE JScript, right? Those implementations are
JavaScriptCore, based on PCRE, and what else?

Again, it's the Perl connection.

On Sep 14, 2007, at 12:42 PM, liorean wrote:

> On 14/09/2007, liorean <liorean at gmail.com> wrote:
>> Maybe you don't. But almost all regex implementations that implement
>> capturing submatches that are specifically implementing the ES3
>> algorithm agree with me.
>
> s/that are specifically/that are not specifically/

"Almost all" excludes IE JScript, right? Those implementations are  
JavaScriptCore, based on PCRE, and what else?

Again, it's the Perl connection.

/be

# liorean (18 years ago)

On 14/09/2007, Brendan Eich <brendan at mozilla.org> wrote:

"Almost all" excludes IE JScript, right?

Well, it doesn't agree with me on what is the right approach, but it does agree with me that the ES3 approach is wrong.

Those implementations are JavaScriptCore, based on PCRE, and what else?

The only implementations that I know of that specifically implement the ES3 algorithms are those in SpiderMonkey, linear_b and futhark (I have no idea about SEE, Rhino, and a lot of other fringe JS engines however). And the ES4 refimpl I guess.

I only said "almost all" because I haven't tested each and every one. It could be possible that the only ones behaving according to the ES3 algorithm at all are indeed those written to match the ES3 algorithms. It's also possible that there are those that do match the ES3 algorithms but are not written specifically for them - I just don't know of any such implementation.

JavaScriptCore uses PCRE, so of course it's not written specifically for the ES3 algorithms.

The JScript one is shared with at least VBScript, and I wouldn't be surprised if it was shared with VB, VBA and other COM environments as well. I don't think it's written specifically for the ES3 algorithms. It's peculiar, and I think it doesn't agree with anybody else on a few edge cases besides those I've brought up.

Again, it's the Perl connection.

Not only. Both Perl and those POSIX flavour regex that implement capturing submatches...

On 14/09/2007, Brendan Eich <brendan at mozilla.org> wrote:
> "Almost all" excludes IE JScript, right?

Well, it doesn't agree with me on what is the right approach, but it
does agree with me that the ES3 approach is wrong.

> Those implementations are
> JavaScriptCore, based on PCRE, and what else?

The only implementations that I know of that specifically implement
the ES3 algorithms are those in SpiderMonkey, linear_b and futhark (I
have no idea about SEE, Rhino, and a lot of other fringe JS engines
however). And the ES4 refimpl I guess.

I only said "almost all" because I haven't tested each and every one.
It could be possible that the only ones behaving according to the ES3
algorithm *at all* are indeed those written to match the ES3
algorithms. It's also possible that there are those that do match the
ES3 algorithms but are not written specifically for them - I just
don't know of any such implementation.

JavaScriptCore uses PCRE, so of course it's not written specifically
for the ES3 algorithms.

The JScript one is shared with at least VBScript, and I wouldn't be
surprised if it was shared with VB, VBA and other COM environments as
well. I don't think it's written specifically for the ES3 algorithms.
It's peculiar, and I think it doesn't agree with anybody else on a few
edge cases besides those I've brought up.

> Again, it's the Perl connection.

Not only. Both Perl and those POSIX flavour regex that implement
capturing submatches...
-- 
David "liorean" Andersson

# StevenLevithan (18 years ago)

liorean wrote:

I think I can sum up the change I think is appropriate by these things:

undefined should be a failure to match instead of a match to the empty string

captures should only be set to undefined in two cases - when the regex matching is started, and if inside a negative lookahead

I am in complete agreement with liorean. The ES3 handling on these points is non-intuitive, non-compatible with other regex libraries, far less useful than the alternative (I could give countless examples of where the Perl-style handling has real-world practicality), and creates future compatibility issues if ECMAScript were to implement certain features from Perl or other Perl-derivative flavors such as capturing-group-based conditionals.

Lars T Hansen wrote:

I hope I'm not being overly flip when I say that it is the spec that circumscribes the set of expectations you are allowed to have. And the spec is entirely clear here, ie what it says is not in question, even if it's not always easy to find out what it says. A somewhat determined developer who needs to rely on the behavior can discover what behavior is expected (though he obviously can't trust the implementations to get it right).

I consider ES3's handling of capturing group participation and so forth (summed up by liorean) to be bugs in the spec (perhaps I'm being overly flip, but these are my biggest gripes with ES3 regexes). The fact that I consider them bugs which might be fixed in future versions means that I cannot rely on the behavior even though I understand what the spec prescribes.

liorean wrote:
> 
>> I think I can sum up the change I think is appropriate by these things:
>> - undefined should be a failure to match instead of a match to the empty
>> string
>> - captures should only be set to undefined in two cases - when the
>>   regex matching is started, and if inside a negative lookahead
> 

I am in complete agreement with liorean. The ES3 handling on these points is
non-intuitive, non-compatible with other regex libraries, *far* less useful
than the alternative (I could give countless examples of where the
Perl-style handling has real-world practicality), and creates future
compatibility issues if ECMAScript were to implement certain features from
Perl or other Perl-derivative flavors such as capturing-group-based
conditionals.

Lars T Hansen wrote:
> 
>> I hope I'm not being overly flip when I say that it is the spec that
>> circumscribes the set of expectations you are allowed to have.  And
>> the spec is entirely clear here, ie what it says is not in question,
>> even if it's not always easy to find out what it says.  A somewhat
>> determined developer who needs to rely on the behavior can discover
>> what behavior is expected (though he obviously can't trust the
>> implementations to get it right).
> 

I consider ES3's handling of capturing group participation and so forth
(summed up by liorean) to be bugs in the spec (perhaps I'm being overly
flip, but these are my biggest gripes with ES3 regexes). The fact that I
consider them bugs which might be fixed in future versions means that I
cannot rely on the behavior even though I understand what the spec
prescribes.

-- 
View this message in context: http://www.nabble.com/Regex%3A-How-should-backreferences-contained-in-the-capturing-match-they-reference-to-work--tf4367560.html#a13946131
Sent from the Mozilla - ECMAScript 4 discussion mailing list archive at Nabble.com.

# StevenLevithan (18 years ago)

StevenLevithan wrote:

liorean wrote:

I think I can sum up the change I think is appropriate by these things:

undefined should be a failure to match instead of a match to the empty string

captures should only be set to undefined in two cases - when the regex matching is started, and if inside a negative lookahead

I am in complete agreement with liorean. The ES3 handling on these points is non-intuitive, non-compatible with other regex libraries, far less useful than the alternative (I could give countless examples of where the Perl-style handling has real-world practicality), and creates future compatibility issues if ECMAScript were to implement certain features from Perl or other Perl-derivative flavors such as capturing-group-based conditionals.

To support my last comment with a couple of the "countless examples", I can point to my blog:

If backreferences to non-participating capturing groups resulted in failure rather than a match of the empty string, it would be possible to mimic conditionals as shown at blog.stevenlevithan.com/archives/mimic-conditionals, blog.stevenlevithan.com/archives/mimic-conditionals . That page presents some generalized patterns, but simpler cases can often be taken advantage of.
If captures were only set to undefined when the regex matching starts and after a negative lookahead which contains captures, it would be possible to do things like shown at blog.stevenlevithan.com/archives/multi-attr-capture, blog.stevenlevithan.com/archives/multi-attr-capture .

Those are particular things I've posted about in the past, but I run into these spec bugs ;-) on a regular basis. I have never run into a case where I wished for the ES3 handling of these issues.

StevenLevithan wrote:
> 
> 
> liorean wrote:
>> 
>>> I think I can sum up the change I think is appropriate by these things:
>>> - undefined should be a failure to match instead of a match to the empty
>>> string
>>> - captures should only be set to undefined in two cases - when the
>>>   regex matching is started, and if inside a negative lookahead
>> 
> 
> I am in complete agreement with liorean. The ES3 handling on these points
> is non-intuitive, non-compatible with other regex libraries, *far* less
> useful than the alternative (I could give countless examples of where the
> Perl-style handling has real-world practicality), and creates future
> compatibility issues if ECMAScript were to implement certain features from
> Perl or other Perl-derivative flavors such as capturing-group-based
> conditionals.
> 

To support my last comment with a couple of the "countless examples", I can
point to my blog:

- If backreferences to non-participating capturing groups resulted in
failure rather than a match of the empty string, it would be possible to
mimic conditionals as shown at 
http://blog.stevenlevithan.com/archives/mimic-conditionals
http://blog.stevenlevithan.com/archives/mimic-conditionals  . That page
presents some generalized patterns, but simpler cases can often be taken
advantage of.

- If captures were only set to undefined when the regex matching starts and
after a negative lookahead which contains captures, it would be possible to
do things like shown at 
http://blog.stevenlevithan.com/archives/multi-attr-capture
http://blog.stevenlevithan.com/archives/multi-attr-capture  .

Those are particular things I've posted about in the past, but I run into
these spec bugs ;-) on a regular basis. I have never run into a case where I
wished for the ES3 handling of these issues.
-- 
View this message in context: http://www.nabble.com/Regex%3A-How-should-backreferences-contained-in-the-capturing-match-they-reference-to-work--tf4367560.html#a13946533
Sent from the Mozilla - ECMAScript 4 discussion mailing list archive at Nabble.com.

# liorean (18 years ago)

Tried to sum up the issue a little more coherently and well articulated in a blog post: uri:http://web-graphics.com/2007/11/26/ecmascript-3-regular-expressions-a-specification-that-doesnt-make-sense/

Doesn't really say anything I haven't said somewhere in this thread already, though.

Tried to sum up the issue a little more coherently and well
articulated in a blog post:
<uri:http://web-graphics.com/2007/11/26/ecmascript-3-regular-expressions-a-specification-that-doesnt-make-sense/>

Doesn't really say anything I haven't said somewhere in this thread
already, though.
-- 
David "liorean" Andersson

# StevenLevithan (18 years ago)

liorean wrote:

Tried to sum up the issue a little more coherently and well articulated in a blog post: uri:http://web-graphics.com/2007/11/26/ecmascript-3-regular-expressions-a-specification-that-doesnt-make-sense/

And I've posted a follow-up blog.stevenlevithan.com/archives/es3-regexes-broken here (which in truth is more of a long rant).

BTW, here's another variation on the use of Perl-style group participation handling that I posted on a regex forum yesterday: regexadvice.com/forums/permalink/25764/36926/ShowThread.aspx#36926 Reg Expression for matching [multiple required attributes] (that regex also uses an atomic group, but that's not imperative in this case and can be emulated even in ES3 by using capturing groups in lookahead, if necessary). The Perl-style handling is intuitive and ripe for creative exploitation. The ES3-style handling is essentially useless. But I'll stop beating this horse now...

liorean wrote:
> 
> Tried to sum up the issue a little more coherently and well
> articulated in a blog post:
> <uri:http://web-graphics.com/2007/11/26/ecmascript-3-regular-expressions-a-specification-that-doesnt-make-sense/>
> 

And I've posted a follow-up 
http://blog.stevenlevithan.com/archives/es3-regexes-broken here  (which in
truth is more of a long rant).

BTW, here's another variation on the use of Perl-style group participation
handling that I posted on a regex forum yesterday: 
http://regexadvice.com/forums/permalink/25764/36926/ShowThread.aspx#36926
Reg Expression for matching [multiple required attributes]  (that regex also
uses an atomic group, but that's not imperative in this case and can be
emulated even in ES3 by using capturing groups in lookahead, if necessary).
The Perl-style handling is intuitive and ripe for creative exploitation. The
ES3-style handling is essentially useless. But I'll stop beating this horse
now...
-- 
View this message in context: http://www.nabble.com/Regex%3A-How-should-backreferences-contained-in-the-capturing-match-they-reference-to-work--tf4367560.html#a13964984
Sent from the Mozilla - ECMAScript 4 discussion mailing list archive at Nabble.com.

# StevenLevithan (18 years ago)

StevenLevithan wrote:

And I've posted a follow-up blog.stevenlevithan.com/archives/es3-regexes-broken here (which in truth is more of a long rant).

At the risk of spamming the mailing list, I'll note that I've just toned down a lot of the previous, undue negativity towards the draft ES4 regex spec in that post. I'm not really a hateful ingrate. :)

StevenLevithan wrote:
> 
> And I've posted a follow-up 
> http://blog.stevenlevithan.com/archives/es3-regexes-broken here  (which in
> truth is more of a long rant).
> 

At the risk of spamming the mailing list, I'll note that I've just toned
down a lot of the previous, undue negativity towards the draft ES4 regex
spec in that post. I'm not really a hateful ingrate. :)
-- 
View this message in context: http://www.nabble.com/Regex%3A-How-should-backreferences-contained-in-the-capturing-match-they-reference-to-work--tf4367560.html#a13984745
Sent from the Mozilla - ECMAScript 4 discussion mailing list archive at Nabble.com.