RegExp lookbehind

# Lea Verou (13 years ago)

After all these years, ES regular expressions still don't support lookbehind. It turns out this was discussed in late 2010 1 and was heading towards being accepted, but the discussion just stalled. So, consider this a bump of that thread.

After all these years, ES regular expressions still don't support 
lookbehind. It turns out this was discussed in late 2010 [1] and was 
heading towards being accepted, but the discussion just stalled. So, 
consider this a bump of that thread.

[1]: http://www.mail-archive.com/[email protected]/msg05205.html

# Felipe Gasper (13 years ago)

Bump++. :)

Bump++. :)

-F

On 3/16/12 3:56 PM, Lea Verou wrote:
> After all these years, ES regular expressions still don't support
> lookbehind. It turns out this was discussed in late 2010 [1] and was
> heading towards being accepted, but the discussion just stalled. So,
> consider this a bump of that thread.
>
> [1]: http://www.mail-archive.com/[email protected]/msg05205.html
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss

# Brendan Eich (13 years ago)

We'll get this strawman:

strawman:steve_levithan_regexp_api_improvements

onto the agenda for the next TC39 meeting. I had thought we'd approved lookbehind. I will do what I can to get it into ES6.

We'll get this strawman:

http://wiki.ecmascript.org/doku.php?id=strawman:steve_levithan_regexp_api_improvements

onto the agenda for the next TC39 meeting. I had thought we'd approved 
lookbehind. I will do what I can to get it into ES6.

/be

Felipe Gasper wrote:
> Bump++. :)
>
> -F
>
> On 3/16/12 3:56 PM, Lea Verou wrote:
>> After all these years, ES regular expressions still don't support
>> lookbehind. It turns out this was discussed in late 2010 [1] and was
>> heading towards being accepted, but the discussion just stalled. So,
>> consider this a bump of that thread.
>>
>> [1]: http://www.mail-archive.com/[email protected]/msg05205.html
>> _______________________________________________
>> es-discuss mailing list
>> es-discuss at mozilla.org
>> https://mail.mozilla.org/listinfo/es-discuss
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss
>

# Steven L. (13 years ago)

Brendan Eich wrote:

We'll get this strawman:

strawman:steve_levithan_regexp_api_improvements

onto the agenda for the next TC39 meeting.

Neat-o. I mentioned in the blog post referenced there [1] that I'd be doing a separate write-up with my opinionated take or what new RegExp features should be added, rather than changed. Unfortunately I never followed through, but IMO the most pressing additions are named capture and backreferences, atomic groups, lookbehind, the /x flag, the /s flag, and a /u flag that lets \d \D \w \W \b \B follow Unicode. There are plenty of other useful features that could be considered, but I'll wait at least until there is some reaction to the above.

There are of course details to hash out for named capture, lookbehind, and the /x flag before they can be spec-ed. The /s flag is the least important of the above--I've included it merely because of its convenience over e.g. [\s\S] and the fact that its absence is unique among modern regex flavors and a common portability hindrance.

Brendan Eich wrote:

I had thought we'd approved lookbehind. I will do what I can to get it into ES6.

I thought so, too. See [2] from Waldemar ("May 24-26 rough meeting notes"). Specifically, he stated "Lookbehind support is promoted to required normative."

[1] blog.stevenlevithan.com/archives/fixing-javascript-regexp [2] esdiscuss/2011-May/014748

--Steven Levithan

Brendan Eich wrote:
> We'll get this strawman:
>
> http://wiki.ecmascript.org/doku.php?id=strawman:steve_levithan_regexp_api_improvements
>
> onto the agenda for the next TC39 meeting.

Neat-o. I mentioned in the blog post referenced there [1] that I'd be doing 
a separate write-up with my opinionated take or what new RegExp features 
should be added, rather than changed. Unfortunately I never followed 
through, but IMO the most pressing additions are named capture and 
backreferences, atomic groups, lookbehind, the /x flag, the /s flag, and a 
/u flag that lets \d \D \w \W \b \B follow Unicode. There are plenty of 
other useful features that could be considered, but I'll wait at least until 
there is some reaction to the above.

There are of course details to hash out for named capture, lookbehind, and 
the /x flag before they can be spec-ed. The /s flag is the least important 
of the above--I've included it merely because of its convenience over e.g. 
[\s\S] and the fact that its absence is unique among modern regex flavors 
and a common portability hindrance.

Brendan Eich wrote:
> I had thought we'd approved lookbehind. I will do what I can to get it 
> into ES6.

I thought so, too. See [2] from Waldemar ("May 24-26 rough meeting notes"). 
Specifically, he stated "Lookbehind support is promoted to required 
normative."

[1] http://blog.stevenlevithan.com/archives/fixing-javascript-regexp
[2] https://mail.mozilla.org/pipermail/es-discuss/2011-May/014748.html

--Steven Levithan

# Lasse Reichstein (13 years ago)

On Sat, Mar 17, 2012 at 9:00 AM, Steven L. <steves_list at hotmail.com> wrote:

I thought so, too. See [2] from Waldemar ("May 24-26 rough meeting notes"). Specifically, he stated "Lookbehind support is promoted to required normative."

Are there any plans for which kind of look-behind to allow? Literal only, fixed-length, repetitions of fixed-length, unrestricted or something else?

The fallback based RegExp specification doesn't lend itself easily to unrestricted regexps (unless they are matched backwards instead of forwards, but I think that's going to be confusing).

On Sat, Mar 17, 2012 at 9:00 AM, Steven L. <steves_list at hotmail.com> wrote:
> I thought so, too. See [2] from Waldemar ("May 24-26 rough meeting notes").
> Specifically, he stated "Lookbehind support is promoted to required
> normative."

Are there any plans for which kind of look-behind to allow? Literal
only, fixed-length, repetitions of fixed-length, unrestricted or
something else?

The fallback based RegExp specification doesn't lend itself easily to
unrestricted regexps (unless they are matched backwards instead of
forwards, but I think that's going to be confusing).

/L

# Andrea Giammarchi (13 years ago)

I would simply apply same logic we have already for the look ahead ... or you think that would cause problems?

I would simply apply same logic we have already for the look ahead ... or
you think that would cause problems?

On Sat, Mar 17, 2012 at 1:07 PM, Lasse Reichstein <
reichsteinatwork at gmail.com> wrote:

> On Sat, Mar 17, 2012 at 9:00 AM, Steven L. <steves_list at hotmail.com>
> wrote:
> > I thought so, too. See [2] from Waldemar ("May 24-26 rough meeting
> notes").
> > Specifically, he stated "Lookbehind support is promoted to required
> > normative."
>
> Are there any plans for which kind of look-behind to allow? Literal
> only, fixed-length, repetitions of fixed-length, unrestricted or
> something else?
>
> The fallback based RegExp specification doesn't lend itself easily to
> unrestricted regexps (unless they are matched backwards instead of
> forwards, but I think that's going to be confusing).
>
> /L
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20120317/ef176779/attachment.html>

# Steven L. (13 years ago)

Andrea Giammarchi wrote:

I would simply apply same logic we have already for the look ahead ... or you think that would cause problems?

As has been discussed previously, it is nontrivial to implement infinite-length lookbehind efficiently (where you’re not just testing /(?:lookbehind)$/ against the lookbehind start position). For this reason, .NET is the only major regex flavor that supports it. It’s able to do so by taking advantage of its Right-to-Left Mode [1], the actual semantics of which are a big question mark. Sometimes it matches intuitive developer expectations (e.g., \d+ matches all of “123” rather than just the trailing “3”), and sometimes it doesn’t. Infinite-length lookbehind would of course be ideal and doesn’t have to cause problems. However, it certainly could cause problems if it is poorly spec-ed and/or poorly implemented. --Steven Levithan [1]: msdn.microsoft.com/en-us/library/yd1hzczs.aspx#RightToLeft

On Sat, Mar 17, 2012 at 1:07 PM, Lasse Reichstein <reichsteinatwork at gmail.com> wrote:

On Sat, Mar 17, 2012 at 9:00 AM, Steven L. <steves_list at hotmail.com> wrote:

I thought so, too. See [2] from Waldemar ("May 24-26 rough meeting notes"). Specifically, he stated "Lookbehind support is promoted to required normative."

Are there any plans for which kind of look-behind to allow? Literal only, fixed-length, repetitions of fixed-length, unrestricted or something else?

The fallback based RegExp specification doesn't lend itself easily to unrestricted regexps (unless they are matched backwards instead of forwards, but I think that's going to be confusing).

es-discuss mailing list es-discuss at mozilla.org, mail.mozilla.org/listinfo/es

Andrea Giammarchi wrote:
> I would simply apply same logic we have already for the
> look ahead ... or you think that would cause problems?
As has been discussed previously, it is nontrivial to implement infinite-length lookbehind efficiently (where you’re not just testing /(?:lookbehind)$/ against the lookbehind start position). For this reason, .NET is the only major regex flavor that supports it. It’s able to do so by taking advantage of its Right-to-Left Mode [1], the actual semantics of which are a big question mark. Sometimes it matches intuitive developer expectations (e.g., \d+ matches all of “123” rather than just the trailing “3”), and sometimes it doesn’t.
Infinite-length lookbehind would of course be ideal and doesn’t *have* to cause problems. However, it certainly could cause problems if it is poorly spec-ed and/or poorly implemented.
--Steven Levithan
[1]: http://msdn.microsoft.com/en-us/library/yd1hzczs.aspx#RightToLeft



 
On Sat, Mar 17, 2012 at 1:07 PM, Lasse Reichstein <reichsteinatwork at gmail.com> wrote:

  On Sat, Mar 17, 2012 at 9:00 AM, Steven L. <steves_list at hotmail.com> wrote:
  > I thought so, too. See [2] from Waldemar ("May 24-26 rough meeting notes").
  > Specifically, he stated "Lookbehind support is promoted to required
  > normative."


  Are there any plans for which kind of look-behind to allow? Literal
  only, fixed-length, repetitions of fixed-length, unrestricted or
  something else?

  The fallback based RegExp specification doesn't lend itself easily to
  unrestricted regexps (unless they are matched backwards instead of
  forwards, but I think that's going to be confusing).

  /L

  _______________________________________________
  es-discuss mailing list
  es-discuss at mozilla.org
  https://mail.mozilla.org/listinfo/es-discuss

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20120317/b7bd1240/attachment-0001.html>

# Lasse Reichstein (13 years ago)

On Sat, Mar 17, 2012 at 2:28 PM, Andrea Giammarchi <andrea.giammarchi at gmail.com> wrote:

I would simply apply same logic we have already for the look ahead ... or you think that would cause problems?

I'm not sure it even makes sense.

ES RegExps are backtracking based, and it makes a difference in which order alternatives are tried. Greedy matching is defined in terms of number of repetitions, not length of the match. All of these are defined in a way that assumes left-to-right matching.

Example: Take the RegExp /(?<((?:aa|aaa)+))b/ where (?< ... ) delimits the look-behind. and try matching it on the string "xaaaaaaaaab". Then tell me how many a's are captured by the capturing group, and why :)

The most "intuitive" interpretation would be a reverse implementation of the normal matching algorithm, i.e., "backwards matching", but that would likely duplicate the entire RegExp semantics (or parameterize it by a direction).

Any attempt to use the normal (forward) semantics and then try to find an earlier point to start it at is likely to be either flawed or effectively unpredictable to users.

And you will probably never achieve that /(<re>)$/ and /(?<(re))$/

always capture the same substring :)

On Sat, Mar 17, 2012 at 2:28 PM, Andrea Giammarchi
<andrea.giammarchi at gmail.com> wrote:
> I would simply apply same logic we have already for the look ahead ... or
> you think that would cause problems?

I'm not sure it even makes sense.

ES RegExps are backtracking based, and it makes a difference in which
order alternatives are tried. Greedy matching is defined in terms of
number of repetitions, not length of the match. All of these are
defined in a way that assumes left-to-right matching.

Example:
 Take the RegExp  /(?<((?:aa|aaa)+))b/  where (?< ... ) delimits the
look-behind.
 and try matching it on the string "xaaaaaaaaab".
 Then tell me how many a's are captured by the capturing group, and why :)

The most "intuitive" interpretation would be a reverse implementation
of the normal matching algorithm, i.e., "backwards matching", but that
would likely duplicate the entire RegExp semantics (or parameterize it
by a direction).

Any attempt to use the normal (forward) semantics and then try to find
an earlier point to start it at is likely to be either flawed or
effectively unpredictable to users.

And you will probably never achieve that /(<re>)$/ and /(?<(re))$/
always capture the same substring :)

/L

# Steven L. (13 years ago)

Lasse Reichstein wrote:

I would simply apply same logic we have already for the look ahead ... or you think that would cause problems?

I'm not sure it even makes sense.

ES RegExps are backtracking based, and it makes a difference in which order alternatives are tried. Greedy matching is defined in terms of number of repetitions, not length of the match. All of these are defined in a way that assumes left-to-right matching.

Example: Take the RegExp /(?<((?:aa|aaa)+))b/ where (?< ... ) delimits the look-behind. and try matching it on the string "xaaaaaaaaab". Then tell me how many a's are captured by the capturing group, and why :)

The most "intuitive" interpretation would be a reverse implementation of the normal matching algorithm, i.e., "backwards matching", but that would likely duplicate the entire RegExp semantics (or parameterize it by a direction).

Any attempt to use the normal (forward) semantics and then try to find an earlier point to start it at is likely to be either flawed or effectively unpredictable to users.

Technically, you're right. They're different. But they can appear exactly the same by implementing lookbehind as a zero-length assertion of (?:lookbehind)$ matched against the lookbehind's left context, starting from the very start of the subject string. Although people thinking about implementation might come to think of some other approach as more intuitive, from my experience every single plain-old-developer unconcerned about implementation thinks of the semantics I just described as intuitive. It is also how every single implementation of lookbehind that I know of actually works.

The reason that all major regex flavors except .NET don't support lookbehind is because it's inefficient to re-search from the very beginning of an arbitrarily long string. That's why they support fixed- or finit-length lookbehind only--if they can determine the maximum distance backward they need to search forward from, they can step back only that many characters. In practice, at least for finite- rather than fixed-length lookbehind, this attempt to avoid far-back searches is kind of silly--e.g., Java lets you use a quantifier like {0,100000} within lookbehind.

The Right-to-Left Mode that powers .NET's lookbehind is pretty neat. It magically follows the plain-old-developer's intuitive expectation while working backword rather than from the start of the string. Unfortunately, how it actually works is fairly mysterious. Although it works fairly reliably, as I previously mentioned it can occasionally be a bit buggy/weird.

And you will probably never achieve that /(<re>)$/ and /(?<(re))$/ always capture the same substring :)

Apart from potential bugs, (<re>)$ and (?<=(<re>))$ capture the same string

in every implementation of lookbehind that I know of.

--Steven Levithan

Lasse Reichstein wrote:

>> I would simply apply same logic we have already for the look ahead ... or
>> you think that would cause problems?
>
> I'm not sure it even makes sense.
>
> ES RegExps are backtracking based, and it makes a difference in which
> order alternatives are tried. Greedy matching is defined in terms of
> number of repetitions, not length of the match. All of these are
> defined in a way that assumes left-to-right matching.
>
> Example:
>  Take the RegExp  /(?<((?:aa|aaa)+))b/  where (?< ... ) delimits the
> look-behind.
>  and try matching it on the string "xaaaaaaaaab".
>  Then tell me how many a's are captured by the capturing group, and why :)
>
> The most "intuitive" interpretation would be a reverse implementation
> of the normal matching algorithm, i.e., "backwards matching", but that
> would likely duplicate the entire RegExp semantics (or parameterize it
> by a direction).
>
> Any attempt to use the normal (forward) semantics and then try to find
> an earlier point to start it at is likely to be either flawed or
> effectively unpredictable to users.

Technically, you're right. They're different. But they can appear exactly 
the same by implementing lookbehind as a zero-length assertion of 
(?:lookbehind)$ matched against the lookbehind's left context, starting from 
the very start of the subject string. Although people thinking about 
implementation might come to think of some other approach as more intuitive, 
from my experience every single plain-old-developer unconcerned about 
implementation thinks of the semantics I just described as intuitive. It is 
also how every single implementation of lookbehind that I know of actually 
works.

The reason that all major regex flavors except .NET don't support lookbehind 
is because it's inefficient to re-search from the very beginning of an 
arbitrarily long string. That's why they support fixed- or finit-length 
lookbehind only--if they can determine the maximum distance backward they 
need to search forward from, they can step back only that many characters. 
In practice, at least for finite- rather than fixed-length lookbehind, this 
attempt to avoid far-back searches is kind of silly--e.g., Java lets you use 
a quantifier like {0,100000} within lookbehind.

The Right-to-Left Mode that powers .NET's lookbehind is pretty neat. It 
magically follows the plain-old-developer's intuitive expectation while 
working backword rather than from the start of the string. Unfortunately, 
how it actually works is fairly mysterious. Although it works fairly 
reliably, as I previously mentioned it can occasionally be a bit 
buggy/weird.

> And you will probably never achieve that /(<re>)$/ and /(?<(re))$/
> always capture the same substring :)

Apart from potential bugs, (<re>)$ and (?<=(<re>))$ capture the same string 
in every implementation of lookbehind that I know of.

--Steven Levithan

# Steven L. (13 years ago)

Steven Levithan wrote:

The reason that all major regex flavors except .NET don't support lookbehind is because [...]

That's a typo. I meant to write "... don't support infinite-length lookbehind".

--Steven Levithan

Steven Levithan wrote:

> The reason that all major regex flavors except .NET don't support 
> lookbehind is because [...]

That's a typo. I meant to write "... don't support *infinite-length* 
lookbehind".

--Steven Levithan

# Erik Corry (13 years ago)

2012/3/18 Steven L. <steves_list at hotmail.com>:

Lasse Reichstein wrote:

I would simply apply same logic we have already for the look ahead ... or you think that would cause problems?

I'm not sure it even makes sense.

ES RegExps are backtracking based, and it makes a difference in which order alternatives are tried. Greedy matching is defined in terms of number of repetitions, not length of the match. All of these are defined in a way that assumes left-to-right matching.

Example: Take the RegExp /(?<((?:aa|aaa)+))b/ where (?< ... ) delimits the look-behind. and try matching it on the string "xaaaaaaaaab". Then tell me how many a's are captured by the capturing group, and why :)

The most "intuitive" interpretation would be a reverse implementation of the normal matching algorithm, i.e., "backwards matching", but that would likely duplicate the entire RegExp semantics (or parameterize it by a direction).

Any attempt to use the normal (forward) semantics and then try to find an earlier point to start it at is likely to be either flawed or effectively unpredictable to users.

Technically, you're right. They're different. But they can appear exactly the same by implementing lookbehind as a zero-length assertion of (?:lookbehind)$ matched against the lookbehind's left context, starting from the very start of the subject string. Although people thinking about implementation might come to think of some other approach as more intuitive, from my experience every single plain-old-developer unconcerned about implementation thinks of the semantics I just described as intuitive. It is also how every single implementation of lookbehind that I know of actually works.

The reason that all major regex flavors except .NET don't support lookbehind is because it's inefficient to re-search from the very beginning of an arbitrarily long string. That's why they support fixed- or finit-length lookbehind only--if they can determine the maximum distance backward they need to search forward from, they can step back only that many characters.

No wonder that look-behinds have a reputation for poor performance if this is how it's done.

In practice, at least for finite- rather than fixed-length lookbehind, this attempt to avoid far-back searches is kind of silly--e.g., Java lets you use a quantifier like {0,100000} within lookbehind.

At least this provides something to point the user at and say "look, this is why you have bad performance".

The Right-to-Left Mode that powers .NET's lookbehind is pretty neat. It magically follows the plain-old-developer's intuitive expectation while working backword rather than from the start of the string. Unfortunately, how it actually works is fairly mysterious. Although it works fairly reliably, as I previously mentioned it can occasionally be a bit buggy/weird.

And you will probably never achieve that /(<re>)$/ and /(?<(re))$/ always capture the same substring :)

Apart from potential bugs, (<re>)$ and (?<=(<re>))$ capture the same string in every implementation of lookbehind that I know of.

Really? I don't have a .NET implementation handy, but I'll make a prediction based on the description of its algorithm. Lets look at the following. (replace the '+'s with {1,1000} for the sake of the non-.NET regexps, I'm not going to type that ugliness here):

/(x(.+?))$/ /(?<(x(.+?))$/

Give these regexps the input: "x foo x bar". The first should match "x foo x bar" and the second will match "x bar" as the non-greedy quantifier will stop when it finds the nearest x from the end. But in a regexp engine with the "start at the earliest point and search forwards" kludge they will return the same.

Sadly this means that if we settle on the non-.NET way to implement look-behind it will be user-visible, so that implementations will not have the option of using the efficient .NET algorithm. Also we will never be able to get rid of the limitation on infinite length lookbehind without losing backwards compatibility.

2012/3/18 Steven L. <steves_list at hotmail.com>:
> Lasse Reichstein wrote:
>
>>> I would simply apply same logic we have already for the look ahead ... or
>>> you think that would cause problems?
>>
>>
>> I'm not sure it even makes sense.
>>
>> ES RegExps are backtracking based, and it makes a difference in which
>> order alternatives are tried. Greedy matching is defined in terms of
>> number of repetitions, not length of the match. All of these are
>> defined in a way that assumes left-to-right matching.
>>
>> Example:
>>  Take the RegExp  /(?<((?:aa|aaa)+))b/  where (?< ... ) delimits the
>> look-behind.
>>  and try matching it on the string "xaaaaaaaaab".
>>  Then tell me how many a's are captured by the capturing group, and why :)
>>
>> The most "intuitive" interpretation would be a reverse implementation
>> of the normal matching algorithm, i.e., "backwards matching", but that
>> would likely duplicate the entire RegExp semantics (or parameterize it
>> by a direction).
>>
>> Any attempt to use the normal (forward) semantics and then try to find
>> an earlier point to start it at is likely to be either flawed or
>> effectively unpredictable to users.
>
>
> Technically, you're right. They're different. But they can appear exactly
> the same by implementing lookbehind as a zero-length assertion of
> (?:lookbehind)$ matched against the lookbehind's left context, starting from
> the very start of the subject string. Although people thinking about
> implementation might come to think of some other approach as more intuitive,
> from my experience every single plain-old-developer unconcerned about
> implementation thinks of the semantics I just described as intuitive. It is
> also how every single implementation of lookbehind that I know of actually
> works.
>
> The reason that all major regex flavors except .NET don't support lookbehind
> is because it's inefficient to re-search from the very beginning of an
> arbitrarily long string. That's why they support fixed- or finit-length
> lookbehind only--if they can determine the maximum distance backward they
> need to search forward from, they can step back only that many characters.

No wonder that look-behinds have a reputation for poor performance if
this is how it's done.

> In practice, at least for finite- rather than fixed-length lookbehind, this
> attempt to avoid far-back searches is kind of silly--e.g., Java lets you use
> a quantifier like {0,100000} within lookbehind.

At least this provides something to point the user at and say "look,
this is why you have bad performance".

> The Right-to-Left Mode that powers .NET's lookbehind is pretty neat. It
> magically follows the plain-old-developer's intuitive expectation while
> working backword rather than from the start of the string. Unfortunately,
> how it actually works is fairly mysterious. Although it works fairly
> reliably, as I previously mentioned it can occasionally be a bit
> buggy/weird.
>
>
>> And you will probably never achieve that /(<re>)$/ and /(?<(re))$/
>> always capture the same substring :)
>
>
> Apart from potential bugs, (<re>)$ and (?<=(<re>))$ capture the same string
> in every implementation of lookbehind that I know of.

Really?  I don't have a .NET implementation handy, but I'll make a
prediction based on the description of its algorithm.  Lets look at
the following.  (replace the '+'s with {1,1000} for the sake of the
non-.NET regexps, I'm not going to type that ugliness here):

/(x(.+?))$/
/(?<(x(.+?))$/

Give these regexps the input: "x foo x bar".  The first should match
"x foo x bar" and the second will match "x bar" as the non-greedy
quantifier will stop when it finds the nearest x from the end.  But in
a regexp engine with the "start at the earliest point and search
forwards" kludge they will return the same.

Sadly this means that if we settle on the non-.NET way to implement
look-behind it will be user-visible, so that implementations will not
have the option of using the efficient .NET algorithm.  Also we will
never be able to get rid of the limitation on infinite length
lookbehind without losing backwards compatibility.

-- 
Erik Corry

# Steven L. (13 years ago)

Erik Corry wrote:

Steven Levithan wrote:

In practice, at least for [Java-style finite-length] lookbehind, this attempt to avoid far-back searches is kind of silly--e.g., Java lets you use a quantifier like {0,100000} within lookbehind.

At least this provides something to point the user at and say "look, this is why you have bad performance".

Fair point.

Erik Corry wrote:

Really? I don't have a .NET implementation handy, but I'll make a prediction based on the description of its algorithm. Lets look at the following. (replace the '+'s with {1,1000} for the sake of the non-.NET regexps, I'm not going to type that ugliness here):

/(x(.+?))$/ /(?<(x(.+?))$/

Give these regexps the input: "x foo x bar". The first should match "x foo x bar" and the second will match "x bar" as the non-greedy quantifier will stop when it finds the nearest x from the end. But in a regexp engine with the "start at the earliest point and search forwards" kludge they will return the same.

You're right for .NET. But I must add: I wasn't really considering the nongreedy case (which should have been obvious!) in any of my previous descriptions of lookbehind behavior or developer expectations for it. Boo, me! :-( I've just done some testing that also shows I've been wrong/oversimplifying/overgeneralizing in my descriptions of lookbehind behavior and implementation across multiple regex flavors (even in the greedy case).

Let me give some real results. I'll leave out comparisons with /(nonlookbehind)$/ since I think everyone interested in this discussion knows what those results should be.

/(?<=(\d+))$/ with "123": * .NET: $1 == "123". * Java: [Unsupported] * PCRE: [Unsupported]

/(?<=(\d{1,3}))$/ with "123": * .NET: $1 == "123". * Java: $1 == "3". * PCRE: [Unsupported]

/(?<=(\d{1,3}?))$/ with "123": * .NET: $1 == "3". * Java: $1 == "3". * PCRE: [Unsupported]

/(?<=(\d{1,3})(\d{1,3}?))$/ with "123": * .NET: $1 == "12", $2 == "3". * Java: $1 == "2", $2 == "3". * PCRE: [Unsupported]

/(?<=(\d{1,3}?)(\d{1,3}))$/ with "123": * .NET: $1 == "1", $2 == "23". * Java: $1 == "2", $2 == "3". * PCRE: [Unsupported]

/(?<=(\d)|(\d\d)|(\d\d\d))$/ with "123": * .NET: $1 == "3". [$2 and $3 don't participate.] * Java: $1 == "3". [$2 and $3 don't participate.] * PCRE: $1 == "3". [$2 and $3 don't participate.]

/(?<=(\d\d\d)|(\d\d)|(\d))$/ with "123": * .NET: $1 == "123". [$2 and $3 don't participate.] * Java: $3 == "3". [$1 and $2 don't participate.] * PCRE: $1 == "123". [$2 and $3 don't participate.]

Notes:

There is more to test (e.g., nested lookbehind, supported by .NET, Java, and PCRE), but this should demonstrate the basic semantics used by the three major implementations that support capturing within variable-length lookbehind.
Oniguruma (used by default in Ruby 1.9) supports /(?<=a|bc)/ but not /(?<=ab?)/ or /(?<=a(?:b|cd))/. It also doesn't support capturing groups in lookbehind.
Perl, Python, and Boost.Regex support fixed-length lookbehind only: (?<=abc) or (?<=a|b).
Ruby 1.8, Tcl, XML Schema/XPath, RE2, and ERE/BRE do not support lookbehind at all.
I haven't tested ICU Regular Expressions, but their docs say they support finite-length lookbehind (without specifics).

IMO, .NET's lookbehind is the flagship/ideal, both in its support for all regex syntax and in it's greedy/nongreedy behavior.

This is probably as good a place as any to note that, should the committee feel compelled to reconsider (or augment) lookbehind support, there is another simpler and more efficient idea that works similar to lookbehind at the start of a pattern: \K ("keep"). \K was invented by Jeff Pinyan (see 1) and made it's way into Perl 5.10, PCRE 7.2, and Boost.Regex (ver?).

\K effectively resets the match start wherever it is encountered, causing any previously matched characters not to be included in the final match (e.g., /foo\Kbar/ matches "foobar", but reports that it matched "bar"). This avoids lookbehind-esque concerns about infinite/finite/fixed length. Also note that \K doesn't interfere with the setting of backreferences (e.g., /(foo)\Kbar/ sets $1 to "foo"), and can be used within alternatives (e.g., /foo|bar\Kbaz/ matches "foo" or "barbaz", but reports matching "foo" or "baz") or repeated groups, etc.

\K has a couple issues of its own, though:

* Perl docs say \K's behavior in lookaround is "not well defined". In

PCRE, \K is acted upon within positive lookaround but ignored in negative lookaround.

* \K in ES would open the possibility of zero-length matches that change

the value of RegExp.prototype.lastIndex. This isn't a problem on its own, but it can effect how accompanying code must be written. It could also break code unwise to \K that accepts arbitrary regexes from an outside source.

-- Steven Levithan

Erik Corry wrote:
> Steven Levithan wrote:
>> In practice, at least for [Java-style finite-length] lookbehind, this
>> attempt to avoid far-back searches is kind of silly--e.g., Java lets you 
>> use
>> a quantifier like {0,100000} within lookbehind.
>
> At least this provides something to point the user at and say "look,
> this is why you have bad performance".

Fair point.

Erik Corry wrote:
> Really?  I don't have a .NET implementation handy, but I'll make a
> prediction based on the description of its algorithm.  Lets look at
> the following.  (replace the '+'s with {1,1000} for the sake of the
> non-.NET regexps, I'm not going to type that ugliness here):
>
> /(x(.+?))$/
> /(?<(x(.+?))$/
>
> Give these regexps the input: "x foo x bar".  The first should match
> "x foo x bar" and the second will match "x bar" as the non-greedy
> quantifier will stop when it finds the nearest x from the end.  But in
> a regexp engine with the "start at the earliest point and search
> forwards" kludge they will return the same.

You're right for .NET. But I must add: I wasn't really considering the 
nongreedy case (which should have been obvious!) in any of my previous 
descriptions of lookbehind behavior or developer expectations for it. Boo, 
me! :-( I've just done some testing that also shows I've been 
wrong/oversimplifying/overgeneralizing in my descriptions of lookbehind 
behavior and implementation across multiple regex flavors (even in the 
greedy case).

Let me give some real results. I'll leave out comparisons with 
/(nonlookbehind)$/ since I think everyone interested in this discussion 
knows what those results should be.

/(?<=(\d+))$/ with "123":
    * .NET: $1 == "123".
    * Java: [Unsupported]
    * PCRE: [Unsupported]

/(?<=(\d{1,3}))$/ with "123":
    * .NET: $1 == "123".
    * Java: $1 == "3".
    * PCRE: [Unsupported]

/(?<=(\d{1,3}?))$/ with "123":
    * .NET: $1 == "3".
    * Java: $1 == "3".
    * PCRE: [Unsupported]

/(?<=(\d{1,3})(\d{1,3}?))$/ with "123":
    * .NET: $1 == "12", $2 == "3".
    * Java: $1 == "2", $2 == "3".
    * PCRE: [Unsupported]

/(?<=(\d{1,3}?)(\d{1,3}))$/ with "123":
    * .NET: $1 == "1", $2 == "23".
    * Java: $1 == "2", $2 == "3".
    * PCRE: [Unsupported]

/(?<=(\d)|(\d\d)|(\d\d\d))$/ with "123":
    * .NET: $1 == "3". [$2 and $3 don't participate.]
    * Java: $1 == "3". [$2 and $3 don't participate.]
    * PCRE: $1 == "3". [$2 and $3 don't participate.]

/(?<=(\d\d\d)|(\d\d)|(\d))$/ with "123":
    * .NET: $1 == "123". [$2 and $3 don't participate.]
    * Java: $3 == "3". [$1 and $2 don't participate.]
    * PCRE: $1 == "123". [$2 and $3 don't participate.]

Notes:

* There is more to test (e.g., nested lookbehind, supported by .NET, Java, 
and PCRE), but this should demonstrate the basic semantics used by the three 
major implementations that support capturing within variable-length 
lookbehind.

* Oniguruma (used by default in Ruby 1.9) supports /(?<=a|bc)/ but not 
/(?<=ab?)/ or /(?<=a(?:b|cd))/. It also doesn't support capturing groups in 
lookbehind.

* Perl, Python, and Boost.Regex support fixed-length lookbehind only: 
(?<=abc) or (?<=a|b).

* Ruby 1.8, Tcl, XML Schema/XPath, RE2, and ERE/BRE do not support 
lookbehind at all.

* I haven't tested ICU Regular Expressions, but their docs say they support 
finite-length lookbehind (without specifics).

IMO, .NET's lookbehind is the flagship/ideal, both in its support for all 
regex syntax and in it's greedy/nongreedy behavior.

This is probably as good a place as any to note that, should the committee 
feel compelled to reconsider (or augment) lookbehind support, there is 
another simpler and more efficient idea that works similar to lookbehind at 
the start of a pattern: \K ("keep"). \K was invented by Jeff Pinyan (see 
[1]) and made it's way into Perl 5.10, PCRE 7.2, and Boost.Regex (ver?).

\K effectively resets the match start wherever it is encountered, causing 
any previously matched characters not to be included in the final match 
(e.g., /foo\Kbar/ matches "foobar", but reports that it matched "bar"). This 
avoids lookbehind-esque concerns about infinite/finite/fixed length. Also 
note that \K doesn't interfere with the setting of backreferences (e.g., 
/(foo)\Kbar/ sets $1 to "foo"), and can be used within alternatives (e.g., 
/foo|bar\Kbaz/ matches "foo" or "barbaz", but reports matching "foo" or 
"baz") or repeated groups, etc.

\K has a couple issues of its own, though:

    * Perl docs say \K's behavior in lookaround is "not well defined". In 
PCRE, \K is acted upon within positive lookaround but ignored in negative 
lookaround.

    * \K in ES would open the possibility of zero-length matches that change 
the value of RegExp.prototype.lastIndex. This isn't a problem on its own, 
but it can effect how accompanying code must be written. It could also break 
code unwise to \K that accepts arbitrary regexes from an outside source.

-- Steven Levithan

[1]: http://search.cpan.org/~pinyan/Regexp-Keep-0.02/Keep.pm

# Steven L. (13 years ago)

Note: I've posted the tests from my previous email at stevenlevithan.com/regex/tests/lookbehind.html

-- Steven Levithan

Note: I've posted the tests from my previous email at 
http://stevenlevithan.com/regex/tests/lookbehind.html

-- Steven Levithan