RegExp lookbehind
Bump++. :)
We'll get this strawman:
strawman:steve_levithan_regexp_api_improvements
onto the agenda for the next TC39 meeting. I had thought we'd approved lookbehind. I will do what I can to get it into ES6.
Brendan Eich wrote:
We'll get this strawman:
strawman:steve_levithan_regexp_api_improvements
onto the agenda for the next TC39 meeting.
Neat-o. I mentioned in the blog post referenced there [1] that I'd be doing a separate write-up with my opinionated take or what new RegExp features should be added, rather than changed. Unfortunately I never followed through, but IMO the most pressing additions are named capture and backreferences, atomic groups, lookbehind, the /x flag, the /s flag, and a /u flag that lets \d \D \w \W \b \B follow Unicode. There are plenty of other useful features that could be considered, but I'll wait at least until there is some reaction to the above.
There are of course details to hash out for named capture, lookbehind, and the /x flag before they can be spec-ed. The /s flag is the least important of the above--I've included it merely because of its convenience over e.g. [\s\S] and the fact that its absence is unique among modern regex flavors and a common portability hindrance.
Brendan Eich wrote:
I had thought we'd approved lookbehind. I will do what I can to get it into ES6.
I thought so, too. See [2] from Waldemar ("May 24-26 rough meeting notes"). Specifically, he stated "Lookbehind support is promoted to required normative."
[1] blog.stevenlevithan.com/archives/fixing-javascript-regexp [2] esdiscuss/2011-May/014748
--Steven Levithan
On Sat, Mar 17, 2012 at 9:00 AM, Steven L. <steves_list at hotmail.com> wrote:
I thought so, too. See [2] from Waldemar ("May 24-26 rough meeting notes"). Specifically, he stated "Lookbehind support is promoted to required normative."
Are there any plans for which kind of look-behind to allow? Literal only, fixed-length, repetitions of fixed-length, unrestricted or something else?
The fallback based RegExp specification doesn't lend itself easily to unrestricted regexps (unless they are matched backwards instead of forwards, but I think that's going to be confusing).
I would simply apply same logic we have already for the look ahead ... or you think that would cause problems?
Andrea Giammarchi wrote:
I would simply apply same logic we have already for the look ahead ... or you think that would cause problems?
As has been discussed previously, it is nontrivial to implement infinite-length lookbehind efficiently (where you’re not just testing /(?:lookbehind)$/ against the lookbehind start position). For this reason, .NET is the only major regex flavor that supports it. It’s able to do so by taking advantage of its Right-to-Left Mode [1], the actual semantics of which are a big question mark. Sometimes it matches intuitive developer expectations (e.g., \d+ matches all of “123” rather than just the trailing “3”), and sometimes it doesn’t. Infinite-length lookbehind would of course be ideal and doesn’t have to cause problems. However, it certainly could cause problems if it is poorly spec-ed and/or poorly implemented. --Steven Levithan [1]: msdn.microsoft.com/en-us/library/yd1hzczs.aspx#RightToLeft
On Sat, Mar 17, 2012 at 1:07 PM, Lasse Reichstein <reichsteinatwork at gmail.com> wrote:
On Sat, Mar 17, 2012 at 9:00 AM, Steven L. <steves_list at hotmail.com> wrote:
I thought so, too. See [2] from Waldemar ("May 24-26 rough meeting notes"). Specifically, he stated "Lookbehind support is promoted to required normative."
Are there any plans for which kind of look-behind to allow? Literal only, fixed-length, repetitions of fixed-length, unrestricted or something else?
The fallback based RegExp specification doesn't lend itself easily to unrestricted regexps (unless they are matched backwards instead of forwards, but I think that's going to be confusing).
/L
es-discuss mailing list es-discuss at mozilla.org, mail.mozilla.org/listinfo/es
On Sat, Mar 17, 2012 at 2:28 PM, Andrea Giammarchi <andrea.giammarchi at gmail.com> wrote:
I would simply apply same logic we have already for the look ahead ... or you think that would cause problems?
I'm not sure it even makes sense.
ES RegExps are backtracking based, and it makes a difference in which order alternatives are tried. Greedy matching is defined in terms of number of repetitions, not length of the match. All of these are defined in a way that assumes left-to-right matching.
Example: Take the RegExp /(?<((?:aa|aaa)+))b/ where (?< ... ) delimits the look-behind. and try matching it on the string "xaaaaaaaaab". Then tell me how many a's are captured by the capturing group, and why :)
The most "intuitive" interpretation would be a reverse implementation of the normal matching algorithm, i.e., "backwards matching", but that would likely duplicate the entire RegExp semantics (or parameterize it by a direction).
Any attempt to use the normal (forward) semantics and then try to find an earlier point to start it at is likely to be either flawed or effectively unpredictable to users.
And you will probably never achieve that /(<re>)$/ and /(?<(re))$/
always capture the same substring :)
Lasse Reichstein wrote:
I would simply apply same logic we have already for the look ahead ... or you think that would cause problems?
I'm not sure it even makes sense.
ES RegExps are backtracking based, and it makes a difference in which order alternatives are tried. Greedy matching is defined in terms of number of repetitions, not length of the match. All of these are defined in a way that assumes left-to-right matching.
Example: Take the RegExp /(?<((?:aa|aaa)+))b/ where (?< ... ) delimits the look-behind. and try matching it on the string "xaaaaaaaaab". Then tell me how many a's are captured by the capturing group, and why :)
The most "intuitive" interpretation would be a reverse implementation of the normal matching algorithm, i.e., "backwards matching", but that would likely duplicate the entire RegExp semantics (or parameterize it by a direction).
Any attempt to use the normal (forward) semantics and then try to find an earlier point to start it at is likely to be either flawed or effectively unpredictable to users.
Technically, you're right. They're different. But they can appear exactly the same by implementing lookbehind as a zero-length assertion of (?:lookbehind)$ matched against the lookbehind's left context, starting from the very start of the subject string. Although people thinking about implementation might come to think of some other approach as more intuitive, from my experience every single plain-old-developer unconcerned about implementation thinks of the semantics I just described as intuitive. It is also how every single implementation of lookbehind that I know of actually works.
The reason that all major regex flavors except .NET don't support lookbehind is because it's inefficient to re-search from the very beginning of an arbitrarily long string. That's why they support fixed- or finit-length lookbehind only--if they can determine the maximum distance backward they need to search forward from, they can step back only that many characters. In practice, at least for finite- rather than fixed-length lookbehind, this attempt to avoid far-back searches is kind of silly--e.g., Java lets you use a quantifier like {0,100000} within lookbehind.
The Right-to-Left Mode that powers .NET's lookbehind is pretty neat. It magically follows the plain-old-developer's intuitive expectation while working backword rather than from the start of the string. Unfortunately, how it actually works is fairly mysterious. Although it works fairly reliably, as I previously mentioned it can occasionally be a bit buggy/weird.
And you will probably never achieve that /(<re>)$/ and /(?<(re))$/ always capture the same substring :)
Apart from potential bugs, (<re>)$ and (?<=(<re>))$ capture the same string
in every implementation of lookbehind that I know of.
--Steven Levithan
Steven Levithan wrote:
The reason that all major regex flavors except .NET don't support lookbehind is because [...]
That's a typo. I meant to write "... don't support infinite-length lookbehind".
--Steven Levithan
2012/3/18 Steven L. <steves_list at hotmail.com>:
Lasse Reichstein wrote:
I would simply apply same logic we have already for the look ahead ... or you think that would cause problems?
I'm not sure it even makes sense.
ES RegExps are backtracking based, and it makes a difference in which order alternatives are tried. Greedy matching is defined in terms of number of repetitions, not length of the match. All of these are defined in a way that assumes left-to-right matching.
Example: Take the RegExp /(?<((?:aa|aaa)+))b/ where (?< ... ) delimits the look-behind. and try matching it on the string "xaaaaaaaaab". Then tell me how many a's are captured by the capturing group, and why :)
The most "intuitive" interpretation would be a reverse implementation of the normal matching algorithm, i.e., "backwards matching", but that would likely duplicate the entire RegExp semantics (or parameterize it by a direction).
Any attempt to use the normal (forward) semantics and then try to find an earlier point to start it at is likely to be either flawed or effectively unpredictable to users.
Technically, you're right. They're different. But they can appear exactly the same by implementing lookbehind as a zero-length assertion of (?:lookbehind)$ matched against the lookbehind's left context, starting from the very start of the subject string. Although people thinking about implementation might come to think of some other approach as more intuitive, from my experience every single plain-old-developer unconcerned about implementation thinks of the semantics I just described as intuitive. It is also how every single implementation of lookbehind that I know of actually works.
The reason that all major regex flavors except .NET don't support lookbehind is because it's inefficient to re-search from the very beginning of an arbitrarily long string. That's why they support fixed- or finit-length lookbehind only--if they can determine the maximum distance backward they need to search forward from, they can step back only that many characters.
No wonder that look-behinds have a reputation for poor performance if this is how it's done.
In practice, at least for finite- rather than fixed-length lookbehind, this attempt to avoid far-back searches is kind of silly--e.g., Java lets you use a quantifier like {0,100000} within lookbehind.
At least this provides something to point the user at and say "look, this is why you have bad performance".
The Right-to-Left Mode that powers .NET's lookbehind is pretty neat. It magically follows the plain-old-developer's intuitive expectation while working backword rather than from the start of the string. Unfortunately, how it actually works is fairly mysterious. Although it works fairly reliably, as I previously mentioned it can occasionally be a bit buggy/weird.
And you will probably never achieve that /(<re>)$/ and /(?<(re))$/ always capture the same substring :)
Apart from potential bugs, (<re>)$ and (?<=(<re>))$ capture the same string in every implementation of lookbehind that I know of.
Really? I don't have a .NET implementation handy, but I'll make a prediction based on the description of its algorithm. Lets look at the following. (replace the '+'s with {1,1000} for the sake of the non-.NET regexps, I'm not going to type that ugliness here):
/(x(.+?))$/ /(?<(x(.+?))$/
Give these regexps the input: "x foo x bar". The first should match "x foo x bar" and the second will match "x bar" as the non-greedy quantifier will stop when it finds the nearest x from the end. But in a regexp engine with the "start at the earliest point and search forwards" kludge they will return the same.
Sadly this means that if we settle on the non-.NET way to implement look-behind it will be user-visible, so that implementations will not have the option of using the efficient .NET algorithm. Also we will never be able to get rid of the limitation on infinite length lookbehind without losing backwards compatibility.
Erik Corry wrote:
Steven Levithan wrote:
In practice, at least for [Java-style finite-length] lookbehind, this attempt to avoid far-back searches is kind of silly--e.g., Java lets you use a quantifier like {0,100000} within lookbehind.
At least this provides something to point the user at and say "look, this is why you have bad performance".
Fair point.
Erik Corry wrote:
Really? I don't have a .NET implementation handy, but I'll make a prediction based on the description of its algorithm. Lets look at the following. (replace the '+'s with {1,1000} for the sake of the non-.NET regexps, I'm not going to type that ugliness here):
/(x(.+?))$/ /(?<(x(.+?))$/
Give these regexps the input: "x foo x bar". The first should match "x foo x bar" and the second will match "x bar" as the non-greedy quantifier will stop when it finds the nearest x from the end. But in a regexp engine with the "start at the earliest point and search forwards" kludge they will return the same.
You're right for .NET. But I must add: I wasn't really considering the nongreedy case (which should have been obvious!) in any of my previous descriptions of lookbehind behavior or developer expectations for it. Boo, me! :-( I've just done some testing that also shows I've been wrong/oversimplifying/overgeneralizing in my descriptions of lookbehind behavior and implementation across multiple regex flavors (even in the greedy case).
Let me give some real results. I'll leave out comparisons with /(nonlookbehind)$/ since I think everyone interested in this discussion knows what those results should be.
/(?<=(\d+))$/ with "123": * .NET: $1 == "123". * Java: [Unsupported] * PCRE: [Unsupported]
/(?<=(\d{1,3}))$/ with "123": * .NET: $1 == "123". * Java: $1 == "3". * PCRE: [Unsupported]
/(?<=(\d{1,3}?))$/ with "123": * .NET: $1 == "3". * Java: $1 == "3". * PCRE: [Unsupported]
/(?<=(\d{1,3})(\d{1,3}?))$/ with "123": * .NET: $1 == "12", $2 == "3". * Java: $1 == "2", $2 == "3". * PCRE: [Unsupported]
/(?<=(\d{1,3}?)(\d{1,3}))$/ with "123": * .NET: $1 == "1", $2 == "23". * Java: $1 == "2", $2 == "3". * PCRE: [Unsupported]
/(?<=(\d)|(\d\d)|(\d\d\d))$/ with "123": * .NET: $1 == "3". [$2 and $3 don't participate.] * Java: $1 == "3". [$2 and $3 don't participate.] * PCRE: $1 == "3". [$2 and $3 don't participate.]
/(?<=(\d\d\d)|(\d\d)|(\d))$/ with "123": * .NET: $1 == "123". [$2 and $3 don't participate.] * Java: $3 == "3". [$1 and $2 don't participate.] * PCRE: $1 == "123". [$2 and $3 don't participate.]
Notes:
-
There is more to test (e.g., nested lookbehind, supported by .NET, Java, and PCRE), but this should demonstrate the basic semantics used by the three major implementations that support capturing within variable-length lookbehind.
-
Oniguruma (used by default in Ruby 1.9) supports /(?<=a|bc)/ but not /(?<=ab?)/ or /(?<=a(?:b|cd))/. It also doesn't support capturing groups in lookbehind.
-
Perl, Python, and Boost.Regex support fixed-length lookbehind only: (?<=abc) or (?<=a|b).
-
Ruby 1.8, Tcl, XML Schema/XPath, RE2, and ERE/BRE do not support lookbehind at all.
-
I haven't tested ICU Regular Expressions, but their docs say they support finite-length lookbehind (without specifics).
IMO, .NET's lookbehind is the flagship/ideal, both in its support for all regex syntax and in it's greedy/nongreedy behavior.
This is probably as good a place as any to note that, should the committee feel compelled to reconsider (or augment) lookbehind support, there is another simpler and more efficient idea that works similar to lookbehind at the start of a pattern: \K ("keep"). \K was invented by Jeff Pinyan (see 1) and made it's way into Perl 5.10, PCRE 7.2, and Boost.Regex (ver?).
\K effectively resets the match start wherever it is encountered, causing any previously matched characters not to be included in the final match (e.g., /foo\Kbar/ matches "foobar", but reports that it matched "bar"). This avoids lookbehind-esque concerns about infinite/finite/fixed length. Also note that \K doesn't interfere with the setting of backreferences (e.g., /(foo)\Kbar/ sets $1 to "foo"), and can be used within alternatives (e.g., /foo|bar\Kbaz/ matches "foo" or "barbaz", but reports matching "foo" or "baz") or repeated groups, etc.
\K has a couple issues of its own, though:
* Perl docs say \K's behavior in lookaround is "not well defined". In
PCRE, \K is acted upon within positive lookaround but ignored in negative lookaround.
* \K in ES would open the possibility of zero-length matches that change
the value of RegExp.prototype.lastIndex. This isn't a problem on its own, but it can effect how accompanying code must be written. It could also break code unwise to \K that accepts arbitrary regexes from an outside source.
-- Steven Levithan
Note: I've posted the tests from my previous email at stevenlevithan.com/regex/tests/lookbehind.html
-- Steven Levithan
After all these years, ES regular expressions still don't support lookbehind. It turns out this was discussed in late 2010 1 and was heading towards being accepted, but the discussion just stalled. So, consider this a bump of that thread.