Proposal for exact matching and matching at a position in RegExp

# Andy Chu (14 years ago)

Here is a very simple proposal. If I can get access to the wiki I could copy it in, but for now it's here:

andychu.net/ecmascript/RegExp-Enhancements.html

Comments appreciated.

thanks, Andy

# Andy Chu (14 years ago)

(The original message was held up in spam moderation for awhile)

Here is an addendum, after it was pointed out to me that this issue has come up before:

andychu.net/ecmascript/RegExp-Enhancements-2.html

Basically the proposal is to add parameters which can override the internal state of the RegExp.

thanks, Andy

# Andy Chu (14 years ago)

On Wed, Jan 27, 2010 at 10:03 PM, Andy Chu <andy at chubot.org> wrote:

(The original message was held up in spam moderation for awhile)

Here is an addendum, after it was pointed out to me that this issue has come up before:

andychu.net/ecmascript/RegExp-Enhancements-2.html

Basically the proposal is to add parameters which can override the internal state of the RegExp.

Does anyone have any comments on this?

Can I put it in a place where it will be considered for the next ECMAScript? The overall idea seems relatively uncontroversial since it was already implemented by Mozilla (for the exact same reason). I have proposed a specific API enhancement too.

thanks, Andy

# Erik Corry (14 years ago)

2010/2/9 Andy Chu <andy at chubot.org>:

On Wed, Jan 27, 2010 at 10:03 PM, Andy Chu <andy at chubot.org> wrote:

(The original message was held up in spam moderation for awhile)

Here is an addendum, after it was pointed out to me that this issue has come up before:

andychu.net/ecmascript/RegExp-Enhancements-2.html

Basically the proposal is to add parameters which can override the internal state of the RegExp.

Does anyone have any comments on this?

I think it would be nice to have this feature. For example the Windscorpion xml test would run must faster with such an option.

However, expanding the language by adding extra parameters to existing functions is annoying because it means you can't test for the presence of absence of the feature with a simple if:

if (RegExp.funkyNewFunction) { ... }

I think that using the length property on the function is going to be unreliable given the existing variation in those values.

The way we do regexps in V8 the 'y' is part of the regexp and we would have to recompile the regexp to handle with or without 'y'. In other words in our implementation we have an implict ".*?(" at the start of the regexp that indicates that there is a non-greedy match-anything before the 0th capture parenthesis. This is a pretty fast and clean approach. I think having something that matched 'the point where searching started' would be more flexible than the global Y flag. Then you could use it like '^' as one part of an alternation. Perl uses \G for the same concept and I think following that would be fine. To test for the presence of the feature you could use

!/\Gx/.test("Gx")

which returns true if the feature is implemented.

# Kam Kasravi (14 years ago)

I had mentioned to Brendan at the last TC39 that you had ported narcissus and had observed the cost of match. There is a new global regexp parameter called 'y' which prevents the copy from occurring. I've taken a look at narcissus and made some edits adding y to match. Both Brendan and Bob Clary have advised on how to get the changes into mercurial on mozilla if you are not a developer within mozilla. I can forward you these emails and we could review respective changes if you would like and its ok with mozilla. The mozilla bug id is bugzilla.mozilla.org/show_bug.cgi?id=542621

thx kam

# Andy Chu (14 years ago)

However, expanding the language by adding extra parameters to existing functions is annoying because it means you can't test for the presence of absence of the feature with a simple if:

if (RegExp.funkyNewFunction) { ... }

That's true, I would want to be able to test for the feature. I think you could do it like this:

if (new RegExp("a", "y").test("----a")) { ... use y }

Although perhaps something easier like RegExp.sticky or RegExp.feature('sticky') would be better.

I think that using the length property on the function is going to be unreliable given the existing variation in those values.

Oh I didn't even know this existed. I see that new Regexp("").exec.length == 1. Is this the argument length -- standard or nonstandard?

There is still the "pos" argument to detect too. I don't know if there is some way of detecting to assume that you have "pos" as well as "y". Although it is syntactic sugar for .lastIndex, I think the argument makes a lot more sense for the reasons in my article (not stomping on it between matches).

Maybe RegExp.feature("pos").

The way we do regexps in V8 the 'y' is part of the regexp and we would have to recompile the regexp to handle with or without 'y'. In other words in our implementation we have an implict ".*?(" at the start of the regexp that indicates that there is a non-greedy match-anything before the 0th capture parenthesis. This is a pretty fast and clean approach. I think having something that matched 'the point where searching started' would be more flexible than the global Y flag. Then you could use it like '^' as one part of an alternation. Perl uses \G for the same concept and I think following that would be fine. To test for the presence of the feature you could use

!/\Gx/.test("Gx")

So I wouldn't be opposed to having this , but not sure about the compatibility issues. Also I vaguely recall someone saying that ".*?" can be trigger a slower code path than .search() in some regexp engines. As in, it's not trivial to get the performance right.

thanks, Andy

# Andy Chu (14 years ago)

On Tue, Feb 9, 2010 at 7:34 AM, Kam Kasravi <kamkasravi at yahoo.com> wrote:

Hi Andy I had mentioned to Brendan at the last TC39 that you had ported narcissus and had observed the cost of match. There is a new global regexp parameter called 'y' which prevents the copy from occurring. I've taken a look at narcissus and made some edits adding y to match. Both Brendan and Bob Clary have advised on how to get the changes into mercurial on mozilla if you are not a developer within mozilla. I can forward you these emails and we could review respective changes if you would like and its ok with mozilla. The mozilla bug id is bugzilla.mozilla.org/show_bug.cgi?id=542621

Thanks for mentioning that. Is Narcissus being used for something now? I had to make a ton of changes to get it to work on non-Mozilla engines (v8, JScript). It got stuck on this performance issue due to lack of /y. But adding this would introduce another nonstandard feature.

I think the solution depends on if the mainline wants to be standard ECMAScript or not. For non-Mozilla engines, the solution without /y is a bit of a pain. What I would do is change the tokenizer to work line-by-line rather on the code entire string. You will still have quadratic behavior within a line, but if lines are less than 100 characters, it won't blow up with a large program size. This would solve it for Mozilla too of course, so I might just go with that since I want it to be standard ECMAScript.

thanks, Andy

# Brendan Eich (14 years ago)

On Feb 9, 2010, at 9:50 AM, Andy Chu wrote:

However, expanding the language by adding extra parameters to
existing functions is annoying because it means you can't test for the
presence of absence of the feature with a simple if:

if (RegExp.funkyNewFunction) { ... }

That's true, I would want to be able to test for the feature. I think you could do it like this:

if (new RegExp("a", "y").test("----a")) { ... use y }

Although perhaps something easier like RegExp.sticky or RegExp.feature('sticky') would be better.

SpiderMonkey (and possibly Rhino, I haven't checked) does reflect the
y flag as the sticky property:

js> re = /hi/y /hi/y js> re.sticky

true js> re = RegExp("hi", "y") /hi/y js> re.sticky

true js> re2 = /bye/ /bye/ js> re2.sticky

false

I see no need to invent different and in-regexp syntax.

# Steve L. (14 years ago)

andychu.net/ecmascript/RegExp-Enhancements-2.html

Basically the proposal is to add parameters which can override the internal state of the RegExp.

Does anyone have any comments on this?

Can I put it in a place where it will be considered for the next ECMAScript? The overall idea seems relatively uncontroversial since it was already implemented by Mozilla (for the exact same reason). I have proposed a specific API enhancement too.

I do not believe it was implemented for "the exact same reason." It seems you are merely looking for a way to match exactly at a given character position, and you correctly note that /y is not an elegant solution for this problem. However, although /y can be used to solve this problem, my understanding is that it was designed to work similarly to the \G regex token from Perl/PCRE/Java/.NET/etc. while tying in nicely with the lastIndex property. An important feature of /y (and \G from other regex flavors) is that, with global regexes (compiled with /g), each successive match must start where the last match ended. This is a very useful feature for writing some types of simple parsers, etc. And in the process of smartly solving this problem, you get an inelegant solution to your problem as a side effect, free of charge.

Steven Levithan blog.stevenlevithan.com

# Steve L. (14 years ago)

Outside of es-discuss, Brendan Eich asked for my thoughts on the merits of \G vs. /y (intrinsically and in light of backward compatibility). I sent the following reply, which he thought would be useful to forward to the list....

I have no preference between /y and \G. When I first saw /y proposed for ES4, I felt it needlessly reinvented the wheel given that \G had already been implemented pretty widely. On the other hand, the fact that \G reaches out of the search pattern to read a property of a regex or string feels a bit too much like magic to me, and implementing it as a flag (/y) seems less weird. An argument in favor of \G is that it's more versatile than /y since it can be used anywhere in a regex pattern (e.g., at the start of an alternation option), not just as the leading element.

Note that \G works a bit differently across implementations. In some cases it matches the start position of the current match (PCRE, Ruby), and elsewhere it matches the end position of the previous match (Perl, Java, .NET). Of course, this distinction only matters after a zero-length match (since that increments the start position of the next search).

Perl has extra functionality around \G that makes it more useful. Specifically, the fact that the location associated with \G is an attribute of target strings (pos()) means that multiple regexes with \G can match against a string in turn and they'll each pick up where the others left off. Combine this with Perl's /c modifier (which prevents failed matches from resetting the \G location) and you can run multiple regexes with \G and /c against a string and advance only when there's a match. Here's a crappy example:

while ($html !~ /\G$/gc) { if ($html =~ /\G[^<&]+/gc) { ... } elsif ($html =~ /\G<(\w+)[^>]+>/gc) { ... } elsif ($html =~ /\G&#?\w+;/gc) { ... } }

Sorry for the tangent, but I thought it might be helpful to describe how \G is used elsewhere.

Steven Levithan blog.stevenlevithan.com


From: "Steve L." <steves_list at hotmail.com>

Sent: Wednesday, February 10, 2010 10:46 AM To: "Andy Chu" <andy at chubot.org>; "es-discuss" <es-discuss at mozilla.org>

Subject: Re: Proposal for exact matching and matching at a position in RegExp

# Andy Chu (14 years ago)

On Thu, Feb 11, 2010 at 10:24 PM, Steve L. <steves_list at hotmail.com> wrote:

Outside of es-discuss, Brendan Eich asked for my thoughts on the merits of \G vs. /y (intrinsically and in light of backward compatibility). I sent the following reply, which he thought would be useful to forward to the list....

I have no preference between /y and \G. When I first saw /y proposed for ES4, I felt it needlessly reinvented the wheel given that \G had already been implemented pretty widely. On the other hand, the fact that \G reaches out of the search pattern to read a property of a regex or string feels a bit too much like magic to me, and implementing it as a flag (/y) seems less weird. An argument in favor of \G is that it's more versatile than /y since it can be used anywhere in a regex pattern (e.g., at the start of an alternation option), not just as the leading element.

Agree that \G breaks some logical barrier. I like to have a mental model of the implementation internals, and \G breaks that a bit.

If compatibility with Mozilla is not an issue, I actually prefer Python's approach of .search() vs. .match(). It's not a part of the regex; it's not a property of the regex; it's how you apply the regex to a string. Just like you can apply the same regex with .split() or .exec() or .replace(). They're orthogonal issues in my mind.

Though as mentioned, gracefully upgrading with ES3-5 is an issue, so I could only think of .exec() and .execLeft() for a left-anchored match.

One thing I didn't bring up is that Python actually has an "endpos" argument. You do regex.search(s, 10, 20), and it will stop at position 20. I couldn't think of a real use case for this. But anyone can think of one, that might be a consideration and sway things in favor of separate methods.

Andy

# Andy Chu (14 years ago)

One thing I didn't bring up is that Python actually has an "endpos" argument.  You do regex.search(s, 10, 20), and it will stop at position 20.  I couldn't think of a real use case for this.  But anyone can think of one, that might be a consideration and sway things in favor of separate methods.

Found some real usage of endpos:

www.google.com/codesearch?q=.search(.%3F%2C[^)]%2C+lang%3Apy&hl=en

if doubledash.search(rawdata, i+4, res.start(0)): self.syntax_error("`--' inside comment")

double dash is the regex '--'.

Andy

# Erik Corry (14 years ago)

2010/2/12 Andy Chu <andy at chubot.org>:

On Thu, Feb 11, 2010 at 10:24 PM, Steve L. <steves_list at hotmail.com> wrote:

Outside of es-discuss, Brendan Eich asked for my thoughts on the merits of \G vs. /y (intrinsically and in light of backward compatibility). I sent the following reply, which he thought would be useful to forward to the list....

I have no preference between /y and \G. When I first saw /y proposed for ES4, I felt it needlessly reinvented the wheel given that \G had already been implemented pretty widely. On the other hand, the fact that \G reaches out of the search pattern to read a property of a regex or string feels a bit too much like magic to me, and implementing it as a flag (/y) seems less weird. An argument in favor of \G is that it's more versatile than /y since it can be used anywhere in a regex pattern (e.g., at the start of an alternation option), not just as the leading element.

Agree that \G breaks some logical barrier.  I like to have a mental model of the implementation internals, and \G breaks that a bit.

\G is more flexible and it is rather similar to ^ conceptually.

The mental model happens to be out of sync with how regexps are implemented. The implicit .*? at the start of a regexp is actually the fastest way to implement since you are using the fast internal search mechanisms that the regexp engine has rather than an external loop that repeatedly asks "does it match here?".

Certainly if the /y variant is adopted then V8 will implement it as if it were specified with \G. Ie there would be two different regexps behind the scenes, one with and one without /y. This is similar to what would happen if you could specify /i at match time instead of compile time.

# Andy Chu (14 years ago)

\G is more flexible and it is rather similar to ^ conceptually.

The mental model happens to be out of sync with how regexps are implemented.  The implicit .*? at the start of a regexp is actually the fastest way to implement since you are using the fast internal search mechanisms that the regexp engine has rather than an external loop that repeatedly asks "does it match here?".

OK I see what you mean.

Actually the one I really care about is "pos". That one is obviously a property of the match, and not the RegExp. I can't really think of a real practical case where you would want a left-anchored match and a search on the same RegExp.

So now that I know "/y" is already part of the RegExp in Mozilla, and semantically (and implementation-wise) it is equivalent to removing the leading .*?, it wouldn't bother me if it stayed that way. But 'pos' should be an argument. Obviously you will want multiple "pos" arguments on the same RegExp.

(I think my vague recollection was the opposite of what I said -- that a leading .*? is faster, while .search() could be slower. It would make sense because the naive way to do .search() as you said is to try matching a character at a time.)

Andy

# Steve L. (14 years ago)

2010/2/12 Andy Chu <andy at chubot.org>:

Agree that \G breaks some logical barrier. I like to have a mental model of the implementation internals, and \G breaks that a bit.

If compatibility with Mozilla is not an issue, I actually prefer Python's approach of .search() vs. .match(). It's not a part of the regex; it's not a property of the regex; it's how you apply the regex to a string.

I think a lot of people (myself included) use a similar mental model, although it doesn't quite match the implementation details. But even according to this model, avoiding /y wouldn't keep ES regex flags pure as mere pattern attributes since ES already crossed that bridge with /g (and no one complains about it).

Though as mentioned, gracefully upgrading with ES3-5 is an issue, so I could only think of .exec() and .execLeft() for a left-anchored match.

Unlike execLeft, /y or \G would be useful not only with regex.exec/test, but also with string.match/replace/split; and all this in backwards-compatible fashion (unlike adding new arguments or methods). The name execLeft might also be misleading since I imagine it would cause global regexes to be lastIndex-anchored rather than left-anchored.

One thing I didn't bring up is that Python actually has an "endpos" argument. You do regex.search(s, 10, 20), and it will stop at position 20. I couldn't think of a real use case for this. But anyone can think of one, that might be a consideration and sway things in favor of separate methods.

Are you proposing that execLeft would have different arguments than exec? I would not be a fan of that idea. Regarding the pos and endPos arguments you've proposed:

  • An endPos argument might be useful with the replace and match methods when using global regexes, but it would probably have limited use-cases with exec and makes no sense with execLeft (assuming endPos is the last position where a match can start). If, rather, it's the position at or before where the last match must end, it would offer nothing more than a possible slight performance improvement vs. regex.execLeft(str.slice(0, endPos)). Searching a sliced copy of the string would actually have the potential to perform better if it prevents having to complete a slow match attempt that goes beyond endPos before being discarded upon completion.

  • A pos argument (which you mentioned elsewhere) is unnecessary and confusing for test/exec/execLeft since the lastIndex property already provides the same functionality (albeit with the requirement that it's used with a global regex). I wish exec, test, replace, and match all offered a pos argument and that the lastIndex property didn't exist, but that ship has sailed.

I too like Python's regex methods, but I don't think that what Python does well here can easily be dropped on top of ES5.

Steven Levithan blog.stevenlevithan.com

# Steve L. (14 years ago)

2010/2/12 Erik Corry <erik.corry at gmail.com>:

Agree that \G breaks some logical barrier. I like to have a mental model of the implementation internals, and \G breaks that a bit.

\G is more flexible and it is rather similar to ^ conceptually.

The mental model happens to be out of sync with how regexps are implemented. The implicit .*? at the start of a regexp is actually the fastest way to implement since you are using the fast internal search mechanisms that the regexp engine has rather than an external loop that repeatedly asks "does it match here?".

\G and ^ become more dissimilar conceptually when \G is not used as the leading element. With or without /m, ^ can be thought of as a one-character lookbehind that doesn't need to know the current search position index. With something like /\d|\G\w/ or the contrived /(?=(x?))\1\G\w/ (equivalent to /(?>x?)\G\w/ or /\G[^\Wx]/), however, the regex pattern must be aware of its

start position within the target string. I think this sort of mental model is commonly used, and it's because \G is harder to sync with this model that that I'm comfortable with introducing /y even though it ignores the prior art and greater flexibility of \G. But then, compatibility with other regex flavors is probably good enough reason to chose \G over /y.

# Brendan Eich (14 years ago)

On Feb 13, 2010, at 12:14 AM, Steve L. wrote:

\G and ^ become more dissimilar conceptually when \G is not used as
the leading element. With or without /m, ^ can be thought of as a
one-character lookbehind that doesn't need to know the current
search position index. With something like /\d|\G\w/ or the
contrived /(?=(x?))\1\G\w/ (equivalent to /(?>x?)\G\w/ or /\G[^ \Wx]/), however, the regex pattern must be aware of its start
position within the target string. I think this sort of mental model
is commonly used, and it's because \G is harder to sync with this
model that that I'm comfortable with introducing /y even though it
ignores the prior art and greater flexibility of \G. But then,
compatibility with other regex flavors is probably good enough
reason to chose \G over /y.

Not clear, since ES3 deviated from Perl and will not reconverge. The
committee is not going to standardize any lastIndex or pos mapping per
target string and regex pair, I am pretty sure.

By starting from Perl 4 (my fault, it was 1997), then upgrading to
Perl 5 with differences due to the need to specify what was codified
quirkily only in perl source code, while also trying only in places to
deal with Unicode well, we have painted ourselves into a corner.

Perl 6 of course breaks compatibility utterly, and good for it.
Perhaps with enough string- and byte-buffer performance, the JS
library ecosystem can promulgate a new class, call it RegEx, which has
the best consensus API, and built-in RegExp can try to fade away.

Anyway, when adding /y, we reckoned it was better to keep being
different than to try \G and invite complaints that we weren't the
same as the "genetic parent", Perl. Mutation is a bitch, and we're
quite the red-headed stepchild now ;-).

# Steve L. (14 years ago)

On Feb 14, 2010, at 1:40 AM, Brendan Eich wrote:

compatibility with other regex flavors is probably good enough reason to chose \G over /y.

Not clear, since ES3 deviated from Perl and will not reconverge. The committee is not going to standardize any lastIndex or pos mapping per target string and regex pair, I am pretty sure.

That seems a reasonable position for the committee to take. To be clear, are you suggesting that \G be ruled out, but leaving the door open for standardizing /y?

Perhaps with enough string- and byte-buffer performance, the JS library ecosystem can promulgate a new class, call it RegEx, which has the best consensus API, and built-in RegExp can try to fade away.

That would be quite interesting to see, but since there are just a few, mostly tolerable design issues with the existing ES RegExp API, I can only envision a switch to a JS library over built-in RegExp being justified if the library included pretty much all the features of ES regexes and added significant new features including better Unicode support. In the near term, the resulting file size would pretty much assure low adoption in browser-land.

Since I don't have much hope for a reasonable alternative to RegExp, I think it's important to continue looking at RegExp improvements for future ES versions. I have a laundry list of desired changes and additions, but the three things I'd like to see first are /x, /y, and atomic groups. But now I'm off topic again, so I'll just leave it there. :)

Steven Levithan blog.stevenlevithan.com

# Mike Samuel (14 years ago)

2010/2/15 Steve L. <steves_list at hotmail.com>:

On Feb 14, 2010, at 1:40 AM, Brendan Eich wrote:

compatibility with other regex flavors is probably good enough  reason to chose \G over /y.

Not clear, since ES3 deviated from Perl and will not reconverge. The committee is not going to standardize any lastIndex or pos mapping per target string and regex pair, I am pretty sure.

That seems a reasonable position for the committee to take. To be clear, are you suggesting that \G be ruled out, but leaving the door open for standardizing /y?

Perhaps with enough string- and byte-buffer performance, the JS  library ecosystem can promulgate a new class, call it RegEx, which has  the best consensus API, and built-in RegExp can try to fade away.

That would be quite interesting to see, but since there are just a few, mostly tolerable design issues with the existing ES RegExp API, I can only envision a switch to a JS library over built-in RegExp being justified if the library included pretty much all the features of ES regexes and added significant new features including better Unicode support. In the near term, the resulting file size would pretty much assure low adoption in browser-land.

If one of the features that you're concerned about is a convenient literal syntax, then the quasis strawman could fill that gap.

# Brendan Eich (14 years ago)

On Feb 14, 2010, at 6:04 AM, Steve L. wrote:

On Feb 14, 2010, at 1:40 AM, Brendan Eich wrote:

compatibility with other regex flavors is probably good enough
reason to chose \G over /y.

Not clear, since ES3 deviated from Perl and will not reconverge.
The committee is not going to standardize any lastIndex or pos
mapping per target string and regex pair, I am pretty sure.

That seems a reasonable position for the committee to take. To be
clear, are you suggesting that \G be ruled out, but leaving the door
open for standardizing /y?

Yes, I think \G without full Perl compatibility is less desirable
than /y -- but I would want some solution here, and it should be on
the Harmony agenda.

Perhaps we need you to seed that agenda. Start a new thread to remind
us of your 95, or just 5 (I hope ;-), theses nailed to the ES3
cathedral door.

Perhaps with enough string- and byte-buffer performance, the JS
library ecosystem can promulgate a new class, call it RegEx, which
has the best consensus API, and built-in RegExp can try to fade
away.

That would be quite interesting to see, but since there are just a
few, mostly tolerable design issues with the existing ES RegExp API,
I can only envision a switch to a JS library over built-in RegExp
being justified if the library included pretty much all the features
of ES regexes and added significant new features including better
Unicode support. In the near term, the resulting file size would
pretty much assure low adoption in browser-land.

Yes, it's a long term hope, or possibly dream.

Since I don't have much hope for a reasonable alternative to RegExp,
I think it's important to continue looking at RegExp improvements
for future ES versions. I have a laundry list of desired changes and
additions, but the three things I'd like to see first are /x, /y,
and atomic groups. But now I'm off topic again, so I'll just leave
it there. :)

To fuel this fire (separate thread is fine, all your theses at once), / x already caused ES4 headaches which Waldemar argued are insuperable,
to-do with semicolon insertion.

# Andy Chu (14 years ago)

(belated followup)

I think a lot of people (myself included) use a similar mental model, although it doesn't quite match the implementation details. But even according to this model, avoiding /y wouldn't keep ES regex flags pure as mere pattern attributes since ES already crossed that bridge with /g (and no one complains about it).

So now that Erik Corry pointed out that /y as a compilation option matches the implementation, as mentioned I don't mind leaving it out of the method parameters. The "pos" is the one that still should be a method argument.

Can we summarize this discussion on a wiki page somewhere? I think it more or less came to a conclusion : )

Though as mentioned, gracefully upgrading with ES3-5 is an issue, so I could only think of .exec() and .execLeft() for a left-anchored match.

Unlike execLeft, /y or \G would be useful not only with regex.exec/test, but also with string.match/replace/split; and all this in backwards-compatible fashion (unlike adding new arguments or methods). The name execLeft might also be misleading since I imagine it would cause global regexes to be lastIndex-anchored rather than left-anchored.

Yeah true. /y works for me. I wasn't really proposing execLeft anymore, just considering its advantages.

Regarding the pos and endPos arguments you've proposed:

  • An endPos argument might be useful with the replace and match methods when using global regexes, but it would probably have limited use-cases with exec and makes no sense with execLeft (assuming endPos is the last position where a match can start). If, rather, it's the position at or before where the last match must end, it would offer nothing more than a possible slight performance improvement vs. regex.execLeft(str.slice(0, endPos)). Searching a sliced copy of the string would actually have the potential to perform better if it prevents having to complete a slow match attempt that goes beyond endPos before being discarded upon completion.

The pos and endpos arguments would be synonymous with slicing the string, but more efficient. In fact the Python docs say explicitly that they're equivalent.

I think so far everyone agrees with the 'pos' optimization. I'm not going to push for endpos but I thought I'd mention it.

  • A pos argument (which you mentioned elsewhere) is unnecessary and confusing for test/exec/execLeft since the lastIndex property already provides the same functionality (albeit with the requirement that it's used with a global regex). I wish exec, test, replace, and match all offered a pos argument and that the lastIndex property didn't exist, but that ship has sailed.

Well, the proposal is that pos, if passed, overrides lastIndex. Is there something wrong with that?

The point is that the same RegExp() instance can be used in concurrent contexts without the stomping on lastIndex.

I too like Python's regex methods, but I don't think that what Python does well here can easily be dropped on top of ES5.

Well I'm only proposing a very small subset. And I think JS has already admitted its influence by Python. Putting \G in because Perl uses it is a bad reason, since following Perl to Perl 5 regexps would be an atrocity : ) Unless someone is seriously arguing for \G, I think it should be ruled out. As you mentioned, it's more complicated to implement since \G doesn't have to be at the beginning.

thanks, Andy

# Andy Chu (14 years ago)

Not clear, since ES3 deviated from Perl and will not reconverge. The committee is not going to standardize any lastIndex or pos mapping per target string and regex pair, I am pretty sure.

I'm glad we're not following Perl to Perl 5 regexps.

Not clear on the "pos per pair" statement. All that I'm proposing is the following logic for .exec() and .test():

First try the parameter pos to find the start position. If pos is unspecified, use the .lastIndex member of the RegExp instance. If both are undefined, use 0.

The only thing different is the first line. Is there something wrong with that?

andychu.net/ecmascript/RegExp-Enhancements-2.html

Anyway, when adding /y, we reckoned it was better to keep being different than to try \G and invite complaints that we weren't the same as the "genetic parent", Perl. Mutation is a bitch, and we're quite the red-headed stepchild now ;-).

Even Perl 6 has let go of Perl 5 regexps, so I don't think anyone will complain. I definitely won't.

Andy

# Steve L. (14 years ago)

On Feb 17, 2010, at 12:45 AM, Brendan Eich wrote:

I think \G without full Perl compatibility is less desirable than /y -- but I would want some solution here, and it should be on the Harmony agenda.

Prior to and after this discussion (I wavered a bit in the middle), I'm inclined to agree with you on /y vs. \G. And I'd be very happy to see a solution on the Harmony agenda, too.

One more thing regarding /y that just came to mind is that if at some future point ECMAScript were to add support for embedded mode modifiers (e.g., (?im), (?-im), (?im:...), etc.), which are widely supported elsewhere and fairly useful, (?y) would not make sense and would probably need to be an error. No big deal though since the same would be true for (?g).

Perhaps we need you to seed that agenda. Start a new thread to remind us of your 95, or just 5 (I hope ;-), theses nailed to the ES3 cathedral door.

Heh. :-) I've posted half of a response at blog.stevenlevithan.com/archives/fixing-javascript-regexp , and within the next couple weeks I'll try to follow up on es-discuss with a write-up that excludes the less realistic change proposals from that page and adds suggested new features (including /y). I'm very interested in which proposals from that page you think are most likely to gain any traction, and which might not be worth raising for serious consideration.

To fuel this fire (separate thread is fine, all your theses at once), / x already caused ES4 headaches which Waldemar argued are insuperable, to-do with semicolon insertion.

Bummer. Do you have a pointer to the related discussion?

# Steve L. (14 years ago)

On February 23, 2010 8:50 AM, Andy Chu wrote:

So now that Erik Corry pointed out that /y as a compilation option matches the implementation, as mentioned I don't mind leaving it out of the method parameters. The "pos" is the one that still should be a method argument. [...] /y works for me. I wasn't really proposing execLeft anymore, just considering its advantages.

If we agree on /y, is the remainder of your proposal to simply add a pos (and possibly endPos) argument to the exec and test methods? I'd be all for that if the lastIndex property was also deprecated. I've argued for the same thing at blog.stevenlevithan.com/archives/fixing-javascript-regexp

The pos and endpos arguments would be synonymous with slicing the string, but more efficient. In fact the Python docs say explicitly that they're equivalent.

I think so far everyone agrees with the 'pos' optimization. I'm not going to push for endpos but I thought I'd mention it.

pos is useful beyond optimization--e.g., it would let you easily iterate over regex matches similar to how lastIndex can already be used. endPos is not nearly as useful. I'm not opposed to it, but I think any pushback you'd receive about adding marginally-useful arguments to existing methods would be warranted (i.e., like you, I won't push for it).

Well, the proposal is that pos, if passed, overrides lastIndex. Is there something wrong with that?

The point is that the same RegExp() instance can be used in concurrent contexts without the stomping on lastIndex.

Agreed--that's how pos should work, if lastIndex is deprecated. But I think having two mechanisms (pos and lastIndex) for setting the search start position is a bad idea. If a pos argument exists, you'd expect pos to be 0 when it's not specified. Having lastIndex around to sometimes screw up this expectation is confusing and will probably cause latent bugs. There are good reasons to deprecate lastIndex (you've mentioned one here, and another is that it only works with global regexes), so I think you should make this part of your proposal. Deprecating lastIndex should be a means toward the end of removing it altogether, after which pos would become more intuitive.

I think JS has already admitted its influence by Python.

ES's RegExp has probably zero Python influence. Python's re has plenty of Perl influence.

Putting \G in because Perl uses it is a bad reason, since following Perl to Perl 5 regexps would be an atrocity : ) Unless someone is seriously arguing for \G, I think it should be ruled out. As you mentioned, it's more complicated to implement since \G doesn't have to be at the beginning.

After this recent back and forth (or as a consequence of it), it seems \G has little if any support on es-discuss (I agree on ruling it out). But why do you say following Perl 5 regexes would be an atrocity? Perl/PCRE are still the leaders that other regex packages mostly follow, and IMO they constitute the state of the art for modern regex packages (although you could maybe point to Tcl as a counterargument). No one here is arguing for the more out-there Perl regex features like embedded code or backtracking control verbs, but there are plenty of useful Perl regex extensions that ES could benefit from, many of which have already been picked up by Java/.NET/Oniguruma/etc. Of course, such features should be considered on a case by case basis, but I have a bias towards regex library compatibility and not inventing new syntax/flags to address problems that have already been solved. Developers like their regexes to be portable, and a common gripe about regexes is that they're not more portable than they are.

I plan on writing up a list of features from other regex libraries (mostly Perl) that I think would be useful to consider for ES. It seems, though, that some people simply don't like the idea of making regular expressions--terse and supposedly-unreadable as they are--even more powerful. (Not referring to you; this is just something I see fairly often.) I do not like the idea of excluding useful features for the sake of supposed purity or saving people who don't understand backtracking from themselves.

--Steven Levithan

# Andy Chu (14 years ago)

If we agree on /y, is the remainder of your proposal to simply add a pos (and possibly endPos) argument to the exec and test methods? I'd be all for that if the lastIndex property was also deprecated. I've argued for the same thing at blog.stevenlevithan.com/archives/fixing-javascript-regexp

Yes, that's all.

Well, the proposal is that pos, if passed, overrides lastIndex.  Is there something wrong with that?

The point is that the same RegExp() instance can be used in concurrent contexts without the stomping on lastIndex.

Agreed--that's how pos should work, if lastIndex is deprecated. But I think having two mechanisms (pos and lastIndex) for setting the search start position is a bad idea. If a pos argument exists, you'd expect pos to be 0 when it's not specified. Having lastIndex around to sometimes screw up this expectation is confusing and will probably cause latent bugs. There are good reasons to deprecate lastIndex (you've mentioned one here, and another is that it only works with global regexes), so I think you should make this part of your proposal. Deprecating lastIndex should be a means toward the end of removing it altogether, after which pos would become more intuitive.

I guess it depends on how much ES6+ want to diverge from ES3/5.

I don't really think the lastIndex property doesn't need to be deprecated for the pos argument. I think we are talking about literally a quarter of a line of code, e.g. in exec()/test():

function exec(s, pos) { var positionToStartFrom = pos || this.lastIndex || 0; this.lastIndex = ... call regex engine ... }

There are APIs designed from the "ground up" this way. The method argument overrides internal state.

I can see your point about how it might be confusing if you omit 'pos'. But I guess I just don't think it's that big a deal; it's a judgement call. People already know the old behavior, and there will be code on the Web that uses until the end of time. They just have to know one new thing.

I like your idea on the blog post of cleaning up the global flag. The current /g behavior makes for a very un-orthogonal API.

Putting \G in because Perl uses it is a bad reason, since following Perl to Perl 5 regexps would be an atrocity : )  Unless someone is seriously arguing for \G, I think it should be ruled out.  As you mentioned, it's more complicated to implement since \G doesn't have to be at the beginning.

After this recent back and forth (or as a consequence of it), it seems \G has little if any support on es-discuss (I agree on ruling it out).  But why do you say following Perl 5 regexes would be an atrocity? Perl/PCRE are still the leaders that other regex packages mostly follow, and IMO they constitute the state of the art for modern regex packages (although you could maybe point to Tcl as a counterargument). No one here is arguing for the more out-there Perl regex features like embedded code or backtracking control verbs, but there are plenty of useful Perl regex extensions that ES could benefit from, many of which have already been picked up by

All I'm saying is that adding \G to be consistent with Perl is not a good reason. And no one is really arguing for \G so it's a moot point.

I agree that Perl has useful stuff that JavaScript doesn't have. But it's a slippery slope because even Perl 6 has admitted that Perl 5 regexes got out of hand. (dev.perl.org/perl6/doc/design/apo/A05.html, "First, let me enumerate some of the things that are wrong with current regex culture.")

Java/.NET/Oniguruma/etc.  Of course, such features should be considered on a case by case basis, but I have a bias towards regex library compatibility and not inventing new syntax/flags to address problems that have already been solved. Developers like their regexes to be portable, and a common gripe about regexes is that they're not more portable than they are.

Consistency with Perl should be considered but IMHO it's not a strong consideration -- if Perl 5 has solved it one way, it doesn't mean it's the best way. It's worth looking at where Perl 6 breaks with Perl 5. It's not possible to copy Perl 6 because of compatibility, but it is a good sign that a Perl 5 solution was not satisfactory in the long term.

Agree that we should take it on a case-by-case basis. In this case it sounds like /y, which has nothing to do with Perl, is good. (If I were to nitpick I would say /y for "sticky" is silly, I would call it /a or /n for "anchored")

I plan on writing up a list of features from other regex libraries (mostly Perl) that I think would be useful to consider for ES. It seems, though, that some people simply don't like the idea of making regular expressions--terse and supposedly-unreadable as they are--even more powerful. (Not referring to you; this is just something I see fairly often.) I do not like the idea of excluding useful features for the sake of supposed purity or saving people who don't understand backtracking from themselves.

It's a hard problem, because adding power and keeping compatibility and keeping sane syntax are all at odds with each other. The people who want to keep regexes simple have a point.

I think focusing on use cases is the important thing. Certainly the tokenization use case has been run into multiple times, thus /y is justified.

Personally I haven't "wanted" for anything in Python's regexps (actually the only thing is to capture repeated groups, like ((?P<foo>\d+),)* could capture a comma separated list of integers, but

don't know if Perl even has that). I know Perl has a ton more stuff, but I am biased toward just writing procedural code combined with regexes. Certainly all the inline code stuff seems like a huge mess to me.

I'll look at your other proposals in more detail. As mentioned I think the /g part is good.

Andy

# Andy Chu (14 years ago)

Heh. :-) I've posted half of a response at blog.stevenlevithan.com/archives/fixing-javascript-regexp , and within the next couple weeks I'll try to follow up on es-discuss with a write-up that excludes the less realistic change proposals from that page and adds suggested new features (including /y). I'm very interested in which proposals from that page you think are most likely to gain any traction, and which might not be worth raising for serious consideration.

I looked it over again. All of it seems good to me, except:

  1. The 2 things about backreferences. I haven't really run into these personally so I guess I'm agnostic. But the arguments about why they're not bad for compatibility aren't super compelling. I guess I question whether enough people have run into it to justify the potential breakage.

  2. The personal preference stuff I did not really evaluate, but my bias would be toward keeping compatibility.

(If it's really a goal to create an entirely new RegEx (no p) class, those things could be addressed there. Although I think that proposal is problematic too since it is a burden on implementers to have 1.5 regex implementations.)

Andy

# Steve L. (14 years ago)

Chopping up your replies a bit...

On March 03, 2010 10:28 AM, Andy Chu wrote:

I agree that Perl has useful stuff that JavaScript doesn't have. But it's a slippery slope because even Perl 6 has admitted that Perl 5 regexes got out of hand. [...] It's a hard problem, because adding power and keeping compatibility and keeping sane syntax are all at odds with each other. [...] I think focusing on use cases is the important thing.

Yes, I tend to agree with this, along with your other related points, generally. But I believe Perl 6's radical redesign (or any other dramatic RegExp changes that would be similarly disruptive) have already been ruled out for ES. So we're stuck with RegExp (which isn't all that bad really--at least it has cross-language familiarity going for it), and therefore I'd rather see it improve than stagnate and perpetuate its problems. The great thing is, if you want to add new features, other regex flavors offer an abundance of options to consider.

Agree that we should take it on a case-by-case basis. In this case it sounds like /y, which has nothing to do with Perl, is good. (If I were to nitpick I would say /y for "sticky" is silly, I would call it /a or /n for "anchored")

I believe the y actually comes from "yylex". ES4 proposals had originally called for the corresponding property to be named anchored or nosearch before changing it to sticky to avoid overloading the term anchored (which is commonly used to describe regexes starting with ^ or ending with $) and to better fit with the letter y. If the name changed to anchored, I'd prefer /a over /n. Flag n (as in "(?n)") is already used by .NET for the ExplicitCapture option, which is a feature that might be worth considering for ES6+. I don't mind "sticky", personally.

Heh. :-) I've posted half of a response at blog.stevenlevithan.com/archives/fixing-javascript-regexp , and within the next couple weeks I'll try to follow up on es-discuss with a write-up that excludes the less realistic change proposals from that page and adds suggested new features (including /y). I'm very interested in which proposals from that page you think are most likely to gain any traction, and which might not be worth raising for serious consideration.

I looked it over again. All of it seems good to me, except:

  1. The 2 things about backreferences. I haven't really run into these personally so I guess I'm agnostic. But the arguments about why they're not bad for compatibility aren't super compelling. I guess I question whether enough people have run into it to justify the potential breakage.

Although I definitely appreciate the feedback, I disagree about both the compatibility impact (it would be nice to have some hard data/evidence about this, of course) and the value of fixing ES's backreference specification bugs.

(If it's really a goal to create an entirely new RegEx (no p) class, those things could be addressed there. Although I think that proposal is problematic too since it is a burden on implementers to have 1.5 regex implementations.)

Regarding "RegEx", I'm pretty certain Brendan was talking about a hypothetical new library (name unimportant) that the JavaScript community might create in the future. I don't think anyone has suggested adding a second regular expression class to the ES spec itself (and anyone who wanted to do so without getting ignored or laughed at had better have some damn compelling justification). Perhaps a new library is a long-term way to address issues with RegExp, but if the problems can be fixed in RegExp itself, obviously that would be for the best (assuming the changes don't cause significant compatibility problems).

--Steven Levithan

# Brendan Eich (14 years ago)

On Mar 3, 2010, at 11:44 PM, Steve L. wrote:

(If it's really a goal to create an entirely new RegEx (no p) class, those things could be addressed there. Although I think that
proposal is problematic too since it is a burden on implementers to have 1.5 regex implementations.)

Regarding "RegEx", I'm pretty certain Brendan was talking about a hypothetical new library (name unimportant) that the JavaScript
community might create in the future. I don't think anyone has suggested
adding a second regular expression class to the ES spec itself

Right. I was explicit about hypothesizing an ecosystem solution,
independent of the standard and predicated on fast JS engines. This is
not meant to preempt your great work making RegExp incrementally better.

We (by which I mean "you" with edit access from me ;-) should turn
your blog post into one or more strawman:proposals entries at .

# Andy Chu (14 years ago)

Regarding "RegEx", I'm pretty certain Brendan was talking about a hypothetical new library (name unimportant) that the JavaScript community might create in the future. I don't think anyone has suggested adding a

I worked on something like this: code.google.com/p/json-pattern . It is more a rethinking of regular expressions than just fixing existing bugs.

It is somewhat dormant now; I planned to port it to JavaScript (from Python) but unfortunately haven't gotten around to it. The lack of standard /y definitely blocks it. I guess I will have to just use the proprietary /y extension for now.

My belief is that regular expressions are hobbled by their syntax. If they didn't have such bad syntax (^ means either negation or the start of a string; you sometimes negate with ^ and sometimes negate with capitalization, the whole (? nightmare,etc. ), then people would write large, useful and fast regexes and no one would bat an eye.

Highlights:

  • You can capture an entire (recursive) JSON structure with named and repeated elements (a generalization of named capture). JavaScript currently just allows you to capture individual numbered values.
  • There are extensible filters (pipes) for converting values. You can capture \d+ to the number 3 rather than the string "3"; you can write a filter to convert "3-2-2009" to a Date() instance, etc.
  • Pattern reuse / composition (nicer than Perl's)
  • More readable and more consistent syntax (I wasn't completely happy with where I ended up, but I have some unimplemented improvments)

Here is an example of a single expression that parses 'ls -al' output: chubot.org/json-pattern/test-cases/testLs.html .

It's interesting that regular expression syntax in computers has been evolving continuously for over 40 years now. It's probably older than any language still in common use except perhaps FORTRAN.

Andy

# Steve L. (14 years ago)

On March 04, 2010 11:40 AM, Andy Chu wrote:

Regarding "RegEx", I'm pretty certain Brendan was talking about a hypothetical new library (name unimportant) that the JavaScript community might create in the future. I don't think anyone has suggested adding a

I worked on something like this: code.google.com/p/json-pattern . It is more a rethinking of regular expressions than just fixing existing bugs.

Thanks for the link. I'm always interested in projects like this.

My belief is that regular expressions are hobbled by their syntax. If they didn't have such bad syntax (^ means either negation or the start of a string; you sometimes negate with ^ and sometimes negate with capitalization, the whole (? nightmare,etc. ), then people would write large, useful and fast regexes and no one would bat an eye.

Syntax is a part of it. But also, backtracking. Despite backtracking being a big part of what makes regexes so expressive and powerful (and therefore contributing to their popularity), truly understanding backtracking is complicated for many people at first and it's easy for backtracking to get out of hand or cause unexpected results if you're not careful. I think the effects of backtracking are at least as big a reason for some people's reservations about regexes as is the syntax.

People do write large, useful, fast (and relatively readable) regexes. ES's lack of /x hampers this, to a significant extent. As do some missing features like named capture and atomic groups/possessive quantifiers. The "large" part is kind of beside the point, though--it would often be appropriate to split a regex into smaller parts regardless of what their syntax was.

Highlights:

  • You can capture an entire (recursive) JSON structure with named and repeated elements (a generalization of named capture). JavaScript currently just allows you to capture individual numbered values.

Possibly of interest: xregexp.com That's my library where I play with some regex ideas and hack support for new syntax/flags into existing RegExps. It needs some cleanup and changes, and I wouldn't recommend actually using it in production applications, but it adds comprehensive named capture support to existing RegExps. There's also some shitty recursive matching support via a plugin.

  • There are extensible filters (pipes) for converting values. You can capture \d+ to the number 3 rather than the string "3"; you can write a filter to convert "3-2-2009" to a Date() instance, etc.
  • Pattern reuse / composition (nicer than Perl's)
  • More readable and more consistent syntax (I wasn't completely happy with where I ended up, but I have some unimplemented improvments)

Again, thanks for the link. I'll try to look over it sometime.

# Steve L. (14 years ago)

On March 04, 2010 11:07 AM, Brendan Eich wrote:

Right. I was explicit about hypothesizing an ecosystem solution, independent of the standard and predicated on fast JS engines. This is not meant to preempt your great work making RegExp incrementally better.

We (by which I mean "you" with edit access from me ;-) should turn your blog post into one or more strawman:proposals entries at

Coolness. I can't focus on this at the moment, but when I have some time I'll follow up with you about it. Thanks.

--Steven Levithan