RegExps that don't modify global state?

# Domenic Denicola (10 years ago)

I had a conversation with Jaswanth at JSConf EU that revealed that RegExps cannot be used in parallel JS because they modify global state, i.e. RegExp.$0 and friends.

We were thinking it would be nice to find some way of getting rid of this wart. One idea would be to bundle the don't-modify-global-state behavior with the /u flag. Another would be to introduce a new flag to opt-out. The former is a bit more attractive since people will probably want to use /u all the time anyway. I imagine there might be other possibilities others can think of.

I also noticed today that the static RegExp properties are not specced, which seems at odds with our new mandate to at least Annex B-ify the required-for-web-compat stuff.

# Allen Wirfs-Brock (10 years ago)

On Sep 16, 2014, at 11:16 AM, Domenic Denicola wrote:

I had a conversation with Jaswanth at JSConf EU that revealed that RegExps cannot be used in parallel JS because they modify global state, i.e. RegExp.$0 and friends.

We were thinking it would be nice to find some way of getting rid of this wart. One idea would be to bundle the don't-modify-global-state behavior with the /u flag. Another would be to introduce a new flag to opt-out. The former is a bit more attractive since people will probably want to use /u all the time anyway. I imagine there might be other possibilities others can think of.

I also noticed today that the static RegExp properties are not specced, which seems at odds with our new mandate to at least Annex B-ify the required-for-web-compat stuff.

Yes, they should be in Annex B. But that means that somebody needs to write a spec. that defines their behavior.

We could then add that extension to clause 16.1 as being forbidden for RegExps created with the /u flag.

# Andrea Giammarchi (10 years ago)

I personally find the re.test(str) case a good reason to keep further access to RegExp.$1 and others available instead of needing to test and grab eventually a match (redundant, slower, etc)

As mentioned already /u will be used by default as soon as supported; having this implicit opt-out feels very wrong to me since /u meaning is completely different.

Moreover, AFAIK JavaScript is single threaded per each EventLoop so I don't see conflicts possible if parallel execution is performed elsewhere, where also globals will (will them?) be a part, as every sandbox/iframe/worker has worked until now.

I'd personally +1 an explicit opt-out and indifferently accept a re-opt as modifier such /us where s would mean stateful (or any other char would do as long as RegExp.prototype.test won't loose it's purpose and power).

P.S. there's no such thing as RegExp.$0 but RegExp['$&'] will provide the (probably) intended result P.S. to know more about RegExp and these proeprties my old slides from BerlinJS event should do: webreflection.blogspot.co.uk/2012/02/berlin-js-regexp-slides.html

# Alex Kocharin (10 years ago)

An HTML attachment was scrubbed... URL: esdiscuss/attachments/20140917/4ae6be75/attachment-0001

# Jussi Kalliokoski (10 years ago)

On Wed, Sep 17, 2014 at 3:21 AM, Alex Kocharin <alex at kocharin.ru> wrote:

What's the advantage of re.test(str); RegExp.$1 over let m=re.match(str); m[1]?

Nothing. However, with control structures it removes a lot of awkwardness:

  • if ( /foo:(\d+)/.test(str) && parseInt(RegExp.$1, 10) > 15 ) { ...
  • if ( /name: (\w+)/).test(str) ) { var name = RegExp.$1; ...

I personally find this functionality very useful and would be saddened if /u which I want to use all of the sudden broke this feature. Say what you mean. Unicode flag disabling features to enable parallelism is another footnote for WTFJS.

# Steve Fink (10 years ago)

On 09/16/2014 10:13 PM, Jussi Kalliokoski wrote:

On Wed, Sep 17, 2014 at 3:21 AM, Alex Kocharin <alex at kocharin.ru <mailto:alex at kocharin.ru>> wrote:

What's the advantage of `re.test(str); RegExp.$1` over `let
m=re.match(str); m[1]`?

Nothing. However, with control structures it removes a lot of awkwardness:

  • if ( /foo:(\d+)/.test(str) && parseInt(RegExp.$1, 10) > 15 ) { ...
  • if ( /name: (\w+)/).test(str) ) { var name = RegExp.$1; ...

Is

if ((m = /foo:(\d+)/.exec(str)) && parseInt(m[1], 10) > 15) { ... }

so bad? JS assignment is an expression; make use of it. Much better than Python's refusal to allow such a thing, requiring indentation trees of doom or hacky workarounds when you just want to case-match a string against a couple of regexes.

The global state is bad, and you don't need turns or parallelism to be bitten by it.

function f(s) { if (s.test(/foo(\d+/)) { print("Found in " + formatted(s)); return RegExp.$1; // Oops! formatted() does a match internally. } }

Global variables are bad. They halfway made sense in Perl, but even the Perl folks wish they'd been lexical all along.

# Brendan Eich (10 years ago)

Jussi Kalliokoski wrote:

Unicode flag disabling features to enable parallelism is another footnote for WTFJS.

A bit overdone, but I agree on this point.

A separate flag per regexp, and/or a way to opt-out of RegExp.$foo altogether, seem better.

# Jussi Kalliokoski (10 years ago)

On Wed, Sep 17, 2014 at 8:35 AM, Steve Fink <sphink at gmail.com> wrote:

On 09/16/2014 10:13 PM, Jussi Kalliokoski wrote:

On Wed, Sep 17, 2014 at 3:21 AM, Alex Kocharin <alex at kocharin.ru> wrote:

What's the advantage of re.test(str); RegExp.$1 over let m=re.match(str); m[1]?

Nothing. However, with control structures it removes a lot of awkwardness:

  • if ( /foo:(\d+)/.test(str) && parseInt(RegExp.$1, 10) > 15 ) { ...
  • if ( /name: (\w+)/).test(str) ) { var name = RegExp.$1; ...

Is

if ((m = /foo:(\d+)/.exec(str)) && parseInt(m[1], 10) > 15) { ... }

so bad? JS assignment is an expression; make use of it. Much better than Python's refusal to allow such a thing, requiring indentation trees of doom or hacky workarounds when you just want to case-match a string against a couple of regexes.

Looks pretty confusing, and my linter agrees (assignment in an if statement is most likely a bug). Also that doesn't do the same thing, it assigns to global m, unless you var it before the if(), so more noise, especially when this if() is an else if() in a set of if() statements.

But this boils down to taste in linter rules and other bias for what is pretty and what is not, which is not a very interesting discussion. My main point was that the /u flag shouldn't disable this feature as a side effect.

The global state is bad, and you don't need turns or parallelism to be bitten by it.

function f(s) { if (s.test(/foo(\d+/)) { print("Found in " + formatted(s)); return RegExp.$1; // Oops! formatted() does a match internally. } }

Global variables are bad. They halfway made sense in Perl, but even the Perl folks wish they'd been lexical all along.

No argument here, I have no use for the RegExp.$ things being global. I'd much rather have them lexical, e.g. if ( /name: (\w+)/.test(str) ) { let name = $1; ... but that ship's sailed and even if we wanted to introduce that now as a part of the "disable global state modification" flag (which would be awesome), it would have a lot of things that need to be thought through to make it happen and I doubt anyone's willing to champion that effort.

# Andrea Giammarchi (10 years ago)

On Wed, Sep 17, 2014 at 6:35 AM, Steve Fink <sphink at gmail.com> wrote:

Is

if ((m = /foo:(\d+)/.exec(str)) && parseInt(m[1], 10) > 15) { ... }

so bad?

well, you just polluted the global scope "by accident" creating (maybe) a hybrid Array with properties attaches in order to just access index 1 where same array needs to be evaluated as truthy in the initial if

VS

using a boolean value for an if statement and eventually access directly RegExp.$1

which one would you pick?

The global state is bad, and you don't need turns or parallelism to be bitten by it.

function f(s) { if (s.test(/foo(\d+/)) { print("Found in " + formatted(s)); return RegExp.$1; // Oops! formatted() does a match internally. } }

I don't think that's a real-world code example and I've personally never done anything like that .. the case for .test() or String.prototype.search is to instantly access RegExp later on.

This does not play well with generators or with functions call in the middle but it does not have to, it's straight forward for its use case that has worked together until now and for 12+ years.

However, this summarizes even better my thoughts on proposing such change through /u

Unicode flag disabling features to enable parallelism is another footnote

for WTFJS.

new language features shouldn't be abused to sneakily drop well known functionalities, regardless what I smoke.

# Andrea Giammarchi (10 years ago)

you could obtain that inside a with statement :P

with(RegExp) {
  if (/r(\d)d(\d)/i.test('R2D2')) {
    '22' === $1 + $2;
  }
}
# Allen Wirfs-Brock (10 years ago)

On Sep 16, 2014, at 11:22 PM, Brendan Eich wrote:

Jussi Kalliokoski wrote:

Unicode flag disabling features to enable parallelism is another footnote for WTFJS.

A bit overdone, but I agree on this point.

A separate flag per regexp, and/or a way to opt-out of RegExp.$foo altogether, seem better.

Speaking strictly from a standards perspective, it seems that we are getting a bit ahead of ourselves.

The $ properties of RegExp and the behavior of of setting them to reflect the most recent exec is not part of any edition of ECMA-262. It seems like they should be in Annex B, but apparently nobody has ever proposed that and/or offered a specification for them.

Getting them into Annex B sounds like the first step.

Since it is Annex B that defines them, Annex B could then also define a flag to eliminate them them. But then a program that asserted that it didn't want to use that Annex B feature would be dependent upon the presence of Annex B support.

Annex B contains many changes to RegExp from the base standard. Perhaps a better way to approach this is to have a standard (not Annex B) regexp flag (perhaps 's', for "standard" or "strict") that means that this regexp should be strictly applied using only the ES standard semantics without any Annex B or other extensions.

# Mathias Bynens (10 years ago)

On Tue, Sep 16, 2014 at 8:16 PM, Domenic Denicola <domenic at domenicdenicola.com> wrote:

I also noticed today that the static RegExp properties are not specced, which seems at odds with our new mandate to at least Annex B-ify the required-for-web-compat stuff.

As a general note to people looking to spec some Annex B stuff, javascript.spec.whatwg.org is a good place to start. Many such things are listed there, but still lack a proper spec definition. Case in point: javascript.spec.whatwg.org/#regexp.$n

# C. Scott Ananian (10 years ago)

On Wed, Sep 17, 2014 at 11:39 AM, Allen Wirfs-Brock <allen at wirfs-brock.com> wrote:

Annex B contains many changes to RegExp from the base standard. Perhaps a better way to approach this is to have a standard (not Annex B) regexp flag (perhaps 's', for "standard" or "strict") that means that this regexp should be strictly applied using only the ES standard semantics without any Annex B or other extensions.

s is already in use in the wider regex world, of course. The perl precedent seems to be to use d for this -- it originally meant "Default" as in, "default semantics pre-perl 5.14", but was soon nicknamed "dodgy" or "d@!#". Seems like a good model to follow.

Not sure we can get away with making this opt-in, though. It would be nice if we could.

# Andrea Giammarchi (10 years ago)

FWIW that 's' flag would work for me, but about not being specd, those properties are described already here: people.mozilla.org/~jorendorff/es6-draft.html#sec-getreplacesubstitution

and these are a long time de-facto standard. "somebody" wrote already about them a while ago: msdn.microsoft.com/en-us/library/ie/9dthzd08(v=vs.94).aspx

# Allen Wirfs-Brock (10 years ago)

On Sep 17, 2014, at 8:59 AM, Andrea Giammarchi wrote:

FWIW that 's' flag would work for me, but about not being specd, those properties are described already here: people.mozilla.org/~jorendorff/es6-draft.html#sec-getreplacesubstitution

that doesn't say anything about RegExp properties, their attributes, when they're set, etc. It just defines how to interpret a replacement string template.

and these are a long time de-facto standard. "somebody" wrote already about them a while ago: msdn.microsoft.com/en-us/library/ie/9dthzd08(v=vs.94).aspx

Not spec. worthy material. Plus somebody needs to do an interoperable semantics intersection analysis cross the major browsers.

# Brendan Eich (10 years ago)

Allen Wirfs-Brock wrote:

A separate flag per regexp, and/or a way to opt-out of RegExp.$foo altogether, seem better.

Speaking strictly from a standards perspective, it seems that we are getting a bit ahead of ourselves.

Sure, but this is es-discuss, it runs ahead of standardization by miles. I just wanted to reinforce Jussi's plea not to overload 'u', s'all -- ok?

# Andrea Giammarchi (10 years ago)

I meant "nothing new to learn", it doesn't say anything but it describes them already so when you say "not in spec" I think "well, not as property", even if every engine has them, but already in specs somehow.

Hence a reason to not break de-facto legacy and think about an opt out instead

# Viktor Mukhachev (10 years ago)

lastIndex also prevents usage of one instance in parallel... I know the javascript do not support parallel execution, but the code is not so beatiful...

the idea to "deprecate" lastIndex was proposed 4 years ago:

see blog.stevenlevithan.com/archives/fixing-javascript-regexp, esdiscuss.org/topic/proposal

# John Lenz (10 years ago)

I wanted to mention the global state of regexs prevent the elimination of otherwise dead code by optimizers. Which is unfortunate. I personally have fixed a number of bugs where a regex overwrote the global state. It is a refactoring hazard.

# Andrea Giammarchi (10 years ago)

a code that does not instantly check results through RegExp access is basically dead anyway since unreliable or potentially broken later on but a flag to opt out sill make this a no-problem, right?

That code is legacy that should work as "impossible to mark as dead" anyway, not sure changing this would help those cases.

# Andy Earnshaw (10 years ago)

I've found the functionality useful on occasion too, but I've also seen it misused. I could see the extra flag been used by libraries everywhere, Intl.js for instance runs a "restoring" regular expression to restore the state of the RegExp object after some internal regexes are used (it does this in order to pass many tests in the test262/intl402 suite) – a simple flag to disable the behaviour would have been great at the time.

On Wed, Sep 17, 2014 at 6:13 AM, Jussi Kalliokoski < jussi.kalliokoski at gmail.com> wrote:

I personally find this functionality very useful and would be saddened if /u which I want to use all of the sudden broke this feature.

What about properly speccing the functionality on the RegExp instance itself? e.g. on a matches or lastMatches property:

var re = /foo:(\d+)/;
if (re.test(str) && parseInt(re.lastMatches[1], 10) > 15) {
    ...
}

This would be a way of avoiding using assignment in the conditional expression if you don't want to disable that option in your linter.

# Claude Pache (10 years ago)

Le 16 sept. 2014 à 20:16, Domenic Denicola <domenic at domenicdenicola.com> a écrit :

I had a conversation with Jaswanth at JSConf EU that revealed that RegExps cannot be used in parallel JS because they modify global state, i.e. RegExp.$0 and friends.

We were thinking it would be nice to find some way of getting rid of this wart. One idea would be to bundle the don't-modify-global-state behavior with the /u flag. Another would be to introduce a new flag to opt-out. The former is a bit more attractive since people will probably want to use /u all the time anyway. I imagine there might be other possibilities others can think of.

Another idea is to to define a variant of the RegExp.prototype.exec() method, that does the Right Thing (doesn't read/write stuff on the RegExp instance, nor on the RegExp global, nor I don't know where):

RegExp.prototype.run (str, params):

    Do what is currently specified for `RegExp.prototype.exec`, except that:
    
        * global, sticky and lastIndex properties are read and written on `params` instead of `this`
        * implementations are not allowed to extend that method in order to mess with `RegExp`, etc.

All other (legacy) methods are rewritten in terms of RE.p.run (in the current ES6 draft, they are mostly written in terms of RE.p.exec). For example:

RegExp.prototype.exec (str):

    1. Check that `this` is a RegExp.
    2. Let `result = this.run(str, this)`.
    3. Populate `RegExp.$1`, etc.
    4. Return `result`.


RegExp.prototype.split (str): — (somewhat simplified for expository purpose)

    1. Check that `this` is a RegExp.
    2. Coerce `str` to a string.
    3. Let `params = { lastIndex: 0, global: true, sticky: false }`.
    4. Do a series of calls to `this.run(str, params)` in order to find where to split the string.
    5. Return the splitted string.

More interestingly, it is now possible to write brand new methods based on RegExp.prototype.run, that are not handicapped with legacy stuff. In particular, note that the following RE.p.iterate generator is not confused by unexpected changes of this.lastIndex, because it uses a locally scoped version of lastIndex instead:

RegExp.prototype.iterate = function* (str) {
    if (!IsRegExp(this))
        throw new TypeError
    str = ToString(str)
    let params = { lastIndex: 0, global: true, __proto__: this }
    let previousLastIndex = 0
    let result
    while ((result = this.run(str, params)) !== null) {
        yield result
        if (params.lastIndex <= previousLastIndex)
            params.lastIndex = previousLastIndex + 1
        previousLastIndex = params.lastIndex
    }
}

String.prototype.replaceAll = function(rx, replacement) {
    var input = ToString(this)
    var result = ''
    var pos = 0
    for (let match of rx.iterate(str)) {            
        // ... left as nontrivial exercise to the reader ...
    }
    return result
}
# C. Scott Ananian (10 years ago)

On Tue, Sep 23, 2014 at 10:04 AM, Claude Pache <claude.pache at gmail.com> wrote:

Another idea is to to define a variant of the RegExp.prototype.exec() method, that does the Right Thing (doesn't read/write stuff on the RegExp instance, nor on the RegExp global, nor I don't know where):

RegExp.prototype.run (str, params):

    Do what is currently specified for `RegExp.prototype.exec`, except that:

        * global, sticky and lastIndex properties are read and written on `params` instead of `this`
        * implementations are not allowed to extend that method in order to mess with `RegExp`, etc.

All other (legacy) methods are rewritten in terms of RE.p.run (in the current ES6 draft, they are mostly written in terms of RE.p.exec).

+1. Deprecating the side-effects of the legacy methods might be tricky though.

# Andrea Giammarchi (10 years ago)

specially when you yield around, who knows what happens ... but yes, this looks like the best option to me too so +1

# Claude Pache (10 years ago)

Le 23 sept. 2014 à 16:35, C. Scott Ananian <ecmascript at cscott.net> a écrit :

On Tue, Sep 23, 2014 at 10:04 AM, Claude Pache <claude.pache at gmail.com> wrote:

Another idea is to to define a variant of the RegExp.prototype.exec() method, that does the Right Thing (doesn't read/write stuff on the RegExp instance, nor on the RegExp global, nor I don't know where):

RegExp.prototype.run (str, params):

   Do what is currently specified for `RegExp.prototype.exec`, except that:

       * global, sticky and lastIndex properties are read and written on `params` instead of `this`
       * implementations are not allowed to extend that method in order to mess with `RegExp`, etc.

All other (legacy) methods are rewritten in terms of RE.p.run (in the current ES6 draft, they are mostly written in terms of RE.p.exec).

+1. Deprecating the side-effects of the legacy methods might be tricky though. --scott

I think that "deprecating" is a word that the web hardly understand. For instance, in the spec, String.prototype.blink (yuck!) isn't marked as "deprecated", but as "mandatory for web browsers". In our case, .exec(), .match(), etc. will still need to support some legacy stuff. Brand new methods will not.

# Viktor Mukhachev (10 years ago)

Another idea is to to define a variant of the RegExp.prototype.exec() method, that does the Right Thing (doesn't read/write stuff on the RegExp instance, nor on the RegExp global, nor I don't know where):

RegExp.prototype.run (str, params):

Do what is currently specified for RegExp.prototype.exec, except that:

   * global, sticky and lastIndex properties are read and written on `params` instead of `this`
   * implementations are not allowed to extend that method in order to mess with `RegExp`, etc.

All other (legacy) methods are rewritten in terms of RE.p.run (in the current ES6 draft, they are mostly written in terms of RE.p.exec).

RegExp.prototype.exec returns array with extra properties (input, index), may be it is better to return something other for run, frozen value object with 0, 1, ... keys for example. What do you think?

Actually, index and "input" are not very interesting, as input is a string passed to exec and index = string.indexOf(match[0], lastIndex), right?

# C. Scott Ananian (10 years ago)

On Wed, Sep 24, 2014 at 2:36 AM, Viktor Mukhachev <esdiscusser at yandex.ru> wrote:

RegExp.prototype.exec returns array with extra properties (input, index), may be it is better to return something other for run, frozen value object with 0, 1, ... keys for example. What do you think?

For ES6 this can return a proper subclass of Array.

Actually, index and "input" are not very interesting, as input is a string passed to exec and index = string.indexOf(match[0], lastIndex), right?

Not in the presence of lookahead assertions.

# Brendan Eich (10 years ago)

C. Scott Ananian wrote:

On Wed, Sep 24, 2014 at 2:36 AM, Viktor Mukhachev<esdiscusser at yandex.ru> wrote:

RegExp.prototype.exec returns array with extra properties (input, index), may be it is better to return something other for run, frozen value object with 0, 1, ... keys for example. What do you think?

For ES6 this can return a proper subclass of Array.

Adding a new method still leaves the old ones around and developers will have problems until some later date. Doesn't mean we shouldn't, but in the mean time the API surface grows. Is it worth it compared to the alternatives?

  1. Support configurable as an attribute on the magic, not-yet-specified RegExp statics, so they can be deleted. SES (Caja) wants this, others could use it, libraries could do it at startup.

  2. Add a RegExp instance flag (don't overload u) that disables updating the RegExp statics.

TC39 wants both of these, per today's meeting.

Actually, index and "input" are not very interesting, as input is a string passed to exec and index = string.indexOf(match[0], lastIndex), right?

Not in the presence of lookahead assertions.

Good point!