Letting RegExp method return something iterable?

# Axel Rauschmayer (8 years ago)

At the moment, the following two methods abuse regular expressions as iterators (if the /g flag is set):

  • RegExp.prototype.test()
  • RegExp.prototype.exec()

Would it make sense to create similar methods that return something iterable, so that for-of can iterate over the result?

# Axel Rauschmayer (8 years ago)

Well, obviously it doesn’t make much sense to do that for test(), but it would be great to have for exec().

# Axel Rauschmayer (8 years ago)

An example to make things clearer (thanks for the suggestion, Domenic):

console.log(extractTagNamesES5('<a> and <b> or <c>'));  // [ 'a', 'b', 'c' ]
 
// If exec() is invoked on a regular expression whose flag /g is set
// then the regular expression is abused as an iterator:
// Its property `lastIndex` tracks how far along the iteration is
// and must be reset. It also means that the regular expression can’t be frozen.
var regexES5 = /<(.*?)>/g;
function extractTagNamesES5(str) {
    regexES5.lastIndex = 0;  // to be safe
    var results = [];
    while (true) {
        var match = regexES5.exec(str);
        if (!match) break;
        results.push(match[1]);
    }
    return results;
}
 
// If we had a method `execMultiple()` that returns an iterable,
// the above code would become simpler in ES6.
const REGEX_ES6 = /<(.*?)>/;  // no need to set flag /g
function extractTagNamesES6a(str) {
    let results = [];
    for (let match of REGEX_ES6.execMultiple(str)) {
         results.push(match[1]);
    }
    return results;
}
 
// Even shorter:
function extractTagNamesES6b(str) {
    return Array.from(REGEX_ES6.execMultiple(str), x => x[1]);
}
 
// Shorter yet:
function extractTagNamesES6c(str) {
    return [ for (x of REGEX_ES6.execMultiple(str)) x[1] ];
}

gist.github.com/rauschma/6330265

# Domenic Denicola (8 years ago)

This is really nice. lastIndex silliness in ES5 has bitten me quite a few times, and the example code shows how much better this would be. I hope someone on TC39 wants to champion this!

# Andrea Giammarchi (8 years ago)

you don't need to reset the lastIndex to zero if you don't break the loop before unless you are sharing that regexp with some other part of code you don't control.

What I am saying is that the example is very wrong as it is since there's no way to have an unsafe regexES5 behavior in there.

Moreover, if anyone uses the flag g to test() something is wrong unless those properties you are complaining about or somebody might find silly are actually used in a clever way that do not require the creation of Arrays and garbage and track the position of each complex operation allowing incremental parsers based on substrings to keep going and do a fast job without bothering too much RAM and/or GC.

Long story short, I don't see any real use/case or any concrete advantage with those examples so please make it more clear what's the problem you are trying to solve and how these methods will concretely make our life easier ^_^

So far, and for what I can tell there, if you really need that Array you can go probably faster simply abusing replace.

var re = /<(.*?)>/g; // also bad regexp for tags
                            // it grabs attributes too
function addMatch($0, $1) {
  this.push($1);
}
function createMatches(str) {
  var matches = [];
  str.replace(re, addMatch.bind(matches));
  return matches;
}

Above trick also scales more if you need to push more than a match in the RegExp.

Although we might need a way to do similar thing without abusing other methods but I keep guessing when this is so needed that should be in core (1 line operation as your last one is ... anyone that needs that could implement it without problems ;-))

# Erik Arvidsson (8 years ago)

On Mon, Aug 26, 2013 at 1:05 PM, Andrea Giammarchi <andrea.giammarchi at gmail.com> wrote:

Long story short, I don't see any real use/case or any concrete advantage with those examples so please make it more clear what's the problem you are trying to solve and how these methods will concretely make our life easier ^_^

Having a way to get an iterable over all the matches is highly useful. It is common to use such a construct in Python.

for m in re.finditer(r"\w+", "aa b ccc"):
  print m.group(0)
for (let m of 'aa b ccc'.matchAll(/(\w+/)) {
  print(m[0]);
}
# Andrea Giammarchi (8 years ago)

Is it very useful because you wrote for instead of while ?

while (m = re.exec(str))
  console.log(m[0])
;

I don't really see any concrete advantage, sorry, but maybe it's me not liking at all this iterable all the things new trend.

# Brendan Eich (8 years ago)

Andrea Giammarchi wrote:

Is it very useful because you wrote for instead of while ?

while (m = re.exec(str))
  console.log(m[0])
;

It is, for two reasons:

  1. in JS only for can have a let or var binding in the head.

  2. the utility extends to all for-of variations: array comprehensions, generator expresisons.

# Andrea Giammarchi (8 years ago)
{let m; while(m = re.exec(str)) {
  // ... no, really
}}

I don't get the need of this but if this is the trend then String#split needs an iterable too (no!)

# Forbes Lindesay (8 years ago)

String#split already is iterable because it returns an array. What it isn't is lazy.

To be equivalent to the for code, the let needs to go inside the body of the while, not outside. This neatly demonstrates the key point:

  • as it stands, writing this kind of code tends to be bug prone (i.e. people get it wrong in confusing ways)
  • it would be less bug prone if there was just a method that returned an iterable. That could be an Array, rather than a lazy collection.
# Brendan Eich (8 years ago)

Forbes Lindesay wrote:

String#split already is iterable because it returns an array. What it isn't is lazy.

To be equivalent to the for code, the let needs to go inside the body of the while, not outside. This neatly demonstrates the key point:

  • as it stands, writing this kind of code tends to be bug prone (i.e. people get it wrong in confusing ways)
  • it would be less bug prone if there was just a method that returned an iterable. That could be an Array, rather than a lazy collection.

Spot on.

Note the Algol family includes languages that allow bindings in if (), while (), etc. heads as well as for () -- C++ for example. We talked about extending JS this way but for unclear reasons deferred.

# Andrea Giammarchi (8 years ago)

to be really honest, most people will get it wrong regardless since thanks to JSLint and friends they are use to declare everything on top and they probably forgot for accepts var declarations.

I've never ever needed this syntax and I don't see the miracle. I am sure somebody one day will use that but I am just worried that resources will be wasted to add methods nobody needed that much 'till now instead of focusing on things that we cannot even write in one line of code as Alex did.

Just my 2 cents on this topic, no hard feelings on the proposal itself.

# Andrea Giammarchi (8 years ago)

one lazy hilarious thought on that though ...

On Mon, Aug 26, 2013 at 5:30 PM, Forbes Lindesay <forbes at lindesay.co.uk>wrote:

String#split already is iterable because it returns an array. What it isn't is lazy.

it's straight forward to make String#split(re) lazy using the lastIndex indeed:


function* lazySplit(str, re) {
  for (var
    i = 0;
    re.test(str);
    i = re.lastIndex
  )
    yield str.slice(
      i, re.lastIndex - RegExp.lastMatch.length
    )
  ;
  yield str.slice(i);
}

# Claude Pache (8 years ago)

Le 27 août 2013 à 01:23, Brendan Eich <brendan at mozilla.com> a écrit :

It is, for two reasons:

  1. in JS only for can have a let or var binding in the head.

  2. the utility extends to all for-of variations: array comprehensions, generator expresisons.

There is a third reason. The syntax:

for (let m of re.execAll(str) {
	// ...
}

has the clear advantage to express the intention of the programmer, and nothing more. It does not require good knowledge of the details of the language to understand what happens.

Indeed, when I read while(m = re.exec(str)), I really have to analyse the following additional points:

  • = is not a typo for == (here, some annotation would be useful);
  • RegExp#exec returns a falsy value if and only if there is no more match;
  • re has its global flag set, and its .lastIndex property has not been disturbed.

All these tricks are unrelated to the intention of the programmer, and are just distracting points, especially for any reader that use only occasionally RegExp#exec with the global flag set.

In summary, citing 1: "Don’t be clever, don’t make me think."

# Andrea Giammarchi (8 years ago)

sure you know everything as soon as you read of ... right ? How objectives are your points ? If you know JS that while looks very simple, IMO

# Brendan Eich (8 years ago)

On Aug 27, 2013, at 9:42 AM, Andrea Giammarchi <andrea.giammarchi at gmail.com> wrote:

sure you know everything as soon as you read of ... right ?

Wrong. The nested assignment is idiomatic in C but not good for everyone (see gcc's warning when not parenthesized in such contexts) due to == and = being so close as to make typo and n00b hazards.

Furthermore, the exogenous binding / hoisting problem is objectively greater cognitive load and bug habitat.

How objectives are your points ? If you know JS that while looks very simple, IMO

Please learn when to fold a losing argument :-|.

# Andrea Giammarchi (8 years ago)

let me rephrase ... I've no idea what this code does if not a syntax error (and for different reasons)

for (let m of re.execAll(str) {

what is of ... will let mark that variable as local ? what is returned and what will be m ?

I need to know these things ... this has nothing to do with "Don’t be clever, don’t make me think." a point which also I don't understand (I have to think about such statement ... I don't demand Ocaml language to be C like 'cause I don't get it)

Anyway, I've already commented my point of view.

# Andrea Giammarchi (8 years ago)

losing argument ... as if assignment within condition has been a real problem except for JSLint ... uhm, I don't think so but I am off this conversation. Already said my point, feel free to (as usual) disagree ^_^

# Claude Pache (8 years ago)

Le 27 août 2013 à 18:48, Andrea Giammarchi <andrea.giammarchi at gmail.com> a écrit :

let me rephrase ... I've no idea what this code does if not a syntax error (and for different reasons)

for (let m of re.execAll(str) {

what is of ... will let mark that variable as local ? what is returned and what will be m ?

I need to know these things ... this has nothing to do with "Don’t be clever, don’t make me think." a point which also I don't understand (I have to think about such statement ... I don't demand Ocaml language to be C like 'cause I don't get it)

Trying to reexplain: for (let m of re.execAll(str)) is a direct, one-to-one translation of the meaning of the programmer into (the expected) EcmaScript 6. But with while (m = re.exec(str)), you exploit some secondary fact about the value of re.exec(str) (falsy iff when over) which is unrelated to the object of the code. (And yes, you have to learn the complete syntax of for/of in order to understand the code, but it is unrelated to the point.)

# Forbes Lindesay (8 years ago)

Right, my impression is that most of us are in agreement that it would be extremely useful to have a simple way to loop over the list of matches for a regular expression and do something with each one. I don't see why @andrea doesn't see this need (maybe it's not something he's found need to do recently).

I think to move on, it would be useful to consider whether the method should return an array (which would be iterable and also have methods like .map built in) or a custom, lazy iterable (which might be better for efficiency if that laziness were useful, but will have disadvantages like lacking the array prototype methods and presumably failing if you try to loop over it twice).

I'm guessing that code like:

var matches = /foo/.execMultipleLazy('str')
for (let match of matches) {
  //do something
}
for (let match of matches) {
  //do something
}
for (let match of matches) {
  //do something
}

Would go wrong somehow whereas:

var matches = /foo/.execMultipleGreedy('str')
for (let match of matches) {
  //do something
}
for (let match of matches) {
  //do something
}
for (let match of matches) {
  //do something
}

Would work fine?

# Tab Atkins Jr. (8 years ago)

On Wed, Aug 28, 2013 at 2:12 AM, Forbes Lindesay <forbes at lindesay.co.uk> wrote:

Right, my impression is that most of us are in agreement that it would be extremely useful to have a simple way to loop over the list of matches for a regular expression and do something with each one. I don't see why @andrea doesn't see this need (maybe it's not something he's found need to do recently).

I think to move on, it would be useful to consider whether the method should return an array (which would be iterable and also have methods like .map built in) or a custom, lazy iterable (which might be better for efficiency if that laziness were useful, but will have disadvantages like lacking the array prototype methods and presumably failing if you try to loop over it twice).

I'm guessing that code like:

var matches = /foo/.execMultipleLazy('str')
for (let match of matches) {
  //do something
}
for (let match of matches) {
  //do something
}
for (let match of matches) {
  //do something
}

Would go wrong somehow whereas:

var matches = /foo/.execMultipleGreedy('str')
for (let match of matches) {
  //do something
}
for (let match of matches) {
  //do something
}
for (let match of matches) {
  //do something
}

Yes. This is a standard Python issue - if you want to make sure you can loop over something twice, regardless of whether it's an array or an iterator, just pass it through list() first.

Similarly, in JS you'd just pass it through Array.from() first.

# Forbes Lindesay (8 years ago)

So you're in favor of returning the Iterable and then having people use Array.from if they need an array?

# Axel Rauschmayer (8 years ago)

I'm guessing that code like:

var matches = /foo/.execMultipleLazy('str')
for (let match of matches) {
 //do something
}
for (let match of matches) {
 //do something
}
for (let match of matches) {
 //do something
}

Would go wrong somehow

Yes. This is a standard Python issue - if you want to make sure you can loop over something twice, regardless of whether it's an array or an iterator, just pass it through list() first.

Similarly, in JS you'd just pass it through Array.from() first.

Additional option:

var arrayOfMatches = [ .../foo/.execMultipleLazy('str') ]
# Andrea Giammarchi (8 years ago)

On Wed, Aug 28, 2013 at 2:12 AM, Forbes Lindesay <forbes at lindesay.co.uk>wrote:

a simple way to loop over the list of matches for a regular expression

it's about 10 years or more we have that .. so to make my very personal statement clear:

I've got 99 problems in JS, make everything an iterator ain't one

Specially when the need comes out of an example and code that does not fully understand or use what's available since ever in JS, specially for something that has never been a real-world problem (but please tell me how better your app would have been otherwise)

for(let m of lazySplit('a.b.c', /\./g)) taking my example and prioritize something else since generators give already us the ability to do that? ^_^

Or maybe not, fine for me.

# Forbes Lindesay (8 years ago)

Right, I don't care whether it's lazy. I only care that it exists. Nobody's crying out for a lazy version of string split (at least not yet anyway). I have had the issue of needing to loop over all the matches that a regular expression has. It is a common, recurring issue that many developers face.

Let's move on from whether it should exist (clearly it should) and stick to whether it should be an array, or lazy. Does anyone have a strong opinion either way? The fact that all our regular expression iteration thus far has been lazy to me suggests this probably should be too, but maybe it would be simpler if it returned an array. I really hope someone will chime in on this.

# Brendan Eich (8 years ago)

Forbes Lindesay wrote:

Let’s move on from whether it should exist (clearly it should)

Does String.prototype.match not count?

and stick to whether it should be an array, or lazy. Does anyone have a strong opinion either way? The fact that all our regular expression iteration thus far has been lazy to me suggests this probably should be too, but maybe it would be simpler if it returned an array. I really hope someone will chime in on this.

The fact that s.match(/re/g) returns the array of all matches (with captures) sucks some of the oxygen away from any /re/g.execAll(s) proposal.

But String.prototype.match has perlish hair (e.g., those capture groups showing up in the result array). Perhaps we do want execAll (with a better name) just to break down the composite perl4-era legacy into compositional methods.

# Ron Buckton (8 years ago)

The advantage of a lazy execAll, is the ability to break out of the for..of loop without the need to continue to traverse the input string looking for matches. This is the same advantage that the while(m = re.exec()) has going for it. You can always be greedy by using Array.from or an array comprehension if execAll is lazy, but you are back to using a while loop if execAll is greedy and you want lazy matching, which limits its usefulness in some scenarios.

Ron

Sent from my Windows Phone

# Brendan Eich (8 years ago)

Ron Buckton wrote:

The advantage of a lazy execAll, is the ability to break out of the for..of loop without the need to continue to traverse the input string looking for matches. This is the same advantage that the while(m = re.exec()) has going for it. You can always be greedy by using Array.from or an array comprehension if execAll is lazy, but you are back to using a while loop if execAll is greedy and you want lazy matching, which limits its usefulness in some scenarios.

Good point -- on top of the quasi-redundancy of an eager execAll viz. String.prototype.match, I think this makes a good case for a lazy execAll -- with a much better name.

Candidates: r.iterate(s), r.iterateOver(s), r.execIterator(s) (blech!). Suggest some!

# Axel Rauschmayer (8 years ago)

The fact that s.match(/re/g) returns the array of all matches (with captures) sucks some of the oxygen away from any /re/g.execAll(s) proposal.

But String.prototype.match has perlish hair (e.g., those capture groups showing up in the result array).

Really? AFAICT, only the complete matches (group 0) are returned.

# Axel Rauschmayer (8 years ago)

[...] I think this makes a good case for a lazy execAll -- with a much better name.

Candidates: r.iterate(s), r.iterateOver(s), r.execIterator(s) (blech!). Suggest some!

I think “exec” should be in the name, to indicate that the new method is a version of exec().

Ideas:

  • execMulti()
  • execIter()

execAll() may not be that bad. It’s not pretty, but it’s fairly easy to guess what it does (if one know what the normal exec() does).

# Brendan Eich (8 years ago)

Axel Rauschmayer wrote:

Really? AFAICT, only the complete matches (group 0) are returned.

Sorry, of course you are right -- how soon I forget -- the subgroups show up only in each exec result array, but are dropped from the match result.

So hair on the other side of the coin, if you will. A naive iterator that calls exec would return, e.g.,["ab", "b"] for the first iteration given r and s as follows:

js> r = /.(.)/g
/.(.)/g
js> s = 'abcdefgh'
"abcdefgh"
js> a = s.match(r)
["ab", "cd", "ef", "gh"]
js> b = r.exec(s)
["ab", "b"]

Is this what the programmer wants? If not, String.prototype.match stands ready, and again takes away motivation for an eager execAll.

But programmers wanting exec with submatches could use a lazy form:

js> r.lastIndex = 0
0
js> RegExp.prototype.execAll = function (s) { let m; while (m = this.exec(s)) yield m; }
(function (s) { let m; while (m = this.exec(s)) yield m; })
js> c = [m for (m of r.execAll(s))]
[["ab", "b"], ["cd", "d"], ["ef", "f"], ["gh", "h"]]
# Brendan Eich (8 years ago)

Axel Rauschmayer wrote:

– execIter()

Not bad, I think better than execAll, which does not connote return of an iterator, but which does perversely suggest returning an array of all exec results.

execAll() may not be that bad. It’s not pretty, but it’s fairly easy to guess what it does (if one know what the normal exec() does).

(If only!)

# Axel Rauschmayer (8 years ago)

I agree that execAll() is not a 100% winner, more like a clean-up of a quirky corner. But exec() in “multi” mode has a surprising amount of pitfalls:

  • /g flag must be set
  • lastIndex must be 0
  • can’t inline the regex, because it is needed as a pseudo-iterator (more of an anti-pattern, anyway, but still)
  • side effects via lastIndex may be a problem

All of these would go away with a execAll(). The thing I’m not sure about is how frequently exec() is used that way. String.prototype.match() does indeed cover a lot of use cases. So does String.prototype.replace().

# Brendan Eich (8 years ago)

Axel Rauschmayer wrote:

  • /g flag must be set
  • lastIndex must be 0
  • can’t inline the regex, because it is needed as a pseudo-iterator (more of an anti-pattern, anyway, but still)
  • side effects via lastIndex may be a problem

Anything we do of the execAll/execIter kind had better be immune to the awful Perl4-infused "mutable lastIndex state but only if global" kind. Compositionality required.

The design decision to face is what to do when a global regexp is used. Throw, or ignore its lastIndex?

# Dean Landolt (8 years ago)

On Thu, Aug 29, 2013 at 4:13 AM, Brendan Eich <brendan at mozilla.com> wrote:

The design decision to face is what to do when a global regexp is used. Throw, or ignore its lastIndex?

I'd hate to see it throw. Ignoring lastIndex seems friendlier, especially if it were called execAll. It probably shouldn't be called execIter considering exec is already an iterator (even if a bit crazy).

I'd love to be able to send a specific index to the generator, which would be completely equivalent to RegExp.prototype.exec without the lastIndex smell.

# Oliver Hunt (8 years ago)

On Aug 29, 2013, at 1:13 AM, Brendan Eich <brendan at mozilla.com> wrote:

The design decision to face is what to do when a global regexp is used. Throw, or ignore its lastIndex?

I would favor ignoring lastIndex rather than throwing, but to be sure can you clarify what you mean by global regexp?

If we're talking /.../g, then my feeling is that the /g should be ignored -- if you're wanting a regexp iterator for a string (or whatever) I would think that the API would imply that all regexps were intended to be "global".

If we're talking about multiple concurrent iterators with the same regexp/string then it should definitely be ignored :D

Erm.

I'm not sure if that's coherent, but the TLDR is that I favor ignoring all the old side state warts (i would not have iterators update the magic $ properties, etc)

# Andrea Giammarchi (8 years ago)

then you are probably looking for something like this?

String.prototype.matchAll = function (re) {
  for (var
    re = new RegExp(
      re.source,
      "g" +
      (re.ignoreCase ? "i" : "") +
      (re.multiline ? "m" : "")
    ),
    a = [],
    m; m = re.exec(this);
    a.push(m)
  );
  return a;
};

// example
'abcdefgh'.matchAll(/.(.)/g);

[
  ["ab", "b"],
  ["cd", "d"],
  ["ef", "f"],
  ["gh", "h"]
]
# Brendan Eich (8 years ago)

Dean Landolt wrote:

I'd hate to see it throw. Ignoring lastIndex seems friendlier, especially if it were called execAll. It probably shouldn't be called execIter considering exec is already an iterator (even if a bit crazy).

'exec' is not an iterator in any well-defined ES6 sense.

Yes, ok -- I blew it by trying to emulate Perl 4. The jargon there was "list" vs "scalar" context, not "iterator".

I'd love to be able to send a specific index to the generator, which would be completely equivalent to RegExp.prototype.exec without the lastIndex smell.

Why an index? Rarely have I seen anyone assign other than constant 0 to lastIndex.

# Brendan Eich (8 years ago)

Oliver Hunt wrote:

I would favor ignoring lastIndex rather than throwing, but to be sure can you clarify what you mean by global regexp?

One created with the 'g' flag, either literally (/re/g) or via the constructor (new RegExp(src, 'g')).

If we're talking /.../g, then my feeling is that the /g should be ignored -- if you're wanting a regexp iterator for a string (or whatever) I would think that the API would imply that all regexps were intended to be "global".

Agreed, if we don't just throw from execAll on a global regexp ;-).

If we're talking about multiple concurrent iterators with the same regexp/string then it should definitely be ignored :D

IOW, in general, 'g' should be ignored by new APIs.

Erm.

I'm not sure if that's coherent, but the TLDR is that I favor ignoring all the old side state warts (i would not have iterators update the magic $ properties, etc)

Yes, agreed. The devil is in the details.

# 森建 (4 years ago)

Although It seems that some people agreed with appending RegExp#execAll to EcmaScript 4 years ago, what happened to it after that?

topic: esdiscuss.org/topic/letting-regexp-method-return-something-iterable
implementation: www.npmjs.com/package/regexp.execall

# Oriol _ (4 years ago)

There is a String#matchAll proposal in stage 1.

tc39/String.prototype.matchAll https://github.com/tc39/String.prototype.matchAll

Oriol

# 森建 (4 years ago)

@Oriol Thanks for your reply!