Regex on substrings

# Peter van der Zee (14 years ago)

A problem I faced recently is the inability to apply regular expressions to a substring of a string without explicitly taking the substring first. So I'm wondering how much trouble it would be to extend the RegExp api to this...

RegExp.prototype.test = function(string[, start=0[, stop=string.length]]){ }; RegExp.prototype.exec = function(string[, start=0[, stop=string.length]]){ };

The regular expression would only be applied to that part of the string. It'd be (almost) the same as regex.test(string.substring(start, stop)), except the substringing is handled internally.

I can't think of any backward compatibility issues for this change.

My use case is that I have a set of words I want to find in a certain input string, but almost always starting at pos>0. Right now I have to

take the substring of the longest possible match (or remaining of input) and check the results of exec/match to see the length of the match, if any. The alternative is to compile a regular expression that "skips" the first n characters.

Optionally, it might be handy to have .test return a number, indicating the length of the (first) match. If zero, there was no match. This would however break with scripts that explicitly check for === false.

It seems to me like applying regular expressions to substrings could be optimized internally much better (using pointers) than having to do a substring in ES every time. This should speed up the parsing of input, for instance.

# Dave Fugate (14 years ago)

Sounds like an excellent suggestion to me. I'm aware of some implementation tests this would end up breaking, but no 'real world' code per-se.

My best,

# Mike Samuel (14 years ago)

2011/6/2 Peter van der Zee <ecma at qfox.nl>:

A problem I faced recently is the inability to apply regular expressions to a substring of a string without explicitly taking the substring first. So I'm wondering how much trouble it would be to extend the RegExp api to this...

RegExp.prototype.test = function(string[, start=0[, stop=string.length]]){ }; RegExp.prototype.exec = function(string[, start=0[, stop=string.length]]){ };

What do the ^ and $ assertions in non-multiline mode mean when start and stop are not [0, string.length)? Do they match at the beginning and end of the substring?

Optionally, it might be handy to have .test return a number, indicating the length of the (first) match. If zero, there was no match. This would however break with scripts that explicitly check for === false.

Using zero to indicate that there is no match will conflate zero length matches with no match. Consider

/\b/.test("a-b") === true
# Peter van der Zee (14 years ago)

On Thu, Jun 2, 2011 at 7:17 PM, Mike Samuel <mikesamuel at gmail.com> wrote:

2011/6/2 Peter van der Zee <ecma at qfox.nl>:

A problem I faced recently is the inability to apply regular expressions to a substring of a string without explicitly taking the substring first. So I'm wondering how much trouble it would be to extend the RegExp api to this...

RegExp.prototype.test = function(string[, start=0[, stop=string.length]]){ }; RegExp.prototype.exec = function(string[, start=0[, stop=string.length]]){ };

What do the ^ and $ assertions in non-multiline mode mean when start and stop are not [0, string.length)? Do they match at the beginning and end of the substring?

They mean exactly the same as they would do for regex.test(string.substring(start, stop))

Optionally, it might be handy to have .test return a number, indicating the length of the (first) match. If zero, there was no match. This would however break with scripts that explicitly check for === false.

Using zero to indicate that there is no match will conflate zero length matches with no match. Consider

/\b/.test("a-b") === true

That's a good point. It suppose could return false, but that wouldn't be very backcompat for these cases. It could return -1 but that would definately be backcompat breaking.

Well, it would have been nice. But if the tax on returning match length for .test is too big I'm not feeling that strong about it myself.

# Brendan Eich (14 years ago)

Note that

harmony:regexp_y_flag

has been promoted to harmony:proposals status (see harmony:proposals), so very likely to be in ES.next.

With this flag on a regexp, you don't need extra arguments for a suite of methods. Instead you set the regexp's lastIndex property (or let it start at 0 and then be updated by successive matches -- the y flag causes updates to lastIndex in the same way that the g flag does).

As for matching $ at the end, that is a less common use-case not served by the y flag. Usually you are lexing through a string and want to avoid an anchoring search, and of course avoid taking tail slices at each lastIndex. But the lexer munches according to a regular grammar and knows when to stop. It is atypical to want $ to match exactly N characters from the anchor point.