Regex on substrings

# Peter van der Zee (14 years ago)

A problem I faced recently is the inability to apply regular expressions to a substring of a string without explicitly taking the substring first. So I'm wondering how much trouble it would be to extend the RegExp api to this...

RegExp.prototype.test = function(string[, start=0[, stop=string.length]]){ }; RegExp.prototype.exec = function(string[, start=0[, stop=string.length]]){ };

The regular expression would only be applied to that part of the string. It'd be (almost) the same as regex.test(string.substring(start, stop)), except the substringing is handled internally.

I can't think of any backward compatibility issues for this change.

My use case is that I have a set of words I want to find in a certain input string, but almost always starting at pos>0. Right now I have to

take the substring of the longest possible match (or remaining of input) and check the results of exec/match to see the length of the match, if any. The alternative is to compile a regular expression that "skips" the first n characters.

Optionally, it might be handy to have .test return a number, indicating the length of the (first) match. If zero, there was no match. This would however break with scripts that explicitly check for === false.

It seems to me like applying regular expressions to substrings could be optimized internally much better (using pointers) than having to do a substring in ES every time. This should speed up the parsing of input, for instance.

A problem I faced recently is the inability to apply regular
expressions to a substring of a string without explicitly taking the
substring first. So I'm wondering how much trouble it would be to
extend the RegExp api to this...

RegExp.prototype.test = function(string[, start=0[, stop=string.length]]){ };
RegExp.prototype.exec = function(string[, start=0[, stop=string.length]]){ };

The regular expression would only be applied to that part of the
string. It'd be (almost) the same as
regex.test(string.substring(start, stop)), except the substringing is
handled internally.

I can't think of any backward compatibility issues for this change.

My use case is that I have a set of words I want to find in a certain
input string, but almost always starting at pos>0. Right now I have to
take the substring of the longest possible match (or remaining of
input) and check the results of exec/match to see the length of the
match, if any. The alternative is to compile a regular expression that
"skips" the first n characters.

Optionally, it might be handy to have .test return a number,
indicating the length of the (first) match. If zero, there was no
match. This would however break with scripts that explicitly check for
=== false.

It seems to me like applying regular expressions to substrings could
be optimized internally much better (using pointers) than having to do
a substring in ES every time. This should speed up the parsing of
input, for instance.

- peter

# Dave Fugate (14 years ago)

Sounds like an excellent suggestion to me. I'm aware of some implementation tests this would end up breaking, but no 'real world' code per-se.

My best,

Sounds like an excellent suggestion to me.  I'm aware of some implementation tests this would end up breaking, but no 'real world' code per-se.

My best,

Dave

-----Original Message-----
From: es-discuss-bounces at mozilla.org [mailto:es-discuss-bounces at mozilla.org] On Behalf Of Peter van der Zee
Sent: Thursday, June 02, 2011 4:30 AM
To: es-discuss
Subject: Regex on substrings

A problem I faced recently is the inability to apply regular expressions to a substring of a string without explicitly taking the substring first. So I'm wondering how much trouble it would be to extend the RegExp api to this...

RegExp.prototype.test = function(string[, start=0[, stop=string.length]]){ }; RegExp.prototype.exec = function(string[, start=0[, stop=string.length]]){ };

The regular expression would only be applied to that part of the string. It'd be (almost) the same as regex.test(string.substring(start, stop)), except the substringing is handled internally.

I can't think of any backward compatibility issues for this change.

My use case is that I have a set of words I want to find in a certain input string, but almost always starting at pos>0. Right now I have to take the substring of the longest possible match (or remaining of
input) and check the results of exec/match to see the length of the match, if any. The alternative is to compile a regular expression that "skips" the first n characters.

Optionally, it might be handy to have .test return a number, indicating the length of the (first) match. If zero, there was no match. This would however break with scripts that explicitly check for === false.

It seems to me like applying regular expressions to substrings could be optimized internally much better (using pointers) than having to do a substring in ES every time. This should speed up the parsing of input, for instance.

- peter
_______________________________________________
es-discuss mailing list
es-discuss at mozilla.org
https://mail.mozilla.org/listinfo/es-discuss

# Mike Samuel (14 years ago)

2011/6/2 Peter van der Zee <ecma at qfox.nl>:

A problem I faced recently is the inability to apply regular expressions to a substring of a string without explicitly taking the substring first. So I'm wondering how much trouble it would be to extend the RegExp api to this...

RegExp.prototype.test = function(string[, start=0[, stop=string.length]]){ }; RegExp.prototype.exec = function(string[, start=0[, stop=string.length]]){ };

What do the ^ and $ assertions in non-multiline mode mean when start and stop are not [0, string.length)? Do they match at the beginning and end of the substring?

Optionally, it might be handy to have .test return a number, indicating the length of the (first) match. If zero, there was no match. This would however break with scripts that explicitly check for === false.

Using zero to indicate that there is no match will conflate zero length matches with no match. Consider

/\b/.test("a-b") === true

2011/6/2 Peter van der Zee <ecma at qfox.nl>:
> A problem I faced recently is the inability to apply regular
> expressions to a substring of a string without explicitly taking the
> substring first. So I'm wondering how much trouble it would be to
> extend the RegExp api to this...
>
> RegExp.prototype.test = function(string[, start=0[, stop=string.length]]){ };
> RegExp.prototype.exec = function(string[, start=0[, stop=string.length]]){ };

What do the ^ and $ assertions in non-multiline mode mean when start
and stop are not [0, string.length)?
Do they match at the beginning and end of the substring?

> Optionally, it might be handy to have .test return a number,
> indicating the length of the (first) match. If zero, there was no
> match. This would however break with scripts that explicitly check for
> === false.

Using zero to indicate that there is no match will conflate zero
length matches with no match.
Consider

    /\b/.test("a-b") === true

# Peter van der Zee (14 years ago)

On Thu, Jun 2, 2011 at 7:17 PM, Mike Samuel <mikesamuel at gmail.com> wrote:

2011/6/2 Peter van der Zee <ecma at qfox.nl>:

A problem I faced recently is the inability to apply regular expressions to a substring of a string without explicitly taking the substring first. So I'm wondering how much trouble it would be to extend the RegExp api to this...

RegExp.prototype.test = function(string[, start=0[, stop=string.length]]){ }; RegExp.prototype.exec = function(string[, start=0[, stop=string.length]]){ };

What do the ^ and $ assertions in non-multiline mode mean when start and stop are not [0, string.length)? Do they match at the beginning and end of the substring?

They mean exactly the same as they would do for regex.test(string.substring(start, stop))

Optionally, it might be handy to have .test return a number, indicating the length of the (first) match. If zero, there was no match. This would however break with scripts that explicitly check for === false.

Using zero to indicate that there is no match will conflate zero length matches with no match. Consider

/\b/.test("a-b") === true

That's a good point. It suppose could return false, but that wouldn't be very backcompat for these cases. It could return -1 but that would definately be backcompat breaking.

Well, it would have been nice. But if the tax on returning match length for .test is too big I'm not feeling that strong about it myself.

On Thu, Jun 2, 2011 at 7:17 PM, Mike Samuel <mikesamuel at gmail.com> wrote:
> 2011/6/2 Peter van der Zee <ecma at qfox.nl>:
>> A problem I faced recently is the inability to apply regular
>> expressions to a substring of a string without explicitly taking the
>> substring first. So I'm wondering how much trouble it would be to
>> extend the RegExp api to this...
>>
>> RegExp.prototype.test = function(string[, start=0[, stop=string.length]]){ };
>> RegExp.prototype.exec = function(string[, start=0[, stop=string.length]]){ };
>
> What do the ^ and $ assertions in non-multiline mode mean when start
> and stop are not [0, string.length)?
> Do they match at the beginning and end of the substring?

They mean exactly the same as they would do for
regex.test(string.substring(start, stop))

>
>> Optionally, it might be handy to have .test return a number,
>> indicating the length of the (first) match. If zero, there was no
>> match. This would however break with scripts that explicitly check for
>> === false.
>
> Using zero to indicate that there is no match will conflate zero
> length matches with no match.
> Consider
>
>    /\b/.test("a-b") === true

That's a good point. It suppose could return false, but that wouldn't
be very backcompat for these cases. It could return -1 but that would
definately be backcompat breaking.

Well, it would have been nice. But if the tax on returning match
length for .test is too big I'm not feeling that strong about it
myself.

- peter

# Brendan Eich (14 years ago)

Note that

harmony:regexp_y_flag

has been promoted to harmony:proposals status (see harmony:proposals), so very likely to be in ES.next.

With this flag on a regexp, you don't need extra arguments for a suite of methods. Instead you set the regexp's lastIndex property (or let it start at 0 and then be updated by successive matches -- the y flag causes updates to lastIndex in the same way that the g flag does).

As for matching $ at the end, that is a less common use-case not served by the y flag. Usually you are lexing through a string and want to avoid an anchoring search, and of course avoid taking tail slices at each lastIndex. But the lexer munches according to a regular grammar and knows when to stop. It is atypical to want $ to match exactly N characters from the anchor point.

Note that

http://wiki.ecmascript.org/doku.php?id=harmony:regexp_y_flag

has been promoted to harmony:proposals status (see http://wiki.ecmascript.org/doku.php?id=harmony:proposals), so very likely to be in ES.next.

With this flag on a regexp, you don't need extra arguments for a suite of methods. Instead you set the regexp's lastIndex property (or let it start at 0 and then be updated by successive matches -- the y flag causes updates to lastIndex in the same way that the g flag does).

As for matching $ at the end, that is a less common use-case not served by the y flag. Usually you are lexing through a string and want to avoid an anchoring search, and of course avoid taking tail slices at each lastIndex. But the lexer munches according to a regular grammar and knows when to stop. It is atypical to want $ to match exactly N characters from the anchor point.

/be

On Jun 2, 2011, at 10:32 AM, Peter van der Zee wrote:

> On Thu, Jun 2, 2011 at 7:17 PM, Mike Samuel <mikesamuel at gmail.com> wrote:
>> 2011/6/2 Peter van der Zee <ecma at qfox.nl>:
>>> A problem I faced recently is the inability to apply regular
>>> expressions to a substring of a string without explicitly taking the
>>> substring first. So I'm wondering how much trouble it would be to
>>> extend the RegExp api to this...
>>> 
>>> RegExp.prototype.test = function(string[, start=0[, stop=string.length]]){ };
>>> RegExp.prototype.exec = function(string[, start=0[, stop=string.length]]){ };
>> 
>> What do the ^ and $ assertions in non-multiline mode mean when start
>> and stop are not [0, string.length)?
>> Do they match at the beginning and end of the substring?
> 
> They mean exactly the same as they would do for
> regex.test(string.substring(start, stop))
> 
>> 
>>> Optionally, it might be handy to have .test return a number,
>>> indicating the length of the (first) match. If zero, there was no
>>> match. This would however break with scripts that explicitly check for
>>> === false.
>> 
>> Using zero to indicate that there is no match will conflate zero
>> length matches with no match.
>> Consider
>> 
>>    /\b/.test("a-b") === true
> 
> That's a good point. It suppose could return false, but that wouldn't
> be very backcompat for these cases. It could return -1 but that would
> definately be backcompat breaking.
> 
> Well, it would have been nice. But if the tax on returning match
> length for .test is too big I'm not feeling that strong about it
> myself.
> 
> - peter
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss