Regexp APIs and capturing group positions

# Mark Macdonald (13 years ago)

In ES 5.1, the regular expression APIs do not expose the index at which a capturing group matched. The RegExp.prototype.exec(string) function returns an Array giving (among other things) the text matched by capturing groups, but does not give the *positions *of the captured text within the input string.

For example, consider this code using the current regex APIs:

var match = /(fox).*(dog)/.exec("The quick brown fox jumps over the lazy dog"); match[1]; // "fox" match[2]; // "dog"

We want to get this: "fox" at index 16 "dog" at index 40

But there is no way to obtain the indices 16, 40 from the match object (or any other API I'm aware of). This makes it hard to write something like a regex coach, which takes an arbitrary regular expression and input string, and outputs a highlighted version of the input string showing where the capturing groups matched.

Proposal: When RegExp.prototype.exec(string) returns a nonnull value, the returned object shall have a property named "captures", which is an Array. The value of captures[n] is the index at which the n'th capturing group's match begins. As usual, groups are numbered from 1. The captures array does not have a "0" property (it would always be equal to the "index" property of the match object, and thus redundant).

Proposed code:

var match = /(fox).*(dog)/.exec("The quick brown fox jumps over the lazy dog"); match.captures[1]; // 16 match.captures[2]; // 40

This (combined with the group text from the match object) gives you enough information to enumerate the captured regions of the input string.

Prior art: Java's java.util.Matcher.start() [1], Python's re.MatchObject.start() [2].

Comments, suggestions?

Mark

[1] docs.oracle.com/javase/1.5.0/docs/api/java/util/regex/Matcher.html#start(int) [2] docs.python.org/library/re.html#match

# Steven Levithan (13 years ago)

Definite +1 on adding some way to determine capturing group match start positions.

Mark Macdonald wrote:

This makes it hard to write something like a regex coach, which takes an arbitrary regular expression and input string, and outputs a highlighted version of the input string showing where the capturing groups matched.

It makes it impossible to do accurately. There are a few crude approaches I’ve pursued in the past to make something like this work with handcrafted regexes in some cases only. E.g., you can use str.replace() to insert markers before and after backreferences, and then check for the position of your markers after the fact. But that certainly won’t work with any arbitrary regex fed in via a regex tester. Captured subpatterns might not even appear within the text of the match, due to lookahead.

There are several JavaScript regex testers that try to report backreference positions, but they are incredibly easy to fool. See, e.g., leaverou.github.com/regexplained (plus LeaVerou/regexplained#7 ) and www.gethifi.com/tools/regex .

As for the proposed implementation, I have a few concerns:

  1. The main issue I see is that the proposal doesn’t provide a clean way to support named backreferences, should a future version of ES add named capturing groups. Future ES might want to share the proposed captures array or object for providing named backreferences, as well as their match positions. xregexp.com/syntax/named_capture_comparison shows where named backreferences are stored in various regex flavors (usually accessible via a method named group() or groups(), although XRegExp stores named backreference properties directly on the result array).

  2. Keep in mind that, since str.match(nonglobalregex) is an alias of regex.exec(str), anything added to regex.exec() should also be added to the nonglobal str.match() overload.

  3. IMO, the name captures is misleading, given the specific proposal, since it seems to suggest that it stores the backreferences themselves, rather than their start positions.

  4. I dislike the idea of excluding backreference zero (i.e., the entire match) from any result array.

  5. The proposal does not mention what should happen when trying to access the start position of a nonparticipating capturing group. Presumably, the value should be null or undefined.

Thanks for mentioning the prior art of java.util.Matcher.start() and Python's re.MatchObject.start(). They offer an alternative design that might be better, assuming that adding a start() method to the array returned by exec() is an acceptable solution. If future ES adopts named capture, it would be easy to change such a method to accept strings in addition to integers. (For whatever reason [probably just omission], Java 7’s Matcher.start() doesn’t accept strings, even though Java 7 supports named capture.)

Another potential way to do this might be to change the string primitives in the exec() result array to String objects that can store properties. Then you could do something like /.(.)/.exec('foo')[1].start === 1.

--Steven Levithan

From: Mark Macdonald Sent: Thursday, July 12, 2012 2:26 PM To: es-discuss at mozilla.org Subject: Regexp APIs and capturing group positions

In ES 5.1, the regular expression APIs do not expose the index at which a capturing group matched. The RegExp.prototype.exec(string) function returns an Array giving (among other things) the text matched by capturing groups, but does not give the positions of the captured text within the input string.

For example, consider this code using the current regex APIs:

var match = /(fox).*(dog)/.exec("The quick brown fox jumps over the lazy dog"); match[1]; // "fox" match[2]; // "dog"

We want to get this: "fox" at index 16 "dog" at index 40

But there is no way to obtain the indices 16, 40 from the match object (or any other API I'm aware of). This makes it hard to write something like a regex coach, which takes an arbitrary regular expression and input string, and outputs a highlighted version of the input string showing where the capturing groups matched.

Proposal: When RegExp.prototype.exec(string) returns a nonnull value, the returned object shall have a property named "captures", which is an Array. The value of captures[n] is the index at which the n'th capturing group's match begins. As usual, groups are numbered from 1. The captures array does not have a "0" property (it would always be equal to the "index" property of the match object, and thus redundant).

Proposed code:

var match = /(fox).*(dog)/.exec("The quick brown fox jumps over the lazy dog"); match.captures[1]; // 16 match.captures[2]; // 40

This (combined with the group text from the match object) gives you enough information to enumerate the captured regions of the input string.

Prior art: Java's java.util.Matcher.start() [1], Python's re.MatchObject.start() [2].

Comments, suggestions?

Mark

[1] docs.oracle.com/javase/1.5.0/docs/api/java/util/regex/Matcher.html#start(int) [2] docs.python.org/library/re.html#match