Look-behind proposal

# Nozomu Katō (9 years ago)

I submit a proposal for adding the look-behind assertions to RegExp.


Abstract

Basically, the evaluation for the look-behind assertion is performed by "rewind the target sequence and do the same thing as the look-ahead assertion":

  1. When compiling a regular expression pattern, count the total number of PatternCharacter and . and \ AtomEscape and CharacterClass in Disjunction per look-behind assertion. Each Atom may be followed by {n}, but may not be followed by the other quantifiers.
  2. When a look-behind assertion evaluated, rewind the target sequence by that number and evaluate as if the look-ahead assertion from that point.

-- Static Semantics: Early Errors (21.2.1.1 in ECMAScript Specifition 6)

Assertion :: ( ? < = Disjunction ) Assertion :: ( ? < ! Disjunction )

It is a Syntax Error if Disjunction contains Quantifier :: QuantifierPrefix except QuantifierPrefix :: { DecimalDigits }.

When Disjunction is separated by the | regular expression, if the number of the sequence of code points which the left Alternative matches and the number of the sequence of code points which the right Disjunction matches are different, then it is a Syntax Error.

Assertion (21.2.2.6 in ECMAScript Specifition 6)

The production Assertion :: ( ? < = Disjunction ) evaluates as follows:

  1. Evaluate Disjunction to obtain a Matcher m.
  2. Return an internal Matcher closure that takes two arguments, a State x and a Continuation c, and performs the following steps:
  3. Let n be the exact number of the sequence of code points which Disjunction matches [NOTE].
  4. Let xe be x's endIndex.
  5. If xe < n, return failure.
  6. Let xcap be x's captures List.
  7. Let y be the State (xe-n, xcap).
  8. Let d be a Continuation that always returns its State argument as a successful MatchResult.
  9. Call m(y, d) and let r be its result.
  10. If r is failure, return failure.
  11. Let z be r's State.
  12. Let zcap be z's captures List.
  13. Let a be the State (xe, zcap).
  14. Call c(a) and return its result.

The production Assertion :: ( ? < ! Disjunction ) evaluates as follows:

  1. Evaluate Disjunction to obtain a Matcher m.
  2. Return an internal Matcher closure that takes two arguments, a State x and a Continuation c, and performs the following steps:
  3. Let n be the exact number of the sequence of code points which Disjunction matches [NOTE].
  4. Let xe be x's endIndex.
  5. If xe < n, then call c(x) and return its result.
  6. Let xcap be x's captures List.
  7. Let y be the State (xe-n, xcap).
  8. Let d be a Continuation that always returns its State argument as a successful MatchResult.
  9. Call m(y, d) and let r be its result.
  10. If r is not failure, return failure.
  11. Call c(x) and return its result.

NOTE: n is computed by counting and summing up the following Atoms in the Disjunction: a. Atom :: PatternCharacter b. Atom :: . c. Atom :: \ AtomEscape d. Atom :: CharacterClass

If an Atom is followed by a "QuantifierPrefix :: { DecimalDigits }", then that Atom is counted as the number of DecimalDigits instead of one.

If the Disjunction contains an "Atom :: ( Disjunction )" or "Atom :: ( ? : Disjunction )" that is followed by a "QuantifierPrefix :: { DecimalDigits }", then the number of code points which that inner-Disjunction matches is multiplied by the number of DecimalDigits.


If it is not necessarily important to strictly follow the syntax of Perl 5, then adopting the following syntax may make implemetation simple:

(?<{n}= Disjunction) (?<{n}! Disjunction)

where n in {} is the number of code points by which the target sequence is rewound when a look-behind assertion is evaluated. In the case of this syntax, it is not required to count Atoms in Disjunction and the number to rewind the sequence of code points is taken from n directly.

, Nozomu

# Claude Pache (9 years ago)

Old thread of interest:

esdiscuss.org/topic/regexp-lookbehind, esdiscuss.org/topic/regexp-lookbehind

I've also personally thought quite thoroughly on how to add look-behind assertions in ES RegExps.

Your approach imposes that the exact length of a string matched by the sub-pattern ( ? < = Disjunction ) is statically determinable This is, indeed, the original semantics of Perl 5.

Another more powerful approach (which I prefer), is the one followed by .NET: the sub-pattern ( ? < = Disjunction ) is tentatively matched backwards. The advantage is that it does not impose any restriction on the subpattern, while remaining as efficient as possible (in theory, about as efficient as an equivalent look-ahead assertion). The main drawback is that it may be somewhat confusing for some types of complicated subpatterns, especially for those that contain backreferences. (Also it is more mind-bending to spec.)

That said, I think there are easier missing features to be added, that are supported by most other RegExp flavours. I'm thinking of:

  • the s, or .dotall flag: the dot . matches every character, including newlines;
  • true support of Unicode, namely: escape sequences such as \Lu for uppercase letter, or \X for grapheme cluster.
# Jason Orendorff (9 years ago)

On Mon, May 18, 2015 at 8:50 AM, Nozomu Katō <noz.ka at akenotsuki.com> wrote:

I submit a proposal for adding the look-behind assertions to RegExp.

Fantastic! I'm not a TC39 member but I've been hoping this would happen for some time.

I have often thought that the major obstacle was that no one has done this work. I hope I was not wrong!

It is a Syntax Error if Disjunction contains Quantifier :: QuantifierPrefix except QuantifierPrefix :: { DecimalDigits }.

Backreferences must be ruled out, too.

  1. Let n be the exact number of the sequence of code points which Disjunction matches [NOTE].

I don't think a NOTE is strong enough for spec purposes, so you can improve the proposal by adding a "Static Semantics:" section that formally specifies this.

The spec uses attribute grammars for dozens of things like this, where some piece of information has to be determined from the parse tree. See "Static Semantics: ElisionWidth" for a simple example.

# Jason Orendorff (9 years ago)

On Mon, May 18, 2015 at 1:51 PM, Claude Pache <claude.pache at gmail.com> wrote:

Another more powerful approach (which I prefer), [...]

As long as we prohibit matching groups within lookbehind assertions, I think the approach you prefer is a strict superset of what's proposed. So there's no opportunity cost to taking the proposal as-is: the .NET semantics could be compatibly added later.

I imagine TC39 gives some preference to proposals that exist in a concrete form. So if you prefer the .NET approach enough to do anything about it, now's a great time to write some spec text.

That said, I think there are easier missing features to be added, that are supported by most other RegExp flavours. I'm thinking of:

  • the s, or .dotall flag: the dot . matches every character, including newlines;
  • true support of Unicode, namely: escape sequences such as \Lu for uppercase letter, or \X for grapheme cluster.

Why not submit proposals for these?

# Allen Wirfs-Brock (9 years ago)

On May 18, 2015, at 2:20 PM, Jason Orendorff wrote:

On Mon, May 18, 2015 at 8:50 AM, Nozomu Katō <noz.ka at akenotsuki.com> wrote:

I submit a proposal for adding the look-behind assertions to RegExp.

Fantastic! I'm not a TC39 member but I've been hoping this would happen for some time.

I have often thought that the major obstacle was that no one has done this work. I hope I was not wrong!

You're exactly right, we've been waiting for somebody (anybody) to create a technically complete proposal

It is a Syntax Error if Disjunction contains Quantifier :: QuantifierPrefix except QuantifierPrefix :: { DecimalDigits }.

Backreferences must be ruled out, too.

  1. Let n be the exact number of the sequence of code points which Disjunction matches [NOTE].

I don't think a NOTE is strong enough for spec purposes, so you can improve the proposal by adding a "Static Semantics:" section that formally specifies this.

The spec uses attribute grammars for dozens of things like this, where some piece of information has to be determined from the parse tree. See "Static Semantics: ElisionWidth" for a simple example.

The proposal also needs to follow the conventions and terminology of the ES6 RegExp spec. In particular, the RegExp pattern matching algorithms use the term "character" with a specific meaning rather than using "code unit" or "code point" (see people.mozilla.org/~jorendorff/es6-draft.html#sec-pattern-semantics ). It does this so that the same algorithms can be use to describe the matching semantic of both "normal" and "unicode" patterns. Any extensions also will need t work with both kinds of patterns.

# Claude Pache (9 years ago)

Le 18 mai 2015 à 23:53, Jason Orendorff <jason.orendorff at gmail.com> a écrit :

On Mon, May 18, 2015 at 1:51 PM, Claude Pache <claude.pache at gmail.com> wrote:

Another more powerful approach (which I prefer), [...]

As long as we prohibit matching groups within lookbehind assertions, I think the approach you prefer is a strict superset of what's proposed. So there's no opportunity cost to taking the proposal as-is: the .NET semantics could be compatibly added later.

I imagine TC39 gives some preference to proposals that exist in a concrete form. So if you prefer the .NET approach enough to do anything about it, now's a great time to write some spec text.

That said, I think there are easier missing features to be added, that are supported by most other RegExp flavours. I'm thinking of:

  • the s, or .dotall flag: the dot . matches every character, including newlines;
  • true support of Unicode, namely: escape sequences such as \Lu for uppercase letter, or \X for grapheme cluster.

Why not submit proposals for these?

You are right, but procrastination and other activities... But I'll try to write some more concrete proposals for those ideas.

# Nozomu Katō (9 years ago)

Allen Wirfs-Brock wrote on Mon, 18 May 2015, at 15:08:13 -0700:

On May 18, 2015, at 2:20 PM, Jason Orendorff wrote:

On Mon, May 18, 2015 at 8:50 AM, Nozomu Katō wrote:

It is a Syntax Error if Disjunction contains Quantifier :: QuantifierPrefix except QuantifierPrefix :: { DecimalDigits }.

Backreferences must be ruled out, too.

Yes, I forgot to mention backreferences in my proposal. I have modified my proposal to reflect this.

  1. Let n be the exact number of the sequence of code points which Disjunction matches [NOTE].

I don't think a NOTE is strong enough for spec purposes, so you can improve the proposal by adding a "Static Semantics:" section that formally specifies this.

The spec uses attribute grammars for dozens of things like this, where some piece of information has to be determined from the parse tree. See "Static Semantics: ElisionWidth" for a simple example.

Thank you. This sample is helpful. I was unsure how to describe this information.

The proposal also needs to follow the conventions and terminology of the ES6 RegExp spec. In particular, the RegExp pattern matching algorithms use the term "character" with a specific meaning rather than using "code unit" or "code point" (see people.mozilla.org/~jorendorff/es6-draft.html#sec-pattern-semantics ). It does this so that the same algorithms can be use to describe the matching semantic of both "normal" and "unicode" patterns. Any extensions also will need t work with both kinds of patterns.

Based on these comments, I re-submit a modified proposal:


Abstract

Basically, the evaluation for the look-behind assertion is performed by "rewind the target sequence and do the same thing as the look-ahead assertion":

  1. When compiling a regular expression pattern, count the total number of PatternCharacter and . and \ AtomEscape and CharacterClass in Disjunction per look-behind assertion. Each Atom may be followed by {n}, but may not be followed by the other quantifiers.
  2. When a look-behind assertion evaluated, rewind the target sequence by that number and evaluate as if the look-ahead assertion from that point.

-- Static Semantics: Early Errors

Assertion :: ( ? < = Disjunction ) Assertion :: ( ? < ! Disjunction )

It is a Syntax Error if Disjunction contains Quantifier :: QuantifierPrefix except QuantifierPrefix :: { DecimalDigits }.

It is a Syntax Error if Disjunction contains AtomEscape :: DecimalEscape.

When Disjunction is separated by the | regular expression, if the number of the sequence of characters which the left Alternative matches and the number of the sequence of characters which the right Disjunction matches are different, then it is a Syntax Error.

Static Semantics: DisjunctionSize

  1. return the total sum of the following Atoms in the Disjunction: a. Atom :: PatternCharacter b. Atom :: . c. Atom :: \ AtomEscape except AtomEscape :: DecimalEscape d. Atom :: CharacterClass

    If an Atom is followed by a "QuantifierPrefix :: { DecimalDigits }", then that Atom is counted as the number of DecimalDigits instead of one. If the Disjunction contains an "Atom :: ( Disjunction )" or "Atom :: ( ? : Disjunction )" that is followed by a "QuantifierPrefix :: { DecimalDigits }", then the total number of the sequence of Atoms in that inner-Disjunction is multiplied by the number of DecimalDigits.

Assertion

The production Assertion :: ( ? < = Disjunction ) evaluates as follows:

  1. Evaluate Disjunction to obtain a Matcher m.
  2. Return an internal Matcher closure that takes two arguments, a State x and a Continuation c, and performs the following steps:
  3. Let n be the result of DisjunctionSize for Disjunction.
  4. Let xe be x's endIndex.
  5. If xe < n, return failure.
  6. Let xcap be x's captures List.
  7. Let y be the State (xe-n, xcap).
  8. Let d be a Continuation that always returns its State argument as a successful MatchResult.
  9. Call m(y, d) and let r be its result.
  10. If r is failure, return failure.
  11. Let z be r's State.
  12. Let zcap be z's captures List.
  13. Let a be the State (xe, zcap).
  14. Call c(a) and return its result.

The production Assertion :: ( ? < ! Disjunction ) evaluates as follows:

  1. Evaluate Disjunction to obtain a Matcher m.
  2. Return an internal Matcher closure that takes two arguments, a State x and a Continuation c, and performs the following steps:
  3. Let n be the result of DisjunctionSize for Disjunction.
  4. Let xe be x's endIndex.
  5. If xe < n, then call c(x) and return its result.
  6. Let xcap be x's captures List.
  7. Let y be the State (xe-n, xcap).
  8. Let d be a Continuation that always returns its State argument as a successful MatchResult.
  9. Call m(y, d) and let r be its result.
  10. If r is not failure, return failure.
  11. Call c(x) and return its result.

Nozomu

# Nozomu Katō (9 years ago)
  • the s, or .dotall flag: the dot . matches every character, including newlines;
  • true support of Unicode, namely: escape sequences such as \Lu for uppercase letter, or \X for grapheme cluster.

Why not submit proposals for these?

You are right, but procrastination and other activities... But I'll try to write some more concrete proposals for those ideas.

Unlike Perl, there is no special treatment for ] that appears just after [ or [^ in RegExp of ECMAScript. Thanks to this, [^] can be used for what matches every character including newlines.

# Nozomu Katō (9 years ago)

Sorry, I accidentally posted an unfinished e-mail. I was about to add the link of a html version of my proposal to my previous post: www.akenotsuki.com/misc/srell/lookbehind_proposal.html

Nozomu

# Nozomu Katō (9 years ago)

I revised my proposal for lookbehinds: www.akenotsuki.com/misc/srell/lookbehind_proposal.html

I rewrote the paragraphs about how to count Atoms in Disjunction, to clarify that the recursive call is used for nested Disjunctions.

, Nozomu

# Sebastian Zartner (9 years ago)

Thank you for picking this up again! I asked for adding look-behinds back in 2013[1], though I didn't find the time to come up with an algorithm and read into writing this down for the specification. So let's hope this discussion will result in something this time.

Sebastian

[1] esdiscuss/2013-September/033837

# Nozomu Katō (9 years ago)

Sebastian Zartner wrote on Wed, 3 Jun 2015, at 13:44:06 +0200:

Thank you for picking this up again! I asked for adding look-behinds back in 2013[1], though I didn't find the time to come up with an algorithm and read into writing this down for the specification. So let's hope this discussion will result in something this time.

Yes, I also really hope so. Although I read old discussions about look-behinds before posting my contribution, I am still uncertain whether additional concrete actions are required for look-behind support to go forward. Anyhow, I am looking forward to seeing new Meeting Notes.

Nozomu

# Allen Wirfs-Brock (9 years ago)

In order to keep this moving forward, you need to recruit a TC39 member who will work with you and act as the champion for the proposal within TC39.

# Nozomu Katō (9 years ago)

Thank you for the clarification.

Unfortunately, I do not know at all who is a TC39 member. I wonder if there is a member who will work for this proposal.

Even in the case that no one is found this time, I will leave the html version of the proposal as it is at that url. If anyone later come to have an interest in that proposal please take it and modify (if needed) freely.

Nozomu

# Sebastian Zartner (9 years ago)

So the question is, how to recruit a TC39 member? Through this list? Is it possible to become a member as a private person?

Sebastian

# Brendan Eich (9 years ago)

I will do it if no one else steps up by 10pm tonight PDT.

Thank you to Nozomu Katō for writing it up!

# Sebastian Zartner (9 years ago)

Thank you Brendan for picking this up! Nozomu, Brendan, please let me know if I can be of any help.

Sebastian

# Matthew Robb (9 years ago)

This thread is a cool example of why this list is still do important. +1 to all of it!

# Nozomu Katō (9 years ago)

On 8 June 2015 at 18:47, Brendan Eich wrote:

I will do it if no one else steps up by 10pm tonight PDT.

I thank you very much indeed for taking this proposal on. I am very relieved!

Sebastian Zartner wrote on Tue, 9 Jun 2015, at 10:20:00 +0200:

Nozomu, Brendan, please let me know if I can be of any help.

Thank you for your kindness. Any action that could make the proposal move forward is a great help to me.

This is the first time for me to see how a proposal becomes part of the ECMAScript standard. If/whenever I do not a required action please urge me to do it.

, Nozomu

# Allen Wirfs-Brock (9 years ago)

On Jun 9, 2015, at 6:18 AM, Nozomu Katō wrote:

On 8 June 2015 at 18:47, Brendan Eich wrote:

I will do it if no one else steps up by 10pm tonight PDT.

I thank you very much indeed for taking this proposal on. I am very relieved!

Nozomu,

Also, please register as an Ecma TC39 non-member contributor by following the steps described at tc39/ecma262#contributing-new-proposals

Thanks,

# Nozomu Katō (9 years ago)

Allen Wirfs-Brock wrote on Tue, 9 Jun 2015, at 07:27:36 -0700:

Nozomu,

Also, please register as an Ecma TC39 non-member contributor by following the steps described at tc39/ecma262#contributing-new-proposals

I just did. Thank you for telling me that.

Nozomu