Look-behind proposal in trouble

# Nozomu Katō (9 years ago)

Apparently my proposal for adding the look-behind assertions to RegExp has been in trouble. I would like to ask anyone for help.

The following story is what I know about the proposal after my previous post:

I created a pull request for the proposal in July and sent an email to Brendan Eich asking if I can put his name as a champion: tc39/ecma262#48

I have not received a reply to my email, but I received a notification email in September that replying to the pull request, the proposal was moved to stage 0. Today, however, I just noticed that the proposal had been dropped from stage 0, stating "RegExp lookbehind has no champion". tc39/ecma262/commits/master/stage0.md (Oct 4, 2015)

I am uncertain about what happened. Does this mean that Brendan Eich is no longer a champion or did not take a champion on from the beginning or ...?

# Brian Terlson (9 years ago)

Brendan has indeed discovered he doesn't have time to champion the proposal through TC39, so I removed it while I searched for a new champion. Good news on that front - I have found one! Gorkem Yakin works on the Chakra team and is available to help move this proposal forward. I will also help out where I can. I've added the proposal back to the stage 0 list!

# Nozomu Katō (9 years ago)

I thank you very much indeed for your email and bringing really good news! I thought that my proposal might not be able to move forward anymore.

I am also thankful that you searched for a new champion and Gorkem undertakes this proposal!

# Sebastian Zartner (9 years ago)

Brian, where can people get the information about the reasons of such decisions (besides asking) and more generally about the processes behind the ES development?

I was following Nozomu's proposal closely, though to me it looked like the progress on this just died out.

Non-the-less, great to hear that new champions could be found!

# Erik Corry (9 years ago)

Your proposal for look-behind relies on being able to count the match length of the look-behind in order to step back that far. This presupposes that atoms like . and character classes have a fixed length.

However, with the /u flag, the . and some character classes can be either 1 or two code units. This means you don't know how far to step back. This needs to be fixed in a way that is not incompatible with the "correct" .NET way of doing things.

Eg matching /a.(?<!x..)/u against "xa😹" (x, a, cat-face-with-tears-of-joy, which is a surrogate pair). The back reference has an apparent width of 3, so we step back 3 code units, but that hits the 'a', not the 'x' and so the back reference fails to spot the 'x'.

# Erik Corry (9 years ago)

(edit made to previous post)

# Claude Pache (9 years ago)

This should not be a problem: With the /u flag, you work with code points, not code units. In particular, the . matches always a sequence (of code points with /u, or code units otherwise) of length 1.

# Erik Corry (9 years ago)

The proposal needs to be clarified to explain that you are stepping back a number of code points, not units. This implies that you are inspecting the input string as you step backwards. Also it should be explained what to do if there are unpaired surrogates in the input string and inside the lookbehind expression source.

I think the proposal would benefit from a pointer to an implementation or two. Of course the implementations should also fully support /u.

# Claude Pache (9 years ago)

Le 7 oct. 2015 à 11:16, Erik Corry <erik.corry at gmail.com> a écrit :

The proposal needs to be clarified to explain that you are stepping back a number of code points, not units. This implies that you are inspecting the input string as you step backwards. Also it should be explained what to do if there are unpaired surrogates in the input string and inside the lookbehind expression source.

Looking at the proposal, there is a Note section (recently added) clarifying that point if needed.

The way of counting, and the meaning of the words "character", "code point" and "code unit", are the same as in ES2015; there is really nothing new here. See 2 for details. If anything needs to be clarified e.g. regarding unpaired surrogates, it is not specific to lookbehind, but applies to the whole regexp semantics.

# Nozomu Katō (9 years ago)

What Claude mentioned is already part of the specification: "Input is a List consisting of all of the characters" and "Each character is either a code unit or a code point, depending upon the kind of pattern involved" (21.2.2.1).

But I added the Note section to the page of my proposal for clarification two days ago because I was asked a similar question.

Incidentally, in the initial version of the proposal I used the term "code point" but later changed it to "character" since Allen pointed out: esdiscuss/2015-May/042922

# Brian Terlson (9 years ago)

From: Sebastian Zartner [mailto:sebastianzartner at gmail.com]

where can people get the information about the reasons of such decisions (besides asking) and more generally about the processes behind the ES development?

You can follow the tc39/ecma262 github repository for updates on proposals. It also contains information about our process.

# Erik Corry (9 years ago)

I made an implementation of .NET-style variable length lookbehinds. It's not in a JS engine, but it's in a very simple (and very slow) ES5-compatible regexp engine that is used in the tiny Dart implementation named Fletch.

No unicode issues arise since this engine does not support /u, but I don't expect any issues since it's not trying to second-guess the length of the string matched by an expression.

Needs a lot more tests, but it seems to work OK and was surprisingly simple to do. Basically:

  • All steps in the input string are reversed, so if you would step forwards you step backwards.
  • Check for start of string instead of end of string.
  • Test against the character to the left of the cursor instead of to the right.
  • The parts of the Alternative (see the regexp grammar in the standard) are code-generated in reverse order.

Code is here: codereview.chromium.org/1398033002

# Nozomu Katō (9 years ago)

Me too; I have once implemented lookbehind assertions by using this way in SRELL, my C++ template library whose engine is compatible with RegExp of ECMAScript but whose class design is compatible with std::regex of C++ [1].

However, later I removed the code for such lookbehinds and adopted Perl5 style lookbehinds instead. The core reasons are:

  1. Right-to-left matchers are used only in lookbehind assertions;
  2. Nevertheless, these cannot share code with normal (left-to-right) matchers and need their own optimization processes.

Thus, I came to feel that what I can get and what I have to do are unbalanced.

In my understanding, features that are available in .NET style lookbehinds but are not so and even cannot be emulated in Perl5 style lookbehinds are 1) the use of the backreference and 2) the use of the quantifiers other than {n}. The others can be emulated in some way.

For example, the positive multiple-length lookbehind (?<=ab|cde) can be substituted by (?:(?<=ab)|(?<=cde)). The substitution of the negative multiple-length lookbehind is more simple, only to write assertions in succession; for example, (?<!ab|cde) can be written as (?<!ab)(?<!cde).

I guess that oniguruma supports expressions like (?<=ab|cde) by doing such substitutions inside the library, but just my guess.

So, I came to feel that Perl5 style lookbehinds are balanced. It may not be best, though. In fact, the current implementation for lookbehinds in my library is far simple; it shares code with lookaheads. If the count to rewind is 0 then it means lookahead, otherwise (if equal to or more than 1) it means lookbehind.

If we would introduce .NET style lookbehinds into RegExp of ECMAScript, it would need someone who writes right-to-left versions of the most parts of the definitions under 21.2 of the specification.

Nozomu

[1] www.akenotsuki.com/misc/srell/en

# Claude Pache (9 years ago)

Note that full-featured lookbehind assertions (à la .NET) is not the only case where backward matching is useful. Consider for instance, the following simple method:

String.prototype.trimRight = function () {
    return this.replace(/\s+$/u, '') 
}

That implementation would be more efficient if we could instruct the regexp to be applied backwards.

# Erik Corry (9 years ago)

I'm not convinced that the current proposal is easier to implement than the real thing. Take a look at the patch, it's trivial.

The lack of variable length lookbehind is a big annoyance in most languages. Search for the term and you'll find lots of frustrated perl users.

On the other hand I don't think adding variable length lookbehind to the spec makes it any easier to optimize /.+$/.

# Nozomu Katō (9 years ago)

Since there was a comment about Perl5 style vs .NET style when I first posted my proposal to es-discuss, too, I just wanted to explain about the background of my proposal. I proposed Perl5 compatible lookbehinds because I thought it was relatively simple to implement. Moreover, I am not confident that I can write a lookbehind proposal based on .NET implementation, in the manner used in the ECMAScript specification.

As Jason Orendorff wrote before, the lookbehind supported by .NET is a strict superset of what I have proposed. So, if you or someone else submits another lookbehind proposal based on .NET and it supersedes my proposal in a later version of ECMAScript, from the point of view of users, that would look like just an enhancement.

My hope is that lookbehind assertions are certainly supported in RegExp of ECMAScript in the near future.

# Nozomu Katō (9 years ago)

(edit incorporated into above post)

# Waldemar Horwat (9 years ago)

On 10/09/2015 15:07, Nozomu Katō wrote:

As Jason Orendorff wrote before, the lookbehind supported by .NET is a strict superset of what I have proposed. So, if you or someone else submits another lookbehind proposal based on .NET and it supersedes my proposal in a later version of ECMAScript, from the point of view of users, that would look like just an enhancement.

It's not a superset. Captures would match differently.

# Nozomu Katō (9 years ago)

Hmm... I am getting confused. For now I wait for a decision of TC39 on my proposal.

# Erik Corry (9 years ago)

On Sat, Oct 10, 2015 at 12:47 AM, Waldemar Horwat <waldemar at google.com> wrote:

It's not a superset. Captures would match differently.

Can you elaborate? How would they be different?

# Erik Corry (9 years ago)

Just for the lulz I ran the tests I could find from perl5 (which I think is very similar to the proposal here) and the captures were identical when using .Net-style reverse capturing. It's not a huge number of tests, though.

# Waldemar Horwat (9 years ago)

On 10/10/2015 03:48, Erik Corry wrote:

Can you elaborate? How would they be different?

If you have a capture inside a loop (controlled, say, by {n}), one of the proposals would capture the first instance, while the other proposal would capture the last instance.

# Erik Corry (9 years ago)

Yes, that makes sense.

This could be fixed by removing {n} loops from positive lookbehinds. Or by doing the .NET-style back-references immediately.

# Waldemar Horwat (9 years ago)

On 10/13/2015 02:18, Erik Corry wrote:

Yes, that makes sense.

This could be fixed by removing {n} loops from positive lookbehinds. Or by doing the .NET-style back-references immediately.

I think it would be cleanest to do the full reverse-order matching (what I think you're calling .NET-style) from the start.

 Waldemar
# Nozomu Katoo (9 years ago)

Erik Corry wrote on Tue, 13 Oct 2015 at 11:18:48 +0200:

Yes, that makes sense.

This could be fixed by removing {n} loops from positive lookbehinds. Or by doing the .NET-style back-references immediately.

Personally, I am reluctant to remove any feature from the current proposal intentionally for a future proposal that it is uncertain whether it really comes or not. It might end up only making lookbehinds of ECMAScript and ones of Perl 5 incompatible.

On 10/10/2015 03:48, Erik Corry wrote:

On Sat, Oct 10, 2015 at 12:47 AM, Waldemar Horwat wrote:

It's not a superset.  Captures would match differently.

Can you elaborate? How would they be different?

If you have a capture inside a loop (controlled, say, by {n}), one of the proposals would capture the first instance, while the other proposal would capture the last instance.

I was missing that point. I just confirmed that

perl -e "$a = 'abcdef'; $a =~ /(?<=.(.){2}.)./; print $1;"

returned 'c' whereas .NET returned 'b'. Implementation based on my proposal would return the same result as Perl 5.

By the way, at one point in this thread, I moved some email addresses from To to Cc when sending my reply. But somehow several of them had disappeared from the Cc field in the delivered email while they all remain in a copy in my sent-email folder. I apologize to those who received disconnected emails in this thread.

, Nozomu

# Erik Corry (8 years ago)

I made a playground where you can try out regexps with lookbehind.

dartpad.dartlang.org/8feea83c01ab767acdf1

# Erik Corry (8 years ago)

And here's a similar playground for .Net, not by me:

www.regexplanet.com/advanced/dotnet/index.html

# Yang Guo (8 years ago)

The experimental implementation [0] in V8 landed a few days ago and is included in the latest Canary build (49.0.2568.0 and later). You can test it after enabling it with the command line flag --js-flags="--harmony-regexp-lookbehind". Apparently the port to SpiderMonkey is already underway [1].

This implementation supports variable length lookbehind similar to .NET's semantics. It does so by emitting code to read backwards inside the lookbehind. The size of the change without platform ports and tests is about 600 lines.

[0] codereview.chromium.org/1418963009 [1] bugzilla.mozilla.org/show_bug.cgi?id=1225665