Backward running version look-behinds (Re: Look-behind proposal in trouble)

# Nozomu Katō (9 years ago)

On Thu, 19 Nov 2015 09:04:41 +0100, Yang Guo wrote:

This implementation supports variable length lookbehind similar to .NET's semantics. It does so by emitting code to read backwards inside the lookbehind. The size of the change without platform ports and tests is about 600 lines.

When regular expressions inside a look-behind are evaluated backwards, the automaton will not have a chance to know what has been captured in $1 in /(?<=(.)$1)/, since the automaton reaches $1 prior to (.).

This can be fixed by defining that a backreference referring to a capture inside the same look-behind has to be put leftwards contrary to the usual rule, but saying so would impose reading regular expressions backwards or prefetching forward expressions inside a look-behind upon the parser, to know if the corresponding capturing parentheses exist.

I am not sure if it is easy or there is a simpler solution, anyhow, if ECMAScript supports .NET-like look-behinds this matter would need to be fixed.

Nozomu

# Claude Pache (9 years ago)

Le 19 nov. 2015 à 20:17, Nozomu Katō <noz.ka at akenotsuki.com> a écrit :

On Thu, 19 Nov 2015 09:04:41 +0100, Yang Guo wrote:

This implementation supports variable length lookbehind similar to .NET's semantics. It does so by emitting code to read backwards inside the lookbehind. The size of the change without platform ports and tests is about 600 lines.

When regular expressions inside a look-behind are evaluated backwards, the automaton will not have a chance to know what has been captured in $1 in /(?<=(.)$1)/, since the automaton reaches $1 prior to (.).

This can be fixed by defining that a backreference referring to a capture inside the same look-behind has to be put leftwards contrary to the usual rule, but saying so would impose reading regular expressions backwards or prefetching forward expressions inside a look-behind upon the parser, to know if the corresponding capturing parentheses exist.

I am not sure if it is easy or there is a simpler solution, anyhow, if ECMAScript supports .NET-like look-behinds this matter would need to be fixed.

Nozomu

There is always the coward solution: forbid backreferences inside lookbehinds.

If we are going to ask programmers to write /(?<\1(.))/, it may be a good idea to prevent footguns like /(?<(.)\1)/ by throwing a syntax error (for it can be statically determined that it will not work as intended).

# Yang Guo (9 years ago)

FWIW this is precisely how it works in .NET as well (reverse order of capture and back reference). While it might be surprising, it makes sense with the backward read direction in mind.

I'm not fond of the idea of throwing syntax error if the back reference is to the right of the capture inside a lookbehind. /\1(.)/ and /(.\1)/ are both perfectly valid regexps as well, with the semantic that these back references always match, as their corresponding captures are empty at that point.

Yang