RegExp pet peeves (was: should calling RegExp constructor as function without arguments throw?)
Lasse R.H. Nielsen wrote:
The only difference between an Atom and an Assertion is that the former can have a quantifier attached. There is absolutely no reason to put a quantifier on a look-ahead, and look-aheads are zero-width matches just like all assertions, so they would fit much better as assertions. Changing the grammar to make look-aheads actual assertions wouldn't even require implementations to change. It would just change quantified look-aheads from being standard to being an extension
There's no benefit to the end user from making this non-backwards compatible change. In fact, if ES were to change the behavior of backreferences to non-participating groups to make them fail rather than match the empty string (which I think would be very positive, and this has been discussed here previously), (?=(x)?) would become a potentially useful construct, and (?=(x))? would be a valid alternative syntax.
Lasse R.H. Nielsen wrote:
The problem with back-references is that the requirement prevents a one-pass parser, because you need to scan the entire regexp to know whether a decimal escape is valid. Well, actually it wouldn't be a problem if you didn't want to be compatible with all the current implementations that treat invalid decimal escapes as octal escapes - so you need to know whether a given decimal sequence is a valid back-reference in order to parse it as octal if it isn't valid.
I should look over the precise spec rules regarding this since I don't remember if/when \1 would/could be treated as an escaped literal "1" (and other related issues that I'm currently hazy on), but as you noted, the spec does not call for it to match octal character index 1. Forward references such as (\2two|(one))+ would actually be useful (matching "oneonetwo", and I believe this is what happens with .NET, Java, Perl, PCRE, and Ruby regexes) if ES didn't call for resetting the backreference value when repeating the group (an ES regex pet peeve of my own).
I'm not proposing any ES changes in this email. I'm just saying that if ES regex changes are in order, I'd be proposing different ones from Lasse.
Steven Levithan Baghdad, Iraq stevenlevithan.com
From: brendan at mozilla.com To: atwork at infimum.dk Subject: Re: RegExp pet peeves (was: should calling RegExp constructor as function without arguments throw?) Date: Wed, 14 Jan 2009 14:56:56 -0800 CC: es-discuss at mozilla.org
This is really a separate thread -- please change the subject accordingly. See also past messages here, which linked to web-graphics.com/2007/11/26/ecmascript-3-regular-expressions-a-specification-that-doesnt-make-sense/http://blog.stevenlevithan.com/archives/npcg-javascript If you want access, I will add you to bugs.ecmascript.org so you can file tickets on your peeves. /be On Jan 14, 2009, at 2:01 PM, Lasse R.H. Nielsen wrote:On Wed, 14 Jan 2009 14:13:13 +0100, Hallvord R. M. Steen <hallvord at opera.com> wrote:
Apologies if this has already been covered, I tried googling but found only tangentially related stuff about "/regexp/()" syntax.
There are a few parts of the regexp syntax that wouldn't mind a look-over.
My two primary pee-ve's are that look-aheads are Atoms, not Assertions, and that back-references to captures occuring later in the source, are valid.
The only difference between an Atom and an Assertion is that the former can have a quantifier attached. There is absolutely no reason to put a quantifier on a look-ahead, and look-aheads are zero-width matches just like all assertions, so they would fit much better as assertions. Changing the grammar to make look-aheads actual assertions wouldn't even require implementations to change. It would just change quantified look-aheads from being standard to being an extension, like so many other things in regexps already are. (The feature was only added to JSC recently - I'm guessing nobody had needed it).
The problem with back-references is that the requirement prevents a one-pass parser, because you need to scan the entire regexp to know whether a decimal escape is valid. Well, actually it wouldn't be a problem if you didn't want to be compatible with all the current implementations that treat invalid decimal escapes as octal escapes - so you need to know whether a given decimal sequence is a valid back-reference in order to parse it as octal if it isn't valid. At least IE6 actually limits the valid back-references to the captures that were started previous to the back-reference in the source. That's a reasonable approach from a parsing perspective (I'd be happy if that was what was required), but really you only need to be able to reference captures that can be completed at the point where they occour, i.e., where both the start and end parentheses of the capture being referenced occur prior to the back-reference in the source.
This is really a separate thread -- please change the subject
accordingly.
See also past messages here, which linked to
web-graphics.com/2007/11/26/ecmascript-3-regular-expressions-a-specification-that-doesnt-make-sense, blog.stevenlevithan.com/archives/npcg-javascript
If you want access, I will add you to bugs.ecmascript.org so
you can file tickets on your peeves.