RegExp syntax suggestion: allow CharacterClassEscape in CharacterRange
On Wed, 08 Dec 2010 21:43:06 +0100, Gavin Barraclough
<barraclough at apple.com> wrote:
According to the ES5 spec a regular expression such as /[\w-_]/ should
generate a syntax error. Unfortunately there appears to be a
significant quantity of existing code that will break if this behavior
is implemented (I have been experimenting with bringing WebKit's RegExp
implementation into closer conformance to the spec), and looking at
other implementations it appears common for this error to be ignored.
It's far from the only extension to RegExp syntax that is common to most implementations. In fact, the extensions are both extensive and consistent across browsers. A quick check through the possible syntax errors show the following:
// Invalid ControlEscape/IdentityEscape character treated as literal. /\z/; // Invalid escape, same as /z/ // Incomplete/Invalid ControlEscape treated as either "\c" or "c" /\c/; // same as /c/ or /\c/ /\c2/; // same as /c2/ or /\c2/ // Incomplete HexEscapeSequence escape treated as either "\x" or "x". /\x/; // incomplete x-escape /\x1/; // incomplete x-escape /\x1z/; // incomplete x-escape // Incomplete UnicodeEscapeSequence escape treated as either "\u" or "u". /\u/; // incomplete u-escape /\uz/; // incomplete u-escape /\u1/; // incomplete u-escape /\u1z/; // incomplete u-escape /\u12/; // incomplete u-escape /\u12z/; // incomplete u-escape /\u123/; // incomplete u-escape /\u123z/; // incomplete u-escape // Bad quantifier range: /x{z/; // same as /x{z/ /x{1z/; // same as /x{1z/ /x{1,z/; // same as /x{1,z/ /x{1,2z/; // same as /x{1,2z/ /x{10000,20000z/; // same as /x{10000,20000z/ // Notice: It needs arbitrary lookahead to determine the invalidity, // except Mozilla that limits the numbers.
// Zero-initialized Octal escapes. /\012/; // same as /\x0a/
// Nonexisting back-references treated as octal escapes: /\5/; // same as /\x05/
// Invalid PatternCharacter accepted unescaped /]/; /{/; /}/;
// Bad escapes also inside CharacterClass. /[\z]/; /[\c]/; /[\c2]/; /[\x]/; /[\x1]/; /[\x1z]/; /[\u]/; /[\uz]/; /[\u1]/; /[\u1z]/; /[\u12]/; /[\u12z]/; /[\u123]/; /[\u123z]/; /[\012]/; /[\5]/; // And in addition: /[\B]/; /()()[\2]/; // Valid backreference should be invalid.
None of these RegExps cause a syntax error in any of the current "top-5"
browsers,
even though they are (AFAICS) invalid syntax.
Most of the RegExps treat a malformed (start of a multi-character) escape
sequence
as a simple identity escape or octal escape, and extends identity escapes
to all characters
that doesn't already have another meaning (ControlEscape,
CharacterClassEscape or
one of c, x, u, or b, and B outside a CharacterClass).
To match the current behavior, IdentityEscape shouldn't exclude all of
IdentifierPart,
but only the characters that already mean something else.
Allowing /\c2/ to match "c2", but requiring /\CB/ to match "\x02" seems
like it would
be better explained in prose than in the BNF.
...
I'd like to propose a minimal change to hopefully allow implementations
to come into line with the spec, without breaking the web. I'd suggest
changing the first step of CharacterRange to instead read:
- If A does not contain exactly one character or B does not contain
exactly one character then create a CharSet AB containing the union of
the CharSets A and B, and return the union of CharSet AB and the CharSet
containing the one character -.
I think this matches the current actual behavior of all the browsers, and
is
short and understandable.
/Lasse R.H. Nielsen
Thank you for the comprehensive list!
In all these cases we should perhaps first attempt to determine whether these could just be considered spec compliance issues that ought to be fixed in implementations. If retaining compatibility with existing code is going to require continued support for syntax beyond that currently covered by the spec, then it would seem beneficial to try to add this. I'd be interested in discussing some proposals to address the issues you've listed.
On Dec 9, 2010, at 12:33 PM, Lasse Reichstein wrote:
// Invalid ControlEscape/IdentityEscape character treated as literal. /\z/; // Invalid escape, same as /z/
To match the current behavior, IdentityEscape shouldn't exclude all of IdentifierPart, but only the characters that already mean something else.
Agreed - I expect this change will be necessary to retain compatibility with existing code. String literals define all source characters not used in escape sequences - bar decimal digits - to be valid identity escapes (NonEscapeCharacter), so it would only seem consistent to do the same for RegExps. A minimal but slightly messy fix would be:
IdentityEscape :: SourceCharacter but not DecimalDigit nor ControlEscape nor CharacterClassEscape nor any of: b B c x u
ClassEscape :: DecimalEscape b B CharacterEscape CharacterClassEscape
Adding semantics such that ClassEscape :: B generates the character set { B }. A slightly cleaner solution may be to restructure this bit of BNF to give ClassEscapes and AtomEscapes separate copies of IdentityEscape such that special handling for /[\B]/ is not necessary:
AtomEscape :: DecimalEscape CharacterEscape CharacterClassEscape AtomIdentityEscape
AtomIdentityEscape :: SourceCharacter but not DecimalDigit nor ControlEscape nor CharacterClassEscape nor any of: b B c x u
CharacterEscape :: ControlEscape c ControlLetter HexEscapeSequence UnicodeEscapeSequence
ClassEscape :: DecimalEscape b CharacterEscape CharacterClassEscape ClassIdentityEscape
ClassIdentityEscape :: SourceCharacter but not DecimalDigit nor ControlEscape nor CharacterClassEscape nor any of: b c x u
Even if we did not need to liberalize IdentityEscape to this extent, the current definition in the spec seems a little unworkable, particularly since it appears to prohibit /$/. Defining IdentityEscape in terms of IdentifierPart also appears to have the consequence of requiring users to remember which set of punctuation falls within the UnicodeConnectorPunctuation set to correctly determine which punctuation marks they can validly escape this way! I would suggest a still restrictive, but more practical, definition for IdentityEscape would be:
IdentityEscape :: SourceCharacter but not ControlLetter or DecimalDigit
// Incomplete/Invalid ControlEscape treated as either "\c" or "c" /\c/; // same as /c/ or /\c/ /\c2/; // same as /c2/ or /\c2/ // Incomplete HexEscapeSequence escape treated as either "\x" or "x". /\x/; // incomplete x-escape /\x1/; // incomplete x-escape /\x1z/; // incomplete x-escape // Incomplete UnicodeEscapeSequence escape treated as either "\u" or "u". /\u/; // incomplete u-escape /\uz/; // incomplete u-escape /\u1/; // incomplete u-escape /\u1z/; // incomplete u-escape /\u12/; // incomplete u-escape /\u12z/; // incomplete u-escape /\u123/; // incomplete u-escape /\u123z/; // incomplete u-escape
Most of the RegExps treat a malformed (start of a multi-character) escape sequence as a simple identity escape or octal escape, and extends identity escapes to all characters that doesn't already have another meaning (ControlEscape, CharacterClassEscape or one of c, x, u, or b, and B outside a CharacterClass).
Agreed - though I think it would be preferable to try generating the syntax error in these cases, if this is feasible.
Given that these invalid sequences are currently being handled with some inconsistency across browsers, this might a good hint that we can treat this as a compliance bug that we can just fix in implementations. Hex and unicode escapes are also present in string literals, and malformed escapes in string literals should also generate a syntax error (browser behaviour appears inconsistent here), and it would seem preferable to retain consistency between string & RegExp literals.
If we do need to be permissive in these cases, then I guess this requires up to four characters of lookahead. I believe invalid escapes - if not a syntax error - should be handled as per IdentityEscapes (i.e. the backslash is not retained - /\c/ as /c/, not /\c/).
Allowing /\c2/ to match "c2", but requiring /\CB/ to match "\x02" seems like it would be better explained in prose than in the BNF.
Well, if we're going to allow /\c2/ to match "c2" then it seems we should also allow /\c/ to match "c", and I think that will require changes to the BNF. Also, a prose explanation may be a little trickier than it appears at first blush, when you take quantifiers into account (if one trivially were to write semantics that would generate a nested pair of matchers for /\c2/ (to first match for { c }, then {2}), then I think this would result in /\c2*/ being handled as /(?:c2)*/ when the quantifier is applied - I'm sure there would be ways to avoid this, but would probably make a prose approach more complex). And this might not actually be too bad to deal with in BNF anyway - I believe this ought to work?:
CharacterEscape :: ControlEscape c [lookahead ∉ ControlLetter] c ControlLetter HexEscapeSequence UnicodeEscapeSequence IdentityEscape
Then we would just need to add semantics to define the result of c [lookahead ∉ ControlLetter] to be the character set { c }.
// Bad quantifier range: /x{z/; // same as /x{z/ /x{1z/; // same as /x{1z/ /x{1,z/; // same as /x{1,z/ /x{1,2z/; // same as /x{1,2z/ /x{10000,20000z/; // same as /x{10000,20000z/ // Notice: It needs arbitrary lookahead to determine the invalidity, // except Mozilla that limits the numbers.
Permitting open braces that don't form valid range quantifiers (per current browser implementations) leads to a couple of odd quirks:
- it seems inconsistent, given that syntax errors are generated for unclosed brackets and parentheses.
- it means that a correctly formed quantifier without an atom will throw a syntax error, but a malformed one will not. E.g. /{0}/ throws a syntax error, /{0{/ does not.
If we can't change implementations to adopt spec behavior here, then one possible simple solution (assuming it is web compatible) might be to add the following pattern to the 'Atom' rule:
Atom :: PatternCharacter . \ AtomEscape CharacterClass ( Disjunction ) ( ? : Disjunction ) { [lookahead ∉ DecimalDigit]
This would allow an unclosed brace in many situations, without requiring arbitrary lookahead, and would not be ambiguous with range quantifiers (since these always require the open brace to be followed by a decimal digit). The drawback I see with this approach would be that it might appear arbitrary to accept /{a/ but reject /{1/. This said, the spec already is more prone to generate syntax errors around decimal digits in literals, for example escaped decimal digits being a syntax error in string literals and RegExp character classes, so perhaps this would be okay.
// Zero-initialized Octal escapes. /\012/; // same as /\x0a/
I've only tested this one on a couple of browsers, but those tested permit any number of zeros, followed by any octal sequence with a decimal value up to 255. Here's a possible set of grammar rules that I think might be okay:
DecimalEscape :: Zeros OptionalLatin1OctalValue NonZeroDigit DecimalDigits
Zeros :: 0 Zeros 0
OptionalLatin1OctalValue :: [lookahead ∉ OctalDigit] OneToThree [lookahead ∉ OctalDigit] FourToSeven [lookahead ∉ OctalDigit] OneToThree OctalDigit [lookahead ∉ OctalDigit] FourToSeven OctalDigit OneToThree OctalDigit OctalDigit
OneToThree :: one of 1 2 3
FourToSeven :: one of 4 5 6 7
(Alternatively this syntax could be tightened up a little to restrict octal values to a single leading zero, if implementations can do so too).
// Nonexisting back-references treated as octal escapes: /\5/; // same as /\x05/
I really hope we can remove this 'feature' from implementations. I'm going to hold off for now on attempting to propose grammar. This one does look a little complex to write BNF for, I guess we can attampt a prose solution, though again this will run into problems in the case of quantifiers.
(OOI there is some inconsistency in how this is handled in browsers - for example, is /\777/ equivalent to /\x3f/ or /\x3f7/).
// Invalid PatternCharacter accepted unescaped /]/; /{/; /}/;
'{' is addressed above, fixing the spec to permit ']' and '}' would be easy, and would mean modifying this rule:
PatternCharacter :: SourceCharacter but not any of: ^ $ \ . * + ? ( ) [ ] { } |
However it would introduce something of an inconsistency if we permit unmatched ']' and '}" but not ')'.
- One possibility would be to relax both the spec and implementations, and allow all types of close brackets - ']' and '}" and ')' (though I think permitting ')' might require a little ugly duplication in the BNF).
- Depending on the resolution to bad quantifier ranges, above, another option might be to permit an unmatched close brace, but start generating a syntax error for unmatched close brackets (and retain the error for unmatched closed parentheses) - i.e. if we do permit unmatched open brace, then perhaps we should also be permitting unmatched close brace.
// Bad escapes also inside CharacterClass. /[\z]/; /[\c]/; /[\c2]/; /[\x]/; /[\x1]/; /[\x1z]/; /[\u]/; /[\uz]/; /[\u1]/; /[\u1z]/; /[\u12]/; /[\u12z]/; /[\u123]/; /[\u123z]/; /[\012]/; /[\5]/; // And in addition: /[\B]/; /()()[\2]/; // Valid backreference should be invalid.
In the case of hex and unicode escapes, I believe handling of ClassEscapes matches that for AtomEscapes, and this should remain the case.
I'm not sure about other browsers, but I am aware that Safari and Firefox both extend the syntax of control letters in character classes to include DecimalDigits and '_', so /[\c2]/ is accepted and is regarded as a control letter, unlike /\c2/ (which neither generates a syntax error, nor is treated as a control letter) - I believe that is both browsers /\c2/ is equivalent to /c2/ but /[\c2]/ is equivalent to /[\x12]/ (or /\x12/). Hopefully this extension can simply be removed – having incompatible syntax here between ClassEscapes and AtomEscapes seems unhelpful.
Since most browsers seem to permit decimal characters as an IdentityEscape, though these are permitted in string literals, so it would be a shame to add support if not strictly necessary. Additionally implementations permit octal escapes in ClassEscapes - we should keep this in sync with whatever decision we make for octal in AtomEscapes. Assuming some of the additional grammar proposed above (the BNF for octal, and splitting IdentityEscape for AtomEscape and ClassEscape) the following may be a workable solution if we need to support these features:
ClassEscape :: Zeros OptionalLatin1OctalValue b CharacterEscape CharacterClassEscape ClassIdentityEscape
ClassIdentityEscape :: SourceCharacter but not ControlEscape nor CharacterClassEscape nor any of: 0 b c x u
I'd be keen to hear your thoughts on all this. I'll continue investigating what parts of the extended syntax are truly necessary from a compatibility perspective, and expect I should start moving this conversation onto a wiki page at some point.
cheers, G.
p.s. One unrelated missing syntax error I've noticed is on quantified parenthetical assertions – browsers seem to permit /(?=)*/, which should be a syntax error.
According to the ES5 spec a regular expression such as /[\w-_]/ should generate a syntax error. Unfortunately there appears to be a significant quantity of existing code that will break if this behavior is implemented (I have been experimenting with bringing WebKit's RegExp implementation into closer conformance to the spec), and looking at other implementations it appears common for this error to be ignored.
The parsing of this expression matches a single NonemptyClassRanges of the form "ClassAtom - ClassAtom", where the first ClassAtom is a CharacterClassEscape and the second a SourceCharacter. Per section 15.10.2.15 of the spec this calls CharacterRange, resulting in this syntax error:
I'd like to propose a minimal change to hopefully allow implementations to come into line with the spec, without breaking the web. I'd suggest changing the first step of CharacterRange to instead read:
This is roughly equivalent to implicitly escaping the hyphen in any invalid range*, so /[\w-]/ is treated as /[\w-]/. I believe this change would have a low impact on the spec, that it should be feasible for implementors to easily adopt this behavior, and that this should commonly be compatible with existing code that is currently not spec compliant.
many thanks, Gavin
[ * However this is not exactly equivalent to treating the hyphen in an invalid ranges as having being escaped. Consider /[\d-a-z]/. Escaping the hyphen in the invalid range would give the expression /[\d-a-z]/, in which case a-z would be matched as a CharacterRange. This would arguably be a more intuitive interpretation of the expression, but changing the language to match this would require a more intrusive change to the grammar, which I'm assuming would not be desirable. ]