invalid escape sequences

# Mike Samuel (15 years ago)

During the last meeting, the semantics of "\z" came up. Specifically, what does \ followed by a character not in the set with a specified escape expand to?

From 7.8.4 StringLiteral

"
EscapeSequence :: CharacterEscapeSequence
"

leads to

"
CharacterEscapeSequence :: ...
    NonEscapeCharacter

NonEscapeCharacter :: SourceCharacter but not one of

EscapeCharacter or LineTerminator "

and the semantics of NonEscapeCharacter is given thus

"
The CV of CharacterEscapeSequence :: NonEscapeCharacter is the CV

of the NonEscapeCharacter. "

so are the following assertions true?

(1)

The only SourceCharacter sequences that do not match ( DoubleStringCharacter | SingleStringCharacter ) applied one or more times are a LineTerminator not preceded by an odd number of backslashes, "u" not followed by 4 valid hex digits and not preceded by an even number of backslashes, "x" not followed by 2 valid hex digits and not preceded by an even number of backslashes, or a decimal digit not preceded by an even number of backslashes. I.e. /(?:^|[^\])(?:\\)*([\r\n\u2028\u2029]|\u(?![0-9A-Fa-f]{4})|\x(?![0-9A-Fa-f]{2})|\[0-9]/ tests whether a sequence of SourceCharacters matches zero or more ( DoubleStringCharacter | SingleStringCharacter ).

(2)

The B.1.2 additional octal syntax, quoted below, does change the validity of the test above. " OctalEscapeSequence :: OctalDigit [lookahead not in DecimalDigit] ZeroToThree OctalDigit [lookahead not in DecimalDigit] FourToSeven OctalDigit ZeroToThree OctalDigit OctalDigit "

NonEscapeCharacter excludes DecimalDigit through SingleEscapeCharacter but OctalEscape allows [0-7]. So under B.1.2, /(?:^|[^\])(?:\\)*([\r\n\u2028\u2029]|\u(?![0-9A-Fa-f]{4})|\x(?![0-9A-Fa-f]{2}|\[89]|\[0-3][0-7]?(?![89])|\4-7)/ tests whether a sequence of SourceCharacters matches zero or more ( DoubleStringCharacter | SingleStringCharacter ).

I did some empirical testing to see what is actually allowed by running the below in a variety of browsers in the squarefree shell.

All are invalid absent B.1.2 if the assertions above are true. With B.1.2, "\3778", "\478", and "\778" are valid.

I'm having trouble running IE today, but on other browsers, in alphabetical order:

Chrome "\r" : "ERROR" "\u" : "u" "\x" : "x" "\8" : "8" "\28" : "\u00028" "\228" : "\u00128" "\3778" : "ÿ8" "\478" : "'8" "\778" : "?8"

FF3 "\u000d" : "ERROR" "\u" : "u" "\x" : "x" "\8" : "8" "\28" : "\u00028" "\228" : "\u00128" "\3778" : "ÿ8" "\478" : "'8" "\778" : "?8"

Safari "\r" : "ERROR" "\u" : "u" "\x" : "x" "\8" : "8" "\28" : "\u00028" "\228" : "\u00128" "\3778" : "ÿ8" "\478" : "'8" "\778" : "?8"

So at least 3 different interpreter strains treat "\u" === "u", "\x" === "x", "\8" === "8", and don't care whether there is a decimal digit after an octal escape sequence. All reject unescaped newlines in string literals.

I would like to be able to specify quasiliteral literal part decoding in terms of the SV defined in 7.8.4. If user code is going to have decoded literal parts available when they validly decode, but at least have access to the raw literal parts otherwise, then it would be good for them to be consistently available across interpreters. Would it be worthwhile having the SV and CV in 7.8.4 specify the decoding of some sourcecharacter sequences that can't actually reach the SV or CV from via the StringLiteral production?

During the last meeting, the semantics of "\z" came up.  Specifically,
what does \ followed by a character not in the set with a specified
escape expand to?

>From 7.8.4 StringLiteral

    "
    EscapeSequence :: CharacterEscapeSequence
    "

leads to

    "
    CharacterEscapeSequence :: ...
        NonEscapeCharacter

    NonEscapeCharacter :: SourceCharacter but not one of
EscapeCharacter or LineTerminator
    "

and the semantics of NonEscapeCharacter is given thus

    "
    The CV of CharacterEscapeSequence :: NonEscapeCharacter is the CV
of the NonEscapeCharacter.
    "

so are the following assertions true?

(1)

The only SourceCharacter sequences that do not match (
DoubleStringCharacter | SingleStringCharacter ) applied one or more
times are a LineTerminator not preceded by an odd number of
backslashes, "u" not followed by 4 valid hex digits and not preceded
by an even number of backslashes, "x" not followed by 2 valid hex
digits and not preceded by an even number of backslashes, or a decimal
digit not preceded by an even number of backslashes.
I.e. /(?:^|[^\\])(?:\\\\)*([\r\n\u2028\u2029]|\\u(?![0-9A-Fa-f]{4})|\\x(?![0-9A-Fa-f]{2})|\\[0-9]/
tests whether a sequence of SourceCharacters matches zero or more (
DoubleStringCharacter | SingleStringCharacter ).

(2)

The B.1.2 additional octal syntax, quoted below, does change the
validity of the test above.
    "
    OctalEscapeSequence :: OctalDigit [lookahead not in DecimalDigit]
        ZeroToThree OctalDigit [lookahead not in DecimalDigit]
        FourToSeven OctalDigit
        ZeroToThree OctalDigit OctalDigit
    "

NonEscapeCharacter excludes DecimalDigit through SingleEscapeCharacter
but OctalEscape allows [0-7].  So under B.1.2,
/(?:^|[^\\])(?:\\\\)*([\r\n\u2028\u2029]|\\u(?![0-9A-Fa-f]{4})|\\x(?![0-9A-Fa-f]{2}|\\[89]|\\[0-3][0-7]?(?![89])|\\[4-7](?![89]))/
tests whether a sequence of SourceCharacters matches zero or more (
DoubleStringCharacter | SingleStringCharacter ).



I did some empirical testing to see what is actually allowed by
running the below in a variety of browsers in the squarefree shell.

var notStringLiterals = [ "\r", "\\u", "\\x", "\\8", "\\28", "\\228",
"\\3778", "\\478", "\\778" ];
for (var i = 0; i < notStringLiterals.length; ++i) {
  var result;
  try {
    result = eval('"' + notStringLiterals[i] + '"');
  } catch (ex) {
    result = "ERROR";
  }
  print(JSON.stringify(notStringLiterals[i]) + " : " + JSON.stringify(result));
}

All are invalid absent B.1.2 if the assertions above are true.  With
B.1.2, "\3778", "\478", and "\778" are valid.

I'm having trouble running IE today, but on other browsers, in
alphabetical order:

Chrome
"\r" : "ERROR"
"\\u" : "u"
"\\x" : "x"
"\\8" : "8"
"\\28" : "\u00028"
"\\228" : "\u00128"
"\\3778" : "ÿ8"
"\\478" : "'8"
"\\778" : "?8"


FF3
"\u000d" : "ERROR"
"\\u" : "u"
"\\x" : "x"
"\\8" : "8"
"\\28" : "\u00028"
"\\228" : "\u00128"
"\\3778" : "ÿ8"
"\\478" : "'8"
"\\778" : "?8"


Safari
"\r" : "ERROR"
"\\u" : "u"
"\\x" : "x"
"\\8" : "8"
"\\28" : "\u00028"
"\\228" : "\u00128"
"\\3778" : "ÿ8"
"\\478" : "'8"
"\\778" : "?8"


So at least 3 different interpreter strains treat "\u" === "u", "\x"
=== "x", "\8" === "8", and don't care whether there is a decimal digit
after an octal escape sequence.  All reject unescaped newlines in
string literals.


I would like to be able to specify quasiliteral literal part decoding
in terms of the SV defined in 7.8.4.  If user code is going to have
decoded literal parts available when they validly decode, but at least
have access to the raw literal parts otherwise, then it would be good
for them to be consistently available across interpreters.  Would it
be worthwhile having the SV and CV in 7.8.4 specify the decoding of
some sourcecharacter sequences that can't actually reach the SV or CV
from via the StringLiteral production?

# Mike Samuel (15 years ago)

2011/5/31 Mike Samuel <mikesamuel at gmail.com>:

I'm having trouble running IE today, but on other browsers, in alphabetical order:

IE 7 loves me but apparently hates \u.

"\r" : "ERROR" "\u" : "ERROR" "\x" : "ERROR" "\8" : "8" "\28" : "\u00028" "\228" : "\u00128" "\3778" : "ÿ8" "\478" : "'8" "\778" : "?8"

2011/5/31 Mike Samuel <mikesamuel at gmail.com>:
> I'm having trouble running IE today, but on other browsers, in
> alphabetical order:

IE 7 loves me but apparently hates \u.

"\r" : "ERROR"
"\\u" : "ERROR"
"\\x" : "ERROR"
"\\8" : "8"
"\\28" : "\u00028"
"\\228" : "\u00128"
"\\3778" : "ÿ8"
"\\478" : "'8"
"\\778" : "?8"

# Dave Fugate (15 years ago)

Results for IE9 ("IE9 standards" mode) given the snippet below: "\r" : "ERROR" "\u" : "ERROR" "\x" : "ERROR" "\8" : "8" "\28" : "\u00028" "\228" : "\u00128" "\3778" : "ÿ8" "\478" : "'8" "\778" : "?8"

My best,

Results for IE9 ("IE9 standards" mode) given the snippet below:
	"\r" : "ERROR"
	"\\u" : "ERROR"
	"\\x" : "ERROR"
	"\\8" : "8"
	"\\28" : "\u00028"
	"\\228" : "\u00128"
	"\\3778" : "ÿ8"
	"\\478" : "'8"
	"\\778" : "?8"

My best,

Dave

-----Original Message-----
From: es-discuss-bounces at mozilla.org [mailto:es-discuss-bounces at mozilla.org] On Behalf Of Mike Samuel
Sent: Tuesday, May 31, 2011 6:34 PM
To: es-discuss
Subject: invalid escape sequences

During the last meeting, the semantics of "\z" came up.  Specifically, what does \ followed by a character not in the set with a specified escape expand to?

From 7.8.4 StringLiteral

    "
    EscapeSequence :: CharacterEscapeSequence
    "

leads to

    "
    CharacterEscapeSequence :: ...
        NonEscapeCharacter

    NonEscapeCharacter :: SourceCharacter but not one of EscapeCharacter or LineTerminator
    "

and the semantics of NonEscapeCharacter is given thus

    "
    The CV of CharacterEscapeSequence :: NonEscapeCharacter is the CV of the NonEscapeCharacter.
    "

so are the following assertions true?

(1)

The only SourceCharacter sequences that do not match ( DoubleStringCharacter | SingleStringCharacter ) applied one or more times are a LineTerminator not preceded by an odd number of backslashes, "u" not followed by 4 valid hex digits and not preceded by an even number of backslashes, "x" not followed by 2 valid hex digits and not preceded by an even number of backslashes, or a decimal digit not preceded by an even number of backslashes.
I.e. /(?:^|[^\\])(?:\\\\)*([\r\n\u2028\u2029]|\\u(?![0-9A-Fa-f]{4})|\\x(?![0-9A-Fa-f]{2})|\\[0-9]/
tests whether a sequence of SourceCharacters matches zero or more ( DoubleStringCharacter | SingleStringCharacter ).

(2)

The B.1.2 additional octal syntax, quoted below, does change the validity of the test above.
    "
    OctalEscapeSequence :: OctalDigit [lookahead not in DecimalDigit]
        ZeroToThree OctalDigit [lookahead not in DecimalDigit]
        FourToSeven OctalDigit
        ZeroToThree OctalDigit OctalDigit
    "

NonEscapeCharacter excludes DecimalDigit through SingleEscapeCharacter but OctalEscape allows [0-7].  So under B.1.2, /(?:^|[^\\])(?:\\\\)*([\r\n\u2028\u2029]|\\u(?![0-9A-Fa-f]{4})|\\x(?![0-9A-Fa-f]{2}|\\[89]|\\[0-3][0-7]?(?![89])|\\[4-7](?![89]))/
tests whether a sequence of SourceCharacters matches zero or more ( DoubleStringCharacter | SingleStringCharacter ).



I did some empirical testing to see what is actually allowed by running the below in a variety of browsers in the squarefree shell.

var notStringLiterals = [ "\r", "\\u", "\\x", "\\8", "\\28", "\\228", "\\3778", "\\478", "\\778" ]; for (var i = 0; i < notStringLiterals.length; ++i) {
  var result;
  try {
    result = eval('"' + notStringLiterals[i] + '"');
  } catch (ex) {
    result = "ERROR";
  }
  print(JSON.stringify(notStringLiterals[i]) + " : " + JSON.stringify(result)); }

All are invalid absent B.1.2 if the assertions above are true.  With B.1.2, "\3778", "\478", and "\778" are valid.

I'm having trouble running IE today, but on other browsers, in alphabetical order:

Chrome
"\r" : "ERROR"
"\\u" : "u"
"\\x" : "x"
"\\8" : "8"
"\\28" : "\u00028"
"\\228" : "\u00128"
"\\3778" : "ÿ8"
"\\478" : "'8"
"\\778" : "?8"


FF3
"\u000d" : "ERROR"
"\\u" : "u"
"\\x" : "x"
"\\8" : "8"
"\\28" : "\u00028"
"\\228" : "\u00128"
"\\3778" : "ÿ8"
"\\478" : "'8"
"\\778" : "?8"


Safari
"\r" : "ERROR"
"\\u" : "u"
"\\x" : "x"
"\\8" : "8"
"\\28" : "\u00028"
"\\228" : "\u00128"
"\\3778" : "ÿ8"
"\\478" : "'8"
"\\778" : "?8"


So at least 3 different interpreter strains treat "\u" === "u", "\x"
=== "x", "\8" === "8", and don't care whether there is a decimal digit after an octal escape sequence.  All reject unescaped newlines in string literals.


I would like to be able to specify quasiliteral literal part decoding in terms of the SV defined in 7.8.4.  If user code is going to have decoded literal parts available when they validly decode, but at least have access to the raw literal parts otherwise, then it would be good for them to be consistently available across interpreters.  Would it be worthwhile having the SV and CV in 7.8.4 specify the decoding of some sourcecharacter sequences that can't actually reach the SV or CV from via the StringLiteral production?
_______________________________________________
es-discuss mailing list
es-discuss at mozilla.org
https://mail.mozilla.org/listinfo/es-discuss