this vs thi\u0073

# Mike Samuel (13 years ago)

I was mucking around with some tests of how interpreters deal with identifiers whose decoded value is the same as a reserved keyword.

Many interpreters seem to diverge wildly from the spec and from each other w.r.t. encoded versions of the keyword "this".

I don't see any relationship between the identifier "this" and ThisBinding in the spec, so I believe the below tests should all pass. Am I missing something?

The testcases are available at google-caja.googlecode.com/svn/trunk/doc/html/identifier-keyword-confusion.html and are reproduced below for easy quoting:

  assertTrue("this" === "thi\u0073");

  (function () {
    var thi\u0073 = 42;
    assertFalse(this === 42);
    assertTrue(thi\u0073 === 42);
  }());

  (function (thi\u0073) {
    assertFalse(this === 42);
    assertTrue(thi\u0073 === 42);
  }(42));

  (function thi\u0073() {
    assertFalse("function" === typeof this);
    assertTrue("function" === typeof thi\u0073);
  }());

  var called = false, bodyReached = false;
  var \u0069\u0066 = function (x) { assertFalse(x); called = true; };

  assertTrue("if" === "\u0069\u0066");

  \u0069\u0066(false)
  {
    bodyReached = true;
  }
  assertTrue(called);
  assertTrue(bodyReached);
# Oliver Hunt (13 years ago)

Without looking at the spec but just based on what I know our lexer (JSC) does, we won't consider any identifier with escaped characters to be a keyword, so yes you could make identifiers that technically match language keywords, but you'd never be able to use them directly (without escaping).

# Mike Samuel (13 years ago)

2011/6/20 Oliver Hunt <oliver at apple.com>:

Without looking at the spec but just based on what I know our lexer (JSC) does, we won't consider any identifier with escaped characters to be a keyword, so yes you could make identifiers that technically match language keywords, but you'd never be able to use them directly (without escaping).

You can due to global object aliasing.

var thi\u0073 = 42; alert([this.this === 42, // Look ma, no escaping! this["this"] === 42]); // Look ma, still no escaping!

works in Firefox.

The first is legal because MemberExpression is defined as MemberExpression . IdentifierName not MemberExpression . Identifier and it is only Identifier that is not allowed to be a reserved word Identifier :: IdentifierName but not ReservedWord IdentifierName :: IdentifierStart | IdentifierName IdentifierPart

# Luke Hoban (13 years ago)

My read of the spec is that thi\u0073 is a ReservedWord and should not be allowed as an Identifer. So the following part of the examples quoted below should be an early error:

var thi\u0073 = 42;

The text in 7.6 seems to address this with:

"All interpretations of identifiers within this specification are based upon their actual characters regardless of whether or not an escape sequence was used to contribute any particular characters."

Luke

# Allen Wirfs-Brock (13 years ago)

On Jun 21, 2011, at 12:50 AM, Luke Hoban wrote:

My read of the spec is that thi\u0073 is a ReservedWord and should not be allowed as an Identifer. So the following part of the examples quoted below should be an early error:

var thi\u0073 = 42;

The text in 7.6 seems to address this with:

"All interpretations of identifiers within this specification are based upon their actual characters regardless of whether or not an escape sequence was used to contribute any particular characters."

Luke

+1

I can't dig into the spec. right now but the section Luke quoted is the one I was going to try to dig for.

# Geoffrey Sneddon (13 years ago)

On 21/06/11 00:50, Luke Hoban wrote:

My read of the spec is that thi\u0073 is a ReservedWord and should not be allowed as an Identifer. So the following part of the examples quoted below should be an early error:

var thi\u0073 = 42;

The text in 7.6 seems to address this with:

"All interpretations of identifiers within this specification are based upon their actual characters regardless of whether or not an escape sequence was used to contribute any particular characters."

There's also wiki.whatwg.org/wiki/Web_ECMAScript#Identifiers

which states (after saying "this is very rough"):

"Identifiers containing escape sequences are not equivalent to fully unescaped identifiers in the case that, after fully unescaping identifier, it is a ReservedWord. In particular it is possible to create Identifiers that unescape to a reserved word so long as at least one character is fully escaped. Subsequent use of such identifiers must also have at least one character escaped (otherwise the reserved word will be used instead) but it need not be the same character(s) as that originally used to create the identifier."

# Mike Samuel (13 years ago)

2011/6/20 Luke Hoban <lukeh at microsoft.com>:

My read of the spec is that thi\u0073 is a ReservedWord and should not be allowed as an Identifer.  So the following part of the examples quoted below should be an early error:

var thi\u0073 = 42;

The text in 7.6 seems to address this with:

"All interpretations of identifiers within this specification are based upon their actual characters regardless of whether or not an escape sequence was used to contribute any particular characters."

I don't think this means what you think it means.

I think this means that the identifier foo is the same identifier as f\u006fo.

But in

if (false) alert(1)

"if" is not an identifier, whereas in

\u0069f(false) alert(1)

\u0069f is an identifier which would be the same as the identifier "if" if "if" could appear as an identifier unescaped.

Since "this" is not an identifier but appears explicitly in the text, that led me to assume that "thi\u0073" should behave differently from "this".

Every browser I have tested treats the identifier "i\u0066" distinctly from the keyword "if". Every browser I have tested treats the identifier "thi\u0073" distinctly from the keyword "this" in at least some cases, but only FF passes all the tests I've come up with.

# Allen Wirfs-Brock (13 years ago)

ES5.1: 7.6.1 Reserved Words

A reserved word is an IdentifierName that cannot be used as an Identifier.

7.6 Identifier Names and Identifiers

Identifier Names are tokens that are interpreted according to the grammar given in the “Identifiers” section of chapter 5 of the Unicode standard, with some small modifications. An Identifier is an IdentifierName that is not a ReservedWord (see 7.6.1). The Unicode identifier grammar is based on both normative and informative character categories specified by the Unicode Standard. The characters in the specified categories in version 3.0 of the Unicode standard must be treated as in those categories by all conforming ECMAScript implementations.

This standard specifies specific character additions: The dollar sign ($) and the underscore (_) are permitted anywhere in an IdentifierName.

Unicode escape sequences are also permitted in an IdentifierName, where they contribute a single character to the IdentifierName, as computed by the CV of the UnicodeEscapeSequence (see 7.8.4). The \ preceding the UnicodeEscapeSequence does not contribute a character to the IdentifierName. A UnicodeEscapeSequence cannot be used to put a character into an IdentifierName that would otherwise be illegal. In other words, if a \ UnicodeEscapeSequence sequence were replaced by its UnicodeEscapeSequence's CV, the result must still be a valid IdentifierName that has the exact same sequence of characters as the original IdentifierName. All interpretations of identifiers within this specification are based upon their actual characters regardless of whether or not an escape sequence was used to contribute any particular characters.

The red text would seems to say that \u0069f and if are the same reserved word.

This may not match implementations but it is what the spec. says.

ES3 didn't distinguish IdentifierName from Identifier but from a quick scan of the ES3 language I don't see that the spec. is any different in this regard.

Also, given the pervasive substitution of Unicode escape sequences I don't see why they shouldn't be legal in reserved words.

# Mike Samuel (13 years ago)

2011/6/21 Allen Wirfs-Brock <allen at wirfs-brock.com>:

ES5.1:

7.6.1   Reserved Words

A reserved word is an IdentifierName that cannot be used as an Identifier.

7.6    Identifier Names and Identifiers

Identifier Names are tokens that are interpreted according to the grammar given in the “Identifiers” section of chapter 5 of the Unicode standard, with some small modifications. An Identifier is an IdentifierName that is not a ReservedWord (see 7.6.1). The Unicode identifier grammar is based on both normative and informative character categories specified by the Unicode Standard. The characters in the specified categories in version 3.0 of the Unicode standard must be treated as in those categories by all conforming ECMAScript implementations.

This standard specifies specific character additions: The dollar sign ($) and the underscore (_) are permitted anywhere in an IdentifierName.

Unicode escape sequences are also permitted in an IdentifierName, where they contribute a single character to the IdentifierName, as computed by the CV of the UnicodeEscapeSequence (see 7.8.4). The \ preceding the UnicodeEscapeSequence does not contribute a character to the IdentifierName. A UnicodeEscapeSequence cannot be used to put a character into an IdentifierName that would otherwise be illegal. In other words, if a
UnicodeEscapeSequence sequence were replaced by its UnicodeEscapeSequence's CV, the result must still be a valid IdentifierName that has the exact same sequence of characters as the original IdentifierName. All interpretations of identifiers within this specification are based upon their actual characters regardless of whether or not an escape sequence was used to contribute any particular characters.

The red text would seems to say that \u0069f and if are the same reserved word. This may not match implementations but it is what the spec. says. ES3 didn't distinguish IdentifierName from Identifier but from a quick scan of the ES3 language I don't see that the spec. is any different in this regard. Also, given the pervasive  substitution of Unicode escape sequences I don't see why they shouldn't be legal in reserved words.

"An Identifier is an IdentifierName that is not a ReservedWord (see 7.6.1)" seems to be a non-normative reference to the normative 7.6 production (the "see 7.6.1" is just a reference to the definition of ReservedWord)

Identifier :: IdentifierName (but not ReservedWord)

Your interpretation assumes that the "but not ReservedWord" language in the Identifier production applies after the identifier has been decoded but that is not at all clear to me.

From the section you quoted, "Identifier Names are tokens" i.e.

sequences of SourceCharacters, so the token "\u0069f" is clearly distinct from the token "if".

Since the normative "but nor ReservedWord" appears in a lexical token grammar it should apply at the token level before any interpretation happens.

Assuming your interpretation is correct though, \u0069f may be an IdentifierName corresponding to a reserved keyword, but since IfStatement is defined in terms of

IfStatement : if ( Expression ) Statement else Statement
     if ( Expression ) Statement

where "if" appears literally instead of any reference to an IdentifierName whose decoded value is "if", would you agree that

\u0069f(false) alert(1);

is not a valid EcmaScript program. It should definitely not be interpreted as an EcmaScript program containing an IfStatement.

In that case, we still have at least 3 (haven't tested IE) of 4 major browsers agreeing that the illegal EcmaScript program

this.\u0069\u0066 = function () { alert("called \u0069\u0066"); };

\u0069\u0066(false)
alert(1);

should be interpreted as a call via the reference "if" followed by a call via the reference "alert".

# Mike Samuel (13 years ago)

2011/6/21 Mike Samuel <mikesamuel at gmail.com>:

In that case, we still have at least 3 (haven't tested IE) of 4 major browsers agreeing that the illegal EcmaScript program

this.\u0069\u0066 = function () { alert("called \u0069\u0066"); };

\u0069\u0066(false)    alert(1);

should be interpreted as a call via the reference "if" followed by a call via the reference "alert".

I tested on IE 7 and IE 8 and they both reject the above.

# Brendan Eich (13 years ago)

I think some engines just have a bug to fix here, nothing more. :-/

# Mike Samuel (13 years ago)

2011/6/21 Brendan Eich <brendan at mozilla.com>:

I think some engines just have a bug to fix here, nothing more. :-/

I think the spec could be clearer as to whether "but not ReservedWord" applies before or after the IdentifierName is decoded.

I'm happy to file bugs if people tell me what the bug is?

Is it

(A) Not treating reserved words with characters escaped where an Identifier is expected as a syntax error.

var \u0069\u0066

should be a syntax error.

(B) Not distinguishing identifiers whose decoded IdentifierName is a reserved word as distinct from the keywod.

eval("var thi\\u0073; this !== thi\\u0073")

should be true.
# Lasse Reichstein (13 years ago)

On Tue, Jun 21, 2011 at 7:23 PM, Allen Wirfs-Brock <allen at wirfs-brock.com>wrote:

ES5.1: 7.6.1 Reserved Words****

A reserved word is an IdentifierName that cannot be used as an * Identifier*.

Descriptive.


7.6 Identifier Names and Identifiers****

Identifier Names are tokens that are interpreted according to the grammar given in the “Identifiers” section of chapter 5 of the Unicode standard, with some small modifications. An Identifier is an IdentifierName that is not a ReservedWord (see 7.6.1).

...

Unicode escape sequences are also permitted in an IdentifierName, where they contribute a single character to the IdentifierName, as computed by the CV of the UnicodeEscapeSequence (see 7.8.4). The * preceding the * UnicodeEscapeSequence does not contribute a character to the * IdentifierName*. A UnicodeEscapeSequence cannot be used to put a character into an *IdentifierName *that would otherwise be illegal. In other words, if a *\ *UnicodeEscapeSequence sequence were replaced by its UnicodeEscapeSequence's CV, the result must still be a valid *IdentifierName that has the exact same sequence of characters as the original * IdentifierName.

No problem here, as i\u0066 is definitely an IdentifierName (so is "if").

ES3 didn't distinguish IdentifierName from Identifier but from a quick scan of the ES3 language I don't see that the spec. is any different in this regard.

Except that there was not syntactic category containing i\u0066 in ES3, but there is in ES5 (IdentifierName).

Also, given the pervasive substitution of Unicode escape sequences I don't see why they shouldn't be legal in reserved words.

I used to think so too, mainly based on reading ES3, but after reading this thread a few times, I can now argue either way.

Identifier ::= IdentifierName but not ReservedWord

Since i\u0066 is an IdentifierName (it's valid to write o.i\u0066 and {i\u0066: 42}), and it's not a ReservedWord (there are no escapes in the ReservedWord definition), it must be an Identifier.

In ES3, the resulting character sequence had to be a valid Identifier. That meant that i\u0066 would not qualify, because "if" was not a valid identifier. In ES5, as quoted above, it just has to be a valid IdentifierName, which "if" is

On the other hand, every other use of IdentifierName refers to its interpreted value, after escape replacement, so that would mean that i\u0066 is not an Identifier, because its value as an IdentifierName is "if", which is a ReservedWord. That also means that there is no syntactic category for i\u0066 except IdentifierName. It's neither an Identifier, nor a keyword (because escapes are not allowed in keywords).

So, it all boils still down to whether the "but not" check is performed before or after interpretation of the input character sequence, i.e., before or after replacing escape sequences.