Unicode non-character being treat as space on Firefox/Chrome

# Gareth Heyes (7 years ago)

Not sure if this is a bug or not. Non-character is being treated as a space even though it's not defined as one. Edge and Safari treat it as an invalid character.

�alert�(1)�

In case the characters get mangled:

eval("alert"+String.fromCharCode(65534)+"(1)");
# Michał Wadas (7 years ago)

I believe that Unicode specification make it undefined behaviour.

In effect, noncharacters can be thought of as application-internal private-use code points. Unlike the private-use characters discussed in Section 16.5, Private-Use Characters, which are assigned characters and which are intended for use in open interchange, subject to interpretation by private agreement, noncharacters are permanently reserved (unassigned) and have no interpretation whatsoever outside of their possible application-internal private uses

www.unicode.org/versions/Unicode6.0.0/ch16.pdf

# Mark S. Miller (7 years ago)

What is the relevant EcmaScript standards text that would delegate to this? Even if Unicode implies an undefined case, EcmaScript should not. If EcmaScript behavior for such cases is undefined, we should define it.

# Gareth Heyes (7 years ago)

On 25 May 2017 at 14:04, Mark S. Miller <erights at google.com> wrote:

What is the relevant EcmaScript standards text that would delegate to this? Even if Unicode implies an undefined case, EcmaScript should not. If EcmaScript behavior for such cases is undefined, we should define it.

Looking at the spec. it seems undefined. 0xfffe isn't defined as a whitespace character. This is probably why we have different behaviour in different browsers.

# Domenic Denicola (7 years ago)

We should probably move this to a GitHub issue then, so ES can have clarity on it.

If it helps, I am pretty sure (although I should double-check) that HTML treats such noncharacters as conformance errors (i.e. external tools like validators will warn you about them), but does not let them impact the processing model; they are passed through as-is.

# Allen Wirfs-Brock (7 years ago)

clause 10.1:

ECMAScript code is expressed using Unicode. ECMAScript source text is a sequence of code points. All Unicode code point values from U+0000 to U+10FFFF, including surrogate code points, may occur in source text where permitted by the ECMAScript grammars.

tc39.github.io/ecma262/#sec-white-space, tc39.github.io/ecma262/#sec-white-space exactly defines which specific code units are treated as Whitespae by the ECMAScript grammar. It does not include unassigned code points in the set of valid Whitespace

# Mark S. Miller (7 years ago)

Allen, I'm very glad to hear that it is unambiguous after all.

Gareth, could you file bugs against the non-conforming browsers? Thanks for finding this!

# Gareth Heyes (7 years ago)

On 25 May 2017 at 17:02, Mark S. Miller <erights at google.com> wrote:

Allen, I'm very glad to hear that it is unambiguous after all.

Gareth, could you file bugs against the non-conforming browsers? Thanks for

finding this!

Yeah sure I'll file the bugs now.