BOM inside tokens

# Igor Bukanov (17 years ago)

The currently proposed rule for byte-order-mark (BOM) characters in ES4 sources is to replace them by whitespace outside of tokens. But what is exactly the tokens in a case like -<bom>-?

AFAICS it would be treated as - - turning cases like: -<bom>-a;

into

  • -a; versus --a; that would be with current ES3 implementations.

, Igor

# Ash Berlin (17 years ago)

On 15 Jul 2008, at 18:22, Igor Bukanov wrote:

The currently proposed rule for byte-order-mark (BOM) characters in ES4 sources is to replace them by whitespace outside of tokens. But what is exactly the tokens in a case like -<bom>-?

AFAICS it would be treated as - - turning cases like: -<bom>-a; into

  • -a; versus --a; that would be with current ES3 implementations.

, Igor _

Hmmm. according do UnicodeCheck app on my mac (and thus to one version
or other of the Unicode spec) a BOM (uFEFF) is 'ZERO WIDTH NO-BREAK
SPACE'

• NamesList: = BYTE ORDER MARK (BOM), ZWNBSP • may be used to detect byte order by contrast with the
noncharacter code point FFFE • use as an indication of non-breaking is deprecated; see 2060
instead → (zero width space - 200B) → (word joiner - 2060) → (<not a character> - FFFE) • Designated in Unicode 1.1

I'd say that a BOM should be treated just like any ordinary whitespace
char - namely that it should invalid in spaces, and beyond that why is
any conversion needed, since its a valid unicode character...

# Ash Berlin (17 years ago)

On 15 Jul 2008, at 18:39, Ash Berlin wrote:

On 15 Jul 2008, at 18:22, Igor Bukanov wrote:

The currently proposed rule for byte-order-mark (BOM) characters in ES4 sources is to replace them by whitespace outside of tokens. But what is exactly the tokens in a case like -<bom>-?

AFAICS it would be treated as - - turning cases like: -<bom>-a; into

  • -a; versus --a; that would be with current ES3 implementations.

, Igor _

Hmmm. according do UnicodeCheck app on my mac (and thus to one version or other of the Unicode spec) a BOM (uFEFF) is 'ZERO WIDTH NO-BREAK SPACE'

• NamesList: = BYTE ORDER MARK (BOM), ZWNBSP • may be used to detect byte order by contrast with the noncharacter code point FFFE • use as an indication of non-breaking is deprecated; see 2060 instead → (zero width space - 200B) → (word joiner - 2060) → (<not a character> - FFFE) • Designated in Unicode 1.1

I'd say that a BOM should be treated just like any ordinary whitespace char - namely that it should invalid in spaces, and beyond that why is any conversion needed, since its a valid unicode character...

Invalid in identifiers

# Igor Bukanov (17 years ago)

2008/7/15 Ash Berlin <ash_es4 at firemirror.com>:

I'd say that a BOM should be treated just like any ordinary whitespace char - namely that it should invalid in spaces, and beyond that why is any conversion needed, since its a valid unicode character...

The problem comes from the current ES3 implementations that strip BOM from the sources and web pages placing BOM in arbitrary places in JS sources. So the question is should ES4 at least partially be compatible with the current code?

igor

# Mark Miller (17 years ago)

On Tue, Jul 15, 2008 at 11:00 AM, Igor Bukanov <igor at mir2.org> wrote:

2008/7/15 Ash Berlin <ash_es4 at firemirror.com>:

I'd say that a BOM should be treated just like any ordinary whitespace char - namely that it should invalid in spaces, and beyond that why is any conversion needed, since its a valid unicode character...

The problem comes from the current ES3 implementations that strip BOM from the sources and web pages placing BOM in arbitrary places in JS sources. So the question is should ES4 at least partially be compatible with the current code?

As we've found with the ES3-specified stripping of Cf characters, the main effect of such transparent stripping of characters is to help attackers slip XSS attacks past defensive filters. ES3.1 agrees with ES4 that BOMs and Cfs should be treated as whitespace rather than stripped.

# Igor Bukanov (17 years ago)

2008/7/15 Mark Miller <erights at gmail.com>:

As we've found with the ES3-specified stripping of Cf characters, the main effect of such transparent stripping of characters is to help attackers slip XSS attacks past defensive filters. ES3.1 agrees with ES4 that BOMs and Cfs should be treated as whitespace rather than stripped.

But this mean that it will silently change the semantic of +<bom-or-cf>+ from ++ into + +. From the security point of view it

would be better to treat such cases as syntax errors. A possible rule could be to allow BOM/Cf only in strings/regexp leterals or if such character follow/precedes non-zero-width white space character.

# Mark S. Miller (17 years ago)

On Tue, Jul 15, 2008 at 11:27 AM, Igor Bukanov <igor at mir2.org> wrote:

2008/7/15 Mark Miller <erights at gmail.com>:

As we've found with the ES3-specified stripping of Cf characters, the main effect of such transparent stripping of characters is to help attackers slip XSS attacks past defensive filters. ES3.1 agrees with ES4 that BOMs and Cfs should be treated as whitespace rather than stripped.

Speaking only for myself, yes, I'd be even happier with the syntax error. I have proposed such harsh treatment before but various objections were raised. In any case, again speaking only for myself, I'm happy with any solution that repairs the security holes created by stripping and avoids introducing new holes.

# Igor Bukanov (17 years ago)

It seems the current IE7/IE8 behavior is to allow Cf only in srtring and regexp literals and to allow BOM only in string/regexps or at the beginning of the source, see bugzilla.mozilla.org/show_bug.cgi?id=430740#c32 . This is very reasonable.

Igor

# Waldemar Horwat (17 years ago)

Igor Bukanov wrote:

It seems the current IE7/IE8 behavior is to allow Cf only in srtring and regexp literals and to allow BOM only in string/regexps or at the beginning of the source,

Precisely what does "in string and regexp literals" mean? The exact interpretation of this phrase is the core source of the aforementioned security holes.

Folks have exploited putting special characters right after a backslash to break out of whitelisted literals and execute arbitrary code from JSON; a few months ago I demonstrated such an attack. Regular expressions offer even more opportunities for this kind of mischief.

Waldemar
# Brendan Eich (17 years ago)

Latest news in the bug:

bugzilla.mozilla.org/show_bug.cgi?id=430740#c42

Igor wrote:

"So MSIE simply treats BOM as a whitespace for the purpose of
parsing. Which suggests to do this in SM to fix the bug: treat BOM as one of Unicode whitespace characters in the scanner avoiding any character skipping or patching."

So no security issues with stripping. Another triumph of de-facto
standard over de-jure.

Pratap got this into ES3.1 drafts already.