BOM inside tokens

# Igor Bukanov (17 years ago)

The currently proposed rule for byte-order-mark (BOM) characters in ES4 sources is to replace them by whitespace outside of tokens. But what is exactly the tokens in a case like -<bom>-?

AFAICS it would be treated as - - turning cases like: -<bom>-a;

into

-a; versus --a; that would be with current ES3 implementations.

, Igor

The currently proposed rule for byte-order-mark (BOM) characters in
ES4 sources is to replace them by whitespace outside of tokens. But
what is exactly the tokens in a case like -<bom>-?

AFAICS it would be treated as - - turning cases like:
  -<bom>-a;
into
  - -a;
versus
  --a;
that would be with current ES3 implementations.

Regards, Igor

# Ash Berlin (17 years ago)

On 15 Jul 2008, at 18:22, Igor Bukanov wrote:

The currently proposed rule for byte-order-mark (BOM) characters in ES4 sources is to replace them by whitespace outside of tokens. But what is exactly the tokens in a case like -<bom>-?

AFAICS it would be treated as - - turning cases like: -<bom>-a; into

-a; versus --a; that would be with current ES3 implementations.

, Igor _

Hmmm. according do UnicodeCheck app on my mac (and thus to one version
or other of the Unicode spec) a BOM (uFEFF) is 'ZERO WIDTH NO-BREAK
SPACE'

• NamesList: = BYTE ORDER MARK (BOM), ZWNBSP • may be used to detect byte order by contrast with the
noncharacter code point FFFE • use as an indication of non-breaking is deprecated; see 2060
instead → (zero width space - 200B) → (word joiner - 2060) → (<not a character> - FFFE) • Designated in Unicode 1.1

I'd say that a BOM should be treated just like any ordinary whitespace
char - namely that it should invalid in spaces, and beyond that why is
any conversion needed, since its a valid unicode character...

On 15 Jul 2008, at 18:22, Igor Bukanov wrote:

> The currently proposed rule for byte-order-mark (BOM) characters in
> ES4 sources is to replace them by whitespace outside of tokens. But
> what is exactly the tokens in a case like -<bom>-?
>
> AFAICS it would be treated as - - turning cases like:
>  -<bom>-a;
> into
>  - -a;
> versus
>  --a;
> that would be with current ES3 implementations.
>
> Regards, Igor
> _

Hmmm. according do UnicodeCheck app on my mac (and thus to one version  
or other of the Unicode spec) a BOM (uFEFF) is 'ZERO WIDTH NO-BREAK  
SPACE'

•	NamesList:
		= BYTE ORDER MARK (BOM), ZWNBSP
		• may be used to detect byte order by contrast with the  
noncharacter code point FFFE
		• use as an indication of non-breaking is deprecated; see 2060  
instead
		→ (zero width space - 200B)
		→ (word joiner - 2060)
		→ (<not a character> - FFFE)
•	Designated in Unicode 1.1

I'd say that a BOM should be treated just like any ordinary whitespace  
char - namely that it should invalid in spaces, and beyond that why is  
any conversion needed, since its a valid unicode character...

-ash

# Ash Berlin (17 years ago)

On 15 Jul 2008, at 18:39, Ash Berlin wrote:

On 15 Jul 2008, at 18:22, Igor Bukanov wrote:

The currently proposed rule for byte-order-mark (BOM) characters in ES4 sources is to replace them by whitespace outside of tokens. But what is exactly the tokens in a case like -<bom>-?

AFAICS it would be treated as - - turning cases like: -<bom>-a; into

-a; versus --a; that would be with current ES3 implementations.

, Igor _

Hmmm. according do UnicodeCheck app on my mac (and thus to one version or other of the Unicode spec) a BOM (uFEFF) is 'ZERO WIDTH NO-BREAK SPACE'

• NamesList: = BYTE ORDER MARK (BOM), ZWNBSP • may be used to detect byte order by contrast with the noncharacter code point FFFE • use as an indication of non-breaking is deprecated; see 2060 instead → (zero width space - 200B) → (word joiner - 2060) → (<not a character> - FFFE) • Designated in Unicode 1.1

I'd say that a BOM should be treated just like any ordinary whitespace char - namely that it should invalid in spaces, and beyond that why is any conversion needed, since its a valid unicode character...

Invalid in identifiers

On 15 Jul 2008, at 18:39, Ash Berlin wrote:

>
> On 15 Jul 2008, at 18:22, Igor Bukanov wrote:
>
>> The currently proposed rule for byte-order-mark (BOM) characters in
>> ES4 sources is to replace them by whitespace outside of tokens. But
>> what is exactly the tokens in a case like -<bom>-?
>>
>> AFAICS it would be treated as - - turning cases like:
>> -<bom>-a;
>> into
>> - -a;
>> versus
>> --a;
>> that would be with current ES3 implementations.
>>
>> Regards, Igor
>> _
>
> Hmmm. according do UnicodeCheck app on my mac (and thus to one version
> or other of the Unicode spec) a BOM (uFEFF) is 'ZERO WIDTH NO-BREAK
> SPACE'
>
> •	NamesList:
> 		= BYTE ORDER MARK (BOM), ZWNBSP
> 		• may be used to detect byte order by contrast with the
> noncharacter code point FFFE
> 		• use as an indication of non-breaking is deprecated; see 2060
> instead
> 		→ (zero width space - 200B)
> 		→ (word joiner - 2060)
> 		→ (<not a character> - FFFE)
> •	Designated in Unicode 1.1
>
> I'd say that a BOM should be treated just like any ordinary whitespace
> char - namely that it should invalid in spaces, and beyond that why is
> any conversion needed, since its a valid unicode character...
>

Invalid in *identifiers*

# Igor Bukanov (17 years ago)

2008/7/15 Ash Berlin <ash_es4 at firemirror.com>:

I'd say that a BOM should be treated just like any ordinary whitespace char - namely that it should invalid in spaces, and beyond that why is any conversion needed, since its a valid unicode character...

The problem comes from the current ES3 implementations that strip BOM from the sources and web pages placing BOM in arbitrary places in JS sources. So the question is should ES4 at least partially be compatible with the current code?

igor

2008/7/15 Ash Berlin <ash_es4 at firemirror.com>:
>
> I'd say that a BOM should be treated just like any ordinary whitespace
> char - namely that it should invalid in spaces, and beyond that why is
> any conversion needed, since its a valid unicode character...

The problem comes from the current ES3 implementations that strip BOM
from the sources and web pages placing BOM in arbitrary places in JS
sources. So the question is should ES4 at least partially be
compatible with the current code?

igor

# Mark Miller (17 years ago)

On Tue, Jul 15, 2008 at 11:00 AM, Igor Bukanov <igor at mir2.org> wrote:

2008/7/15 Ash Berlin <ash_es4 at firemirror.com>:

I'd say that a BOM should be treated just like any ordinary whitespace char - namely that it should invalid in spaces, and beyond that why is any conversion needed, since its a valid unicode character...

The problem comes from the current ES3 implementations that strip BOM from the sources and web pages placing BOM in arbitrary places in JS sources. So the question is should ES4 at least partially be compatible with the current code?

As we've found with the ES3-specified stripping of Cf characters, the main effect of such transparent stripping of characters is to help attackers slip XSS attacks past defensive filters. ES3.1 agrees with ES4 that BOMs and Cfs should be treated as whitespace rather than stripped.

On Tue, Jul 15, 2008 at 11:00 AM, Igor Bukanov <igor at mir2.org> wrote:

> 2008/7/15 Ash Berlin <ash_es4 at firemirror.com>:
> >
> > I'd say that a BOM should be treated just like any ordinary whitespace
> > char - namely that it should invalid in spaces, and beyond that why is
> > any conversion needed, since its a valid unicode character...
>
> The problem comes from the current ES3 implementations that strip BOM
> from the sources and web pages placing BOM in arbitrary places in JS
> sources. So the question is should ES4 at least partially be
> compatible with the current code?
>

As we've found with the ES3-specified stripping of Cf characters, the main
effect of such transparent stripping of characters is to help attackers slip
XSS attacks past defensive filters. ES3.1 agrees with ES4 that BOMs and Cfs
should be treated as whitespace rather than stripped.

-- 
Text by me above is hereby placed in the public domain

   Cheers,
   --MarkM
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.mozilla.org/pipermail/es-discuss/attachments/20080715/a8c2fc00/attachment-0002.html

# Igor Bukanov (17 years ago)

2008/7/15 Mark Miller <erights at gmail.com>:

As we've found with the ES3-specified stripping of Cf characters, the main effect of such transparent stripping of characters is to help attackers slip XSS attacks past defensive filters. ES3.1 agrees with ES4 that BOMs and Cfs should be treated as whitespace rather than stripped.

But this mean that it will silently change the semantic of +<bom-or-cf>+ from ++ into + +. From the security point of view it

would be better to treat such cases as syntax errors. A possible rule could be to allow BOM/Cf only in strings/regexp leterals or if such character follow/precedes non-zero-width white space character.

2008/7/15 Mark Miller <erights at gmail.com>:
> As we've found with the ES3-specified stripping of Cf characters, the main
> effect of such transparent stripping of characters is to help attackers slip
> XSS attacks past defensive filters. ES3.1 agrees with ES4 that BOMs and Cfs
> should be treated as whitespace rather than stripped.

But this mean that it will silently change the semantic of
+<bom-or-cf>+ from ++ into + +. From the security point of view it
would be better to treat such cases as syntax errors. A possible rule
could be to allow BOM/Cf only in strings/regexp leterals or if such
character follow/precedes non-zero-width white space character.

# Mark S. Miller (17 years ago)

On Tue, Jul 15, 2008 at 11:27 AM, Igor Bukanov <igor at mir2.org> wrote:

2008/7/15 Mark Miller <erights at gmail.com>:

As we've found with the ES3-specified stripping of Cf characters, the main effect of such transparent stripping of characters is to help attackers slip XSS attacks past defensive filters. ES3.1 agrees with ES4 that BOMs and Cfs should be treated as whitespace rather than stripped.

Speaking only for myself, yes, I'd be even happier with the syntax error. I have proposed such harsh treatment before but various objections were raised. In any case, again speaking only for myself, I'm happy with any solution that repairs the security holes created by stripping and avoids introducing new holes.

On Tue, Jul 15, 2008 at 11:27 AM, Igor Bukanov <igor at mir2.org> wrote:

> 2008/7/15 Mark Miller <erights at gmail.com>:
> > As we've found with the ES3-specified stripping of Cf characters, the
> main
> > effect of such transparent stripping of characters is to help attackers
> slip
> > XSS attacks past defensive filters. ES3.1 agrees with ES4 that BOMs and
> Cfs
> > should be treated as whitespace rather than stripped.
>

Speaking only for myself, yes, I'd be even happier with the syntax error. I
have proposed such harsh treatment before but various objections were
raised. In any case, again speaking only for myself, I'm happy with any
solution that repairs the security holes created by stripping and avoids
introducing new holes.

-- 
Cheers,
--MarkM
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.mozilla.org/pipermail/es-discuss/attachments/20080715/fd3ee4e7/attachment-0002.html

# Igor Bukanov (17 years ago)

It seems the current IE7/IE8 behavior is to allow Cf only in srtring and regexp literals and to allow BOM only in string/regexps or at the beginning of the source, see bugzilla.mozilla.org/show_bug.cgi?id=430740#c32 . This is very reasonable.

Igor

It seems the current IE7/IE8 behavior is to allow Cf only in srtring
and regexp literals and to allow BOM only in string/regexps or at the
beginning of the source, see
https://bugzilla.mozilla.org/show_bug.cgi?id=430740#c32 . This is very
reasonable.

Igor

# Waldemar Horwat (17 years ago)

Igor Bukanov wrote:

It seems the current IE7/IE8 behavior is to allow Cf only in srtring and regexp literals and to allow BOM only in string/regexps or at the beginning of the source,

Precisely what does "in string and regexp literals" mean? The exact interpretation of this phrase is the core source of the aforementioned security holes.

Folks have exploited putting special characters right after a backslash to break out of whitelisted literals and execute arbitrary code from JSON; a few months ago I demonstrated such an attack. Regular expressions offer even more opportunities for this kind of mischief.

Waldemar

Igor Bukanov wrote:
> It seems the current IE7/IE8 behavior is to allow Cf only in srtring
> and regexp literals and to allow BOM only in string/regexps or at the
> beginning of the source,

Precisely what does "in string and regexp literals" mean?  The exact interpretation of this phrase is the core source of the aforementioned security holes.

Folks have exploited putting special characters right after a backslash to break out of whitelisted literals and execute arbitrary code from JSON; a few months ago I demonstrated such an attack.  Regular expressions offer even more opportunities for this kind of mischief.

    Waldemar

# Brendan Eich (17 years ago)