Am I paranoid enough?

# David-Sarah Hopwood (16 years ago)

Suppose that S is a Unicode string in which each character matches ValidChar below, not containing the subsequences "<!", "</" or "]]>", and

not containing ("&" followed by a character not matching AmpFollower). S encodes a syntactically correct ES3 or ES3.1 source text chosen by an attacker.

ValidChar :: one of '\u0009' '\u000A' '\u000D' // TAB, LF, CR [\u0020-\u007E] [\u00A0-\u00AC] [\u00AE-\u05FF] [\u0604-\u06DC] [\u06DE-\u070E] [\u0710-\u17B3] [\u17B6-\u200A] [\u2010-\u2027] [\u202F-\u205F] [\u2070-\uD7FF] [\uE000-\uFDCF] [\uFDF0-\uFEFE] [\uFF00-\uFFEF]

AmpFollower :: one of '=' '(' '+' '-' '!' '~' '"' '/' [0-9] '\u0027' '\u005C' '\u0020' '\u0009' '\u000A' \u000D' // single quote, backslash, space, TAB, LF, CR

(ValidChar excludes format control characters, and some other characters known to be mishandled by browsers. AmpFollower is intended to exclude characters that can start an entity reference.)

S is inserted between "<script>" and "</script>" in a place where a <script> tag is allowed in an otherwise valid HTML document, or

between "<script><![CDATA[" and "]]></script>" in a place where a <script> tag is allowed in an otherwise valid XHTML document.

The HTML or XHTML document starts with a correct <!DOCTYPE or <?xml declaration respectively, and is encoded as well-formed UTF-8.

Are these restrictions sufficient to ensure that the embedded script is interpreted as it would have been if referenced from an external file, foiling any attempts of browsers to collude with the attacker in misparsing it?

Are some of the restrictions unnecessary?

Suppose that S is a Unicode string in which each character matches
ValidChar below, not containing the subsequences "<!", "</" or "]]>", and
not containing ("&" followed by a character not matching AmpFollower).
S encodes a syntactically correct ES3 or ES3.1 source text chosen by
an attacker.

  ValidChar :: one of
    '\u0009' '\u000A' '\u000D' // TAB, LF, CR
    [\u0020-\u007E]
    [\u00A0-\u00AC]
    [\u00AE-\u05FF]
    [\u0604-\u06DC]
    [\u06DE-\u070E]
    [\u0710-\u17B3]
    [\u17B6-\u200A]
    [\u2010-\u2027]
    [\u202F-\u205F]
    [\u2070-\uD7FF]
    [\uE000-\uFDCF]
    [\uFDF0-\uFEFE]
    [\uFF00-\uFFEF]

  AmpFollower :: one of
    '=' '(' '+' '-' '!' '~' '"' '/' [0-9]
    '\u0027' '\u005C' '\u0020' '\u0009' '\u000A' \u000D'
    // single quote, backslash, space, TAB, LF, CR

(ValidChar excludes format control characters, and some other
characters known to be mishandled by browsers. AmpFollower is
intended to exclude characters that can start an entity reference.)

S is inserted between "<script>" and "</script>" in a place where a
<script> tag is allowed in an otherwise valid HTML document, or
between "<script><![CDATA[" and "]]></script>" in a place where a
<script> tag is allowed in an otherwise valid XHTML document.
The HTML or XHTML document starts with a correct <!DOCTYPE or
<?xml declaration respectively, and is encoded as well-formed
UTF-8.


Are these restrictions sufficient to ensure that the embedded
script is interpreted as it would have been if referenced from
an external file, foiling any attempts of browsers to collude
with the attacker in misparsing it?

Are some of the restrictions unnecessary?

-- 
David-Sarah Hopwood ⚥

# David-Sarah Hopwood (16 years ago)

No, I'm not paranoid enough yet. It's not sufficient only to say that the HTML is encoded as UTF-8 (see below).

David-Sarah Hopwood wrote: [...]

The HTML or XHTML document starts with a correct <!DOCTYPE or <?xml declaration respectively,

I meant, the document starts with <!DOCTYPE HTML> in the case

of HTML, or <?xml version="1.0"?><!DOCTYPE HTML> in the case of

XHTML.

(This will also put the parser into sane^H^H^H^Hstandards mode.)

and is encoded as well-formed UTF-8.

The document must also start with a UTF-8 BOM, and must not contain a META directive that changes the charset, and in the case of HTML, must either be retrieved from a local file or over HTTP with the header "Content-Type: text/html; charset=UTF-8". This is because the method of determining the encoding is chosen based on the phase of the moon.

Any other problems?

No, I'm not paranoid enough yet. It's not sufficient only to say
that the HTML is encoded as UTF-8 (see below).

David-Sarah Hopwood wrote:
[...]
> The HTML or XHTML document starts with a correct <!DOCTYPE or
> <?xml declaration respectively,

I meant, the document starts with <!DOCTYPE HTML> in the case
of HTML, or <?xml version="1.0"?><!DOCTYPE HTML> in the case of
XHTML.

(This will also put the parser into sane^H^H^H^Hstandards mode.)

> and is encoded as well-formed UTF-8.

The document must also start with a UTF-8 BOM, *and* must not
contain a META directive that changes the charset, *and* in the
case of HTML, must either be retrieved from a local file or over
HTTP with the header "Content-Type: text/html; charset=UTF-8".
This is because the method of determining the encoding is chosen
based on the phase of the moon.

Any other problems?

-- 
David-Sarah Hopwood ⚥

# Waldemar Horwat (16 years ago)

What are you trying to do? Exclude all scripts that use the && operator?

Waldemar

What are you trying to do?  Exclude all scripts that use the && operator?

    Waldemar


David-Sarah Hopwood wrote:
> Suppose that S is a Unicode string in which each character matches
> ValidChar below, not containing the subsequences "<!", "</" or "]]>", and
> not containing ("&" followed by a character not matching AmpFollower).
> S encodes a syntactically correct ES3 or ES3.1 source text chosen by
> an attacker.
> 
>   ValidChar :: one of
>     '\u0009' '\u000A' '\u000D' // TAB, LF, CR
>     [\u0020-\u007E]
>     [\u00A0-\u00AC]
>     [\u00AE-\u05FF]
>     [\u0604-\u06DC]
>     [\u06DE-\u070E]
>     [\u0710-\u17B3]
>     [\u17B6-\u200A]
>     [\u2010-\u2027]
>     [\u202F-\u205F]
>     [\u2070-\uD7FF]
>     [\uE000-\uFDCF]
>     [\uFDF0-\uFEFE]
>     [\uFF00-\uFFEF]
> 
>   AmpFollower :: one of
>     '=' '(' '+' '-' '!' '~' '"' '/' [0-9]
>     '\u0027' '\u005C' '\u0020' '\u0009' '\u000A' \u000D'
>     // single quote, backslash, space, TAB, LF, CR
> 
> (ValidChar excludes format control characters, and some other
> characters known to be mishandled by browsers. AmpFollower is
> intended to exclude characters that can start an entity reference.)
> 
> S is inserted between "<script>" and "</script>" in a place where a
> <script> tag is allowed in an otherwise valid HTML document, or
> between "<script><![CDATA[" and "]]></script>" in a place where a
> <script> tag is allowed in an otherwise valid XHTML document.
> The HTML or XHTML document starts with a correct <!DOCTYPE or
> <?xml declaration respectively, and is encoded as well-formed
> UTF-8.
> 
> 
> Are these restrictions sufficient to ensure that the embedded
> script is interpreted as it would have been if referenced from
> an external file, foiling any attempts of browsers to collude
> with the attacker in misparsing it?
> 
> Are some of the restrictions unnecessary?
>

# David-Sarah Hopwood (16 years ago)

Waldemar Horwat wrote:

What are you trying to do? Exclude all scripts that use the && operator?

Oops, I failed to describe the intended restrictions correctly. Any sequence of consecutive '&' would be allowed if followed by AmpFollower. But at least this tells me you were paying attention ;-)

Waldemar Horwat wrote:
> What are you trying to do?  Exclude all scripts that use the && operator?

Oops, I failed to describe the intended restrictions correctly.
Any sequence of consecutive '&' would be allowed if followed by AmpFollower.
But at least this tells me you were paying attention ;-)

> David-Sarah Hopwood wrote:
>> Suppose that S is a Unicode string in which each character matches
>> ValidChar below, not containing the subsequences "<!", "</" or "]]>", and
>> not containing ("&" followed by a character not matching AmpFollower).

-- 
David-Sarah Hopwood