[caplet] Am I paranoid enough?

# Mike Samuel (17 years ago)

2009/2/16 David-Sarah Hopwood <david.hopwood at industrial-designers.co.uk>

Suppose that S is a Unicode string in which each character matches ValidChar below, not containing the subsequences "<!", "</" or "]]>", and not containing ("&" followed by a character not matching AmpFollower). S encodes a syntactically correct ES3 or ES3.1 source text chosen by an attacker.

ValidChar :: one of '\u0009' '\u000A' '\u000D' // TAB, LF, CR [\u0020-\u007E] [\u00A0-\u00AC] [\u00AE-\u05FF] [\u0604-\u06DC] [\u06DE-\u070E] [\u0710-\u17B3] [\u17B6-\u200A] [\u2010-\u2027] [\u202F-\u205F] [\u2070-\uD7FF]

So no surrogates?

[\uE000-\uFDCF] [\uFDF0-\uFEFE] [\uFF00-\uFFEF]

Why include FFEF?

AmpFollower :: one of '=' '(' '+' '-' '!' '~' '"' '/' [0-9] '\u0027' '\u005C' '\u0020' '\u0009' '\u000A' \u000D' // single quote, backslash, space, TAB, LF, CR

(ValidChar excludes format control characters, and some other characters known to be mishandled by browsers. AmpFollower is intended to exclude characters that can start an entity reference.)

S is inserted between "<script>" and "</script>" in a place where a <script> tag is allowed in an otherwise valid HTML document, or between "<script><![CDATA[" and "]]></script>" in a place where a <script> tag is allowed in an otherwise valid XHTML document. The HTML or XHTML document starts with a correct <!DOCTYPE or <?xml declaration respectively, and is encoded as well-formed UTF-8.

Are these restrictions sufficient to ensure that the embedded script is interpreted as it would have been if referenced from an external file, foiling any attempts of browsers to collude with the attacker in misparsing it?

You may still be subject to encoding attacks. I'm sure there are valid scripts that look like UTF-7, so if the script appears in the first 1024B, you might need to make sure it's preceded by a <meta>

element specifying an encoding, and/or use the XML prologue form that specifies an encoding.

2009/2/16 David-Sarah Hopwood <david.hopwood at industrial-designers.co.uk>
>
> Suppose that S is a Unicode string in which each character matches
> ValidChar below, not containing the subsequences "<!", "</" or "]]>", and
> not containing ("&" followed by a character not matching AmpFollower).
> S encodes a syntactically correct ES3 or ES3.1 source text chosen by
> an attacker.
>
> ValidChar :: one of
> '\u0009' '\u000A' '\u000D' // TAB, LF, CR
> [\u0020-\u007E]
> [\u00A0-\u00AC]
> [\u00AE-\u05FF]
> [\u0604-\u06DC]
> [\u06DE-\u070E]
> [\u0710-\u17B3]
> [\u17B6-\u200A]
> [\u2010-\u2027]
> [\u202F-\u205F]
> [\u2070-\uD7FF]

So no surrogates?

> [\uE000-\uFDCF]
> [\uFDF0-\uFEFE]
> [\uFF00-\uFFEF]

Why include FFEF?

> AmpFollower :: one of
> '=' '(' '+' '-' '!' '~' '"' '/' [0-9]
> '\u0027' '\u005C' '\u0020' '\u0009' '\u000A' \u000D'
> // single quote, backslash, space, TAB, LF, CR
>
> (ValidChar excludes format control characters, and some other
> characters known to be mishandled by browsers. AmpFollower is
> intended to exclude characters that can start an entity reference.)
>
> S is inserted between "<script>" and "</script>" in a place where a
> <script> tag is allowed in an otherwise valid HTML document, or
> between "<script><![CDATA[" and "]]></script>" in a place where a
> <script> tag is allowed in an otherwise valid XHTML document.
> The HTML or XHTML document starts with a correct <!DOCTYPE or
> <?xml declaration respectively, and is encoded as well-formed
> UTF-8.
>
> Are these restrictions sufficient to ensure that the embedded
> script is interpreted as it would have been if referenced from
> an external file, foiling any attempts of browsers to collude
> with the attacker in misparsing it?

You may still be subject to encoding attacks.  I'm sure there are
valid scripts that look like UTF-7, so if the script appears in the
first 1024B, you might need to make sure it's preceded by a <meta>
element specifying an encoding, and/or use the XML prologue form that
specifies an encoding.

> Are some of the restrictions unnecessary?
>
> --
> David-Sarah Hopwood ⚥

# David-Sarah Hopwood (17 years ago)

Mike Samuel wrote:

2009/2/16 David-Sarah Hopwood <david.hopwood at industrial-designers.co.uk>

Suppose that S is a Unicode string in which each character matches ValidChar below, not containing the subsequences "<!", "</" or "]]>", and not containing ("&" followed by a character not matching AmpFollower). S encodes a syntactically correct ES3 or ES3.1 source text chosen by an attacker.

ValidChar :: one of '\u0009' '\u000A' '\u000D' // TAB, LF, CR [\u0020-\u007E] [\u00A0-\u00AC] [\u00AE-\u05FF] [\u0604-\u06DC] [\u06DE-\u070E] [\u0710-\u17B3] [\u17B6-\u200A] [\u2010-\u2027] [\u202F-\u205F] [\u2070-\uD7FF]

So no surrogates?

Correct. They're not characters (or even "noncharacters").

[\uE000-\uFDCF] [\uFDF0-\uFEFE] [\uFF00-\uFFEF]

Why include FFEF?

It's unassigned, and there's no particular reason to exclude it. (\uFFF0-\uFFF8 are also unassigned, but that's an area reserved for "special" characters.)

AmpFollower :: one of '=' '(' '+' '-' '!' '~' '"' '/' [0-9] '\u0027' '\u005C' '\u0020' '\u0009' '\u000A' \u000D' // single quote, backslash, space, TAB, LF, CR

(ValidChar excludes format control characters, and some other characters known to be mishandled by browsers. AmpFollower is intended to exclude characters that can start an entity reference.)

S is inserted between "<script>" and "</script>" in a place where a <script> tag is allowed in an otherwise valid HTML document, or between "<script><![CDATA[" and "]]></script>" in a place where a <script> tag is allowed in an otherwise valid XHTML document. The HTML or XHTML document starts with a correct <!DOCTYPE or <?xml declaration respectively, and is encoded as well-formed UTF-8.

Are these restrictions sufficient to ensure that the embedded script is interpreted as it would have been if referenced from an external file, foiling any attempts of browsers to collude with the attacker in misparsing it?

You may still be subject to encoding attacks. I'm sure there are valid scripts that look like UTF-7, so if the script appears in the first 1024B, you might need to make sure it's preceded by a <meta> element specifying an encoding, and/or use the XML prologue form that specifies an encoding.

Right; I covered that in a follow-up. Is including a UTF-8 BOM at the start sufficient for all browsers (that is, are there any browsers in which a <meta> tag or other content sniffing can override an

explicit initial UTF-8 BOM, in either HTML or XHTML)?

HTML5 section 8.2.2.1 seems to indicate that "if the transport layer specifies an encoding" (i.e. presumably the charset specified in a Content-Type header), then that should override a BOM. That's irritating, because it means that you have to assume that the server gets the Content-Type right, as well as including a BOM for the browsers in which Content-Type doesn't override sniffing (Internet Explorer, at least), and for the case where the document is read from a local file.

Mike Samuel wrote:
> 2009/2/16 David-Sarah Hopwood <david.hopwood at industrial-designers.co.uk>
>> Suppose that S is a Unicode string in which each character matches
>> ValidChar below, not containing the subsequences "<!", "</" or "]]>", and
>> not containing ("&" followed by a character not matching AmpFollower).
>> S encodes a syntactically correct ES3 or ES3.1 source text chosen by
>> an attacker.
>>
>> ValidChar :: one of
>> '\u0009' '\u000A' '\u000D' // TAB, LF, CR
>> [\u0020-\u007E]
>> [\u00A0-\u00AC]
>> [\u00AE-\u05FF]
>> [\u0604-\u06DC]
>> [\u06DE-\u070E]
>> [\u0710-\u17B3]
>> [\u17B6-\u200A]
>> [\u2010-\u2027]
>> [\u202F-\u205F]
>> [\u2070-\uD7FF]
> 
> So no surrogates?

Correct. They're not characters (or even "noncharacters").

>> [\uE000-\uFDCF]
>> [\uFDF0-\uFEFE]
>> [\uFF00-\uFFEF]
> 
> Why include FFEF?

It's unassigned, and there's no particular reason to exclude it.
(\uFFF0-\uFFF8 are also unassigned, but that's an area reserved
for "special" characters.)

>> AmpFollower :: one of
>> '=' '(' '+' '-' '!' '~' '"' '/' [0-9]
>> '\u0027' '\u005C' '\u0020' '\u0009' '\u000A' \u000D'
>> // single quote, backslash, space, TAB, LF, CR
>>
>> (ValidChar excludes format control characters, and some other
>> characters known to be mishandled by browsers. AmpFollower is
>> intended to exclude characters that can start an entity reference.)
>>
>> S is inserted between "<script>" and "</script>" in a place where a
>> <script> tag is allowed in an otherwise valid HTML document, or
>> between "<script><![CDATA[" and "]]></script>" in a place where a
>> <script> tag is allowed in an otherwise valid XHTML document.
>> The HTML or XHTML document starts with a correct <!DOCTYPE or
>> <?xml declaration respectively, and is encoded as well-formed
>> UTF-8.
>>
>> Are these restrictions sufficient to ensure that the embedded
>> script is interpreted as it would have been if referenced from
>> an external file, foiling any attempts of browsers to collude
>> with the attacker in misparsing it?
> 
> You may still be subject to encoding attacks.  I'm sure there are
> valid scripts that look like UTF-7, so if the script appears in the
> first 1024B, you might need to make sure it's preceded by a <meta>
> element specifying an encoding, and/or use the XML prologue form that
> specifies an encoding.

Right; I covered that in a follow-up. Is including a UTF-8 BOM at the
start sufficient for all browsers (that is, are there any browsers
in which a <meta> tag or other content sniffing can override an
explicit initial UTF-8 BOM, in either HTML or XHTML)?

HTML5 section 8.2.2.1 seems to indicate that "if the transport layer
specifies an encoding" (i.e. presumably the charset specified in
a Content-Type header), then that should override a BOM. That's
irritating, because it means that you have to assume that the server
gets the Content-Type right, *as well as* including a BOM for the
browsers in which Content-Type doesn't override sniffing
(Internet Explorer, at least), and for the case where the document
is read from a local file.

-- 
David-Sarah Hopwood ⚥