Am I paranoid enough?
No, I'm not paranoid enough yet. It's not sufficient only to say that the HTML is encoded as UTF-8 (see below).
David-Sarah Hopwood wrote: [...]
The HTML or XHTML document starts with a correct <!DOCTYPE or <?xml declaration respectively,
I meant, the document starts with <!DOCTYPE HTML> in the case
of HTML, or <?xml version="1.0"?><!DOCTYPE HTML> in the case of
XHTML.
(This will also put the parser into sane^H^H^H^Hstandards mode.)
and is encoded as well-formed UTF-8.
The document must also start with a UTF-8 BOM, and must not contain a META directive that changes the charset, and in the case of HTML, must either be retrieved from a local file or over HTTP with the header "Content-Type: text/html; charset=UTF-8". This is because the method of determining the encoding is chosen based on the phase of the moon.
Any other problems?
What are you trying to do? Exclude all scripts that use the && operator?
Waldemar
Waldemar Horwat wrote:
What are you trying to do? Exclude all scripts that use the && operator?
Oops, I failed to describe the intended restrictions correctly. Any sequence of consecutive '&' would be allowed if followed by AmpFollower. But at least this tells me you were paying attention ;-)
Suppose that S is a Unicode string in which each character matches ValidChar below, not containing the subsequences "<!", "</" or "]]>", and
not containing ("&" followed by a character not matching AmpFollower). S encodes a syntactically correct ES3 or ES3.1 source text chosen by an attacker.
ValidChar :: one of '\u0009' '\u000A' '\u000D' // TAB, LF, CR [\u0020-\u007E] [\u00A0-\u00AC] [\u00AE-\u05FF] [\u0604-\u06DC] [\u06DE-\u070E] [\u0710-\u17B3] [\u17B6-\u200A] [\u2010-\u2027] [\u202F-\u205F] [\u2070-\uD7FF] [\uE000-\uFDCF] [\uFDF0-\uFEFE] [\uFF00-\uFFEF]
AmpFollower :: one of '=' '(' '+' '-' '!' '~' '"' '/' [0-9] '\u0027' '\u005C' '\u0020' '\u0009' '\u000A' \u000D' // single quote, backslash, space, TAB, LF, CR
(ValidChar excludes format control characters, and some other characters known to be mishandled by browsers. AmpFollower is intended to exclude characters that can start an entity reference.)
S is inserted between "<script>" and "</script>" in a place where a <script> tag is allowed in an otherwise valid HTML document, or
between "<script><![CDATA[" and "]]></script>" in a place where a <script> tag is allowed in an otherwise valid XHTML document.
The HTML or XHTML document starts with a correct <!DOCTYPE or <?xml declaration respectively, and is encoded as well-formed UTF-8.
Are these restrictions sufficient to ensure that the embedded script is interpreted as it would have been if referenced from an external file, foiling any attempts of browsers to collude with the attacker in misparsing it?
Are some of the restrictions unnecessary?