JSON.stringify </script>

# Michał Wadas (3 years ago)

Idea: require implementations to stringify "</script>" as "<\uxxxxscript>".

Benefits: remove XSS vulnerability when injecting JSON as content of <script> tag (quite common antipattern).

Backward compatible: yes, unless binary equality is required and this string is used.

# Mike Samuel (3 years ago)

I think defining an easy way to produce embeddable JSON is a great idea, but it's not quite that simple.

OWASP/json-sanitizer#output captures some requirements that I came up with for embedding JSON in HTML:

""" The output is well-formed JSON as defined by RFC 4627. The output satisfies these additional properties:

  • The output will not contain the substring (case-insensitively) "</script" so can be embedded inside an HTML script element without further encoding.
  • The output will not contain the substring "]]>" so can be embedded

inside an XML CDATA section without further encoding.

  • The output is a valid Javascript expression, so can be parsed by Javascript's eval builtin (after being wrapped in parentheses) or by JSON.parse. Specifically, the output will not contain any string literals with embedded JS newlines (U+2028 Paragraph separator or U+2029 Line separator).
  • The output contains only valid Unicode scalar values (no isolated UTF-16 surrogates) that are allowed in XML unescaped. """

These apply equally well to RFC 7159 IIUC. The latter few constraints are required to allow embedding of JSON in HTML in a foreign content context ( www.w3.org/TR/html5/syntax.html#cdata-sections ).

Those rules are sufficient to allow embedding in HTML without breaking token boundaries in the embedding language.

To preserve semantics when embedding in HTML you also need to escape '&'. To prevent exfiltration via external entities in SVG & other XML variants, you should probably also escape '%'.

# Alexander Jones (3 years ago)

That's awful. As you say, it's an antipattern, no further effort should be spent on this. JSON produced by JavaScript has far more general uses than slapping directly into a script tag unencoded, so no-one else should have to see this. Also, there are many other producers of JSON than JavaScript.

Instead, use XHTML and CDATA (which has a straightforward encoding mechanism that doesn't ruin the parseability of the code or affect it in any way) if you really want to pull stunts like this.

Alex

# Michał Wadas (3 years ago)

Actually CDATA suffer the same issue - for string "]]>". Mike Samuel has a

very strong point here.

And by saying "it's antipattern, don't do this" we will not make old vulnerable code go away. And we have a very good way to stop people from shooting their own feet - for free.

On 28 Sep 2016 8:31 p.m., "Alexander Jones" <alex at weej.com> wrote:

That's awful. As you say, it's an antipattern, no further effort should be spent on this. JSON produced by JavaScript has far more general uses than slapping directly into a script tag unencoded, so no-one else should have to see this. Also, there are many other producers of JSON than JavaScript.

Instead, use XHTML and CDATA (which has a straightforward encoding mechanism that doesn't ruin the parseability of the code or affect it in any way) if you really want to pull stunts like this.

Alex

# Alexander Jones (3 years ago)

Embedding a JSON literal into HTML involves first encoding to JSON then encoding that into HTML. Two stages which must not be confused. The 'encoding into HTML' part is best done in XHTML with CDATA, and the encoding method is taken care of by whichever XML-generating library you're using. If you hint it to use CDATA for such a text node, or if for any other reason it chooses to use CDATA, rather than merely converting every < to <, etc., then it will (or should) "escape" ]]> as ]]]]><![CDATA[> or whatever equivalent. See

en.wikipedia.org/wiki/CDATA#Nesting for more info. Crucially, this works for encoding ANY text data into a text node in an XML document, not just JSON.

Having the specified JSON algorithm in ECMAScript deal with concerns of embedding into legacy, non XML-based HTML (oh yes, I totally went there! ;) ) is a classic layer violation, which I would guarantee offends 99 out of 100 experienced programmers' sensibilities. :)

Aside, I'll repeat again that this would be largely ineffective - a lot of JSON that might be dumbly pasted into a text stream of HTML would be generated by implementations other than that specified by ECMAScript.

Hope this clears it up

Alex

# Kris Siegel (3 years ago)

ECMAScript, while highly used in web browsers, should really not care about HTML constructs. That's where WHATWG and W3C come in. I suggest this type of feature should come from one of those groups, not ECMA.

# Mike Samuel (3 years ago)

I agree it's subideal which is why I work to address problems like this in template systems but ad-hoc string concatenation happens and embeddable sub-languages provide defense-in-depth without sacrificing correctness.

CDATA sections solve no problems because they cannot contain any string that has "]]>" as a substring so you still have to s/]]>/]]>]]<!CDATA>/g.

# Alexander Jones (3 years ago)

They do solve the problem. You encode your entire JS before pasting it, encoding ]]> and nothing more, and the XML document's text node contains

the unadulterated text, which the JS parser also sees. It's perfect layer isolation. Ye olde HTML can't do that because there is no escaping mechanism for </script> that actually allows the JS parser to see the

text (code) content unmodified.

Viva la <xhtml:revolución /> ;)

# Mike Samuel (3 years ago)

Without CDATA you have to encode script bodies properly. With CDATA you have to encode script bodies properly. What problem did CDATA solve?

# Alexander Jones (3 years ago)

In XHTML, CDATA allows a 'more' verbatim spelling of text node content. But the end token has to be escaped, as discussed. Despite this escaping, the text node can contain arbitrary strings.

In XHTML, you can achieve the same effect without CDATA, just by escaping XML entities. Again, and cruciallt, the text node can contain arbitrary strings.

In HTML without CDATA, using HTML entities within the script tag is wrong specifically because they are not interpreted. The text node in the HTML document CANNOT contain arbitrary strings, and there is no further decode step before the JS parser hits your code, so you're forced to take other measures to ensure that </script> does not appear in your code. There are

a few places this can appear, only one of which is embedded in string literals, so the method of avoiding this is actually sensitive to the context and not practical to specify.

I hope you can appreciate how ridiculous this problem is for HTML - I don't believe CDATA support in HTML 5 can solve this due to forward compatibility - which is why it's an antipattern. Just don't do it, or use XHTML. It's not cool to hate on XML anymore. ;)

Alex

# Simon Pieters (3 years ago)

On Wed, 28 Sep 2016 19:06:31 +0200, Michał Wadas <michalwadas at gmail.com>

wrote:

Idea: require implementations to stringify "</script>" as
"<\uxxxxscript>".

Benefits: remove XSS vulnerability when injecting JSON as content of <script> tag (quite common antipattern).

Backward compatible: yes, unless binary equality is required and this string is used.

You would also need to escape "<!--" and "<script" for HTML. See
html.spec.whatwg.org/multipage/scripting.html#restrictions

# Mike Samuel (3 years ago)

On Thu, Sep 29, 2016 at 2:09 AM, Alexander Jones <alex at weej.com> wrote:

In XHTML, CDATA allows a 'more' verbatim spelling of text node content. But the end token has to be escaped, as discussed. Despite this escaping, the text node can contain arbitrary strings.

In XHTML, you can achieve the same effect without CDATA, just by escaping XML entities. Again, and cruciallt, the text node can contain arbitrary strings.

So, <script><![CDATA[...]]></script> has a complete escaping process,

whereas, since CDATA sections were taken out of HTML foreign element content disallowing <svg><script><![[CDATA[...]]></script></svg>

HTML does not, so to figure out how to embed

alert("</script>"); if (a < /script>/.exec(myString)) ...

you have to do scripting language specific analysis.

Is that about right?

In HTML without CDATA, using HTML entities within the script tag is wrong specifically because they are not interpreted. The text node in the HTML document CANNOT contain arbitrary strings, and there is no further decode step before the JS parser hits your code, so you're forced to take other measures to ensure that </script> does not appear in your code. There are a few places this can appear, only one of which is embedded in string literals, so the method of avoiding this is actually sensitive to the context and not practical to specify.

I hope you can appreciate how ridiculous this problem is for HTML - I don't believe CDATA support in HTML 5 can solve this due to forward compatibility

  • which is why it's an antipattern. Just don't do it, or use XHTML. It's not cool to hate on XML anymore. ;)

Yes. I've written hardened DOM tree serializers. I appreciate these problems. No-one is hating on XML.

We're talking about JSON serializers. Every JSON serializers produces a subset of the output language. Choices about that sublanguage affect how easy/hard it is to use that serializer with other tools.

That "if everyone wrote software with property P, we would not have problem Q" is a great argument that we should prefer stacks with property P, but does not mean we should not take the prevalence of problem Q into account when designing elements of software stacks. You seem to actually be arguing that we should not do our best to prevent problem Q by other means, but real systems need defense-in-depth.

So I concede your point about CDATA sections but don't see that these arguments about antipatterns and the benefits of XHTML are all that relevant.

# Oriol Bugzilla (3 years ago)

ECMAScript, while highly used in web browsers, should really not care about HTML constructs. That's where WHATWG and W3C come in. I suggest this type of feature should come from one of those groups, not ECMA.

That applies to escaping things like </script> or ]]>, and I agree. But as Mike Samuel mentioned, JSON strings containing U+2028 or U+2029 are not valid JS expressions. I think it would make sense for JSON.stringify to escape these.

# Mike Samuel (3 years ago)

On Thu, Sep 29, 2016 at 8:45 AM, Oriol Bugzilla <oriol-bugzilla at hotmail.com> wrote:

ECMAScript, while highly used in web browsers, should really not care about HTML constructs. That's where WHATWG and W3C come in. I suggest this type of feature should come from one of those groups, not ECMA.

That applies to escaping things like </script> or ]]>, and I agree. But as Mike Samuel mentioned, JSON strings containing U+2028 or U+2029 are not valid JS expressions. I think it would make sense for JSON.stringify to escape these.

What is it that you're saying is not in TC-39's bailiwick?

Is it that w3c/whatwg should define what constitutes "embeddable JSON"?

Or is it that if it's worth defining a function that produces embeddable JSON from an EcmaScript object, that w3c/whatwg should include that in some set of EcmaScript APIs that it defines?

If you agree with my earlier claim """ We're talking about JSON serializers. Every serializers produces a subset of the output language. Choices about that sublanguage affect how easy/hard it is to use that serializer with other tools. """ then it seems that TC-39 might take embeddability into account when crafting the subset of JSON that JSON.stringify produces.

# Alexander Jones (3 years ago)

Maybe we should just make U+2028 and U+2029 valid in JS then? What other productions in JSON are invalid syntax in JS?

# Mike Samuel (3 years ago)

On Thu, Sep 29, 2016 at 9:25 AM, Alexander Jones <alex at weej.com> wrote:

Maybe we should just make U+2028 and U+2029 valid in JS then? What other productions in JSON are invalid syntax in JS?

I don't think any other productions in JSON are invalid syntax in an Expression context.

JSON places no limit on size of numeric literals, and other languages ban unrepresentably large ones, but IIRC ES does not.

Obviously if you start parsing JSON in a statement context, you run into problems where a JSON object with one or more properties is an invalid BlockStatement and the ExpressionStatement production is not reached because of the negative lookahead.

# Mike Samuel (3 years ago)

On Wed, Sep 28, 2016 at 10:06 AM, Michał Wadas <michalwadas at gmail.com> wrote:

Idea: require implementations to stringify "</script>" as "<\uxxxxscript>".

Benefits: remove XSS vulnerability when injecting JSON as content of <script> tag (quite common antipattern).

Backward compatible: yes, unless binary equality is required and this string is used.

TLDR; I'm against this.

I've pushed back against a number of threads, so I want to avoid leaving the impression that I support this proposal.

I think this is a bad idea, so let me try to pull together the various threads and address them in one place.

Should EcmaScript or any other standards body define "embeddable JSON"?

No. Standards bodies move slowly. The main argument for this feature is to make it easier to write more secure code, and to transparently make existing code more secure.

Standards bodies move too slowly. Library code can roll-out quickly in response to zero-days or emerging threats, but standards cannot.

For example, client-side templates using mustaches ( goo.gl/eztprF ) are an emerging threat.

There has been a poor history of this, even with JSON. Crock's RFC 4627 said """ A JSON text can be safely passed into JavaScript's eval() function (which compiles and executes a string) if all the characters not enclosed in strings are in the set of characters that form JSON tokens. This can be quickly determined in JavaScript with two regular expressions and calls to the test and replace methods.

  var my_JSON_object = !(/[^,:{}\[\]0-9.\-+Eaeflnr-u \n\r\t]/.test(
         text.replace(/"(\\.|[^"\\])*"/g, ''))) &&
     eval('(' + text + ')');

""" which is not in the latest JSON RFC because it was found to be false in a dozen ways before RFC 7158 (obsoleted) removed that language.

The only way to deal with emerging threats is to have a quickly patchable system. Patching serializers causes spurious test failures, the broken-hearts problem: assertTrue("I <3 u", serializeHtml("I <3 u")) I suspect that the best we will ever be able to do re emerging-threats is to allow those who care about security to patch and fix tests and ignore the maintenance cost to unmaintained projects.

Is there any value in embeddable sanitizers?

I think embeddable serializers can provide defense-in-depth against faults in code that composes network messages which is why I wrote OWASP/json-sanitizer to do just that.

Is this backwards compatible?

No. JSON strings are used as keys in persisted tables because we have de-facto defined a canonical subset of JSON.

This kind of thing can be discouraged by randomizing the way Java is doing with builtin map implementaions in Java 9 and helps avoid broken-hearts problems. Java is a large API language so can provide umpteen variants of x in a way that wouldn't fit well in ES, and providing an alternate API loses a lot of the benefit of the original proposal.

Are embeddable serializers an anti-pattern?

No. The anti-pattern is that trustworthy and untrustworthy content are mixed using naive string concatenation to produce a trusted output.

Even if the real anti-pattern were not endemic within distributed systems, composing trustworthy network messages is hard and embeddable serializers provide useful defense-in-depth for message composing code.

Is XHTML more easily secured than HTML?

Yes. XML is much more easily statically analyzed, and mistaken assumptions in a serializer much more frequently manifest as parse failures so fail safe more often. When the embedding language fails-safe, the whole is more secure than if you have an embedded languages that fails-safe in an embedding language which does not as is the case with JSON in HTML.

This is why, when I write an HTML sanitizer or hardened DOM serializer, I try to make the output the intersection of HTML & vanilla XML+namespaces. (This prevents use of CDATA sections, incidentally so serializers have included JS rewriters.).

At the risk of FUD though, XHTML-specific parsing branches might be simpler but have been much less heavily tested and fuzzed, so it might actually be easier to craft a buffer overflow to take over the renderer for an origin that serves XHTML than one that serves HTML exclusively.

The security of XHTML is not relevant though, because XHTML isn't used.

To anyone who is passionate about the benefits of making HTML more XML-like, I would be happy to help with a proposal to the content-security-policy team or similar body to add a switch that says that the parsing should halt as soon as it is realized that the content is not syntactically valid XML to get the fail-safe benefits of XML.

# Mark S. Miller (3 years ago)

On Thu, Sep 29, 2016 at 9:25 AM, Alexander Jones <alex at weej.com> wrote:

Maybe we should just make U+2028 and U+2029 valid in JS then? What other productions in JSON are invalid syntax in JS?

IIRC, Doug Crockford, possibly Mike Samuel, and I (and perhaps others) advocated such a change to EcmaScript back during the transition from ES3 to ES3.1/ES5. ES differed enough between platforms in other ways that, some of us felt, it would have been worth the experiment to see if we could get away with it -- without breaking the web. We were not able to convince people to engage in that experiment then. Such an experiment would be much more expensive now, with a much lower probability of success, and with a lower payoff. I don't see it happening.

On Thursday, 29 September 2016, Mike Samuel <mikesamuel at gmail.com> wrote:

On Thu, Sep 29, 2016 at 8:45 AM, Oriol Bugzilla <oriol-bugzilla at hotmail.com> wrote:

ECMAScript, while highly used in web browsers, should really not care about HTML constructs. That's where WHATWG and W3C come in. I suggest this

type of feature should come from one of those groups, not ECMA.

That applies to escaping things like </script> or ]]>, and I agree. But as Mike Samuel mentioned, JSON strings containing U+2028 or U+2029 are not valid JS expressions. I think it would make sense for JSON.stringify to escape these.

What is it that you're saying is not in TC-39's bailiwick?

Is it that w3c/whatwg should define what constitutes "embeddable JSON"?

Or is it that if it's worth defining a function that produces embeddable JSON from an EcmaScript object, that w3c/whatwg should include that in some set of EcmaScript APIs that it defines?

If you agree with my earlier claim """ We're talking about JSON serializers. Every serializers produces a subset of the output language. Choices about that sublanguage affect how easy/hard it is to use that serializer with other tools. """ then it seems that TC-39 might take embeddability into account when crafting the subset of JSON that JSON.stringify produces.

I agree that this issue belongs with TC39 much more than it belongs anywhere else. TC39's steering of JS is certainly influenced by how JS gets used in web browsers. When an issue touches both JS and browser specific concerns, it can often be unclear whose "jurisdiction" it belongs in. This one is not unclear. It should be treated as a language issue by TC39.