[Json] BOMs

# Bjoern Hoehrmann (11 years ago)
  • Tatu Saloranta wrote:

Dominant Java implementations support UTF-16 with BOM; either directly or through Java's Reader implementations that handle BOMs. String concatenation case seems irrelevant, since BOMs are not included in in-memory representation anyway, as opposed to byte stream serialization.

HTTP implementations cannot correctly determine whether an entity body is text in a single character encoding and if so what that encoding is, accordingly the dominant API deals in byte[] arrays, not text Strings; furthermore, many programming languages default to byte[] arrays for string literals. That often combines into forms of

byte[] json = sprintf('{"x": %s, "y": %s}', GET(...), GET(...));

which works fine if all three byte[] arrays are UTF-8 encoded and use no Unicode signature, which is the case 99% of the time.

# Allen Wirfs-Brock (11 years ago)

On Nov 19, 2013, at 3:09 AM, Martin J. Dürst wrote:

... As for JSON, it doesn't have the problem of legacy encodings. JSON by definition is encoded in an Unicode encoding form, and it's easy to distinguish these because of the restrictions on character sequences in JSON. And this can be done without a BOM (or with a BOM).

What's most important now is to know what receivers actually accept. We are not in a design phase, we are just updating the definition of JSON and making sure we fix problems if there are problems, but we have to use the installed base for the main guidance, not other protocols or formats.

There can be no doubt that the most widely deployed JSON parsers are those that are built intp the browser javascript implementations. The ECMAScript 5 specification for JSON.parse that they implement says BOM is an illegal character. But what do the browser actually implement? This:

//FireFox 25 scratchpad execution: JSON.parse('\ufeff {"abc": 0} ') /* Exception: JSON.parse: unexpected character @Scratchpad/1:1 */

JSON.parse('\ufeff {"abc": 0} ') /* Exception: JSON.parse: unexpected character @Scratchpad/1:1 */

JSON.parse('\ufeff {"abc": 0} ') /* Exception: JSON.parse: unexpected character @Scratchpad/1:1 / JSON.parse('\ufeff {"abc": 0} ') / Exception: JSON.parse: unexpected character @Scratchpad/1:1 / JSON.parse('\ufeff {"abc": 0} ') / Exception: JSON.parse: unexpected character @Scratchpad/1:1 */

//Safari 5.1.9 JS console JSON.parse('\ufeff {"abc": 0} ') message: "JSON Parse error: Unrecognized token '?'"

//Chrome 31 JS console JSON.parse('\ufeff {"abc": 0} ') SyntaxError: Unexpected token  message: "Unexpected token "

Unfortunately, I don't have access to IE right now, but the trend is clear

# Allen Wirfs-Brock (11 years ago)

On Nov 19, 2013, at 10:18 PM, Martin J. Dürst wrote:

Hello Henry, others,

On 2013/11/20 3:55, Henry S. Thompson wrote:

Allen Wirfs-Brock writes:

There can be no doubt that the most widely deployed JSON parsers are those that are built intp the browser javascript implementations. The ECMAScript 5 specification for JSON.parse that they implement says BOM is an illegal character. But what do the browser actually implement? This:

No, try e.g. jsonviewer.stack.hu [1] (works in Chrome, Safari, Opera, not in IE or Firefox)

In Firefox, I got some garbled characters, in particular some question marks for each of the two bytes of the BOM and one question mark for the e-acute. Because of the type of the errors, I strongly suspect it is related to what we are trying to investigate, and so I don't think this can be taken as evidence one way or another.

or feed [2] to www.jsoneditoronline.org (Use

Open/Url) (works in Chrome, IE, Firefox, ran out of time to test more).

The fact that some libraries or Web sites accept a BOM for JSON isn't a proof that all (well, let's say the majority) accept a BOM.

Just to be clear about this. My tests directly tested JavaScript built-in JSON parsers WRT to BOM support in three major browsers. The tests directly invoked the built-in JSON.parse functions and directly passed to them a source strings that was explicitly constructed to contain a BOM code point . This was done to ensure that the all transport layers (and any transcodings they might perform) were bypassed and that we were actually testing the real built-in JSON parse functions.

Neither of the sites referenced about perform a comparable test. They take user inputed text when is then pass through whose knows what layers of browser and application preprocessing and then they present something derived from that original user input to a JSON parser. In both bases the actual parser does not appear to be the the built-in JavaScript JSON.parse function that I was testing.

json.view.stack.hu uses Ext.util.JSON.decode whose document describe it as "Modified version of Douglas Crockford"s json.js". In other words not the built-in JSON.parse function

www.jsoneditoronlineorg uses a library called JSONLint in preference to the built-in JSON.parse function. It does not conform to the ECMAScript 5 JSON.parse specification.

So testing using either of these sites say nothing relevant to about my observation concern BOM handling by the most widely deployed JSON parsers (the ones that are built into browser JavaScript implementations)

# Allen Wirfs-Brock (11 years ago)

On Nov 21, 2013, at 5:28 AM, Henri Sivonen wrote:

On Thu, Nov 21, 2013 at 7:53 AM, Allen Wirfs-Brock <allen at wirfs-brock.com> wrote:

Just to be clear about this. My tests directly tested JavaScript built-in JSON parsers WRT to BOM support in three major browsers. The tests directly invoked the built-in JSON.parse functions and directly passed to them a source strings that was explicitly constructed to contain a BOM code point . This was done to ensure that the all transport layers (and any transcodings they might perform) were bypassed and that we were actually testing the real built-in JSON parse functions.

It would be surprising if JSON.parse() accepted a BOM, since it doesn't take bytes as input.

ECMAScript's JSON.parse accepts an ECMAScript string value as its input. ECMAScript strings are sequences of 16-bit values. JSON.parse (and most other ECMAScript functions) interpret those values as Unicode code units. The value U+FEFF can appear at any position within a string. When defining a string as an ECMAScript literal, a sequence like \ufeff is an escape sequence that means place the code unit value 0xefff into the string at this position in the sequence. Also note that the actual strings passed below to JSON.parse contain the actual code point value U+FEFF not the escape sequence that was used to express it. To include the actual escape sequence characters in the string it would have to be expressed as '\feff'.

JSON.parse('\ufeff ["XYZ"]'); //note outer quotes delimit an ECMAScript string, the inner quotes are a JSON string.

throws a runtime SyntaxError exception because the JSON grammar does not allow U+FEFF to appear that position

JSON.parse('["\ufeffXYZ"]');

operates without error and returns a Array containing a four element ECMAScript string. This works because the JSON grammar allows any code unit except for " and \ and the ASCII control characters to appear literally in a JSON string.

However, XHR's responseType = "json" exercises browsers in a way where the input is bytes from the network. From the perspective of JSON support in XHR, lists.w3.org/Archives/Public/www-tag/2013Nov/0149.html (which didn't reach the es-discuss part of this thread previously) applies.

Right, JSON use via XHR is a different usage scenario and that probably involves encoding and decoding steps. It has very little to do with the JSON syntax, as defined in ECMA-404. It's all about how the bits that represent a string are interchanged, not the eventual semantic processing of the string (ie, processing by JSON.parse or some other JSON parser)

# Bjoern Hoehrmann (11 years ago)
  • Allen Wirfs-Brock wrote:

On Nov 21, 2013, at 5:28 AM, Henri Sivonen wrote:

On Thu, Nov 21, 2013 at 7:53 AM, Allen Wirfs-Brock <allen at wirfs-brock.com> wrote:

Just to be clear about this. My tests directly tested JavaScript built-in JSON parsers WRT to BOM support in three major browsers. The tests directly invoked the built-in JSON.parse functions and directly passed to them a source strings that was explicitly constructed to contain a BOM code point .

It would be surprising if JSON.parse() accepted a BOM, since it doesn't take bytes as input.

ECMAScript's JSON.parse accepts an ECMAScript string value as its input. ECMAScript strings are sequences of 16-bit values. JSON.parse (and most other ECMAScript functions) interpret those values as Unicode code units. The value U+FEFF can appear at any position within a string. When defining a string as an ECMAScript literal, a sequence like \ufeff is an escape sequence that means place the code unit value 0xefff into the string at this position in the sequence. Also note that the actual strings passed below to JSON.parse contain the actual code point value U+FEFF not the escape sequence that was used to express it. To include the actual escape sequence characters in the string it would have to be expressed as '\feff'.

A byte order mark indicates the order of bytes in a sequence of bytes. An ecmascript string is not a sequence of bytes and therefore it cannot have a byte order mark inside it. Your test is not for BOM support but for an egregious semantic error in the implementation of JSON.parse.

shadowregistry.org/js/misc/#t2ea25a961255bb1202da9497a1942e09

That is a similar test. It makes Firefox see UTF-8 BOMs in ecmascript strings. Firefox is not supposed to look for UTF-8 BOMs in ecmascript strings because ecmascript strings are not sequences of bytes at that level of reasoning.

Is there any chance, by the way, to change JSON.stringify so it does not output strings that cannot be encoded using UTF-8? Specifically,

JSON.stringify(JSON.parse(""\uD800""))

would need to escape the surrogate instead of emitting it literally.

# Mathias Bynens (11 years ago)

On 21 Nov 2013, at 09:41, Bjoern Hoehrmann <derhoermi at gmx.net> wrote:

Is there any chance, by the way, to change JSON.stringify so it does not output strings that cannot be encoded using UTF-8? Specifically,

JSON.stringify(JSON.parse(""\uD800""))

would need to escape the surrogate instead of emitting it literally.

Previous discussion: esdiscuss.org/topic/code-points-vs-unicode-scalar-values#content

# Bjoern Hoehrmann (11 years ago)
  • John Cowan wrote:

Bjoern Hoehrmann scripsit:

Is there any chance, by the way, to change JSON.stringify so it does not output strings that cannot be encoded using UTF-8? Specifically,

JSON.stringify(JSON.parse(""\uD800""))

would need to escape the surrogate instead of emitting it literally.

No, there isn't. We've been down this road repeatedly. People can and do use JSON strings to encode arbitrary sequences of unsigned 16-bit integers.

The output of JSON.stringify("\uD800") contains no backslash character, if you call utf8_encode(JSON.stringify("\uD800")) you get an exception because UTF-8 cannot encode the lone surrogate and utf8_encode does not know it could encode it as \uD800 without loss of information. If JSON.stringify produced an escape sequence instead, there would be no problem passing the output to utf8_encode.

# Bjoern Hoehrmann (11 years ago)
  • Matt Miller (mamille2) wrote:

There does not appear to be any consensus on explicitly allowing or disallowing of a Byte Order Mark (BOM). Neither RFC4627 nor the current draft mention BOM anywhere, and the modus operandi of the JSON Working Group has been to leave text unchanged unless there was wide support.

To be clear, that means application/json entities that start with a byte sequence that matches U+FEFF encoded in UTF-8/16/32 is malformed because the ABNF does not allow a U+FEFF at that position (and interpreting such a sequence as anything other than ordinary character data requires explicit specification). I do think an informational note saying as much could be useful.