[Json] BOMs

# Bjoern Hoehrmann (12 years ago)

Tatu Saloranta wrote:

Dominant Java implementations support UTF-16 with BOM; either directly or through Java's Reader implementations that handle BOMs. String concatenation case seems irrelevant, since BOMs are not included in in-memory representation anyway, as opposed to byte stream serialization.

HTTP implementations cannot correctly determine whether an entity body is text in a single character encoding and if so what that encoding is, accordingly the dominant API deals in byte[] arrays, not text Strings; furthermore, many programming languages default to byte[] arrays for string literals. That often combines into forms of

byte[] json = sprintf('{"x": %s, "y": %s}', GET(...), GET(...));

which works fine if all three byte[] arrays are UTF-8 encoded and use no Unicode signature, which is the case 99% of the time.

* Tatu Saloranta wrote:
>Dominant Java implementations support UTF-16 with BOM; either directly or
>through Java's Reader implementations that handle BOMs.
>String concatenation case seems irrelevant, since BOMs are not included in
>in-memory representation anyway, as opposed to byte stream serialization.

HTTP implementations cannot correctly determine whether an entity body
is text in a single character encoding and if so what that encoding is,
accordingly the dominant API deals in byte[] arrays, not text Strings;
furthermore, many programming languages default to byte[] arrays for
string literals. That often combines into forms of

  byte[] json = sprintf('{"x": %s, "y": %s}', GET(...), GET(...));

which works fine if all three byte[] arrays are UTF-8 encoded and use
no Unicode signature, which is the case 99% of the time.
-- 
Björn Höhrmann · mailto:bjoern at hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

# Allen Wirfs-Brock (12 years ago)

On Nov 19, 2013, at 3:09 AM, Martin J. Dürst wrote:

... As for JSON, it doesn't have the problem of legacy encodings. JSON by definition is encoded in an Unicode encoding form, and it's easy to distinguish these because of the restrictions on character sequences in JSON. And this can be done without a BOM (or with a BOM).

What's most important now is to know what receivers actually accept. We are not in a design phase, we are just updating the definition of JSON and making sure we fix problems if there are problems, but we have to use the installed base for the main guidance, not other protocols or formats.

There can be no doubt that the most widely deployed JSON parsers are those that are built intp the browser javascript implementations. The ECMAScript 5 specification for JSON.parse that they implement says BOM is an illegal character. But what do the browser actually implement? This:

//FireFox 25 scratchpad execution: JSON.parse('\ufeff {"abc": 0} ') /* Exception: JSON.parse: unexpected character @Scratchpad/1:1 */

JSON.parse('\ufeff {"abc": 0} ') /* Exception: JSON.parse: unexpected character @Scratchpad/1:1 */

JSON.parse('\ufeff {"abc": 0} ') /* Exception: JSON.parse: unexpected character @Scratchpad/1:1 / JSON.parse('\ufeff {"abc": 0} ') / Exception: JSON.parse: unexpected character @Scratchpad/1:1 / JSON.parse('\ufeff {"abc": 0} ') / Exception: JSON.parse: unexpected character @Scratchpad/1:1 */

//Safari 5.1.9 JS console JSON.parse('\ufeff {"abc": 0} ') message: "JSON Parse error: Unrecognized token '?'"

//Chrome 31 JS console JSON.parse('\ufeff {"abc": 0} ') SyntaxError: Unexpected token message: "Unexpected token "

Unfortunately, I don't have access to IE right now, but the trend is clear

On Nov 19, 2013, at 3:09 AM, Martin J. Dürst wrote:
> ...
> As for JSON, it doesn't have the problem of legacy encodings. JSON by definition is encoded in an Unicode encoding form, and it's easy to distinguish these because of the restrictions on character sequences in JSON. And this can be done without a BOM (or with a BOM).
> 
> What's most important now is to know what receivers actually accept. We are not in a design phase, we are just updating the definition of JSON and making sure we fix problems if there are problems, but we have to use the installed base for the main guidance, not other protocols or formats.

There can be no doubt that the most widely deployed JSON parsers are those that are built intp the browser javascript implementations.  The ECMAScript 5 specification for JSON.parse that they implement says BOM is an illegal character.  But what do the browser actually implement?  This:

//FireFox 25 scratchpad execution:
JSON.parse('\ufeff {"abc": 0} ')
/*
Exception: JSON.parse: unexpected character
@Scratchpad/1:1
*/

JSON.parse('\ufeff {"abc": 0} ')
/*
Exception: JSON.parse: unexpected character
@Scratchpad/1:1
*/

JSON.parse('\ufeff {"abc": 0} ')
/*
Exception: JSON.parse: unexpected character
@Scratchpad/1:1
*/
JSON.parse('\ufeff {"abc": 0} ')
/*
Exception: JSON.parse: unexpected character
@Scratchpad/1:1
*/
JSON.parse('\ufeff {"abc": 0} ')
/*
Exception: JSON.parse: unexpected character
@Scratchpad/1:1
*/

//Safari 5.1.9 JS console
JSON.parse('\ufeff {"abc": 0} ')
message: "JSON Parse error: Unrecognized token '?'"

//Chrome 31 JS console
JSON.parse('\ufeff {"abc": 0} ')
SyntaxError: Unexpected token 
message: "Unexpected token "

Unfortunately, I don't have access to IE right now,  but the trend is clear

Allen


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20131119/5a0b230b/attachment.html>

# Allen Wirfs-Brock (12 years ago)

On Nov 19, 2013, at 10:18 PM, Martin J. Dürst wrote:

Hello Henry, others,

On 2013/11/20 3:55, Henry S. Thompson wrote:

Allen Wirfs-Brock writes:

There can be no doubt that the most widely deployed JSON parsers are those that are built intp the browser javascript implementations. The ECMAScript 5 specification for JSON.parse that they implement says BOM is an illegal character. But what do the browser actually implement? This:

No, try e.g. jsonviewer.stack.hu [1] (works in Chrome, Safari, Opera, not in IE or Firefox)

In Firefox, I got some garbled characters, in particular some question marks for each of the two bytes of the BOM and one question mark for the e-acute. Because of the type of the errors, I strongly suspect it is related to what we are trying to investigate, and so I don't think this can be taken as evidence one way or another.

or feed [2] to www.jsoneditoronline.org (Use

Open/Url) (works in Chrome, IE, Firefox, ran out of time to test more).

The fact that some libraries or Web sites accept a BOM for JSON isn't a proof that all (well, let's say the majority) accept a BOM.

Just to be clear about this. My tests directly tested JavaScript built-in JSON parsers WRT to BOM support in three major browsers. The tests directly invoked the built-in JSON.parse functions and directly passed to them a source strings that was explicitly constructed to contain a BOM code point . This was done to ensure that the all transport layers (and any transcodings they might perform) were bypassed and that we were actually testing the real built-in JSON parse functions.

Neither of the sites referenced about perform a comparable test. They take user inputed text when is then pass through whose knows what layers of browser and application preprocessing and then they present something derived from that original user input to a JSON parser. In both bases the actual parser does not appear to be the the built-in JavaScript JSON.parse function that I was testing.

json.view.stack.hu uses Ext.util.JSON.decode whose document describe it as "Modified version of Douglas Crockford"s json.js". In other words not the built-in JSON.parse function

www.jsoneditoronlineorg uses a library called JSONLint in preference to the built-in JSON.parse function. It does not conform to the ECMAScript 5 JSON.parse specification.

So testing using either of these sites say nothing relevant to about my observation concern BOM handling by the most widely deployed JSON parsers (the ones that are built into browser JavaScript implementations)

On Nov 19, 2013, at 10:18 PM, Martin J. Dürst wrote:

> Hello Henry, others,
> 
> On 2013/11/20 3:55, Henry S. Thompson wrote:
>> Allen Wirfs-Brock writes:
>> 
>>> There can be no doubt that the most widely deployed JSON parsers are
>>> those that are built intp the browser javascript implementations.
>>> The ECMAScript 5 specification for JSON.parse that they implement
>>> says BOM is an illegal character.  But what do the browser actually
>>> implement?  This:
>> 
>> No, try e.g. jsonviewer.stack.hu [1] (works in Chrome, Safari, Opera,
>> not in IE or Firefox)
> 
> In Firefox, I got some garbled characters, in particular some question marks for each of the two bytes of the BOM and one question mark for the e-acute. Because of the type of the errors, I strongly suspect it is related to what we are trying to investigate, and so I don't think this can be taken as evidence one way or another.
> 
> or feed [2] to www.jsoneditoronline.org (Use
>> Open/Url) (works in Chrome, IE, Firefox, ran out of time to test more).
> 
> The fact that some libraries or Web sites accept a BOM for JSON isn't a proof that all (well, let's say the majority) accept a BOM.

Just to be clear about this.  My tests directly tested JavaScript built-in JSON parsers WRT to BOM support in three major browsers.  The tests directly invoked the built-in JSON.parse functions and directly passed to them a source strings that was explicitly constructed to contain a BOM code point .  This was done to ensure that the all transport layers  (and any transcodings they might perform) were bypassed and that we were actually testing the real built-in JSON parse functions.

Neither of the sites referenced about perform a comparable test.  They take user inputed text when is then pass through whose knows what layers of browser and application preprocessing and then they present something derived from that original user input to a JSON parser.  In both bases the actual parser does not appear to be the the built-in JavaScript JSON.parse function that I was testing.

json.view.stack.hu uses Ext.util.JSON.decode whose document describe it as "Modified version of Douglas Crockford"s json.js".   In other words not the built-in JSON.parse function

www.jsoneditoronlineorg uses a library called JSONLint in preference to the built-in JSON.parse function.  It does not conform to the ECMAScript 5 JSON.parse specification.

So testing using either of these sites say nothing relevant to about my observation concern BOM handling by the most widely deployed JSON parsers (the ones that are built into browser JavaScript implementations) 

Allen

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20131120/6da8d7e9/attachment-0001.html>

# Allen Wirfs-Brock (12 years ago)

On Nov 21, 2013, at 5:28 AM, Henri Sivonen wrote:

On Thu, Nov 21, 2013 at 7:53 AM, Allen Wirfs-Brock <allen at wirfs-brock.com> wrote:

Just to be clear about this. My tests directly tested JavaScript built-in JSON parsers WRT to BOM support in three major browsers. The tests directly invoked the built-in JSON.parse functions and directly passed to them a source strings that was explicitly constructed to contain a BOM code point . This was done to ensure that the all transport layers (and any transcodings they might perform) were bypassed and that we were actually testing the real built-in JSON parse functions.

It would be surprising if JSON.parse() accepted a BOM, since it doesn't take bytes as input.

ECMAScript's JSON.parse accepts an ECMAScript string value as its input. ECMAScript strings are sequences of 16-bit values. JSON.parse (and most other ECMAScript functions) interpret those values as Unicode code units. The value U+FEFF can appear at any position within a string. When defining a string as an ECMAScript literal, a sequence like \ufeff is an escape sequence that means place the code unit value 0xefff into the string at this position in the sequence. Also note that the actual strings passed below to JSON.parse contain the actual code point value U+FEFF not the escape sequence that was used to express it. To include the actual escape sequence characters in the string it would have to be expressed as '\feff'.

JSON.parse('\ufeff ["XYZ"]'); //note outer quotes delimit an ECMAScript string, the inner quotes are a JSON string.

throws a runtime SyntaxError exception because the JSON grammar does not allow U+FEFF to appear that position

JSON.parse('["\ufeffXYZ"]');

operates without error and returns a Array containing a four element ECMAScript string. This works because the JSON grammar allows any code unit except for " and \ and the ASCII control characters to appear literally in a JSON string.

However, XHR's responseType = "json" exercises browsers in a way where the input is bytes from the network. From the perspective of JSON support in XHR, lists.w3.org/Archives/Public/www-tag/2013Nov/0149.html (which didn't reach the es-discuss part of this thread previously) applies.

Right, JSON use via XHR is a different usage scenario and that probably involves encoding and decoding steps. It has very little to do with the JSON syntax, as defined in ECMA-404. It's all about how the bits that represent a string are interchanged, not the eventual semantic processing of the string (ie, processing by JSON.parse or some other JSON parser)

On Nov 21, 2013, at 5:28 AM, Henri Sivonen wrote:

> On Thu, Nov 21, 2013 at 7:53 AM, Allen Wirfs-Brock
> <allen at wirfs-brock.com> wrote:
>> Just to be clear about this.  My tests directly tested JavaScript built-in
>> JSON parsers WRT to BOM support in three major browsers.  The tests directly
>> invoked the built-in JSON.parse functions and directly passed to them a
>> source strings that was explicitly constructed to contain a BOM code point .
>> This was done to ensure that the all transport layers  (and any transcodings
>> they might perform) were bypassed and that we were actually testing the real
>> built-in JSON parse functions.
> 
> It would be surprising if JSON.parse() accepted a BOM, since it
> doesn't take bytes as input.

ECMAScript's JSON.parse accepts an ECMAScript string value as its input.  ECMAScript strings are sequences of 16-bit values.  JSON.parse (and most other ECMAScript functions) interpret those values  as Unicode code units.  The value U+FEFF can appear at any position within a string. When defining a string as an ECMAScript literal, a sequence like \ufeff is an escape sequence that means place the code unit value 0xefff into the string at this position in the sequence. Also note that the actual strings passed below to JSON.parse contain the actual code point value U+FEFF not the escape sequence that was used to express it.  To include the actual escape sequence characters in the string it would have to be expressed as '\\feff'.

JSON.parse('\ufeff ["XYZ"]');  //note outer quotes delimit an ECMAScript string, the inner quotes are a JSON string.  

throws a runtime SyntaxError exception because the JSON grammar does not allow U+FEFF to appear that position

JSON.parse('["\ufeffXYZ"]');

operates without error and returns a Array containing a four element ECMAScript string.   This works because the JSON grammar allows any code unit except for " and \ and the ASCII control characters to appear literally in a JSON string. 

> 
> However, XHR's responseType = "json" exercises browsers in a way where
> the input is bytes from the network. From the perspective of JSON
> support in XHR,
> http://lists.w3.org/Archives/Public/www-tag/2013Nov/0149.html (which
> didn't reach the es-discuss part of this thread previously) applies.

Right, JSON use via XHR is a different usage scenario and that probably involves encoding and decoding steps. It has very little to do with the JSON syntax, as defined in ECMA-404. It's all about how the bits that represent a string are interchanged, not the eventual semantic processing of the string (ie, processing by JSON.parse or some other JSON parser)

Allen

# Bjoern Hoehrmann (12 years ago)

Allen Wirfs-Brock wrote:

On Nov 21, 2013, at 5:28 AM, Henri Sivonen wrote:

On Thu, Nov 21, 2013 at 7:53 AM, Allen Wirfs-Brock <allen at wirfs-brock.com> wrote:

Just to be clear about this. My tests directly tested JavaScript built-in JSON parsers WRT to BOM support in three major browsers. The tests directly invoked the built-in JSON.parse functions and directly passed to them a source strings that was explicitly constructed to contain a BOM code point .

It would be surprising if JSON.parse() accepted a BOM, since it doesn't take bytes as input.

ECMAScript's JSON.parse accepts an ECMAScript string value as its input. ECMAScript strings are sequences of 16-bit values. JSON.parse (and most other ECMAScript functions) interpret those values as Unicode code units. The value U+FEFF can appear at any position within a string. When defining a string as an ECMAScript literal, a sequence like \ufeff is an escape sequence that means place the code unit value 0xefff into the string at this position in the sequence. Also note that the actual strings passed below to JSON.parse contain the actual code point value U+FEFF not the escape sequence that was used to express it. To include the actual escape sequence characters in the string it would have to be expressed as '\feff'.

A byte order mark indicates the order of bytes in a sequence of bytes. An ecmascript string is not a sequence of bytes and therefore it cannot have a byte order mark inside it. Your test is not for BOM support but for an egregious semantic error in the implementation of JSON.parse.

shadowregistry.org/js/misc/#t2ea25a961255bb1202da9497a1942e09

That is a similar test. It makes Firefox see UTF-8 BOMs in ecmascript strings. Firefox is not supposed to look for UTF-8 BOMs in ecmascript strings because ecmascript strings are not sequences of bytes at that level of reasoning.

Is there any chance, by the way, to change JSON.stringify so it does not output strings that cannot be encoded using UTF-8? Specifically,

JSON.stringify(JSON.parse(""\uD800""))

would need to escape the surrogate instead of emitting it literally.

* Allen Wirfs-Brock wrote:
>On Nov 21, 2013, at 5:28 AM, Henri Sivonen wrote:
>> On Thu, Nov 21, 2013 at 7:53 AM, Allen Wirfs-Brock
>> <allen at wirfs-brock.com> wrote:
>>> Just to be clear about this.  My tests directly tested JavaScript built-in
>>> JSON parsers WRT to BOM support in three major browsers.  The tests directly
>>> invoked the built-in JSON.parse functions and directly passed to them a
>>> source strings that was explicitly constructed to contain a BOM code point .

>> It would be surprising if JSON.parse() accepted a BOM, since it
>> doesn't take bytes as input.
>
>ECMAScript's JSON.parse accepts an ECMAScript string value as its input.
>ECMAScript strings are sequences of 16-bit values.  JSON.parse (and most
>other ECMAScript functions) interpret those values  as Unicode code 
>units.  The value U+FEFF can appear at any position within a string. 
>When defining a string as an ECMAScript literal, a sequence like \ufeff 
>is an escape sequence that means place the code unit value 0xefff into 
>the string at this position in the sequence. Also note that the actual 
>strings passed below to JSON.parse contain the actual code point value 
>U+FEFF not the escape sequence that was used to express it.  To include 
>the actual escape sequence characters in the string it would have to be 
>expressed as '\\feff'.

A byte order mark indicates the order of bytes in a sequence of bytes.
An ecmascript string is not a sequence of bytes and therefore it cannot
have a byte order mark inside it. Your test is not for BOM support but
for an egregious semantic error in the implementation of JSON.parse.

  http://shadowregistry.org/js/misc/#t2ea25a961255bb1202da9497a1942e09

That is a similar test. It makes Firefox see UTF-8 BOMs in ecmascript
strings. Firefox is not supposed to look for UTF-8 BOMs in ecmascript
strings because ecmascript strings are not sequences of bytes at that
level of reasoning.

Is there any chance, by the way, to change `JSON.stringify` so it does
not output strings that cannot be encoded using UTF-8? Specifically,

  JSON.stringify(JSON.parse("\"\uD800\""))

would need to escape the surrogate instead of emitting it literally.
-- 
Björn Höhrmann · mailto:bjoern at hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

# Mathias Bynens (12 years ago)

On 21 Nov 2013, at 09:41, Bjoern Hoehrmann <derhoermi at gmx.net> wrote:

Is there any chance, by the way, to change JSON.stringify so it does not output strings that cannot be encoded using UTF-8? Specifically,

JSON.stringify(JSON.parse(""\uD800""))

would need to escape the surrogate instead of emitting it literally.

Previous discussion: esdiscuss.org/topic/code-points-vs-unicode-scalar-values#content

On 21 Nov 2013, at 09:41, Bjoern Hoehrmann <derhoermi at gmx.net> wrote:

> Is there any chance, by the way, to change `JSON.stringify` so it does
> not output strings that cannot be encoded using UTF-8? Specifically,
> 
>  JSON.stringify(JSON.parse("\"\uD800\""))
> 
> would need to escape the surrogate instead of emitting it literally.

Previous discussion: http://esdiscuss.org/topic/code-points-vs-unicode-scalar-values#content-14

# Bjoern Hoehrmann (12 years ago)

John Cowan wrote:

Bjoern Hoehrmann scripsit:

Is there any chance, by the way, to change JSON.stringify so it does not output strings that cannot be encoded using UTF-8? Specifically,

JSON.stringify(JSON.parse(""\uD800""))

would need to escape the surrogate instead of emitting it literally.

No, there isn't. We've been down this road repeatedly. People can and do use JSON strings to encode arbitrary sequences of unsigned 16-bit integers.

The output of JSON.stringify("\uD800") contains no backslash character, if you call utf8_encode(JSON.stringify("\uD800")) you get an exception because UTF-8 cannot encode the lone surrogate and utf8_encode does not know it could encode it as \uD800 without loss of information. If JSON.stringify produced an escape sequence instead, there would be no problem passing the output to utf8_encode.

* John Cowan wrote:
>Bjoern Hoehrmann scripsit:
>
>> Is there any chance, by the way, to change `JSON.stringify` so it does
>> not output strings that cannot be encoded using UTF-8? Specifically,
>> 
>>   JSON.stringify(JSON.parse("\"\uD800\""))
>> 
>> would need to escape the surrogate instead of emitting it literally.
>
>No, there isn't.  We've been down this road repeatedly.  People can and
>do use JSON strings to encode arbitrary sequences of unsigned 16-bit integers.

The output of JSON.stringify("\uD800") contains no backslash character,
if you call `utf8_encode(JSON.stringify("\uD800"))` you get an exception
because UTF-8 cannot encode the lone surrogate and `utf8_encode` does
not know it could encode it as `\uD800` without loss of information. If
`JSON.stringify` produced an escape sequence instead, there would be no
problem passing the output to `utf8_encode`.
-- 
Björn Höhrmann · mailto:bjoern at hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

# Bjoern Hoehrmann (12 years ago)

Matt Miller (mamille2) wrote:

There does not appear to be any consensus on explicitly allowing or disallowing of a Byte Order Mark (BOM). Neither RFC4627 nor the current draft mention BOM anywhere, and the modus operandi of the JSON Working Group has been to leave text unchanged unless there was wide support.

To be clear, that means application/json entities that start with a byte sequence that matches U+FEFF encoded in UTF-8/16/32 is malformed because the ABNF does not allow a U+FEFF at that position (and interpreting such a sequence as anything other than ordinary character data requires explicit specification). I do think an informational note saying as much could be useful.

* Matt Miller (mamille2) wrote:
>There does not appear to be any consensus on explicitly allowing or 
>disallowing of a Byte Order Mark (BOM).  Neither RFC4627 nor the current 
>draft mention BOM anywhere, and the modus operandi of the JSON Working 
>Group has been to leave text unchanged unless there was wide support.

To be clear, that means application/json entities that start with a byte
sequence that matches U+FEFF encoded in UTF-8/16/32 is malformed because
the ABNF does not allow a U+FEFF at that position (and interpreting such
a sequence as anything other than ordinary character data requires
explicit specification). I do think an informational note saying as much
could be useful.
-- 
Björn Höhrmann · mailto:bjoern at hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/