BOMs

# Bjoern Hoehrmann (12 years ago)

Martin J. Dürst wrote:

As for what to say about whether to accept BOMs or not, I'd really want to know what the various existing parsers do. If they accept BOMs, then we can say they should accept BOMs. If they don't accept BOMs, then we should say that they don't.

Unicode signatures are not useful for application/json resources and are likely to break exisiting and future code, it is not at all uncommon to construct JSON text by concatenating, say, string literals with some web service response without passing the data through a JSON parser. And as RFC 4627 makes no mention of them, there is little reason to think that implementations tolerate them.

Perl's JSON module gives me

malformed JSON string, neither array, object, number, string or atom, at character offset 0 (before "\x{ef}\x{bb}\x{bf}[]")

Python's json module gives me

ValueError: No JSON object could be decoded

Go's "encoding/json" module gives me

invalid character 'ï' looking for beginning of value

shadowregistry.org/js/misc/#t2ea25a961255bb1202da9497a1942e09 is another example of what kinds of bugs await us if we were to specify the use of Unicode signatures for JSON, essentially

new DOMParser().parseFromString("\uBBEF\u3CBF\u7979\u3E2F","text/xml")

Now U+BBEF U+3CBF U+7979 U+3E2F is not an XML document but Firefox and Internet Explorer treat it as if it were equivalent to "<yy/>".

* Martin J. Dürst wrote:
>As for what to say about whether to accept BOMs or not, I'd really want 
>to know what the various existing parsers do. If they accept BOMs, then 
>we can say they should accept BOMs. If they don't accept BOMs, then we 
>should say that they don't.

Unicode signatures are not useful for application/json resources and are
likely to break exisiting and future code, it is not at all uncommon to
construct JSON text by concatenating, say, string literals with some web
service response without passing the data through a JSON parser. And as
RFC 4627 makes no mention of them, there is little reason to think that
implementations tolerate them.

Perl's JSON module gives me

  malformed JSON string, neither array, object, number, string
  or atom, at character offset 0 (before "\x{ef}\x{bb}\x{bf}[]")

Python's json module gives me

  ValueError: No JSON object could be decoded

Go's "encoding/json" module gives me

  invalid character 'ï' looking for beginning of value

http://shadowregistry.org/js/misc/#t2ea25a961255bb1202da9497a1942e09 is
another example of what kinds of bugs await us if we were to specify the
use of Unicode signatures for JSON, essentially

  new DOMParser().parseFromString("\uBBEF\u3CBF\u7979\u3E2F","text/xml")

Now U+BBEF U+3CBF U+7979 U+3E2F is not an XML document but Firefox and
Internet Explorer treat it as if it were equivalent to "<yy/>".
-- 
Björn Höhrmann · mailto:bjoern at hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

# Bjoern Hoehrmann (12 years ago)

Henry S. Thompson wrote:

I'm curious to know what level you're invoking the parser at. As implied by my previous post about the Python 'requests' package, it handles application/json resources by stripping any initial BOM it finds -- you can try this with

import requests r=requests.get("www.ltg.ed.ac.uk/ov-test/b16le.json") r.json()

The Perl code was

perl -MJSON -MEncode -e "my $s = encode_utf8(chr 0xFEFF) . '[]'; JSON->new->decode($s)"

The Python code was

import json json.loads(u"\uFEFF[]".encode('utf-8'))

The Go code was

package main

import "encoding/json" import "fmt"

func main() { r := "\uFEFF[]"

var f interface{}
err := json.Unmarshal([]byte(r), &f)

fmt.Println(err)

}

In other words, always passing a UTF-8 encoded byte string to the byte string parsing part of the JSON implementation. RFC 4627 is the only specification for the application/json on-the-wire format and it does not mention anything about Unicode signatures. Looking for certain byte sequences at the beginning and treating them as a Unicode signature is the same as looking for /* ... */ and treating it as a comment.

* Henry S. Thompson wrote:
>I'm curious to know what level you're invoking the parser at.  As
>implied by my previous post about the Python 'requests' package, it
>handles application/json resources by stripping any initial BOM it
>finds -- you can try this with
>
>>>> import requests
>>>> r=requests.get("http://www.ltg.ed.ac.uk/ov-test/b16le.json")
>>>> r.json()

The Perl code was

  perl -MJSON -MEncode -e
    "my $s = encode_utf8(chr 0xFEFF) . '[]'; JSON->new->decode($s)"

The Python code was

  import json
  json.loads(u"\uFEFF[]".encode('utf-8'))

The Go code was

  package main
  
  import "encoding/json"
  import "fmt"
  
  func main() {
    r := "\uFEFF[]"
  
    var f interface{}
    err := json.Unmarshal([]byte(r), &f)
    
    fmt.Println(err)
  }

In other words, always passing a UTF-8 encoded byte string to the byte
string parsing part of the JSON implementation. RFC 4627 is the only
specification for the application/json on-the-wire format and it does
not mention anything about Unicode signatures. Looking for certain byte
sequences at the beginning and treating them as a Unicode signature is
the same as looking for `/* ... */` and treating it as a comment.
-- 
Björn Höhrmann · mailto:bjoern at hoehrmann.de · http://bjoern.hoehrmann.de
Am Badedeich 7 · Telefon: +49(0)160/4415681 · http://www.bjoernsworld.de
25899 Dagebüll · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/