[Json] Fwd: two comments on JSON, ECMA-404, 1st edition / October 2013

# Allen Wirfs-Brock (12 years ago)

Peter, I would be idea if you registered submitted these issues to bugs.ecmascrip.org as tickets against ECMA-404. That way you would be copied on any discussion regarding them that took place via the bug tracking systems.

Regardless, I'll make some preliminary responses below.

Allen

On Dec 10, 2013, at 2:14 PM, Patel-Schneider, Peter wrote:

Thanks for forwarding my message to this group.

However, I was hoping for a discussion on what to do about these bugs (or complaints that my analysis is not correct), not how to subscribe to es-discuss. :-)

peter

On Dec 10, 2013, at 1:22 AM, Carsten Bormann <cabo at tzi.org> wrote:

FYI for those who haven’t surrendered into subscribing to es-discuss yet:

Begin forwarded message:

From: "Patel-Schneider, Peter" <Peter.Patel-Schneider at nuance.com> Subject: two comments on JSON, ECMA-404, 1st edition / October 2013 Date: 10 Dec 2013 06:00:34 +0100 To: "es-discuss at mozilla.org" <es-discuss at mozilla.org>

1/ According to ECMA-404, 1st edition / October 2013, a JSON text is a sequence of Unicode code points. The code points that can appear in a JSON text include all code point except the control characters (the text says U+0000 to U+001F but the syntax diagram just says control character, which in Unicode 6.3 also includes U+007F to U+009F). Therefore, the code point sequence <0022, DEAD, 0022> is a valid JSON text.

Strictly speaking, ECMA-404 only allows control characters (that are not also whitespace characters) any code point can appear as part of a JSON string value. They can't occur anywhere else within a JSON text. The normative text that precedes the Syntax Diagram defines "control character" as U+0000 to U+001F so that must be the meaning that applies to the diagram. It should be pretty clear to any reader that a local definitions takes precedences over any other. There is always room to improve the clarity of requirements. So a bug asking to make this clearer is a fine request. In reality, I don't think think anyone reading clause 9 of ECMA-404 would actually come to the conclusion that U+007F is exclude from a JSON strong value. The text is explicit and it does not say that.

However, this code point sequence cannot be represented in UTF-8, UTF-16, or UTF-32, as it is not a sequence of Unicode scalar values, and Unicode encoding forms are only defined on Unicode scalar values.

The code point sequence <0022, DEAD, 0022> is indeed, and intentionally, a a valid JSON text according ECMA-404. JSON parsers do not exclusively operate upon sequences of Unicode scalar values. From its earliest days JSON as been parse using input sources (such as program language string abstractions) that can are actually used to encode code points that are not Unicode Scalar Values. (For example, JavaScript strings UTF-16 encode supplementary characters but also allow unpaired surrogates) The intent of ECMA-404 is to provide a grammar that is usable in such implementation as well as implementation that only deal with well formed UTF encodings.

Just because the grammar defines the how to parse JSON texts that include code points that are not Unicode scalar values, that doesn't mean that that all implementation must be able to process such JSON texts. Implementation may impose restrictions on the forms of input they can process and as is alluded to in the Introduction to ECMA-404 other standards can impose encoding restrictions that apply in specific circumstances. For example, 4627bis is certainly free to both reference ECMA-404 and to say that application/json text must only contain Unicode scalar values.

2/ The unescaping of strings in JSON is ill-defined as there are quoted JSON strings that are the escaped version of two different sequences of Unicode code points. For example both <D834, DD1E> and <1D11E> can be represented as "\uD834\uDD1E".

As I said above, clarity can always be improved. However, ECMA-404 clause 9 says "To escape a code point that is not in the Basic Multilingual Plane, the character is represented as a twelve character [bug should be "code point"] sequence, encoding the UTF-16 surrogate pair." This seems pretty clear that if you see, "\uD834\uDD1E" as a JSON string value it must be interpreter as a JSON string containing the single character U+1D11E. If you wish to express JSON string value containing the two code points U+D834 and U+DD1E then you must directly express at least one of them without using an escape sequence.

Note that this is all independent of any transcoding that might take place prior or subsequent to JSON parsing. For example, the ECMAScript 6 draft specifies that an expression like: JSON.parse(' "\uD834\uDD1E" '); would first parse the argument to this function call as a ECMAScript string literal that produces a string value of length 2 that contains the UTF-16 encoding of code point U+1D11E. In then specifies that JSON.parse, prior to actual parsing, must interpret that ECMAScript string value as UTF-16 (plus possibly with unpaired surrogates) and then process it as if it was transcoded an into an equivalent sequence of unencoded Unicode code points. So JSON.parse must (logically) process <U+0022, U+1D11E, U+0022> which it should recognize as a JSON string value consisting of one code point. The output of JSON.parse is the corresponding ECMAScript string value so that single code point is return as an ECMAScript string of length 2 that contains the two code units U+D834 and U+DD1E.

Peter,
I would be idea if you registered submitted these issues to bugs.ecmascrip.org as tickets against ECMA-404.  That way you would be copied on any discussion regarding them that took place via the bug tracking systems.

Regardless, I'll make some preliminary responses below.

Allen

On Dec 10, 2013, at 2:14 PM, Patel-Schneider, Peter wrote:

> Thanks for forwarding my message to this group.
> 
> However, I was hoping for a discussion on what to do about these bugs (or complaints that my analysis is not correct), not how to subscribe to es-discuss.  :-)
> 
> peter
> 
> On Dec 10, 2013, at 1:22 AM, Carsten Bormann <cabo at tzi.org> wrote:
> 
>> FYI for those who haven’t surrendered into subscribing to es-discuss yet:
>> 
>> Begin forwarded message:
>> 
>>> From: "Patel-Schneider, Peter" <Peter.Patel-Schneider at nuance.com>
>>> Subject: two comments on JSON, ECMA-404, 1st edition / October 2013
>>> Date: 10 Dec 2013 06:00:34 +0100
>>> To: "es-discuss at mozilla.org" <es-discuss at mozilla.org>
>>> 
>>> 1/ According to ECMA-404, 1st edition / October 2013, a JSON text is a sequence
>>> of Unicode code points.   The code points that can appear in a JSON text
>>> include all code point except the control characters (the text says U+0000
>>> to U+001F but the syntax diagram just says control character, which in
>>> Unicode 6.3 also includes U+007F to U+009F).  Therefore, the code
>>> point sequence <0022, DEAD, 0022> is a valid JSON text.

Strictly speaking, ECMA-404 only allows control characters (that are not also whitespace characters)  any code point can appear as part of a JSON string value. They can't occur anywhere else within a JSON text.   The normative text that precedes the Syntax Diagram defines "control character" as U+0000 to U+001F so that must be the meaning that applies to the diagram.  It should be pretty clear to any reader that a local definitions takes precedences over any other.  There is always room to improve the clarity of requirements.  So a bug asking to make this clearer is a fine request.  In reality, I don't think think anyone reading clause 9 of ECMA-404 would actually come to the conclusion that U+007F is exclude from a JSON strong value.  The text is explicit and it does not say that.

>>> 
>>> However, this code point sequence cannot be represented in UTF-8, UTF-16, or
>>> UTF-32, as it is not a sequence of Unicode scalar values, and 
>>> Unicode encoding forms are only defined on Unicode scalar values.  

The code point sequence <0022, DEAD, 0022> is indeed, and intentionally, a a valid JSON text according ECMA-404.  JSON parsers do not exclusively operate upon sequences of Unicode scalar values.  From its earliest days JSON as been parse using input sources (such as program language string abstractions)  that can are actually used to encode code points that are not Unicode Scalar Values. (For example, JavaScript strings UTF-16 encode supplementary characters but also allow unpaired surrogates)  The intent of ECMA-404 is to provide a grammar that is usable in such implementation as well as implementation that only deal with well formed UTF encodings.

Just because the grammar defines the how to parse JSON texts that include code points that are not Unicode scalar values, that doesn't mean that that all implementation must be able to process such JSON texts. Implementation may impose restrictions on the forms of input they can process and as is alluded to in the Introduction to ECMA-404 other standards can impose encoding restrictions  that apply in specific circumstances.  For example, 4627bis is certainly free to both reference ECMA-404 and to say that application/json text must only contain Unicode scalar values.

>>> 
>>> 
>>> 2/ The unescaping of strings in JSON is ill-defined as there are quoted 
>>> JSON strings that are the escaped version of two different sequences of
>>> Unicode code points.  For example both <D834, DD1E> and <1D11E> can be
>>> represented as "\uD834\uDD1E". 

As I said above, clarity can always be improved.  However, ECMA-404 clause 9 says "To escape a code point that is not in the Basic Multilingual Plane, the character is represented as a twelve character [bug should be "code point"] sequence, encoding the UTF-16 surrogate pair."  This seems pretty clear that if you see, "\uD834\uDD1E" as a JSON string value it must be interpreter as a JSON string containing the single character U+1D11E.  If you wish to express JSON string value containing the two code points U+D834 and U+DD1E then you must directly express at least one of them without using an escape sequence. 

Note that this is all independent of any transcoding that might take place prior or subsequent to JSON parsing.  For example, the ECMAScript 6 draft specifies that an expression like:
   JSON.parse(' "\uD834\uDD1E" ');
would first parse the argument to this function call as a ECMAScript string literal that produces a string value of length 2 that contains the UTF-16 encoding of code point U+1D11E.  In then specifies that JSON.parse, prior to actual parsing, must interpret that ECMAScript string value as UTF-16 (plus possibly with unpaired surrogates) and then process it as if it was transcoded an into an equivalent sequence of unencoded Unicode code points. So JSON.parse must (logically) process <U+0022, U+1D11E, U+0022> which it should recognize as a JSON string value consisting of one code point.  The output of JSON.parse is the corresponding ECMAScript string value so that single code point is return as an ECMAScript string of length 2 that contains the two code units U+D834 and U+DD1E.

Allen

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20131210/b90fce3c/attachment.html>

# Allen Wirfs-Brock (12 years ago)

On Dec 11, 2013, at 12:10 PM, Joe Hildebrand (jhildebr) wrote:

On 12/10/13 5:29 PM, "Allen Wirfs-Brock" <allen at wirfs-brock.com> wrote:

The code point sequence <0022, DEAD, 0022> is indeed, and intentionally, a a valid JSON text according ECMA-404. JSON parsers do not exclusively operate upon sequences of Unicode scalar values. From its earliest days JSON as been parse using input sources (such as program language string abstractions) that can are actually used to encode code points that are not Unicode Scalar Values. (For example, JavaScript strings UTF-16 encode supplementary characters but also allow unpaired surrogates) The intent of ECMA-404 is to provide a grammar that is usable in such implementation as well as implementation that only deal with well formed UTF encodings.

As I've said on the other thread, <0022, DEAD, 0022> may be valid in the programming model that ECMAscript uses, but when transmitted on the wire according to 4627bis, it would need to get rendered as "\uDEAD" in order to be encoded as interoperable UTF8. CESU-8 has never been a valid encoding scheme for 4627, and we'd have to have pretty strong consensus to add it.

So, 4627bis can specify that application/json documents an only contain Unicode scalar value code points. That means that a applicaiton/json JSON text is a subset of all possible JSON texts defined by ECMA-404. What wrong with that?

On Dec 11, 2013, at 12:10 PM, Joe Hildebrand (jhildebr) wrote:

> On 12/10/13 5:29 PM, "Allen Wirfs-Brock" <allen at wirfs-brock.com> wrote:
> 
>> The code point sequence <0022, DEAD, 0022> is indeed, and intentionally,
>> a a valid JSON text according ECMA-404.  JSON parsers do not exclusively
>> operate upon sequences of Unicode scalar values.  From its earliest days
>> JSON as been parse using input sources
>> (such as program language string abstractions)  that can are actually
>> used to encode code points that are not Unicode Scalar Values. (For
>> example, JavaScript strings UTF-16 encode supplementary characters but
>> also allow unpaired surrogates)  The intent of
>> ECMA-404 is to provide a grammar that is usable in such implementation
>> as well as implementation that only deal with well formed UTF encodings.
> 
> As I've said on the other thread, <0022, DEAD, 0022> may be valid in the
> programming model that ECMAscript uses, but when transmitted on the wire
> according to 4627bis, it would need to get rendered as "\uDEAD" in order
> to be encoded as interoperable UTF8.  CESU-8 has never been a valid
> encoding scheme for 4627, and we'd have to have pretty strong consensus to
> add it.

So, 4627bis can specify that application/json documents an only contain Unicode scalar value code points. That means that a applicaiton/json JSON text is a subset of all possible JSON texts defined by ECMA-404.  What wrong with that?

Allen

# Allen Wirfs-Brock (12 years ago)

On Dec 11, 2013, at 2:07 PM, Joe Hildebrand (jhildebr) wrote:

On 12/11/13 1:54 PM, "Allen Wirfs-Brock" <allen at wirfs-brock.com> wrote:

So, 4627bis can specify that application/json documents an only contain Unicode scalar value code points.

Well, it says: "JSON text SHALL be encoded in UTF-8, UTF-16, or UTF-32". It's not legal to encode any code point between U+D800 and U+DFFF in UTF-8 or UTF-32, and invalid UTF-16 to have them paired incorrectly. So this restriction is implicit.

That means that a applicaiton/json JSON text is a subset of all possible JSON texts defined by ECMA-404. What wrong with that?

Nothing, perhaps. I do think it would be nice to give a little more guidance about serializing and parsing ECMA-404 to and from the transmission form.

Separation of concerns. application/json is a transmission form but not the only possible one so it seems appropriate for it to specify its specific restrictions. Also, there are local uses of JSON that don't require serialization to a transmission form.

Note that this is not very different from the distinction between RFC 4329 (Scripting Media Types) and ECMA-262. (although 4329 is quite out of date WRT the corresponding language specifications)

On Dec 11, 2013, at 2:07 PM, Joe Hildebrand (jhildebr) wrote:

> On 12/11/13 1:54 PM, "Allen Wirfs-Brock" <allen at wirfs-brock.com> wrote:
> 
>> So, 4627bis can specify that application/json documents an only contain
>> Unicode scalar value code points.
> 
> Well, it says: "JSON text SHALL be encoded in UTF-8, UTF-16, or UTF-32".
> It's not legal to encode any code point between U+D800 and U+DFFF in UTF-8
> or UTF-32, and invalid UTF-16 to have them paired incorrectly.  So this
> restriction is implicit.
> 
>> That means that a applicaiton/json JSON text is a subset of all possible
>> JSON texts defined by ECMA-404.  What wrong with that?
> 
> Nothing, perhaps.  I do think it would be nice to give a little more
> guidance about serializing and parsing ECMA-404 to and from the
> transmission form.

Separation of concerns.   application/json is a transmission form but not the only possible one so it seems appropriate for it to specify its specific restrictions. Also, there are local uses of JSON that don't require serialization to a transmission form.

Note that this is not very different from the distinction between RFC 4329 (Scripting Media Types) and ECMA-262.  (although 4329 is quite out of date WRT the corresponding language specifications)

Allen