JSON.canonicalize()
See wiki.laptop.org/go/Canonical_JSON -- you should probably at least mention unicode normalization of strings. You probably should also specify a validator: it doesn't matter if you emit canonical JSON if you can tweak the hash of the value by feeding non-canonical JSON as an input.
On 2018-03-16 08:52, C. Scott Ananian wrote:
See wiki.laptop.org/go/Canonical_JSON -- you should probably at least mention unicode normalization of strings.
Yes, I could add that unicode normalization of strings is out of scope for this specification.
You probably should also specify a validator: it doesn't matter if you emit canonical JSON if you can tweak the hash of the value by feeding non-canonical JSON as an input.
Pardon me, but I don't understand what you are writing here.
Hash functions only "raison d'être" are providing collision safe checksums.
thanx, Anders
Canonical JSON is often used to imply a security property: two JSON blobs with identical contents are expected to have identical canonical JSON forms (and thus identical hashed values).
However, unicode normalization allows multiple representations of "the same" string, which defeats this security property. Depending on your implementation language and use, a string with precomposed accepts could compare equal to a string with separated accents, even though the canonical JSON or hash differed. In an extreme case (with a weak hash function, say MD5), this can be used to break security by re-encoding all strings in multiple variants until a collision is found. This is just a slight variant on the fact that JSON allows multiple ways to encode a character using escape sequences. You've already taken the trouble to disambiguate this case; security-conscious applications should take care to perform unicode normalization as well, for the same reason.
Similarly, if you don't offer a verifier to ensure that the input is in "canonical JSON" format, then an attacker can try to create collisions by violating the rules of canonical JSON format, whether by using different escape sequences, adding whitespace, etc. This can be used to make JSON which is "the same" appear "different", violating the intent of the canonicalization. Any security application of canonical JSON will require a strict mode for JSON.parse() as well as a strict mode for JSON.stringify().
On Fri, Mar 16, 2018 at 11:38 AM, C. Scott Ananian <ecmascript at cscott.net>
wrote:
Canonical JSON is often used to imply a security property: two JSON blobs with identical contents are expected to have identical canonical JSON forms (and thus identical hashed values).
What does "identical contents" mean in the context of numbers? JSON intentionally avoids specifying any precision for numbers.
JSON.stringify(1/3) === '0.3333333333333333'
What would happen with JSON from systems that allow higher precision? I.e., what would (JSON.canonicalize(JSON.stringify(1/3) + '3')) produce?
However, unicode normalization allows multiple representations of "the same" string, which defeats this security property. Depending on your implementation language
We shouldn't normalize unicode in strings that contain packed binary data. JSON strings are strings of UTF-16 code-units, not Unicode scalar values and any system that assumes the latter will break often.
and use, a string with precomposed accepts could compare equal to a string with separated accents, even though the canonical JSON or hash differed. In an extreme case (with a weak hash function, say MD5), this can be used to break security by re-encoding all strings in multiple variants until a collision is found. This is just a slight variant on the fact that JSON allows multiple ways to encode a character using escape sequences. You've already taken the trouble to disambiguate this case; security-conscious applications should take care to perform unicode normalization as well, for the same reason.
Similarly, if you don't offer a verifier to ensure that the input is in "canonical JSON" format, then an attacker can try to create collisions by violating the rules of canonical JSON format, whether by using different escape sequences, adding whitespace, etc. This can be used to make JSON which is "the same" appear "different", violating the intent of the canonicalization. Any security application of canonical JSON will require a strict mode for JSON.parse() as well as a strict mode for JSON.stringify().
Given the dodginess of "identical" w.r.t. non-integral numbers, shouldn't endpoints be re-canonicalizing before hashing anyway? Why would one want to ship the canonical form over the wire if it loses precision?
--scott
On Fri, Mar 16, 2018 at 4:48 AM, Anders Rundgren < anders.rundgren.net at gmail.com> wrote:
On 2018-03-16 08:52, C. Scott Ananian wrote:
See wiki.laptop.org/go/Canonical_JSON -- you should probably at least mention unicode normalization of strings.
Yes, I could add that unicode normalization of strings is out of scope for this specification.
You probably should also specify a validator: it doesn't matter if you
emit canonical JSON if you can tweak the hash of the value by feeding non-canonical JSON as an input.
Pardon me, but I don't understand what you are writing here.
Hash functions only "raison d'être" are providing collision safe checksums.
thanx, Anders
--scott
On Fri, Mar 16, 2018 at 3:16 AM, Anders Rundgren < anders.rundgren.net at gmail.com <mailto:anders.rundgren.net at gmail.com>> wrote:
Dear List, Here is a proposal that I would be very happy getting feedback on
since it builds on ES but is not (at all) limited to ES.
The request is for a complement to the ES "JSON" object called
canonicalize() which would have identical parameters to the existing stringify() method.
Why should canonicalize take a replacer? Hasn't replacement already happened?
On Fri, Mar 16, 2018 at 12:23 PM, Mike Samuel <mikesamuel at gmail.com> wrote:
On Fri, Mar 16, 2018 at 11:38 AM, C. Scott Ananian <ecmascript at cscott.net> wrote:
Canonical JSON is often used to imply a security property: two JSON blobs with identical contents are expected to have identical canonical JSON forms (and thus identical hashed values).
What does "identical contents" mean in the context of numbers? JSON intentionally avoids specifying any precision for numbers.
JSON.stringify(1/3) === '0.3333333333333333'
What would happen with JSON from systems that allow higher precision? I.e., what would (JSON.canonicalize(JSON.stringify(1/3) + '3')) produce?
However, unicode normalization allows multiple representations of "the
same" string, which defeats this security property. Depending on your implementation language
We shouldn't normalize unicode in strings that contain packed binary data. JSON strings are strings of UTF-16 code-units, not Unicode scalar values and any system that assumes the latter will break often.
Both of these points are made on the URL I originally cited: wiki.laptop.org/go/Canonical_JSON
On 2018-03-16 16:38, C. Scott Ananian wrote:
Canonical JSON is often used to imply a security property: two JSON blobs > with identical contents are expected to have identical canonical JSON forms (and thus identical hashed values).
Right.
However, unicode normalization allows multiple representations of "the same" string, which defeats this security property.
This is an aspect that I believe belongs to the "application" level. This specification is only about "on the wire" format.
Rationale: if this was a part of the SPECIFICATION it would either be ignored (=useless) or be a showstopper (=dead) due to complexity.
If applications using the received data want to address this issue they can for example call msdn.microsoft.com/en-us/library/windows/desktop/dd318671(v=vs.85).aspx and reject if they want.
Or always normalize: developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String/normalize
Depending on your implementation language and use, a string with precomposed accepts could compare equal to a string with separated accents, even though the canonical JSON or hash differed.
I don't want to go there for the reasons mentioned.
In an extreme case (with a weak hash function, say MD5), this can be > used to break security by re-encoding all strings in multiple variants until a collision is found. This is just a slight variant on the fact that JSON allows multiple ways to encode a character using escape sequences. You've already taken the trouble to disambiguate this case; security-conscious applications should take care to perform unicode normalization as well, for the same reason.
If you are able to break the hash function all bets are off anyway because then you can presumably change any part of the object and it would still appear authentic.
Escape normalization: If you don't do this normalization, signatures would typically break and that's not really a "security" (=attacker) problem; it is rather a "nuisance" of the same caliber as a server not responding.
Similarly, if you don't offer a verifier to ensure that the input is in "canonical JSON" format, then an attacker can try to create collisions by violating the rules of canonical JSON format, whether by using different escape sequences, adding whitespace, etc. This can be used to make JSON which is "the same" appear "different", violating the intent of the canonicalization.
Again, if the hash function is broken, there's nothing to do except maybe cry :-(
This a Unicode problem, not a cryptographic problem.
Any security application of canonical JSON will require a strict mode for JSON.parse() as well as a strict mode for JSON.stringify().
Indeed, you ALWAYS must verify that indata conforms to the agreed conventions.
Anyway, feel free pushing a different JSON canonicalization scheme!
Here is another: gibson042.github.io/canonicaljson-spec It claims that you should support "lone surrogates" (invalid Unicode) which for example JDK doesn't. I don't go there either.
Anders
On Fri, Mar 16, 2018 at 12:44 PM, C. Scott Ananian <ecmascript at cscott.net>
wrote:
On Fri, Mar 16, 2018 at 12:23 PM, Mike Samuel <mikesamuel at gmail.com> wrote:
On Fri, Mar 16, 2018 at 11:38 AM, C. Scott Ananian <ecmascript at cscott.net
wrote:
Canonical JSON is often used to imply a security property: two JSON blobs with identical contents are expected to have identical canonical JSON forms (and thus identical hashed values).
What does "identical contents" mean in the context of numbers? JSON intentionally avoids specifying any precision for numbers.
JSON.stringify(1/3) === '0.3333333333333333'
What would happen with JSON from systems that allow higher precision? I.e., what would (JSON.canonicalize(JSON.stringify(1/3) + '3')) produce?
However, unicode normalization allows multiple representations of "the
same" string, which defeats this security property. Depending on your implementation language
We shouldn't normalize unicode in strings that contain packed binary data. JSON strings are strings of UTF-16 code-units, not Unicode scalar values and any system that assumes the latter will break often.
Both of these points are made on the URL I originally cited: wiki.laptop.org/go/Canonical_JSON
Thanks, I see """ Floating point numbers are not allowed in canonical JSON. Neither are leading zeros or "minus 0" for integers. """ which answers my question.
I also see """ A previous version of this specification required strings to be valid unicode, and relied on JSON's \u escape. This was abandoned as it doesn't allow representing arbitrary binary data in a string, and it doesn't preserve the identity of non-canonical unicode strings. """ which addresses my question.
I also see """ It is suggested that unicode strings be represented as the UTF-8 encoding of unicode Normalization Form C www.unicode.org/reports/tr15 (UAX #15). However, arbitrary content may be represented as a string: it is not guaranteed that string contents can be meaningfully parsed as UTF-8. """ which seems to be mixing concerns about the wire format used to encode JSON as octets and NFC which would apply to the text of the JSON string.
If that confusion is cleaned up, then it seems a fine subset of JSON to ship over the wire with a JSON content-type.
It is entirely unsuitable to embedding in HTML or XML though. IIUC, with an implementation based on this
JSON.canonicalize(JSON.stringify("</script>")) === "</script>"
&&
JSON.canonicalize(JSON.stringify("]]>")) === "]]>"
The output of JSON.canonicalize would also not be in the subset of JSON that is also a subset of JavaScript's PrimaryExpression.
JSON.canonicalize(JSON.stringify("\u2028\u2029")) === "\u2028\u2029"
It also is not suitable for use internally within systems that internally use cstrings.
JSON.canonicalize(JSON.stringify("\u0000")) === "\u0000"
On 2018-03-16 18:04, Mike Samuel wrote:
It is entirely unsuitable to embedding in HTML or XML though. IIUC, with an implementation based on this
JSON.canonicalize(JSON.stringify("</script>")) ===
"</script>"
&& JSON.canonicalize(JSON.stringify("]]>")) ==="]]>"
I don't know what you are trying to prove here :-)
The output of JSON.canonicalize would also not be in the subset of JSON that is also a subset of JavaScript's PrimaryExpression.
JSON.canonicalize(JSON.stringify("\u2028\u2029")) ===
"\u2028\u2029"
It also is not suitable for use internally within systems that internally use cstrings.
JSON.canonicalize(JSON.stringify("\u0000")) ===
"\u0000"
JSON.canonicalize() would be [almost] identical to JSON.stringify()
JSON.canonicalize(JSON.parse('"\u2028\u2029"')) === '"\u2028\u2029"' // Returns true
"Emulator":
var canonicalize = function(object) {
var buffer = '';
serialize(object);
return buffer;
function serialize(object) {
if (object !== null && typeof object === 'object') {
if (Array.isArray(object)) {
buffer += '[';
let next = false;
object.forEach((element) => {
if (next) {
buffer += ',';
}
next = true;
serialize(element);
});
buffer += ']';
} else {
buffer += '{';
let next = false;
Object.keys(object).sort().forEach((property) => {
if (next) {
buffer += ',';
}
next = true;
buffer += JSON.stringify(property);
buffer += ':';
serialize(object[property]);
});
buffer += '}';
}
} else {
buffer += JSON.stringify(object);
}
}
};
On Fri, Mar 16, 2018 at 1:30 PM, Anders Rundgren < anders.rundgren.net at gmail.com> wrote:
On 2018-03-16 18:04, Mike Samuel wrote:
It is entirely unsuitable to embedding in HTML or XML though.
IIUC, with an implementation based on this
JSON.canonicalize(JSON.stringify("</script>")) ===
"</script>"
&& JSON.canonicalize(JSON.stringify("]]>")) ==="]]>"
I don't know what you are trying to prove here :-)
Only that canonical JSON is useful in a very narrow context. It cannot be embedded in an HTML script tag. It cannot be embedded in an XML or HTML foreign content context without extra care. If it contains a string literal that embeds a NUL it cannot be embedded in XML period even if extra care is taken.
The output of JSON.canonicalize would also not be in the subset of JSON
that is also a subset of JavaScript's PrimaryExpression.
JSON.canonicalize(JSON.stringify("\u2028\u2029")) ===
"\u2028\u2029"
It also is not suitable for use internally within systems that internally use cstrings.
JSON.canonicalize(JSON.stringify("\u0000")) ===
"\u0000"
JSON.canonicalize() would be [almost] identical to JSON.stringify()
You're correct. Many JSON producers have a web-safe version, but the JavaScript builtin does not. My point is that JSON.canonicalize undoes those web-safety tweaks.
JSON.canonicalize(JSON.parse('"\u2028\u2029"')) === '"\u2028\u2029"' // Returns true
"Emulator":
var canonicalize = function(object) {
var buffer = ''; serialize(object);
I thought canonicalize took in a string of JSON and produced the same. Am I wrong? "Canonicalize" to my mind means a function that returns the canonical member of an equivalence class given any member from that same equivalence class, so is always 'a -> 'a.
return buffer; function serialize(object) { if (object !== null && typeof object === 'object') {
JSON.stringify(new Date(0)) === ""1970-01-01T00:00:00.000Z"" because Date.prototype.toJSON exists.
If you operate as a JSON_string -> JSON_string function then you
can avoid this complexity.
if (Array.isArray(object)) {
buffer += '['; let next = false; object.forEach((element) => { if (next) { buffer += ','; } next = true; serialize(element); }); buffer += ']'; } else { buffer += '{'; let next = false; Object.keys(object).sort().forEach((property) => { if (next) { buffer += ','; } next = true;
buffer += JSON.stringify(property);
I think you need a symbol check here. JSON.stringify(Symbol.for('foo')) === undefined
buffer += ':'; serialize(object[property]); }); buffer += '}'; } } else { buffer += JSON.stringify(object);
This fails to distinguish non-integral numbers from integral ones, and produces non-standard output when object === undefined. Again, not a problem if the input is required to be valid JSON.
On Fri, Mar 16, 2018 at 1:30 PM, Anders Rundgren < anders.rundgren.net at gmail.com> wrote:
On 2018-03-16 18:04, Mike Samuel wrote:
It is entirely unsuitable to embedding in HTML or XML though.
IIUC, with an implementation based on this
JSON.canonicalize(JSON.stringify("</script>")) ===
"</script>"
&& JSON.canonicalize(JSON.stringify("]]>")) ==="]]>"
I don't know what you are trying to prove here :-)
He wants to ship it as application/json and have it be safe if the browser happens to ignore the mime type and interpret it as HTML or XML, I believe. Mandatory encoding of < as an escape would make the output "safe" for such use. I'm not convinced this is in-scope, but it's an interesting case to consider when determining which characters ought to be escaped.
(I think he's writing JSON.canonicalize(JSON.stringify(...))
where he
means to write JSON.canonicalize(...)
, at least if I understand the
proposed API correctly.)
The output of JSON.canonicalize would also not be in the subset of JSON
that is also a subset of JavaScript's PrimaryExpression.
JSON.canonicalize(JSON.stringify("\u2028\u2029")) ===
"\u2028\u2029"
I'm not sure about this, but I think he's saying you can't just eval
the
canonical JSON output, because newlines appear literally, not escaped. I
believe I actually ran into some compatibility issues with this back when I
was playing around with canonical JSON as well; certain JSON parsers
wouldn't accept "JSON" with embedded literal newlines.
OTOH, I don't think anyone should be encouraged to eval JSON! As noted previously, there should be a strict parse function to go along with the strict serialize function.
It also is not suitable for use internally within systems that internally
use cstrings.
JSON.canonicalize(JSON.stringify("\u0000")) ===
"\u0000"
A literal NUL character is unrepresentable in a naive C implementation. You need to use pascal-style strings in your low-level implementation. This is an important consideration for non-JavaScript use. In my page I noted, "Because only two byte values are escaped, be aware that JSON-encoded data may contain embedded control characters and nulls." A similar warning is at least called for here.
On Fri, Mar 16, 2018 at 12:23 PM, Mike Samuel <mikesamuel at gmail.com> wrote: I also see """ It is suggested that unicode strings be represented as the UTF-8 encoding of unicode Normalization Form C www.unicode.org/reports/tr15 (UAX #15). However, arbitrary content may be represented as a string: it is not guaranteed that string contents can be meaningfully parsed as UTF-8. """ which seems to be mixing concerns about the wire format used to encode JSON as octets and NFC which would apply to the text of the JSON string.
Yes, it is rather unfortunate that we have only one datatype here and a bit of an impedance mismatch. JSON serialization is usually considered literally as a byte-stream, but JavaScript wants to parse those bytes as some encoding (usually UTF-8) of a UTF-16 string.
My suggestion is just to make this very plain in a SHOULD comment to the potentially implementor. If the underlying data is unicode string data, it SHOULD be represented as the UTF-8 encoding of unicode Normalization Form C (UAX #15). However, the consumer should be aware that the data may be binary bits and not interpretable as a valid UTF-8 string.
Re:
Escape normalization: If you don't do this normalization, signatures would typically break and that's not really a "security" (=attacker) problem; it is rather a "nuisance" of the same caliber as a server not responding.
Consider signatures for malware detection. If an attacker can trivially modify their (in this example) JSON-encoded payload so that it is still "canonical" and still passes whatever input verifier exists (so much easier if there is not strict parsing!), then they can bypass your signature-based detection system. That's a security problem.
Both sides must be true: equal hashes should mean equal content (to high probability) and unequal hashes should mean different content. Otherwise there is a security problem.
And just to be clear: I'm all for standardizing a canonical JSON form. In addition to my 11-year-old attempt, there have been countless others, and still no standard. I just want us to learn from the previous attempts and try to make something at least as good as everything which has come before, especially in terms of the various non-obvious considerations which individual implementors have discovered the hard way over the years.
On 2018-03-16 18:46, Mike Samuel wrote:
On Fri, Mar 16, 2018 at 1:30 PM, Anders Rundgren <anders.rundgren.net at gmail.com <mailto:anders.rundgren.net at gmail.com>> wrote:
On 2018-03-16 18:04, Mike Samuel wrote: It is entirely unsuitable to embedding in HTML or XML though. IIUC, with an implementation based on this JSON.canonicalize(JSON.stringify("</script>")) === `"</script>"` && JSON.canonicalize(JSON.stringify("]]>")) === `"]]>"` I don't know what you are trying to prove here :-)
Only that canonical JSON is useful in a very narrow context. It cannot be embedded in an HTML script tag. It cannot be embedded in an XML or HTML foreign content context without extra care. If it contains a string literal that embeds a NUL it cannot be embedded in XML period even if extra care is taken.
If we stick to browsers, JSON.canonicalize() would presumably be used with WebCrypto, WebSocket etc.
Node.js is probably a more important target.
Related stuff: tools.ietf.org/id/draft-erdtman-jose-cleartext-jws-00.html JSON signatures without canonicalization.
The output of JSON.canonicalize would also not be in the subset of JSON that is also a subset of JavaScript's PrimaryExpression. JSON.canonicalize(JSON.stringify("\u2028\u2029")) === `"\u2028\u2029"` It also is not suitable for use internally within systems that internally use cstrings. JSON.canonicalize(JSON.stringify("\u0000")) === `"\u0000"` JSON.canonicalize() would be [almost] identical to JSON.stringify()
You're correct. Many JSON producers have a web-safe version, but the JavaScript builtin does not. My point is that JSON.canonicalize undoes those web-safety tweaks.
JSON.canonicalize(JSON.parse('"\u2028\u2029"')) === '"\u2028\u2029"' // Returns true "Emulator": var canonicalize = function(object) { var buffer = ''; serialize(object);
I thought canonicalize took in a string of JSON and produced the same. Am I wrong?
Yes, it is just a variant of JSON.stringify().
"Canonicalize" to my mind means a function that returns the canonical member of an equivalence class given any member from that same equivalence class, so is always 'a -> 'a.
This is rather a canonicalizing serializer.
return buffer; function serialize(object) { if (object !== null && typeof object === 'object') {
JSON.stringify(new Date(0)) === ""1970-01-01T00:00:00.000Z"" because Date.prototype.toJSON exists.
If you operate as a JSON_string -> JSON_string function then you can avoid this complexity.
if (Array.isArray(object)) { buffer += '['; let next = false; object.forEach((element) => { if (next) { buffer += ','; } next = true; serialize(element); }); buffer += ']'; } else { buffer += '{'; let next = false; Object.keys(object).sort().forEach((property) => { if (next) { buffer += ','; } next = true; buffer += JSON.stringify(property);
I think you need a symbol check here. JSON.stringify(Symbol.for('foo')) === undefined
buffer += ':'; serialize(object[property]); }); buffer += '}'; } } else { buffer += JSON.stringify(object);
This fails to distinguish non-integral numbers from integral ones, and produces non-standard output when object === undefined. Again, not a problem if the input is required to be valid JSON.
Well, a proper implementation would build on JSON.stringify() with property sorting as the only enhancement.
On Fri, Mar 16, 2018 at 1:54 PM, C. Scott Ananian <ecmascript at cscott.net>
wrote:
And just to be clear: I'm all for standardizing a canonical JSON form. In addition to my 11-year-old attempt, there have been countless others, and still no standard. I just want us to learn from the previous attempts and try to make something at least as good as everything which has come before, especially in terms of the various non-obvious considerations which individual implementors have discovered the hard way over the years.
I think the hashing use case is an important one. At the risk of bikeshedding, "canonical" seems to overstate the usefulness. Many assume that the canonical form of something is usually the one you use in preference to any other equivalent.
If the integer-only restriction is relaxed (see below), then
- The proposed canonical form seems useful as an input to strong hash functions.
- It seems usable as a complete message body, but not preferable due to potential loss of precision.
- It seems usable but not preferable as a long-term storage format.
- It seems a source of additional risk when used in conjunction with other common web languages.
If that is correct, Would people be averse to marketing this as "hashable JSON" instead of "canonical JSON?"
Numbers
There seem to be 3 main forks in the design space w.r.t. numbers. I'm sure cscott has thought of more, but to make it clear why I think canonical JSON is not very useful as a wire/storage format.
- Integers only PROS: avoids floating point equality issues that have bedeviled many systems CONS: can support only a small portion of the JSON value space CONS: small loss of precision risk with integers encoded from Decimal values. For example, won't roundtrip Java BigDecimals.
- Any numbers with minimal changes: dropping + signs, normalizing zeros, using a fixed threshold for scientific notation. PROS: supports whole JSON value-space CONS: less useful for hashing CONS: risks loss of precision when decoders decide based on presence of decimal point whether to represent as double or int.
- Preserve textual representation. PROS: avoids loss of precision PROS: can support whole JSON value-space CONS: not very useful for hashing
It seems that there is a tradeoff between usefulness for hashing and the ability to support the whole JSON value-space.
Recommending this as a wire / storage format further complicates that tradeoff.
Regardless of which fork is chosen, there are some risks with the current design. For example, 1e100000 takes up some space in memory. This might allow timing attacks. Imagine an attacker can get Alice to embed 1e100000 or another number in her JSON. Alice sends that message to Bob over an encrypted channel. Bob converts the JSON to canonical JSON. If Bob refuses some JSON payloads over a threshold size or the time to process is noticably different for 1e100000 vs 1e1 then the attacker can tell, via traffic analysis alone, when Alice communicates with Bob. We should avoid that in-memory blowup if possible.
On 2018-03-16 19:30, Mike Samuel wrote:
- Any numbers with minimal changes: dropping + signs, normalizing zeros, using a fixed threshold for scientific notation. PROS: supports whole JSON value-space CONS: less useful for hashing CONS: risks loss of precision when decoders decide based on presence of decimal point whether to represent as double or int.
Have you actually looked into the specification? cyberphone.github.io/doc/security/draft-rundgren-json-canonicalization-scheme.html#rfc.section.3.2.2 ES6 has all what it takes.
Anders
On Fri, Mar 16, 2018 at 2:43 PM, Anders Rundgren < anders.rundgren.net at gmail.com> wrote:
On 2018-03-16 19:30, Mike Samuel wrote:
- Any numbers with minimal changes: dropping + signs, normalizing zeros, using a fixed threshold for scientific notation. PROS: supports whole JSON value-space CONS: less useful for hashing CONS: risks loss of precision when decoders decide based on presence of decimal point whether to represent as double or int.
Have you actually looked into the specification? cyberphone.github.io/doc/security/draft-rundgren-jso n-canonicalization-scheme.html#rfc.section.3.2.2 ES6 has all what it takes.
Yes, but other notions of canonical equivalence have been mentioned here so reasons to prefer one to another seem in scope.
On 2018-03-16 19:51, Mike Samuel wrote:
On Fri, Mar 16, 2018 at 2:43 PM, Anders Rundgren <anders.rundgren.net at gmail.com <mailto:anders.rundgren.net at gmail.com>> wrote:
On 2018-03-16 19:30, Mike Samuel wrote: 2. Any numbers with minimal changes: dropping + signs, normalizing zeros, using a fixed threshold for scientific notation. PROS: supports whole JSON value-space CONS: less useful for hashing CONS: risks loss of precision when decoders decide based on presence of decimal point whether to represent as double or int. Have you actually looked into the specification? https://cyberphone.github.io/doc/security/draft-rundgren-json-canonicalization-scheme.html#rfc.section.3.2.2 <https://cyberphone.github.io/doc/security/draft-rundgren-json-canonicalization-scheme.html#rfc.section.3.2.2> ES6 has all what it takes.
Yes, but other notions of canonical equivalence have been mentioned here so reasons to prefer one to another seem in scope.
Availability beats perfection anytime. This is the VHS (if anybody remember that old story) of canonicalization and I don't feel too bad about that :-)
Anders
On Fri, Mar 16, 2018 at 3:03 PM, Anders Rundgren < anders.rundgren.net at gmail.com> wrote:
On 2018-03-16 19:51, Mike Samuel wrote:
On Fri, Mar 16, 2018 at 2:43 PM, Anders Rundgren < anders.rundgren.net at gmail.com <mailto:anders.rundgren.net at gmail.com>> wrote:
On 2018-03-16 19:30, Mike Samuel wrote: 2. Any numbers with minimal changes: dropping + signs,
normalizing zeros, using a fixed threshold for scientific notation. PROS: supports whole JSON value-space CONS: less useful for hashing CONS: risks loss of precision when decoders decide based on presence of decimal point whether to represent as double or int.
Have you actually looked into the specification? https://cyberphone.github.io/doc/security/draft-rundgren-jso
n-canonicalization-scheme.html#rfc.section.3.2.2 < cyberphone.github.io/doc/security/draft-rundgren-js on-canonicalization-scheme.html#rfc.section.3.2.2> ES6 has all what it takes.
Yes, but other notions of canonical equivalence have been mentioned here so reasons to prefer one to another seem in scope.
Availability beats perfection anytime. This is the VHS (if anybody remember that old story) of canonicalization and I don't feel too bad about that :-)
Perhaps. Any thoughts on my question about the merits of "Hashable" vs "Canonical"?
I think the horse is out of the barn re hashable-vs-canonical. It has (independently) been invented and named canonical JSON many many times, starting 11 years ago.
gibson042.github.io/canonicaljson-spec, www.npmjs.com/package/another-json, www.npmjs.com/package/canonical-json, www.npmjs.com/package/keyify, www.npmjs.com/package/canonical-tent-json, www.npmjs.com/package/content-addressable-json, godoc.org/github.com/gibson042/canonicaljson-go, tools.ietf.org/html/draft-staykov-hu-json-canonical-form-00, keybase.io/docs/api/1.0/canonical_packings#json, tools.ietf.org/html/rfc7638#section-3.3, wiki.laptop.org/go/Canonical_JSON, mirkokiefer/canonical-json, davidchambers/CANON
"Content Addressable JSON" is a variant of your "hashable JSON" proposal, though. But the "canonicals" seem to vastly outnumber the "hashables".
My question for Anders is: do you actually plan to incorporate any feedback into changes to your proposal? Or were you really just looking for us to validate your work, not actually contribute to it?
On 2018-03-16 20:09, Mike Samuel wrote:
Availability beats perfection anytime. This is the VHS (if anybody remember that old story) of canonicalization and I don't feel too bad about that :-)
Perhaps. Any thoughts on my question about the merits of "Hashable" vs "Canonical"?
No, there were so much noise here so I may have need a more dense description if possible.
Anders
Though ECMAScript JSON.stringify may suffice for certain Javascript-centric use cases or otherwise restricted subsets thereof as addressed by JOSE, it is not suitable for producing canonical/hashable/etc. JSON, which requires a fully general solution such as 1. Both its number serialization [2] and string serialization [3] specify aspects that harm compatibility (the former having arbitrary branches dependent upon the value of numbers, the latter being capable of producing invalid UTF-8 octet sequences that represent unpaired surrogate code points—unacceptable for exchange outside of a closed ecosystem [4]). JSON is a general language-agnostic interchange format, and ECMAScript JSON.stringify is not a JSON canonicalization solution.
gibson042.github.io/canonicaljson-spec* [2]: ecma-international.org/ecma-262/7.0/#sec- tostring-applied-to-the-number-type [3]: ecma-international.org/ecma-262/7.0/#sec-quotejsonstring [4]: tools.ietf.org/html/rfc8259#section-8.1
On Fri, Mar 16, 2018 at 3:23 PM, Anders Rundgren < anders.rundgren.net at gmail.com> wrote:
On 2018-03-16 20:09, Mike Samuel wrote:
Availability beats perfection anytime. This is the VHS (if anybody
remember that old story) of canonicalization and I don't feel too bad about that :-)
Perhaps. Any thoughts on my question about the merits of "Hashable" vs "Canonical"?
No, there were so much noise here so I may have need a more dense description if possible.
In the email to which you responded "Have you actually looked ..." look for "If that is correct, Would people be averse to marketing this as "hashable JSON" instead of "canonical JSON?""
On 2018-03-16 20:24, Richard Gibson wrote:
Though ECMAScript JSON.stringify may suffice for certain Javascript-centric use cases or otherwise restricted subsets thereof as addressed by JOSE, it is not suitable for producing canonical/hashable/etc. JSON, which requires a fully general solution such as [1]. Both its number serialization [2] and string serialization [3] specify aspects that harm compatibility (the former having arbitrary branches dependent upon the value of numbers, the latter being capable of producing invalid UTF-8 octet sequences that represent unpaired surrogate code points—unacceptable for exchange outside of a closed ecosystem [4]). JSON is a general /language-agnostic/interchange format, and ECMAScript JSON.stringify is nota JSON canonicalization solution.
It effectively depends on your objectives.
.#2 is not really a problem, you would typically not output canonicalized JSON, it is only used internally since there are no requirements that input is canonicalized . .#3 yes, if you create bad data you can [always] screw up. It sounds BTW as a bug which presumable get fixed some day. .#4 If you are targeting Node.js, Browsers, OpenAPI, and all other platforms compatible with those, JSON.stringify() seems to suffice.
The JSON.canonicalize() method proposal was intended for the systems specified in #4.
Perfection is often the enemy of good.
Anders
On Fri, Mar 16, 2018 at 4:07 PM, Anders Rundgren < anders.rundgren.net at gmail.com> wrote:
Perfection is often the enemy of good.
So, to be clear: you don't plan on actually incorporating any feedback into your proposal, since it's already "good"?
On Fri, Mar 16, 2018 at 4:34 PM, C. Scott Ananian <ecmascript at cscott.net>
wrote:
On Fri, Mar 16, 2018 at 4:07 PM, Anders Rundgren < anders.rundgren.net at gmail.com> wrote:
Perfection is often the enemy of good.
So, to be clear: you don't plan on actually incorporating any feedback into your proposal, since it's already "good"?
To restate my main objections:
I think any proposal to offer an alternative stringify instead of a string->string transform is not very good
and could be easily improved by rephrasing it as a string->string transform.
Also, presenting this as a better wire format I think is misleading since I think it has no advantages as a wire format over JSON.stringify's output, and recommending canonical JSON, except for the short duration needed to hash it creates more problems than it solves.
On 2018-03-16 21:41, Mike Samuel wrote:
On Fri, Mar 16, 2018 at 4:34 PM, C. Scott Ananian <ecmascript at cscott.net <mailto:ecmascript at cscott.net>> wrote:
On Fri, Mar 16, 2018 at 4:07 PM, Anders Rundgren <anders.rundgren.net at gmail.com <mailto:anders.rundgren.net at gmail.com>> wrote: Perfection is often the enemy of good. So, to be clear: you don't plan on actually incorporating any feedback into your proposal, since it's already "good"?
I'm not going to incorporate Unicode Normalization because it is better addressed at the application level.
To restate my main objections:
I think any proposal to offer an alternative stringify instead of a string->string transform is not very good and could be easily improved by rephrasing it as a string->string transform.
Could you give a concrete example on that?
Also, presenting this as a better wire format I think is misleading
This was not my intention, I just expressed it poorly. It was rather mixed with my objection to Unicode Normalization.
since I think it has no advantages as a wire format over JSON.stringify's output,
Right, JSON.stringify() is a much better for creating the external format since it honors "creation order".
and recommending canonical JSON, except for the short duration needed to hash it creates more problems than it solves.
Wrong, this is exactly what I had in mind. If the hashable/canonicalizable method works as described (it does not?) it solves the hashing problem.
Anders
On Fri, Mar 16, 2018 at 9:04 PM, Mike Samuel <mikesamuel at gmail.com> wrote:
The output of JSON.canonicalize would also not be in the subset of JSON that is also a subset of JavaScript's PrimaryExpression.
JSON.canonicalize(JSON.stringify("\u2028\u2029")) ===
"\u2028\u2029"
Soon U+2028 and U+2029 will no longer be edge cases. A Stage 3 proposal (currently shipping in Chrome) makes them valid in ECMAScript string literals, making JSON a strict subset of ECMAScript: tc39/proposal
My main feedback is that since this topic has been covered so many times in the past, any serious standardization proposal should include a section surveying existing "canonical JSON" standards and implementations and comparing the proposed standard with prior work. A standard should be a "best of breed" implementation, which adequately replaces existing work, not just another average implementation narrowly tailored to the proposer's own particular use cases.
I don't think Unicode Normalization should necessarily be a requirement of a canonical JSON standard. But any reasonable proposal should at least acknowledge the issues raised, as well as the issues of embedded nulls, HTML safety, and the other points that have been raised in this thread (and the many other points addressed by the dozen other "canonical JSON" implementations I linked to). If you're just going to say, "my proposal is good enough", well then mine is "good enough" too, and so are the other dozen, and none of them need to be the "official JavaScript canonical form". What's your compelling argument that your proposal is better than any of the other dozen? And why start the discussion on this list if you're not going to do anything with the information you learn?
On Fri, Mar 16, 2018, 4:58 PM Anders Rundgren <anders.rundgren.net at gmail.com>
wrote:
On 2018-03-16 21:41, Mike Samuel wrote:
On Fri, Mar 16, 2018 at 4:34 PM, C. Scott Ananian <ecmascript at cscott.net <mailto:ecmascript at cscott.net>> wrote:
On Fri, Mar 16, 2018 at 4:07 PM, Anders Rundgren <
anders.rundgren.net at gmail.com <mailto:anders.rundgren.net at gmail.com>> wrote:
To restate my main objections:
I think any proposal to offer an alternative stringify instead of a string->string transform is not very good and could be easily improved by rephrasing it as a string->string transform.
Could you give a concrete example on that?
I've given three. As written, the proposal produces invalid or low quality output given (undefined, objects with toJSON methods, and symbols as either keys or values). These would not be problems for a real canonicalizer since none are present in a string of JSON.
In addition, two distant users of the canonicalizer who wish to check hashes need to agree on the ancillary arguments like the replacer if canonicalize takes the same arguments and actually uses them. They also need to agree on implementation details of toJSON methods which is a backward compatibility hazard.
If you did solve the toJSON problem by incorporating calls to that method you've now complicated cross-platform behavior. If you phrase in terms of string->string it is much easier to disentangle the definition of
canonicalizers JSON from JS and make it language agnostic.
Finally, your proposal is not the VHS of canonicalizers. That would be x=>JSON.stringify(JSON.parse(x)) since it's deployed and used.
stepping aside from the security aspect, having your code-base’s json-files normalized with sorted-keys is good-housekeeping, especially when you want to sanely maintain ones >1mb in size (e.g. large swagger json-documentations) [1].
and you can easily operationalize your build-process / pre-commit-checks to auto-key-sort json-files with the following simple shell-function [2].
[1] kaizhu256/node-swgg-github-all/blob/2018.2.2/assets.swgg.swagger.json, kaizhu256/node-swgg-github-all/blob/2018.2.2/assets.swgg.swagger.json [2] kaizhu256/node-utility2/blob/2018.1.13/lib.utility2.sh#L1513, kaizhu256/node-utility2/blob/2018.1.13/lib.utility2.sh#L1513
#!/bin/sh
# .bashrc
: '
# to install, copy-paste the shell-function shFileJsonNormalize below
# into your shell startup script (.bashrc, .profile, etc...)
# example shell-usage:
source ~/.bashrc
printf "{
\"version\": \"0.0.1\",
\"name\": \"my-app\",
\"aa\": {
\"zz\": 1,
\"yy\": {
\"xx\": 2,
\"ww\": 3
}
},
\"bb\": [
3,
2,
1,
null
]
}" > package.json
shFileJsonNormalize package.json
cat package.json
# key-sorted output:
{
"aa": {
"yy": {
"ww": 3,
"xx": 2
},
"zz": 1
},
"bb": [
3,
2,
1,
null
],
"name": "my-app",
"version": "0.0.1"
}
'
shFileJsonNormalize() {(set -e
# this shell-function will
# 1. read the json-data from $FILE
# 2. normalize the json-data
# 3. write the normalized json-data back to $FILE
FILE="$1"
node -e "
// <script>
/*jslint
bitwise: true,
browser: true,
maxerr: 8,
maxlen: 100,
node: true,
nomen: true,
regexp: true,
stupid: true
*/
'use strict';
var local;
local = {};
local.fs = require('fs');
local.jsonStringifyOrdered = function (jsonObj, replacer, space) {
/*
* this function will JSON.stringify the jsonObj,
* with object-keys sorted and circular-references removed
*/
var circularList, stringify, tmp;
stringify = function (jsonObj) {
/*
* this function will recursively JSON.stringify the jsonObj,
* with object-keys sorted and circular-references removed
*/
// if jsonObj is an object, then recurse its items with object-keys sorted
if (jsonObj &&
typeof jsonObj === 'object' &&
typeof jsonObj.toJSON !== 'function') {
// ignore circular-reference
if (circularList.indexOf(jsonObj) >= 0) {
return;
}
circularList.push(jsonObj);
// if jsonObj is an array, then recurse its jsonObjs
if (Array.isArray(jsonObj)) {
return '[' + jsonObj.map(function (jsonObj) {
// recurse
tmp = stringify(jsonObj);
return typeof tmp === 'string'
? tmp
: 'null';
}).join(',') + ']';
}
return '{' + Object.keys(jsonObj)
// sort object-keys
.sort()
.map(function (key) {
// recurse
tmp = stringify(jsonObj[key]);
if (typeof tmp === 'string') {
return JSON.stringify(key) + ':' + tmp;
}
})
.filter(function (jsonObj) {
return typeof jsonObj === 'string';
})
.join(',') + '}';
}
// else JSON.stringify as normal
return JSON.stringify(jsonObj);
};
circularList = [];
return JSON.stringify(typeof jsonObj === 'object' && jsonObj
// recurse
? JSON.parse(stringify(jsonObj))
: jsonObj, replacer, space);
};
local.fs.writeFileSync(process.argv[1], local.jsonStringifyOrdered(
JSON.parse(local.fs.readFileSync(process.argv[1], 'utf8')),
null,
4
) + '\n');
// </script>
" "$FILE"
)}
With files frequently that size, it might be worth considering whether you should use a custom format+validator* instead. It'd take a lot less memory, which could be helpful since the first row alone of this file takes about 4-5K in Firefox when deserialized - I verified this in the console (To be exact, 5032 the first time, 4128 the second, and 4416 the third). Also, a megabyte is a lot to send down the wire in Web terms.
* In this case, you'd need a validator that uses minimal perfect hashes and a compact binary data representation that doesn't rely on a concrete start/end. That would avoid the mess of constantly having to look things up in memory, while leaving your IR much smaller. Another item of note: JS strings are 16-bit, which is wasteful in memory for your entire object.
Isiah Meadows me at isiahmeadows.com
Looking for web consulting? Or a new website? Send me an email and we can get started. www.isiahmeadows.com
On Sunday, March 18, 2018, Anders Rundgren <anders.rundgren.net at gmail.com>
wrote:
On 2018-03-16 20:24, Richard Gibson wrote:
Though ECMAScript JSON.stringify may suffice for certain Javascript-centric use cases or otherwise restricted subsets thereof as addressed by JOSE, it is not suitable for producing canonical/hashable/etc. JSON, which requires a fully general solution such as 1. Both its number serialization [2] and string serialization [3] specify aspects that harm compatibility (the former having arbitrary branches dependent upon the value of numbers, the latter being capable of producing invalid UTF-8 octet sequences that represent unpaired surrogate code points—unacceptable for exchange outside of a closed ecosystem [4]). JSON is a general language-agnostic interchange format, and ECMAScript JSON.stringify is not a JSON canonicalization solution.
gibson042.github.io/canonicaljson-spec* [2]: ecma-international.org/ecma-262/7.0/#sec-tostrin g-applied-to-the-number-type [3]: ecma-international.org/ecma-262/7.0/#sec-quotejsonstring [4]: tools.ietf.org/html/rfc8259#section-8.1
Richard, I may be wrong but AFAICT, our respective canoncalization schemes are in fact principally IDENTICAL.
In that they have the same goal, yes. In that they both achieve that goal, no. I'm not married to choices like exponential notation and uppercase escapes, but a JSON canonicalization scheme MUST cover all of JSON.
That the number serialization provided by JSON.stringify() is unacceptable, is not generally taken as a fact. I also think it looks a bit weird, but that's just a matter of esthetics. Compatibility is an entirely different issue.
I concede this point. The modified algorithm is sufficient, but note that a canonicalization scheme will remain static even if ECMAScript changes.
Sorting on Unicode Code Points is of course "technically 100% right" but
strictly put not necessary.
Certain scenarios call for different systems to independently generate equivalent data structures, and it is a necessary property of canonical serialization that it yields identical results for equivalent data structures. JSON does not specify significance of object member ordering, so member ordering does not distinguish otherwise equivalent objects, so canonicalization MUST specify member ordering that is deterministic with respect to all valid data.
Your claim about uppercase Unicode escapes is incorrect, there is no such
requirement:
tools.ietf.org/html/rfc8259#section-7
I don't recall ever making a claim about uppercase Unicode escapes, other than observing that it is the preferred form for examples in the JSON RFCs... what are you talking about?
On Sun, Mar 18, 2018 at 10:08 AM, Richard Gibson <richard.gibson at gmail.com>
wrote:
On Sunday, March 18, 2018, Anders Rundgren <anders.rundgren.net at gmail.com> wrote:
On 2018-03-16 20:24, Richard Gibson wrote:
Though ECMAScript JSON.stringify may suffice for certain Javascript-centric use cases or otherwise restricted subsets thereof as addressed by JOSE, it is not suitable for producing canonical/hashable/etc. JSON, which requires a fully general solution such as 1. Both its number serialization [2] and string serialization [3] specify aspects that harm compatibility (the former having arbitrary branches dependent upon the value of numbers, the latter being capable of producing invalid UTF-8 octet sequences that represent unpaired surrogate code points—unacceptable for exchange outside of a closed ecosystem [4]). JSON is a general language-agnostic interchange format, and ECMAScript JSON.stringify is not a JSON canonicalization solution.
gibson042.github.io/canonicaljson-spec* [2]: ecma-international.org/ecma-262/7.0/#sec-tostrin g-applied-to-the-number-type [3]: ecma-international.org/ecma-262/7.0/#sec-quotejsonstring [4]: tools.ietf.org/html/rfc8259#section-8.1
Richard, I may be wrong but AFAICT, our respective canoncalization schemes are in fact principally IDENTICAL.
In that they have the same goal, yes. In that they both achieve that goal, no. I'm not married to choices like exponential notation and uppercase escapes, but a JSON canonicalization scheme MUST cover all of JSON.
That the number serialization provided by JSON.stringify() is unacceptable, is not generally taken as a fact. I also think it looks a bit weird, but that's just a matter of esthetics. Compatibility is an entirely different issue.
I concede this point. The modified algorithm is sufficient, but note that a canonicalization scheme will remain static even if ECMAScript changes.
Does this mean that the language below would need to be fixed at a specific version of Unicode or that we would need to cite a specific version for canonicalization but might allow a higher version for String.prototype.normalize and in future versions of the spec require it?
www.ecma-international.org/ecma-262/6.0/#sec-conformance """ A conforming implementation of ECMAScript must interpret source text input in conformance with the Unicode Standard, Version 5.1.0 or later """
and in ECMA 404 www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf
""" For undated references, the latest edition of the referenced document (including any amendments) applies. ISO/IEC 10646, Information Technology – Universal Coded Character Set (UCS) The Unicode Consortium. The Unicode Standard www.unicode.org/versions/latest. """
Sorting on Unicode Code Points is of course "technically 100% right" but
strictly put not necessary.
Certain scenarios call for different systems to independently generate equivalent data structures, and it is a necessary property of canonical serialization that it yields identical results for equivalent data structures. JSON does not specify significance of object member ordering, so member ordering does not distinguish otherwise equivalent objects, so canonicalization MUST specify member ordering that is deterministic with respect to all valid data.
Code points include orphaned surrogates in a way that scalar values do not, right? So both "\uD800" and "\uD800\uDC00" are single codepoints. It seems like a strict prefix of a string should still sort before that string but prefix transitivity in general does not hold: "\uFFFF" < "\uD800\uDC00" && "\uFFFF" > "\uD800".
That shouldn't cause problems for hashability but I thought I'd raise it just in case.
On 2018-03-18 15:08, Richard Gibson wrote:
On Sunday, March 18, 2018, Anders Rundgren <anders.rundgren.net at gmail.com <mailto:anders.rundgren.net at gmail.com>> wrote:
On 2018-03-16 20:24, Richard Gibson wrote:
Though ECMAScript JSON.stringify may suffice for certain Javascript-centric use cases or otherwise restricted subsets thereof as addressed by JOSE, it is not suitable for producing canonical/hashable/etc. JSON, which requires a fully general solution such as [1]. Both its number serialization [2] and string serialization [3] specify aspects that harm compatibility (the former having arbitrary branches dependent upon the value of numbers, the latter being capable of producing invalid UTF-8 octet sequences that represent unpaired surrogate code points—unacceptable for exchange outside of a closed ecosystem [4]). JSON is a general /language-agnostic/interchange format, and ECMAScript JSON.stringify is *not*a JSON canonicalization solution. [1]: _http://gibson042.github.io/canonicaljson-spec/ <http://gibson042.github.io/canonicaljson-spec/>_ [2]: http://ecma-international.org/ecma-262/7.0/#sec-tostring-applied-to-the-number-type <http://ecma-international.org/ecma-262/7.0/#sec-tostring-applied-to-the-number-type> [3]: http://ecma-international.org/ecma-262/7.0/#sec-quotejsonstring <http://ecma-international.org/ecma-262/7.0/#sec-quotejsonstring> [4]: https://tools.ietf.org/html/rfc8259#section-8.1 <https://tools.ietf.org/html/rfc8259#section-8.1>
Richard, I may be wrong but AFAICT, our respective canoncalization schemes are in fact principally IDENTICAL.
In that they have the same goal, yes. In that they both achieve that goal, no. I'm not married to choices like exponential notation and uppercase escapes, but a JSON canonicalization scheme MUST cover all of JSON.
Here it gets interesting... What in JSON cannot be expressed through JS and JSON.stringify()?
That the number serialization provided by JSON.stringify() is unacceptable, is not generally taken as a fact. I also think it looks a bit weird, but that's just a matter of esthetics. Compatibility is an entirely different issue.
I concede this point. The modified algorithm is sufficient, but note that a canonicalization scheme will remain static even if ECMAScript changes.
Agreed.
Sorting on Unicode Code Points is of course "technically 100% right" but strictly put not necessary.
Certain scenarios call for different systems to independently generate equivalent data structures, and it is a necessary property of canonical serialization that it yields identical results for equivalent data structures. JSON does not specify significance of object member ordering, so member ordering does not distinguish otherwise equivalent objects, so canonicalization MUST specify member ordering that is deterministic with respect to all valid data.
Violently agree but do not understand (I guess I'm just dumb...) why (for example) sorting on UCS2/UTF-16 Code Units would not achieve the same goal (although the result would differ).
Your claim about uppercase Unicode escapes is incorrect, there is no such requirement: https://tools.ietf.org/html/rfc8259#section-7 <https://tools.ietf.org/html/rfc8259#section-7>
I don't recall ever making a claim about uppercase Unicode escapes, other than observing that it is the preferred form for examples in the JSON RFCs... what are you talking about?
You're right, I found it it in the gibson042.github.io/canonicaljson-spec/#changelog
Thanx, Anders
On Sun, Mar 18, 2018, 10:30 AM Anders Rundgren < anders.rundgren.net at gmail.com> wrote:
Violently agree but do not understand (I guess I'm just dumb...) why (for example) sorting on UCS2/UTF-16 Code Units would not achieve the same goal (although the result would differ).
Because there are JavaScript strings which do not form valid UTF-16 code units. For example, the one-character string '\uD800'. On the input validation side, there are 8-bit strings which can not be decoded as UTF-8. A complete sorting spec needs to describe how these are to be handled. For example, something like WTF-8: simonsapin.github.io/wtf-8
JSON supports arbitrary precision numbers that can't be properly represented as 64 bit floats. This includes numbers like eg. 1e9999 or 1/1e9999.
On Sun, Mar 18, 2018 at 10:43 AM, C. Scott Ananian <ecmascript at cscott.net>
wrote:
On Sun, Mar 18, 2018, 10:30 AM Anders Rundgren < anders.rundgren.net at gmail.com> wrote:
Violently agree but do not understand (I guess I'm just dumb...) why (for example) sorting on UCS2/UTF-16 Code Units would not achieve the same goal (although the result would differ).
Because there are JavaScript strings which do not form valid UTF-16 code units. For example, the one-character string '\uD800'. On the input validation side, there are 8-bit strings which can not be decoded as UTF-8. A complete sorting spec needs to describe how these are to be handled. For example, something like WTF-8: simonsapin. github.io/wtf-8/
Let's get terminology straight. "\uD800" is a valid string of UTF-16 code units. It is also a valid string of codepoints. It is not a valid string of scalar values.
www.unicode.org/glossary/#code_point : Any value in the Unicode codespace; that is, the range of integers from 0 to 10FFFF16. www.unicode.org/glossary/#code_unit : The minimal bit combination that can represent a unit of encoded text for processing or interchange. www.unicode.org/glossary/#unicode_scalar_value : Any Unicode code point www.unicode.org/glossary/#code_point except high-surrogate
and low-surrogate code points. In other words, the ranges of integers 0 to D7FF16 and E00016 to 10FFFF16 inclusive.
On Sun, Mar 18, 2018 at 10:47 AM, Michał Wadas <michalwadas at gmail.com>
wrote:
JSON supports arbitrary precision numbers that can't be properly represented as 64 bit floats. This includes numbers like eg. 1e9999 or 1/1e9999.
I posted this on the summary thread but not here.
gist.github.com/mikesamuel/20710f94a53e440691f04bf79bc3d756 is structured as a string to string transform, so doesn't lose precision when round-tripping, e.g. Python bigints and Java BigDecimals.
It also avoids a space explosion for 1e9999 which might help blunt timing attacks as discussed earlier in this thread.
On 2018-03-18 15:47, Michał Wadas wrote:
JSON supports arbitrary precision numbers that can't be properly represented as 64 bit floats. This includes numbers like eg. 1e9999 or 1/1e9999.
rfc7159: Since software that implements IEEE 754-2008 binary64 (double precision) numbers [IEEE754] is generally available and widely used, good interoperability can be achieved by implementations that expect no more precision or range than these provide, in the sense that implementations will approximate JSON numbers within the expected precision
If interoperability is not an issue you are free to do whatever you feel useful. Targeting a 0.001% customer base with standards, I gladly leave to others to cater for.
The de-facto standard featured in any number of applications, is putting unusual/binary/whatever stuff in text strings.
Anders
Interop with systems that use 64b ints is not a .001% issue.
On Sun, Mar 18, 2018 at 10:29 AM, Mike Samuel <mikesamuel at gmail.com> wrote:
Does this mean that the language below would need to be fixed at a specific version of Unicode or that we would need to cite a specific version for canonicalization but might allow a higher version for String.prototype.normalize and in future versions of the spec require it?
www.ecma-international.org/ecma-262/6.0/#sec-conformance """ A conforming implementation of ECMAScript must interpret source text input in conformance with the Unicode Standard, Version 5.1.0 or later """
and in ECMA 404 www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf
""" For undated references, the latest edition of the referenced document (including any amendments) applies. ISO/IEC 10646, Information Technology – Universal Coded Character Set (UCS) The Unicode Consortium. The Unicode Standard www.unicode.org/versions/latest. """
I can't see why either would have to change. JSON canonicalization should produce a JSON text in UTF-8, using JSON escape sequences only for double quote, backslash, and ASCII control characters U+0000 through U+001F (which are not valid in JSON strings) and unpaired surrogates U+D800 through U+DFFF (which are not conforming UTF-8). The algorithm doesn't need to know whether any given code point has a UCS assignment.
Code points include orphaned surrogates in a way that scalar values do not,
right? So both "\uD800" and "\uD800\uDC00" are single codepoints. It seems like a strict prefix of a string should still sort before that string but prefix transitivity in general does not hold: "\uFFFF" < "\uD800\uDC00" && "\uFFFF" > "\uD800". That shouldn't cause problems for hashability but I thought I'd raise it just in case.
IMO, "\uD800\uDC00" should never be emitted because a proper canonicalization would be "𐀀" (character sequence U+0022 QUOTATION MARK, U+10000 LINEAR B SYLLABLE B008 A, U+0022 QUOTATION MARK; octet sequence 0x22, 0xF0, 0x90, 0x80, 0x80, 0x22).
As for sorting, using the represented code points makes sense to me, but is not the only option (e.g., another option is using the literal characters of the JSON text such that "Z" < """ < "\" < "\u0000" < "\u001F" < "\uD800" < "\uDC00" < "^" < "x" < "ä" < "가" < "A" < "🔥" < "🙃"). Any specification of a total deterministic ordering would suffice, it's just that some are less intuitive than others.
On Sun, Mar 18, 2018 at 10:30 AM, Anders Rundgren < anders.rundgren.net at gmail.com> wrote:
On 2018-03-18 15:08, Richard Gibson wrote:
In that they have the same goal, yes. In that they both achieve that goal, no. I'm not married to choices like exponential notation and uppercase escapes, but a JSON canonicalization scheme MUST cover all of JSON.
Here it gets interesting... What in JSON cannot be expressed through JS and JSON.stringify()?
JSON can express arbitrary numbers, but ECMAScript JSON.stringify is limited to those with an exact IEEE 754 binary64 representation.
And probably more importantly (though not a gap with respect to JSON specifically), it emits octet sequences that don't conform to UTF-8 when serializing unpaired surrogates.
Certain scenarios call for different systems to independently generate
equivalent data structures, and it is a necessary property of canonical serialization that it yields identical results for equivalent data structures. JSON does not specify significance of object member ordering, so member ordering does not distinguish otherwise equivalent objects, so canonicalization MUST specify member ordering that is deterministic with respect to all valid data.
Violently agree but do not understand (I guess I'm just dumb...) why (for example) sorting on UCS2/UTF-16 Code Units would not achieve the same goal (although the result would differ).
Any specification of a total deterministic ordering would suffice. Relying upon 16-bit code units would impose a greater burden on systems that do not use such representations internally, but is not fundamentally broken.
On 2018-03-18 16:47, Mike Samuel wrote:
Interop with systems that use 64b ints is not a .001% issue.
Certainly not but using "Number" for dealing with such data would never be considered by for example the IETF.
This discussion (at least from my point of view), is about creating stuff that fits into standards.
Anders
A definition of canonical that is not tied to JavaScript's current range of values would fit into more standards than the proposal as it stands.
On 2018-03-18 18:40, Mike Samuel wrote:
A definition of canonical that is not tied to JavaScript's current range of values would fit into more standards than the proposal as it stands.
Feel free submitting an Internet-Draft which addresses a more generic Number handling. My guess is that it would be rejected due to [quite valid] interoperability concerns.
It would probably fall in the same category as "Fixing JSON" which has not happened either. www.tbray.org/ongoing/When/201x/2016/08/20/Fixing-JSON
Anders
I think you misunderstood the criticism. JSON does not have numeric precision limits. There are plenty of systems that use JSON that never involve JavaScript and which pack int64s.
On 2018-03-18 19:04, Mike Samuel wrote:
I think you misunderstood the criticism. JSON does not have numeric precision limits.
I think I understood that, yes.
There are plenty of systems that use JSON that never involve JavaScript and which pack int64s.
Sure, but if these systems use the "Number" type they belong to a proprietary world where disregarding recommendations and best practices is OK.
BTW, this an ECMAScript mailing list, why push non-JS complient ideas here?
Anders
On Sun, Mar 18, 2018 at 2:18 PM, Anders Rundgren < anders.rundgren.net at gmail.com> wrote:
On 2018-03-18 19:04, Mike Samuel wrote:
I think you misunderstood the criticism. JSON does not have numeric precision limits.
I think I understood that, yes.
There are plenty of systems that use JSON that never
involve JavaScript and which pack int64s.
Sure, but if these systems use the "Number" type they belong to a proprietary world where disregarding recommendations and best practices is OK.
No. They are simply not following a SHOULD recommendation. I think you have a variance mismatch in your argument.
BTW, this an ECMAScript mailing list, why push non-JS complient ideas here?
Let's review.
You asserted "This discussion (at least from my point of view), is about creating stuff that fits into standards."
I agreed and pointed out that not tying the definition to JavaScript's current value limitations would allow it to fit into standards that do not assume those limitations.
You leveled this criticism: "My guess is that it would be rejected due to [quite valid] interoperability concerns." Implicit in that is when one standard specifies that an input MUST have a property that conflicts with an output that a conforming implementation MAY or SHOULD produce then you have an interoperability concern.
But, you are trying to argue that your proposal is more interoperable because it works for fewer inputs in fewer contexts and, if it were ported to other languages, would reject JSON that is parseable without loss of precision in those languages. How you can say with a straight face that being non-runtime-agnostic makes a proposal more interoperable is beyond me.
Here's where variance comes in. MUST on output makes a standard more interoperable. MAY on input makes a standard more interoperable.
SHOULD and SHOULD NOT do not justify denying service. They are guidelines that should be followed absent a compelling reason -- specific rules trumps the general.
Your proposal is less interoperable because you are quoting a SHOULD, interpreting it as MUST and saying inputs MUST fit into an IEEE 754 double without loss of precision.
This makes it strictly less interoperable than a proposal that does not have that constraint.
EmcaScript SHOULD encourage interoperability since it is often a glue language.
At the risk of getting meta-, TC39 SHOULD prefer library functions that provide service for arbitrary inputs in their range. TC39 SHOULD prefer library functions that MUST NOT, by virtue of their semantics, lose precision silently.
Your proposal fails to be more interoperable inasmuch as it reproduces JSON.stringify(JSON.parse('1e1000')) === 'null'
There is simply no need to convert a JSON string to JavaScript values in order to hash it. There is simply no need to specify this in terms of JavaScript values when a runtime agnostic implementation that takes a string and produces a string provides the same value.
This is all getting very tedious though. I and others have been trying to move towards consensus on what a hashable form of JSON should look like.
We've identified key areas including
- property ordering,
- number canonicalization,
- string normalization,
- whether the input should be a JS value or a string of JSON,
- and others
but, as in this case, you seem to be arguing both sides of a position to support your proposal when you could just say "yes, the proposal could be adjusted along this dimension and still provide what's required."
If you plan on putting a proposal before TC39 are you willing to move on any of these. or are you asking for a YES/NO vote on a proposal that is largely the same as what you've presented?
If the former, then acknowledge that there is a range of options and collect feedback instead of sticking to "the presently drafted one is good enough." If the latter, then I vote NO because I think the proposal in its current form is a poor solution to the problem.
That's not to say that you've done bad work. Most non-incremental stage 0 proposals are poor, and the process is designed to integrate the ideas of people in different specialties to turn poor solutions to interesting problems into robust solutions to a wider range of problems than originally envisioned.
On 2018-03-18 20:15, Mike Samuel wrote:
I and others have been trying to move towards consensus on what a hashable form of JSON should look like.
We've identified key areas including
- property ordering,
- number canonicalization,
- string normalization,
- whether the input should be a JS value or a string of JSON,
- and others
but, as in this case, you seem to be arguing both sides of a position to support your proposal when you could just say "yes, the proposal could be adjusted along this dimension and still provide what's required."
For good or for worse, my proposal is indeed about leveraging ES6's take on JSON including limitations, {bugs}, and all. I'm not backing from that position because then things get way more complex and probably never even happen.
Extending [*] the range of "Number" is pretty much (in practical terms) the same thing as changing JSON itself.
"Number" is indeed mindless crap but it is what is.
OTOH, the "Number" problem was effectively solved some 10 years ago through putting stuff in "strings". Using JSON Schema or "Old School" strongly typed programmatic solutions of the kind I use, this actually works great.
Anders
*] The RFC gives you the right to do that but existing implementations do not.
On Sun, Mar 18, 2018, 4:50 PM Anders Rundgren <anders.rundgren.net at gmail.com>
wrote:
On 2018-03-18 20:15, Mike Samuel wrote:
I and others have been trying to move towards consensus on what a hashable form of JSON should look like.
We've identified key areas including
- property ordering,
- number canonicalization,
- string normalization,
- whether the input should be a JS value or a string of JSON,
- and others
but, as in this case, you seem to be arguing both sides of a position to support your proposal when you could just say "yes, the proposal could be adjusted along this dimension and still provide what's required."
For good or for worse, my proposal is indeed about leveraging ES6's take on JSON including limitations, {bugs}, and all. I'm not backing from that position because then things get way more complex and probably never even happen.
Extending [*] the range of "Number" is pretty much (in practical terms) the same thing as changing JSON itself.
Your proposal is limiting Number; my alternative is not extending Number.
"Number" is indeed mindless crap but it is what is.
On 2018-03-18 21:53, Mike Samuel wrote:
For good or for worse, my proposal is indeed about leveraging ES6's take on JSON including limitations, {bugs}, and all. I'm not backing from that position because then things get way more complex and probably never even happen. Extending [*] the range of "Number" is pretty much (in practical terms) the same thing as changing JSON itself.
Your proposal is limiting Number; my alternative is not extending Number.
Quoting earlier messages from you:
"Your proposal is less interoperable because you are quoting a SHOULD, interpreting it as MUST and saying inputs MUST fit into an IEEE 754 double without loss of precision. This makes it strictly less interoperable than a proposal that does not have that constraint"
"JSON does not have numeric precision limits. There are plenty of systems that use JSON that never involve JavaScript and which pack int64s"
Well, it took a while figuring this out. No harm done. Nobody died.
I think we can safely put this thread to rest now; you want to fix a problem that was fixed > 10Y+ back through other measures [*].
Thanx, Anders
*] Cryptography using JSON exchange integers that are 256 bit long and more Business system using JSON exchange long decimal numbers Scientific systems cramming 80-bit IEEE-754 into "Number" may exist but then we are probably talking about research projects using forked/home-grown JSON software
"Number" was never sufficient and will (IMO MUST) remain in its crippled form, at least if we stick to mainstream.
How does the transform you propose differ from?
JSON.canonicalize = (x) => JSON.stringify( x, (_, x) => { if (x && typeof x === 'object' && !Array.isArray(x)) { const sorted = {} for (let key of Object.getOwnPropertyNames(x).sort()) { sorted[key] = x[key] } return sorted } return x })
The proposal says "in lexical (alphabetical) order." If "lexical order" differs from the lexicographic order that sort uses, then the above could be adjusted to pass a comparator function.
Applied to your example input,
JSON.canonicalize({ "escaping": "\u20ac$\u000F\u000aA'\u0042\u0022\u005c\"/", "other": [null, true, false], "numbers": [1E30, 4.50, 6, 2e-3, 0.000000000000000000000000001] }) ===
String.raw{"escaping":"€$\u000f\nA'B\"\\\\\"/","numbers":[1e+30,4.5,6,0.002,1e-27],"other":[null,true,false]}
// proposed
{"escaping":"\u20ac$\u000f\nA'B"\\"/","numbers":[1e+30,4.5,6,0.002,1e-27],"other":[null,true,false]}
The canonicalized example from section 3.2.3 seems to conflict with the text of 3.2.2:
""" If the Unicode value is outside of the ASCII control character range, it MUST be serialized "as is" unless it is equivalent to 0x005c () or 0x0022 (") which MUST be serialized as \ and " respectively. """
So I think the "\u20ac" should actually be "€" and the implementation above matches your proposal.
On 2018-03-19 14:34, Mike Samuel wrote:
How does the transform you propose differ from?
JSON.canonicalize = (x) => JSON.stringify( x, (_, x) => { if (x && typeof x === 'object' && !Array.isArray(x)) { const sorted = {} for (let key of Object.getOwnPropertyNames(x).sort()) { sorted[key] = x[key] } return sorted } return x })
Probably not all. You are the JS guru, not me :-)
The proposal says "in lexical (alphabetical) order." If "lexical order" differs from the lexicographic order that sort uses, then the above could be adjusted to pass a comparator function.
I hope (and believe) that this is just a terminology problem.
Applied to your example input,
JSON.canonicalize({ "escaping": "\u20ac$\u000F\u000aA'\u0042\u0022\u005c\"/", "other": [null, true, false], "numbers": [1E30, 4.50, 6, 2e-3, 0.000000000000000000000000001] }) === String.raw
{"escaping":"€$\u000f\nA'B\"\\\\\"/","numbers":[1e+30,4.5,6,0.002,1e-27],"other":[null,true,false]}
// proposed {"escaping":"\u20ac$\u000f\nA'B"\\"/","numbers":[1e+30,4.5,6,0.002,1e-27],"other":[null,true,false]}The canonicalized example from section 3.2.3 seems to conflict with the text of 3.2.2:
If you look a under the result you will find a pretty sad explanation:
"Note: \u20ac denotes the Euro character, which not
being ASCII, is currently not displayable in RFCs"
After 30 years with RFCs, we can still only use ASCII :-( :-(
Updates: cyberphone/json-canonicalization/blob/master/JSON.canonicalize.md, cyberphone.github.io/doc/security/browser-json-canonicalization.html
Anders
On Mon, Mar 19, 2018 at 9:53 AM, Anders Rundgren < anders.rundgren.net at gmail.com> wrote:
On 2018-03-19 14:34, Mike Samuel wrote:
How does the transform you propose differ from?
JSON.canonicalize = (x) => JSON.stringify( x, (_, x) => { if (x && typeof x === 'object' && !Array.isArray(x)) { const sorted = {} for (let key of Object.getOwnPropertyNames(x).sort()) { sorted[key] = x[key] } return sorted } return x })
Probably not all. You are the JS guru, not me :-)
The proposal says "in lexical (alphabetical) order." If "lexical order" differs from the lexicographic order that sort uses, then the above could be adjusted to pass a comparator function.
I hope (and believe) that this is just a terminology problem.
I think you're right. www.ecma-international.org/ecma-262/6.0/#sec-sortcompare is where it's specified. After checking that no custom comparator is present:
- Let xString be ToString www.ecma-international.org/ecma-262/6.0/#sec-tostring(x).
- ReturnIfAbrupt www.ecma-international.org/ecma-262/6.0/#sec-returnifabrupt( xString).
- Let yString be ToString www.ecma-international.org/ecma-262/6.0/#sec-tostring(y).
- ReturnIfAbrupt www.ecma-international.org/ecma-262/6.0/#sec-returnifabrupt( yString).
- If xString < yString, return −1.
- If xString > yString, return 1.
- Return +0.
(<) and (>) do not themselves bring in any locale-specific collation rules.
They bottom out on www.ecma-international.org/ecma-262/6.0/#sec-abstract-relational-comparison
If both px and py are Strings, then
- If py is a prefix of px, return false. (A String value p is a prefix of String value q if q can be the result of concatenating p and some other String r. Note that any String is a prefix of itself, because r may be the empty String.)
- If px is a prefix of py, return true.
- Let k be the smallest nonnegative integer such that the code unit at index k within px is different from the code unit at index k within py. (There must be such a k, for neither String is a prefix of the other.)
- Let m be the integer that is the code unit value at index k within px.
- Let n be the integer that is the code unit value at index k within py.
- If m < n, return true. Otherwise, return false.
Those code unit values are UTF-16 code unit values per www.ecma-international.org/ecma-262/6.0/#sec-ecmascript-language-types-string-type
each element in the String is treated as a UTF-16 code unit value
As someone mentioned earlier in this thread, lexicographic string comparisons that use different code unit sizes can compute different results for the same semantic string value. Between UTF-8 and UTF-32 you should see no difference, but UTF-16 can differ from those given supplementary codepoints.
It might be worth making explicit that your lexical order is over UTF-16 strings if that's what you intend.
Applied to your example input,
JSON.canonicalize({ "escaping": "\u20ac$\u000F\u000aA'\u0042\u0022\u005c\"/", "other": [null, true, false], "numbers": [1E30, 4.50, 6, 2e-3, 0.000000000000000000000000001] }) === String.raw
{"escaping":"€$\u000f\nA'B\"\\\\\"/","numbers":[ 1e+30,4.5,6,0.002,1e-27],"other":[null,true,false]}
// proposed {"escaping":"\u20ac$\u000f\nA'B"\\"/","numbers":[1e+30,4 .5,6,0.002,1e-27],"other":[null,true,false]}The canonicalized example from section 3.2.3 seems to conflict with the text of 3.2.2:
If you look a under the result you will find a pretty sad explanation:
"Note: \u20ac denotes the Euro character, which not being ASCII, is currently not displayable in RFCs"
Cool.
After 30 years with RFCs, we can still only use ASCII :-( :-(
Updates: cyberphone/json-canonicalization/blob/mas ter/JSON.canonicalize.md cyberphone.github.io/doc/security/browser-json-canon icalization.html
If this can be implemented in a small amount of library code, what do you need from TC39?
On 2018-03-19 15:17, Mike Samuel wrote:
On Mon, Mar 19, 2018 at 9:53 AM, Anders Rundgren <anders.rundgren.net at gmail.com <mailto:anders.rundgren.net at gmail.com>> wrote:
On 2018-03-19 14:34, Mike Samuel wrote: How does the transform you propose differ from? JSON.canonicalize = (x) => JSON.stringify( x, (_, x) => { if (x && typeof x === 'object' && !Array.isArray(x)) { const sorted = {} for (let key of Object.getOwnPropertyNames(x).sort()) { sorted[key] = x[key] } return sorted } return x }) Probably not all. You are the JS guru, not me :-) The proposal says "in lexical (alphabetical) order." If "lexical order" differs from the lexicographic order that sort uses, then the above could be adjusted to pass a comparator function. I hope (and believe) that this is just a terminology problem.
I think you're right. www.ecma-international.org/ecma-262/6.0/#sec-sortcompare is where it's specified. After checking that no custom comparator is present:
- Let/xString/beToString www.ecma-international.org/ecma-262/6.0/#sec-tostring(/x/).
- ReturnIfAbrupt www.ecma-international.org/ecma-262/6.0/#sec-returnifabrupt(/xString/).
- Let/yString/beToString www.ecma-international.org/ecma-262/6.0/#sec-tostring(/y/).
- ReturnIfAbrupt www.ecma-international.org/ecma-262/6.0/#sec-returnifabrupt(/yString/).
- If/xString/</yString/, return −1.
- If/xString/>/yString/, return 1.
- Return +0.
(<) and (>) do not themselves bring in any locale-specific collation rules. They bottom out on www.ecma-international.org/ecma-262/6.0/#sec-abstract-relational-comparison
If both/px/and/py/are Strings, then
- If/py/is a prefix of/px/, returnfalse. (A String value/p/is a prefix of String value/q/if/q/can be the result of concatenating/p/and some other String/r/. Note that any String is a prefix of itself, because/r/may be the empty String.)
- If/px/is a prefix of/py/, returntrue.
- Let/k/be the smallest nonnegative integer such that the code unit at index/k/within/px/is different from the code unit at index/k/within/py/. (There must be such a/k/, for neither String is a prefix of the other.)
- Let/m/be the integer that is the code unit value at index/k/within/px/.
- Let/n/be the integer that is the code unit value at index/k/within/py/.
- If/m/</n/, returntrue. Otherwise, returnfalse.
Those code unit values are UTF-16 code unit values per www.ecma-international.org/ecma-262/6.0/#sec-ecmascript-language-types-string-type
each element in the String is treated as a UTF-16 code unit value
As someone mentioned earlier in this thread, lexicographic string comparisons that use different code unit sizes can compute different results for the same semantic string value. Between UTF-8 and UTF-32 you should see no difference, but UTF-16 can differ from those given supplementary codepoints.
It might be worth making explicit that your lexical order is over UTF-16 strings if that's what you intend.
Right, it is actually already in 3.2.3:
Property strings to be sorted depend on that strings are represented as arrays of 16-bit unsigned integers where each integer holds a single UCS2/UTF-16 [UNICODE] code unit. The sorting is based on pure value comparisons, independent of locale settings.
This maps "natively" to JS and Java. Probably to .NET as well. Other systems may need a specific comparator.
Applied to your example input, JSON.canonicalize({ "escaping": "\u20ac$\u000F\u000aA'\u0042\u0022\u005c\\\"\/", "other": [null, true, false], "numbers": [1E30, 4.50, 6, 2e-3, 0.000000000000000000000000001] }) === String.raw`{"escaping":"€$\u000f\nA'B\"\\\\\"/","numbers":[1e+30,4.5,6,0.002,1e-27],"other":[null,true,false]}` // proposed {"escaping":"\u20ac$\u000f\nA'B\"\\\\\"/","numbers":[1e+30,4.5,6,0.002,1e-27],"other":[null,true,false]} The canonicalized example from section 3.2.3 seems to conflict with the text of 3.2.2: If you look a under the result you will find a pretty sad explanation: "Note: \u20ac denotes the Euro character, which not being ASCII, is currently not displayable in RFCs"
Cool.
After 30 years with RFCs, we can still only use ASCII :-( :-( Updates: https://github.com/cyberphone/json-canonicalization/blob/master/JSON.canonicalize.md <https://github.com/cyberphone/json-canonicalization/blob/master/JSON.canonicalize.md> https://cyberphone.github.io/doc/security/browser-json-canonicalization.html <https://cyberphone.github.io/doc/security/browser-json-canonicalization.html>
If this can be implemented in a small amount of library code, what do you need from TC39?
At this stage probably nothing, the BIG issue is the algorithm which I took the liberty airing in this forum. To date all efforts creating a JSON canonicalization standard has been shot down or been abandoned.
Anders
On Mon, Mar 19, 2018 at 10:30 AM, Anders Rundgren < anders.rundgren.net at gmail.com> wrote:
On 2018-03-19 15:17, Mike Samuel wrote:
On Mon, Mar 19, 2018 at 9:53 AM, Anders Rundgren < anders.rundgren.net at gmail.com <mailto:anders.rundgren.net at gmail.com>> wrote:
On 2018-03-19 14:34, Mike Samuel wrote: How does the transform you propose differ from? JSON.canonicalize = (x) => JSON.stringify( x, (_, x) => { if (x && typeof x === 'object' && !Array.isArray(x)) { const sorted = {} for (let key of Object.getOwnPropertyNames(x).sort()) { sorted[key] = x[key] } return sorted } return x }) Probably not all. You are the JS guru, not me :-) The proposal says "in lexical (alphabetical) order." If "lexical order" differs from the lexicographic order that sort
uses, then the above could be adjusted to pass a comparator function.
I hope (and believe) that this is just a terminology problem.
I think you're right. www.ecma-international. org/ecma-262/6.0/#sec-sortcompare is where it's specified. After checking that no custom comparator is present:
- Let/xString/beToString <www.ecma-international .org/ecma-262/6.0/#sec-tostring>(/x/).
- ReturnIfAbrupt <www.ecma-international.org/ecma-262/6.0/#sec- returnifabrupt>(/xString/).
- Let/yString/beToString <www.ecma-international .org/ecma-262/6.0/#sec-tostring>(/y/).
- ReturnIfAbrupt <www.ecma-international.org/ecma-262/6.0/#sec- returnifabrupt>(/yString/).
- If/xString/</yString/, return −1.
- If/xString/>/yString/, return 1.
- Return +0.
(<) and (>) do not themselves bring in any locale-specific collation rules. They bottom out on www.ecma-international. org/ecma-262/6.0/#sec-abstract-relational-comparison
If both/px/and/py/are Strings, then
- If/py/is a prefix of/px/, returnfalse. (A String value/p/is a prefix of String value/q/if/q/can be the result of concatenating/p/and some other String/r/. Note that any String is a prefix of itself, because/r/may be the empty String.)
- If/px/is a prefix of/py/, returntrue.
- Let/k/be the smallest nonnegative integer such that the code unit at index/k/within/px/is different from the code unit at index/k/within/py/. (There must be such a/k/, for neither String is a prefix of the other.)
- Let/m/be the integer that is the code unit value at index/k/within/px/.
- Let/n/be the integer that is the code unit value at index/k/within/py/.
- If/m/</n/, returntrue. Otherwise, returnfalse.
Those code unit values are UTF-16 code unit values per www.ecma-international.org/ecma-262/6.0/#sec-ecmascri pt-language-types-string-type
each element in the String is treated as a UTF-16 code unit value
As someone mentioned earlier in this thread, lexicographic string comparisons that use different code unit sizes can compute different results for the same semantic string value. Between UTF-8 and UTF-32 you should see no difference, but UTF-16 can differ from those given supplementary codepoints.
It might be worth making explicit that your lexical order is over UTF-16 strings if that's what you intend.
Right, it is actually already in 3.2.3:
My apologies. I missed that.
Property strings to be sorted depend on that strings are represented
as arrays of 16-bit unsigned integers where each integer holds a single UCS2/UTF-16 [UNICODE] code unit. The sorting is based on pure value comparisons, independent of locale settings.
This maps "natively" to JS and Java. Probably to .NET as well. Other systems may need a specific comparator.
Yep. Off the top of my head: Go and Rust use UTF-8. Python3 is UTF-16, Python2 is usually UTF-16 but may be UTF-32 depending on sizeof(wchar) when compiling the interpreter. C++ as is its wont is all of them.
Applied to your example input, JSON.canonicalize({ "escaping": "\u20ac$\u000F\u000aA'\u0042\u
0022\u005c\"/", "other": [null, true, false], "numbers": [1E30, 4.50, 6, 2e-3, 0.000000000000000000000000001] }) === String.raw
{"escaping":"€$\u00 0f\nA'B\"\\\\\"/","numbers":[1e+30,4.5,6,0.002,1e-27],"othe r":[null,true,false]}
// proposed {"escaping":"\u20ac$\u000f\nA' B"\\"/","numbers":[1e+30,4.5,6,0.002,1e-27],"other":[nul l,true,false]}The canonicalized example from section 3.2.3 seems to conflict
with the text of 3.2.2:
If you look a under the result you will find a pretty sad explanation: "Note: \u20ac denotes the Euro character, which not being ASCII, is currently not displayable in RFCs"
Cool.
After 30 years with RFCs, we can still only use ASCII :-( :-( Updates: https://github.com/cyberphone/json-canonicalization/blob/mas
ter/JSON.canonicalize.md <cyberphone /json-canonicalization/blob/master/JSON.canonicalize.md> cyberphone.github.io/doc/security/browser-json-canon icalization.html <cyberphone.github.io doc/security/browser-json-canonicalization.html>
If this can be implemented in a small amount of library code, what do you need from TC39?
At this stage probably nothing, the BIG issue is the algorithm which I took the liberty airing in this forum. To date all efforts creating a JSON canonicalization standard has been shot down or been abandoned.
Like I said, I think the hashing use case is worthwhile.
JSON is utf-8 ... As far as 16 but coffee points, there are still astral character pairs. Binary data should be enclosed to avoid this, such as with base-64.
Dear List,
Here is a proposal that I would be very happy getting feedback on since it builds on ES but is not (at all) limited to ES.
The request is for a complement to the ES "JSON" object called canonicalize() which would have identical parameters to the existing stringify() method.
The JSON canonicalization scheme (including ES code for emulating it), is described in: cyberphone.github.io/doc/security/draft-rundgren-json-canonicalization-scheme.html
Current workspace: cyberphone/json-canonicalization
Thanx, Anders Rundgren