Should Decode accept U+FFFE or U+FFFF (and other Unicode non-characters)?

# Jeff Walden (15 years ago)

I was looking at how SpiderMonkey decodes URI-encoded strings, specifically to update it to reject overlong UTF-8 sequences per ES5 (breaking change from ES3 that should generally be agreed to have been necessary, not to mention that existing implementations were loose and strict inconsistently). After SpiderMonkey made that change I noticed some non-standard extra behavior: U+FFFE and U+FFFF decode to the replacement character. ES5 doesn't say to do this -- the decode table categorizes only [0xD800, 0xDFFF] as invalid (when not in a surrogate pair) and resulting in a URIError. (Prose in Decode says "If Octects does not contain a valid UTF-8 encoding of a Unicode code point", which might be interpretable as saying that the "UTF-8 encoding" of U+FFFE isn't valid and therefore a URIError must be thrown if you squinted.)

U+FFFF is not a valid Unicode code point, and U+FFFE conceivably could confuse Unicode decoders into decoding with the wrong endianness under the right circumstances. Theoretically, at least. Might it make sense to throw a URIError upon encountering them (and perhaps also the non-code points [U+FDD0, U+FDEF], and maybe even the code points which are = FFFE mod 0x10000 as well)?

# Jeff Walden (13 years ago)

Reraising this issue...

To briefly repeat: Decode, called by decodeURI{,Component}, says to reject %ab%cd%ef sequences whose octets "[do] not contain a valid UTF-8 encoding of a Unicode code point". It appears browsers interpret this requirement as: reject overlong UTF-8 sequences, and otherwise reject only unpaired or mispaired surrogate "code points". Is this exactly what ES5 requires? And if it is, should it be? Firefox has also treated otherwise-valid-looking encodings of U+FFFE and U+FFFF as specifying that the replacement character U+FFFD be used. And the rationale for rejecting U+FFF{E,F} also seems to apply to the non-character range [U+FDD0, U+FDEF] and U+xyFF{E,F}. Table 21 seems to say only malformed encodings and bad surrogates should be rejected, but "valid encoding of a code point" is arguably unclear.

At least one person interested in Firefox's decoding implementation argues that not rejecting or replacing U+FFF{E,F} is a "potential security vulnerability" because those code points (particularly U+FFFE) might confuse code into interpreting a sequence of code points with the wrong endianness. I find the argument unpersuasive and the potential harm too speculative (particularly as no other browser replaces or rejects U+FFF{E,F}). But the point's been raised, and it's at least somewhat plausible, so I'd like to see it conclusively addressed.

A last note: two test262 tests directly exercise exercise the Decode algorithm and expect that these two characters decode to U+FFF{E,F}. (I think at a glance they might also allow throwing, tho it's not clear to me that's intentional.)

hg.ecmascript.org/tests/test262/file/b4690e1408ee/test/suite/sputnik_converted/15_Native/15.1_The_Global_Object/15.1.3_URI_Handling_Function_Properties/15.1.3.1_decodeURI/S15.1.3.1_A2.4_T1.js, hg.ecmascript.org/tests/test262/file/b4690e1408ee/test/suite/sputnik_converted/15_Native/15.1_The_Global_Object/15.1.3_URI_Handling_Function_Properties/15.1.3.2_decodeURIComponent/S15.1.3.2_A2.4_T1.js

# Allen Wirfs-Brock (13 years ago)

On Jul 14, 2011, at 10:38 PM, Jeff Walden wrote:

Reraising this issue...

To briefly repeat: Decode, called by decodeURI{,Component}, says to reject %ab%cd%ef sequences whose octets "[do] not contain a valid UTF-8 encoding of a Unicode code point". It appears browsers interpret this requirement as: reject overlong UTF-8 sequences, and otherwise reject only unpaired or mispaired surrogate "code points". Is this exactly what ES5 requires? And if it is, should it be? Firefox has also treated otherwise-valid-looking encodings of U+FFFE and U+FFFF as specifying that the replacement character U+FFFD be used. And the rationale for rejecting U+FFF{E,F} also seems to apply to the non-character range [U+FDD0, U+FDEF] and U+xyFF{E,F}. Table 21 seems to say only malformed encodings and bad surrogates should be rejected, but "valid encoding of a code point" is arguably unclear.

I haven't swapped back my technical understanding of the subtleties of UTF8 encodings yet today so I'm not yet prepared to try to provide a technical response. But I think I can speak to the intent of the spec (or at least the ES5 version):

  1. these are legacy functions that have been in browser JS implementations at least since ES3 days. We didn't want to change them in any incompatible way.
  2. Like with RegExp and other similar issues, browser reality (well, legacy browser reality, maybe not newbies) is more important than what the spec. actually says. If browser all do something different from the spec. then the spec. should be updated accordingly. However, for ES5 we didn't do any deep analysis of this browser reality so we might have missed something.
  3. The intent is pretty clearly stated in the last paragraph note that includes table 21 (BTW, since the table is in a note it isn't normative). It essentially says throw an exception when decoding anything that RFC 3629 says if not a valid UTF-8 encoding.

I would prioritizes #3 after #1&#2. If there is consistent behavior in all major browsers that date prior to ES5 then that is the behavior that should be followed (and the spec. updated if necessary). If there is disagreement among those legacy browsers then I would simply follow the ES5 spec. unless it does something that is contrary to RFC 3629. If it does, then we need to think about whether we have a spec. bug.

At least one person interested in Firefox's decoding implementation argues that not rejecting or replacing U+FFF{E,F} is a "potential security vulnerability" because those code points (particularly U+FFFE) might confuse code into interpreting a sequence of code points with the wrong endianness. I find the argument unpersuasive and the potential harm too speculative (particularly as no other browser replaces or rejects U+FFF{E,F}). But the point's been raised, and it's at least somewhat plausible, so I'd like to see it conclusively addressed.

It's just a transformation from one JS string to another. It can't do anything that hand written JS code couldn't do. How would this be any more of a problem then simply providing the code points that that the bogus sequence would be incorrectly interpreted as. That said, #3 above does that that the intent is to reject anything that is not valid UTF-8.