Need a champion? StringView strawman

# Allen Wirfs-Brock (11 years ago)

ecmascript#1557 is a request that StringView over ArrayBuffers be added to ES.

The current StringView proposal is a WebIDL based design and not particularly integrated into the ES6 Typed Array support, the ES6 Unicode support, or the post ES6 "Binary Data" work. It isn't clear to me exactly how much, if any, momentum this proposal has in any standards process outside of TC39. However, if something like it is going to emerge TC39 should be involved early to ensure that it is well integrated into ES.

It sounds to me like we need a TC39 champion (or perhaps an anti-champioin) to shepherd this work in the context of the new TC39 development process. Any volunteers? I can help but have limited time available for this right now.

# Anne van Kesteren (11 years ago)

Where is this from?

Google and Mozilla have implemented the API from encoding.spec.whatwg.org as a means to get strings out of bytes (and bytes out of strings). It's not clear we need anything else.

# Kenneth Russell (11 years ago)

There was some discussion about implementing StringView on the blink-dev mailing list in August 2013. My opinion was and is that the Encoding spec satisfies these use cases.

Adding a StringView to Typed Arrays would bring along all of the complexities of character set encoding and decoding to the Typed Array definitions. Typed Arrays were designed to be small, simple, and comprehensible enough that they would be easily implementable and optimizable. I believe that adding a StringView would contradict these goals.

# Allen Wirfs-Brock (11 years ago)

On Jan 10, 2014, at 10:26 AM, Anne van Kesteren wrote:

Where is this from?

Don't know, I'm just creating visibility of the ES bug/feature request.

Google and Mozilla have implemented the API from encoding.spec.whatwg.org as a means to get strings out of bytes (and bytes out of strings). It's not clear we need anything else.

Same base point applies to that proposal. If anybody wants this capability to be considered as a standard ES capability, it needs to have a champion within TC39. I note that (as would be expected) the whatwg encoding spec. is expressed in WebIDL terms (DOMStrings, etc.) and it isn't yet clear to me how well it integrates with ES, ES standard library conventions, or non-browser hosts. Perhaps it's fine to leave this as a web platform API , but support for character set encoding/decoding is a general purpose capabilities and it might be reasonable to have a solution that isn't tied to a specific environment.

# Brendan Eich (11 years ago)

Kenneth Russell wrote:

Adding a StringView to Typed Arrays would bring along all of the complexities of character set encoding and decoding to the Typed Array definitions. Typed Arrays were designed to be small, simple, and comprehensible enough that they would be easily implementable and optimizable. I believe that adding a StringView would contradict these goals.

+ a lot.

What can we do to make sure this thing stays dead, if it is dead? Anne may know some weird W3C protocol trick. ;-)

# Dwayne (11 years ago)

I disagree. I think this should progress. It doesn't have to add any additional functionality to Typed Arrays. As it stands I would consider it a replacement for the purposes of TextEncoder and TextDecoder APIs. Currently the Mozilla TextDecoder Web API does not accept ASCII as a valid encoding option and defaults to UTF-8, if left unspecified.

# Boris Zbarsky (11 years ago)

On 1/10/14 3:47 PM, Dwayne wrote:

Currently the Mozilla TextDecoder Web API does not accept ASCII as a valid encoding option

I'm curious. What would you expect such an option to do? Byte-inflate like ISO-8859-1? Byte-inflate but throw on bytes with values > 127?

Act as a synonym for ISO-8859-9? Something else?

and defaults to UTF-8, if left unspecified.

Right, because it's meant for text, and for text UTF-8 is a pretty reasonable default nowadays.

# Brendan Eich (11 years ago)

Dwayne wrote:

Currently the Mozilla TextDecoder Web API does not accept ASCII as a valid encoding option and defaults to UTF-8, if left unspecified.

That's a feature.

The '90s are over, let's not go back.

Why do you want ASCII, and what do you do with it?

# Dwayne (11 years ago)

On Fri, Jan 10, 2014 at 3:14 PM, Boris Zbarsky <bzbarsky at mit.edu> wrote:

I'm curious. What would you expect such an option to do? Byte-inflate like ISO-8859-1? Byte-inflate but throw on bytes with values > 127? Act as a synonym for ISO-8859-9? Something else?

Exactly how StringView handles the option now. If I generate a random string using byte values then each char in that string should correspond to a single byte when specifying the ISO-8859-1. It doesn't really make since to use UTF-8 for bytes when that data should be manipulated as bytes in the first place. In the case of data being represented as a string but need to be handled as bytes.

bugzilla.mozilla.org/show_bug.cgi?id=957424

UTF-8 being the default is not the problem of course. Throwing an exception for ASCII is.

# Dwayne (11 years ago)

UDP Datagrams.

# Boris Zbarsky (11 years ago)

On 1/10/14 4:29 PM, Dwayne wrote:

Exactly how StringView handles the option now. If I generate a random string using byte values then each char in that string should correspond to a single byte when specifying the ISO-8859-1.

OK, so specify ISO-8859-1, if that's what you're really doing. Or are you saying that you just want "ascii" to be a synonym for "iso-8859-1" here? But it'd be a lie, because ASCII actually means something, and it means something different from ISO-8859-1.

But really, if you just have bytes, not text, why are you generating a string from those byte values at all? This is where a typed array would make more sense...

# Dwayne (11 years ago)

I mean char code points in the range (0-255) a byte. Use the desired terminology or name.

Primarily because of this bug -> Expose raw data on UDP socket messages:

bugzilla.mozilla.org/show_bug.cgi?id=952927

I generate a random string using code points that I eventually convert to bytes. Specifically in the case of a two or 20 char/byte ID. Where I need to be able to use the entire 16 bit or 160 space and then send as bytes and trust that ID will be same for both parties consistently. <-- To elaborate, I need to bencode this information before converting to bytes. I understand all of this could be worked around by just using String.charCodeAt or the synonymous String.codePointAt but why then have such a powerful API and disallow the fore-mentioned feature?

And why exactly have to separate APIs?

# Brendan Eich (11 years ago)

Dwayne wrote:

UDP Datagrams.

Use a Uint8Array and string decoding/encoding API. Browsers have to copy anyway, you're not "optimizing" by using the (soon to be dead, I hope) StringView.

# Dwayne (11 years ago)

No joke. But as far as optimization goes I'm limited. You wrote the book so thanks for at least hearing me out. ;)

# Brendan Eich (11 years ago)

Dwayne wrote:

Primarily because of this bug -> Expose raw data on UDP socket messages: bugzilla.mozilla.org/show_bug.cgi?id=952927

Answering for bz: why do you need string-views or string-anythings to wrangle bytes in and out of a Uint8Array? Can you show some code?

# Dwayne (11 years ago)

Compensate the lack of rawData property --> Bug 952927

Buffer is a Uint8Array which has non standard methods on its prototype using a WeakMap technique. -->

This module will be used with BitTorrent PWP as well so its definitely necessary. DecipherCode/Firebit/blob/master/lib/dgram.js#L78

And here is a snippet covering the other reason(s): pastebin.mozilla.org/3986282

Thanks.

# Boris Zbarsky (11 years ago)

On 1/10/14 10:46 PM, Dwayne wrote:

Compensate the lack of rawData property --> Bug 952927

Sure, but that should be fixed by adding such a property in this case, no? The only reason this is using a string is because it's using a somewhat braindead IDL (way more braindead for purposes of JS sanity than WebIDL is) to expose C data structures pretty directly.

# Anne van Kesteren (11 years ago)

On Fri, Jan 10, 2014 at 7:00 PM, Allen Wirfs-Brock <allen at wirfs-brock.com> wrote:

[...] it might be reasonable to have a solution that isn't tied to a specific environment.

Agreed. I have argued the same for URL parsing url.spec.whatwg.org at some point.

As for the API in the Encoding Standard, I think the only strong tie to the DOM at this point is its usage of DOMException. I have a few times on this list tried to figure out what the right way forward is for exceptions within the web platform so that they still fit well within the ES universe, but none of those led anywhere satisfactory yet.

# Anne van Kesteren (11 years ago)

On Fri, Jan 10, 2014 at 9:40 PM, Boris Zbarsky <bzbarsky at mit.edu> wrote:

OK, so specify ISO-8859-1, if that's what you're really doing.

Note that iso-8859-1 maps to windows-1252. There is an open bug on exposing a label to the API that has the "real" iso-8859-1 behavior: www.w3.org/Bugs/Public/show_bug.cgi?id=23971

# Brendan Eich (11 years ago)

I think based on bugs and bz's advice the Dwayne has been misled by bad old pre-WebIDL API in Gecko -- there's no reason to do any string-viewing here. Certainly not punning bytes as points in a character set encoding.

# Allen Wirfs-Brock (11 years ago)

On Jan 11, 2014, at 6:13 AM, Anne van Kesteren wrote:

As for the API in the Encoding Standard, I think the only strong tie to the DOM at this point is its usage of DOMException. I have a few times on this list tried to figure out what the right way forward is for exceptions within the web platform so that they still fit well within the ES universe, but none of those led anywhere satisfactory yet.

I don't see any occurrences of DomException in encoding.spec.whatwg.org

It seems to be throwing TypeError for parameter validation issues which is what a TC39 spec. would generally do, except that for a few of those cases we might throw a RangeError instead.

There are a couple places where a string such as "EncodingError" is thrown. We'd never do that and would use either TypeError or RangeError.

The major Web platform dependency I see is the use of DOMString and associated attributes such as [EnsureUTF16]. Those shouldn't be there for a host environment independent spec.

Finally, it seems likely that the subclassing contract for TextEncoder/TextDecoder haven't been thought through and I notice that the examples instantiate instances of them without using the new operator.

But overall, it shouldn't be hard to fix these things and make it completely independent of the web platform. It could drop quite nicely into the new TC39 process model it you wanted to go that route of standardization.

# Anne van Kesteren (11 years ago)

On Sat, Jan 11, 2014 at 5:27 PM, Allen Wirfs-Brock <allen at wirfs-brock.com> wrote:

There are a couple places where a string such as "EncodingError" is thrown. We'd never do that and would use either TypeError or RangeError.

If you follow the link for "throw", you'll find it's a DOMException.

The major Web platform dependency I see is the use of DOMString and associated attributes such as [EnsureUTF16]. Those shouldn't be there for a host environment independent spec.

Sure, that's easily mapped though. (The whole EnsureUTF16 thing is in need of fixing in IDL.)

But overall, it shouldn't be hard to fix these things and make it completely independent of the web platform. It could drop quite nicely into the new TC39 process model it you wanted to go that route of standardization.

Agreed. I don't really have the bandwidth at the moment to work on this though. I have fixed the examples:

whatwg/encoding/commit/da5d1426a4e7ff7c7fea6724957b2c70df09bce4

# Allen Wirfs-Brock (11 years ago)

Another nit: the definition of "ASCII whitespace" is different from the definition of whitespace used by String.prototype.trim 1. That means that an implementation of this spec. that was implemented using JS couldn't use S.p.trim to process labels as described in the spec.

# Brendan Eich (11 years ago)

This seems more than a nit!

# Anne van Kesteren (11 years ago)

You cannot use that method for CSS, HTML, HTTP, etc. either. For this API we could have a different definition of whitespace I suppose, but e.g. for <meta charset=...> I doubt we could do that without risking breakage (or at the HTML parser level, say).

# Allen Wirfs-Brock (11 years ago)

I'm only talking about this specification and what it takes to decouple it from web platform dependencies. In this case, ASCII whitespace seems to only be used in processing the label parameter passed to the TextDecoder and TextEncoder constructors. So, it isn't clear how CSS or anything else is relevant to that.

# Anne van Kesteren (11 years ago)

As I said, the algorithm used to get an encoding is also used by HTML, CSS et al. See e.g. dev.w3.org/csswg/css-syntax/#input-byte-stream and www.whatwg.org/specs/web-apps/current-work/multipage/parsing.html#determining

# Allen Wirfs-Brock (11 years ago)

So, just don't couple them. Do this in TC39 and apply a multiple app platform perspective.

# Stefano Gioffré (11 years ago)

I all! I'm the author of the API.

Stefano (alias "fusionchess")