typed array filling convenience AND performance

# Florian Bösch (9 years ago)

The usecases:

1) Filling with custom data

When writing WebGL or physics or many other things todo with large collections of data, it's not unusual to have to fill arrays with custom data of some kind.

someArray.set([
 x0, y0, 1, 0, 0,
 x1, y0, 0, 1, 0,
 x1, y1, 0, 0, 1,
   x0, y0, 1, 0, 0,
   x1, y1, 0, 1, 0,
   x0, y1, 0, 0, 1
 ], i);

2) Copying in data from another array

Some data resides in another array and needs to be copied in. A feature frequently in use by emscripten.

someArray.set(otherArray.subarray(srcOffset, srcSize), dstOffset)

3) Initializing an existing array with a repeated numerical value

For audio processing, physics and a range of other tasks it's important to initialize an array with the same data.

for(var i=0; i<size; i++){ someArray[i] = 0; }

The problem: Doing all of these things is slow and/or unsuitable for realtime code.

  1. someArray.set from a new list is slow due to set being slow, and constructing the list is slow. It's not realtime friendly because it'll construct a new list, which will have to be GCed.
  2. someArray.set is slow due to the new array view construction and it's not realtime friendly due to GCing.
  3. Filling an array one element at a time is slow.

*The test: *jsperf.com/typed-array-fast-filling/4 (screenshot here codeflow.org/pictures/typed-array-test.png and attached)

The status quo:

The fastest way to fill an array with custom data across browsers is:

r[i] = x0;
r[i + 1] = y0;
r[i + 2] = 1;
r[i + 3] = 0;
r[i + 4] = 0;
r[i + 5] = x1;
r[i + 6] = y0;

*Things that are not faster: *

  • pushing to a list: ~93% slower
  • a helper function filling from a list: 57-70% slower
  • array.set: ~73% slower
  • a helper function filling from arguments: 65% - 93% slower
  • asm.js: 69-81% slower (even in firefox)

Suggestions:

  1. Browser engines should get a lot better at arguments handling so that non sized arguments can be quickly iterated by native code. Firefox is already pretty good at unboxing a specified argument list (chrome not so much), but I think that test shows that there's ample room for improvement.
  2. someArray.memcpy: Add a method to typed arrays that can shuffle bytes from array A to array B like so: dst.memcpy(dstOffset, src, srcOffset, size). This is to avoid having to allocate an object to do the job.
  3. someArray.memset: Add a method to typed arrays that can initialize them with a value like so: dst.memset(dstOffset, value, size)
  4. someArray.argcpy: Add a (fast) method to typed arrays that can copy the arguments like so: dst.argcpy(dstOffset, 1, 2, 3, 4)
  5. Drastically improve the set method.

(naming and semantic don't matter to me, long as the methods do it efficiently, conveniently and fast what's suggested).

Related discussion:

Consequence of failure to rectify:

Fast code will be unreadable and unmaintainable. Sophisticated and speed requiring code will not be written in ecmascript. Emscripten and asm.js with its hermetic nature will crowd out ecmascript driven developments. Other alternatives such as on GPU transform&feedback and compute shaders will be preferred to solve the problem.

# Jussi Kalliokoski (9 years ago)

+1 - especially the lack of something like the proposed memset() has been a massive headache for me.

Semantics (I'm quite nitpicky on them)... I'd prefer the argument order for memcpy() to be (src, dstOffset, srcOffset, size) to be consistent with set(). As for memset(), I'd prefer (value, dstOffset, size) because I've so far actually never needed offset and size. I'd also prefer memcpy to be named closer to set(), because it's basically the same thing with two extra arguments. I'm not sure what the performance implications would be to actually just add the extra two parameters to set as optional, so maybe someone smarter than me can chime in on whether it's better to add a separate method or overload the existing set()?

It's worth noting that set() currently gets relatively faster the bigger the data is, and the last I tested (also confirmed by Jukka Jylänki), at around 4k (surprisingly many) elements it becomes faster than manually assigning values.

# Allen Wirfs-Brock (9 years ago)

Have you looked at the ES6 typed array constructor people.mozilla.org/~jorendorff/es6-draft.html#sec-properties-of-the-%typedarray%-intrinsic-object and instance methods people.mozilla.org/~jorendorff/es6-draft.html#sec-properties-of-the-%typedarrayprototype%-object In particular check out the from, copyWith, and fill methods.

In general, TC39’s approach in ES6 is for typed arrays to simply be a variant of Array and all of the Array methods (except those that are not applicable to fixed length arrays) are supported for typed arrays. Similarly, if there are common array operations that are still missing we would want them to be added for both Array and typed arrays.

Performance optimizatiuon in an implementation issues. That’s where you should apply performance pressure.

# Adrian Perez de Castro (9 years ago)

On Thu, 30 Oct 2014 09:29:36 +0100, Florian Bösch <pyalot at gmail.com> wrote:

The usecases:

[...]

3) Initializing an existing array with a repeated numerical value

For audio processing, physics and a range of other tasks it's important to initialize an array with the same data.

for(var i=0; i<size; i++){ someArray[i] = 0; }

For this use case there is %TypedArray%.prototype.fill(), see:

people.mozilla.org/~jorendorff/es6-draft.html#sec-%typedarray%.prototype.fill

JavaScript engines are expected to implement it at some point. For example I am implementing this in V8, along with other new typed array methods. The engines should be able to generate quite good code for uses of this function and/or provide optimized versions relying on knowledge of the underlying element type of the typed array they are applied to.

# Jussi Kalliokoski (9 years ago)

Oh, nice! Had completely missed fill() before. Hope to see this land soon. :)

# Florian Bösch (9 years ago)

On Thu, Oct 30, 2014 at 1:41 PM, Allen Wirfs-Brock <allen at wirfs-brock.com> wrote:

Performance optimizatiuon in an implementation issues. That’s where you should apply performance pressure.

That's true, but if there's only bad APIs to do certain tasks, it doesn't help.

# Steve Fink (9 years ago)

On 10/30/2014 06:14 AM, Adrian Perez de Castro wrote:

For this use case there is %TypedArray%.prototype.fill(), see:

people.mozilla.org/~jorendorff/es6-draft.html#sec-%typedarray%.prototype.fill

JavaScript engines are expected to implement it at some point. For example I am implementing this in V8, along with other new typed array methods. The engines should be able to generate quite good code for uses of this function and/or provide optimized versions relying on knowledge of the underlying element type of the typed array they are applied to.

I implemented this for Firefox 2 years ago, but never landed it - bugzilla.mozilla.org/show_bug.cgi?id=730880

Now there is %TypedArray%.prototype.fill. But I've become generally skeptical about it as an answer to performance concerns. I would rather see engines hyperoptimize

for(var i=0; i<size; i++){ someArray[i] = 0; }

based on observed type information. Which is not to say that we wouldn't want to make TA#fill fast too, but the above seems more generally useful.

On a related note, I would like to have some way of getting the OS to decommit memory. See bugzilla.mozilla.org/show_bug.cgi?id=855669 (start reading at about comment 22) for our discussion and attempt at this, which looks like it mysteriously trailed off this last March. Theoretically, the above loop could also trigger a decommit, but I think it's too much to expect the engine to guess when that's going to be a good idea. On the other hand, from a spec POV it's unobservable behavior, which makes it weird.

# Filip Pizlo (9 years ago)

On Oct 30, 2014, at 10:44 AM, Steve Fink <sphink at gmail.com> wrote:

I would rather see engines hyperoptimize

for(var i=0; i<size; i++){ someArray[i] = 0; }

based on observed type information. Which is not to say that we wouldn't want to make TA#fill fast too, but the above seems more generally useful.

I agree with this philosophy. I would just point out that both in C and Java, having some way for the programmer to say “fill” (i.e. memset/bzero in C, and I forget what it is in Java) has survived despite the compilers being super mature, probably because idiom recognition on loops like this is too fallible in the general case. So, I’d like to see .fill() be part of the language.

Anyway, I just filed a bug on our end for this: bugs.webkit.org/show_bug.cgi?id=138218

# Florian Bösch (9 years ago)

On Thu, Oct 30, 2014 at 6:44 PM, Steve Fink <sphink at gmail.com> wrote:

Now there is %TypedArray%.prototype.fill. But I've become generally skeptical about it as an answer to performance concerns. I would rather see engines hyperoptimize

for(var i=0; i<size; i++){ someArray[i] = 0; }

based on observed type information. Which is not to say that we wouldn't want to make TA#fill fast too, but the above seems more generally useful.

While useful, it's not a substitute for a convenient and fast method. Also, I presented severe usuability issues with that approach if you need more than one value (such as computed values). Usability issues which can be resolved by using set + a new list for every assignment, which, unfortunately, is also quite slow. It's slow because say, for a loop that's running a million times, it makes a million lists.

# Katelyn Gadd (9 years ago)

I'd also like to chime in since this is a real problem for performance-sensitive JSIL code: The idea that JS runtimes will just optimize all our performance-sensitive loops is lovely. It also appears to be a fantasy, because we've been writing those loops for years and they're still slow. I did some extensive testing and experimentation with spidermonkey a year or so back, examining how it optimized various forms of a memcpy loop, and the results were pretty dire. V8 didn't seem to fare much better, though it is harder to tell because they don't trivially expose generated native code for functions.

In my testing, even best-case optimized memcpy/memset loops did not produce the kind of efficient code you want. There were always bounds checks and in some cases I saw other overhead. The set/copy definitely weren't vectorized. In the best case I saw bounds checks getting hoisted out of the loop body (great!)

I feel like at this point the claim that we don't need better copy/set APIs is the 'sufficiently smart compiler' trope all over again, though in this case perhaps it is wholly possible and it's just not happening because it isn't important enough to VM implementers.

TypedArray.prototype.fill is perfect for that scenario, so all we need is a copy. It's essential that the copy support the source array being distinct from the destination one (a previous proposal - did it go through? - required the source & destination arrays to be the same, presumably micro-optimizing for emscripten.)

In the short term these new methods could be self-hosted and would at least be no worse than existing hand-authored loops. If the runtime uses the equivalent of Spidermonkey's callsite cloning, the copy/fill loops would be specialized based on element type, which would increase performance. If the VM were to actually generate fully optimal copy/fill code, it would be a sizable improvement. At a bare minimum, having the 'best' copy/fill loop JS for a given runtime in the stdlib is much better than developers having to figure out the best loop type - especially if that type varies across browsers.

The performance of memcpy and memset is extremely important. Off the top of my head, I'm aware of a Google paper where they experimented with using instrumentation and specialized code to improve memcpy performance across some of their key applications and the results were (IMO) quite significant: static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/40679.pdf

In some of their applications a significant chunk of time is spent just inside memcpy - 20% or more - with memset and memcmp adding up to a couple more percentage points. Any optimizations to these functions can pay dividends, obviously. For real-world applications being ported to JS, if memcpy and memset are already at a performance disadvantage vs the 'standard' ones used in native C, this will be more dramatic.

Related: bugzilla.mozilla.org/show_bug.cgi?id=862249

# Andreas Rossberg (9 years ago)

On 31 October 2014 16:40, Katelyn Gadd <kg at luminance.org> wrote:

I'd also like to chime in since this is a real problem for performance-sensitive JSIL code: The idea that JS runtimes will just optimize all our performance-sensitive loops is lovely. It also appears to be a fantasy, because we've been writing those loops for years and they're still slow. I did some extensive testing and experimentation with spidermonkey a year or so back, examining how it optimized various forms of a memcpy loop, and the results were pretty dire. V8 didn't seem to fare much better, though it is harder to tell because they don't trivially expose generated native code for functions.

Hm, just an aside, but I wonder what you mean by "trivially" here. It is only one flag away.

# Isiah Meadows (9 years ago)

From: Andreas Rossberg <rossberg at google.com>

Hm, just an aside, but I wonder what you mean by "trivially" here. It is only one flag away.

Try node --allow-natives-syntax, and take a look at the runtime-*.cc files at v8/v8-git-mirror/tree/master/src/runtime (grep RUNTIME_FUNCTION)

# Brendan Eich (9 years ago)

Steve Fink wrote:

On a related note, Iwould like to have some way of getting the OS to decommit memory. Seehttps://bugzilla.mozilla.org/show_bug.cgi?id=855669 (start reading at about comment 22) for our discussion and attempt at this, which looks like it mysteriously trailed off this lastMarch. Theoretically, the above loop could also trigger a decommit, but I think it's too much to expect the engine to guess when that's going to be a good idea. On the other hand, from a spec POV it's unobservable behavior, which makes it weird.

ArrayBuffer.transfer (gist.github.com/andhow/95fb9e49996615764eff) is an ES7 stage 0 proposal, needs to move to stage 1 soon. It enables decommitting memory.

# Steve Fink (9 years ago)

On 11/04/2014 11:08 AM, Brendan Eich wrote:

ArrayBuffer.transfer (gist.github.com/andhow/95fb9e49996615764eff) is an ES7 stage 0 proposal, needs to move to stage 1 soon. It enables decommitting memory.

I'm not sure we're talking about the same thing. I'm talking about what would be madvise(MADV_DONTNEED) on POSIX or VirtualAlloc(MEM_RESET) on Windows. Er... actually, I think it would be MEM_RESET followed by MEM_COMMIT to get the zero-filling while still releasing the physical pages.

Unless there's some tricky way of using ArrayBuffer.transfer to signal that memory can be decommitted, but I don't see it.

# Brendan Eich (9 years ago)

Steve Fink wrote:

I'm not sure we're talking about the same thing. I'm talking about what would be madvise(MADV_DONTNEED) on POSIX or VirtualAlloc(MEM_RESET) on Windows. Er... actually, I think it would be MEM_RESET followed by MEM_COMMIT to get the zero-filling while still releasing the physical pages.

Luke should weigh in. I had in mind the words "either truncated or zero-extended to be newByteLength" from

gist.github.com/andhow/95fb9e49996615764eff

# Luke Wagner (9 years ago)

ArrayBuffer.transfer provides a coarser granularity than the proposed ArrayBuffer.prototype.discard in bug 855669. Page-level madvise(MADV_DONTNEED) is the type of thing you'd want if you were implementing a malloc heap (jemalloc being an example). OTOH, ArrayBuffer.transfer also allows you to immediately release virtual address space, something discard doesn't, so I think these features are complementary.

# Brendan Eich (9 years ago)

Luke Wagner wrote:

the proposed ArrayBuffer.prototype.discard

I'll get this on the November Agenda.

# Isiah Meadows (9 years ago)

These two methods happen to fix one of the two reasons I haven't really dabbled in handwritten asm.js (the other being the lack of string handling).

# Jeff Walden (9 years ago)

On 10/31/2014 08:40 AM, Katelyn Gadd wrote:

In my testing, even best-case optimized memcpy/memset loops did not produce the kind of efficient code you want. There were always bounds checks and in some cases I saw other overhead. The set/copy definitely weren't vectorized. In the best case I saw bounds checks getting hoisted out of the loop body (great!)

It's my impression, across many compilers and languages, that completely correct range analysis (a prerequisite to eliminating many bounds checks) is very, very hard.

When people have used Csmith and similar on gcc and clang, they've found range analysis bugs (or issues in systems producing inputs to range analysis) with ease. These are seasoned compilers where one might think such issues had long since been eliminated.

People find similar issues fuzzing SpiderMonkey. Range analysis has been in SpiderMonkey awhile now, yet the incoming stream of range analysis issues continues unabated. We "trust" range analysis information for DCE purposes and constant folding because a wrong value/behavior is not immediately exploitable. But we've been very hesitant to trust it to eliminate bounds checks. The consequences there are memory corruption and arbitrary code execution. When the range analysis information is known buggy, it'd be folly to trust it to remove bounds checks.

I'd also note C/C++ and similar range analysis code usually has the advantage of working with stable, integer-xor-floating types. JS with its double type requires considerably more care to suss out an implicit integer subtype, when double operations can produce integers, integer operations can produce doubles, and negative zero constantly confounds. Range analysis for those languages is easier than it is for JS, and yet they're still buggy.

I feel like at this point the claim that we don't need better copy/set APIs is the 'sufficiently smart compiler' trope all over again, though in this case perhaps it is wholly possible and it's just not happening because it isn't important enough to VM implementers.

"not important enough" is true, but it requires a big-picture view. Your narrow focus is on typed arrays and native-equivalent code. There's practically a whole web of code dissimilar to that, that also needs optimizing. There are new standard features to implement. There are security issues to fix. There's low-level architecture to get right so that it's possible to build upon basic optimizations and eventually implement bigger ones (such as the vectorization ideas you mention here). We'll get to this eventually, but it has to compete with the other important concerns on the table.

Jeff

P.S. -- For what it's worth, I think we'll care a bit more about this soon. We may need to do bits of this to permit self-hosting of SpiderMonkey's typed array methods without losing performance, and we really want to make that change soon. Don't give up hope!