parallel arrays and sorting

# Norm Rubin (12 years ago)

In comparing ParallelArrays (rivertrail) to the cuda thrust library, I noticed that Sorting using parallelArrays looks like a missing primitive operation.

Sorting is special because the good (high performance) algorithm on a gpu is radix sort, while the good (high performance) algorithm on the cpu is parallel merge sort, The other way around radix sort of cpu, or parallel merge sort of a gpu is very slow and often worse than serial implementations

Sadly it is well past a jit to take the code for one flavor of sort and transform it into the other, while it would be pretty simple for a run-time to pick a good sort, if only it knew that a sort was going on. Run times would not be required to do so here since they can always pick a slow sort.

I know this is a slippery road, since once you add another prim, adding more prims becomes ever easier but sorting seems pretty important. And array already has a sort, so adding a version that works on ParallelArrays does not seem so bad.

What do you guys think?

This email message is for the sole use of the intended recipient(s) and may contain confidential information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.

In comparing ParallelArrays (rivertrail) to the cuda thrust library, I noticed that Sorting using parallelArrays looks like a missing primitive operation.

What do you guys think?

-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may contain
confidential information. Any unauthorized review, use, disclosure or distribution
is prohibited. If you are not the intended recipient, please contact the sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20130408/c4dd7e37/attachment.html>

# Till Schneidereit (12 years ago)

cc-ing dev-tech-js-engine-rivertrail at lists.mozilla.org, where the authors of Mozilla's implementation will read it.

cc-ing dev-tech-js-engine-rivertrail at lists.mozilla.org, where the authors
of Mozilla's implementation will read it.


On Mon, Apr 8, 2013 at 4:45 PM, Norm Rubin <nrubin at nvidia.com> wrote:

> In comparing ParallelArrays (rivertrail) to the cuda thrust library,  I
> noticed  that Sorting using parallelArrays looks like a missing primitive
>  operation.****
>
> ** **
>
> Sorting is  special because the good (high performance) algorithm on a gpu
> is radix sort, while the good (high performance) algorithm on the cpu is
> parallel merge sort,  The other way around radix sort of cpu, or parallel
> merge sort of a gpu is very slow and often worse than serial implementations
> ****
>
> ** **
>
> Sadly it is well past a jit to take the code for one flavor of sort and
> transform it into the other,  while  it would be pretty simple for a
> run-time to pick a good sort, if only it knew that a sort was going on.
> Run times would not be required to do  so  here since they can always pick
> a slow sort. ****
>
> ** **
>
> I know this is a slippery road, since once you add another prim, adding
> more prims becomes ever easier  but sorting seems pretty important.   And
> array already has a sort, so adding a version that works on ParallelArrays
> does not seem so bad.****
>
> ** **
>
> What do you guys think?****
>
> ** **
>
> ** **
>  ------------------------------
>  This email message is for the sole use of the intended recipient(s) and
> may contain confidential information.  Any unauthorized review, use,
> disclosure or distribution is prohibited.  If you are not the intended
> recipient, please contact the sender by reply email and destroy all copies
> of the original message.
>  ------------------------------
>
>
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20130409/b8c26810/attachment.html>

# Herhut, Stephan A (12 years ago)

Rick is travelling, so let me chime in.

We have discussed this back and forth but have not come to a conclusion. Generally, we agree that adding a sort primitive makes a lot of sense, in particular as the Array object in JavaScript already has a sort method. Also, as you too mentioned, implementing an efficient sort as a library function without knowing the details of the parallel hardware used is difficult, to say the least. So sort ticks all the boxes to become a primitive.

The other, and arguably more difficult question, is what a sort method should look like. If we take JavaScript's existing Array.sort, the sort method would get an (optional) comparator function. However, using a comparator would preclude the use of radix sort. An alternative would be to implement sorting of primitive types only. This brings back more choice in sort algorithms but limits use. For such a design, we considered a function as optional argument to sort that, given a value from the ParallelArray to be sorted, returns a key used for comparison, which again needs to be a primitive. This would at least enable sorting of objects by a field and, at some runtime cost, sorting of general data.

The tradeoff between these two approaches, and probably other designs, is hard to judge without knowing what sort is used for. So we decided to wait for some good use cases before deciding on a specific design.

Sorry, no answers only further questions. Stephan

From: es-discuss-bounces at mozilla.org [mailto:es-discuss-bounces at mozilla.org] On Behalf Of Norm Rubin Sent: Monday, April 08, 2013 7:46 AM To: es-discuss at mozilla.org Subject: parallel arrays and sorting

In comparing ParallelArrays (rivertrail) to the cuda thrust library, I noticed that Sorting using parallelArrays looks like a missing primitive operation.

What do you guys think?

Rick is travelling, so let me chime in.

We have discussed this back and forth but have not come to a conclusion. Generally, we agree that adding a sort primitive makes a lot of sense, in particular as the Array object in JavaScript already has a sort method. Also, as you too mentioned, implementing an efficient sort as a library function without knowing the details of the parallel hardware used is difficult, to say the least. So sort ticks all the boxes to become a primitive.

The other, and arguably more difficult question, is what a sort method should look like. If we take JavaScript's existing Array.sort, the sort method would get an (optional) comparator function. However, using a comparator would preclude the use of radix sort.
An alternative would be to implement sorting of primitive types only. This brings back more choice in sort algorithms but limits use. For such a design, we considered a function as optional argument to sort that, given a value from the ParallelArray to be sorted, returns a key used for comparison, which again needs to be a primitive. This would at least enable sorting of objects by a field and, at some runtime cost, sorting of general data.

The tradeoff between these two approaches, and probably other designs, is hard to judge without knowing what sort is used for. So we decided to wait for some good use cases before deciding on a specific design.

Sorry, no answers only further questions.
  Stephan

From: es-discuss-bounces at mozilla.org [mailto:es-discuss-bounces at mozilla.org] On Behalf Of Norm Rubin
Sent: Monday, April 08, 2013 7:46 AM
To: es-discuss at mozilla.org
Subject: parallel arrays and sorting

In comparing ParallelArrays (rivertrail) to the cuda thrust library,  I noticed  that Sorting using parallelArrays looks like a missing primitive  operation.

Sorting is  special because the good (high performance) algorithm on a gpu is radix sort, while the good (high performance) algorithm on the cpu is parallel merge sort,  The other way around radix sort of cpu, or parallel merge sort of a gpu is very slow and often worse than serial implementations

Sadly it is well past a jit to take the code for one flavor of sort and transform it into the other,  while  it would be pretty simple for a run-time to pick a good sort, if only it knew that a sort was going on.  Run times would not be required to do  so  here since they can always pick a slow sort.

I know this is a slippery road, since once you add another prim, adding more prims becomes ever easier  but sorting seems pretty important.   And array already has a sort, so adding a version that works on ParallelArrays does not seem so bad.

What do you guys think?


________________________________
This email message is for the sole use of the intended recipient(s) and may contain confidential information.  Any unauthorized review, use, disclosure or distribution is prohibited.  If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.
________________________________
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20130411/346e52b1/attachment-0001.html>

# Norm Rubin (12 years ago)

Good set of questions about sort, thanks for looking at this carefully. You guys have been great about responding to my emails.

One sorting application that I would expect is physics particle simulation, where the code might sort objects based on distance from the viewer. This could be used so that far away objects get rendered smaller.

Another might be computing a histogram over a visual image - sort the data first so nearby values will go into the same bucket, without needing atomic synchronization.

Sorting with an optional compare seems ok to me- even if implementations can only optimize for some comparisons

One interesting question you brought up as part of sorting by key is pretty tricky.

In general parallel data arrays would like to be stored in transposed order from classic cpu styles. We usually call this array of structures or structures of arrays Do we sort an array of objects each with multiple fields stored one object after another or do we sort multiple arrays, with all the values of field1 followed by all the values of field2 etc

I'd hope we allow implementations to reorder the data within a parallelArray as they like without needing to expose the layout to developers. Thrust, on the other hand, pushes the layout to the developer and thus has a significant set of transpose routines.

Just for a point of reference there are 8 versions of sort in the thrust library. As you can see, Thrust never aimed to be a minimal set it just gained operations as applications appeared. AMD has a similar library called bolt which uses the same interface. Two different parallel libraries with the same routines does suggest that this set is enough to do useful work.

Internally thrust notices that the compare is < on primitive types and uses radix sort (on the gpu).

Thurst includes:

Stable and unstable forms
Ascending only or with a comparison function
Single array or a pair of keys and values

 Sort(array)  - sorts the array into ascending order, not guaranteed to be stable

 Sort (array)- with a comparison function

 Sort by key (keys, values) returns both the reordered keys and  the reordered values, sorted by the array keys. This one is kind of confusing so here is an example

// an example of key sorting
#include <thrust/sort.h>
  const int N = 6;
  int    keys[N] = {  1,   4,   2,   8,   5,   7};
  char values[N] = {'a', 'b', 'c', 'd', 'e', 'f'};
  thrust::sort_by_key(keys, keys + N, values, thrust::greater<int>());
  // keys is now   {  8,   7,   5,   4,   2,   1}
  // values is now {'d', 'f', 'e', 'b', 'c', 'a'}

sort.h: docs.thrust.googlecode.com/hg/sort_8h.html
thrust::sort_by_key: docs.thrust.googlecode.com/hg/group__sorting.html#ga2bb765aeef19f6a04ca8b8ba11efff24
thrust::greater: docs.thrust.googlecode.com/hg/structthrust_1_1greater.html

 Sort by key with a  comparison function

5-8) the stable forms

Based on the thrust primitives, one parallel array sort might be

Sort an array of objects Besides the array the arguments might be

a flag indicating if the sort must be stable
an optional user compare function that defaulted to < on primitive types

Good set of questions about sort,  thanks for looking at this carefully.  You guys have  been great about responding to my emails.



One sorting application that I would expect is physics particle simulation, where the code  might sort objects based on distance from the viewer. This could be used so that  far away objects get rendered smaller.

Another might be  computing a histogram  over a visual image - sort the data first so nearby values will go into the same bucket, without needing atomic synchronization.

Sorting with an optional compare seems ok  to me- even if implementations can only optimize for  some  comparisons

One  interesting  question you brought up as  part of sorting by key is pretty tricky.

In general parallel data arrays  would like to be stored in transposed  order from classic cpu styles.
We usually call this array of structures or structures of arrays
Do we sort  an array of objects each with multiple fields stored  one object after another or do we sort multiple arrays,  with all the values of field1 followed by all the values of field2 etc

I'd hope we allow  implementations to reorder the data within a parallelArray as they like without needing to expose the layout to developers.  Thrust, on the other hand, pushes the layout  to the developer and thus has a significant set of transpose routines.

Just for a point of reference there are  8  versions of sort in the thrust library. As  you can see, Thrust never aimed to be a minimal set it just gained operations as applications appeared. AMD has a similar library called bolt which uses the same interface.
Two different parallel  libraries with the same routines does suggest that this set is enough  to do useful work.


Internally thrust notices  that the compare is  <  on primitive  types and  uses radix sort (on the gpu).

Thurst includes:
  Stable and unstable forms
Ascending only or with a comparison function
Single array or a pair of keys and values


1)      Sort(array)  - sorts the array into ascending order, not guaranteed to be stable

2)      Sort (array)- with a comparison function

3)      Sort by key (keys, values) returns both the reordered keys and  the reordered values, sorted by the array keys. This one is kind of confusing so here is an example

// an example of key sorting
#include <thrust/sort.h<http://docs.thrust.googlecode.com/hg/sort_8h.html>>
  const int N = 6;
  int    keys[N] = {  1,   4,   2,   8,   5,   7};
  char values[N] = {'a', 'b', 'c', 'd', 'e', 'f'};
  thrust::sort_by_key<http://docs.thrust.googlecode.com/hg/group__sorting.html#ga2bb765aeef19f6a04ca8b8ba11efff24>(keys, keys + N, values, thrust::greater<int><http://docs.thrust.googlecode.com/hg/structthrust_1_1greater.html>());
  // keys is now   {  8,   7,   5,   4,   2,   1}
  // values is now {'d', 'f', 'e', 'b', 'c', 'a'}




4)      Sort by key with a  comparison function

5-8 the stable forms


Based on the thrust primitives, one  parallel array sort  might be

Sort an array of objects
Besides the array the arguments might be
- a flag indicating if the sort must be  stable
- an optional user compare function that defaulted to < on primitive types


From: Herhut, Stephan A [mailto:stephan.a.herhut at intel.com]
Sent: Thursday, April 11, 2013 1:09 PM
To: Norm Rubin; es-discuss at mozilla.org
Subject: RE: parallel arrays and sorting

Rick is travelling, so let me chime in.

We have discussed this back and forth but have not come to a conclusion. Generally, we agree that adding a sort primitive makes a lot of sense, in particular as the Array object in JavaScript already has a sort method. Also, as you too mentioned, implementing an efficient sort as a library function without knowing the details of the parallel hardware used is difficult, to say the least. So sort ticks all the boxes to become a primitive.

The other, and arguably more difficult question, is what a sort method should look like. If we take JavaScript's existing Array.sort, the sort method would get an (optional) comparator function. However, using a comparator would preclude the use of radix sort.
An alternative would be to implement sorting of primitive types only. This brings back more choice in sort algorithms but limits use. For such a design, we considered a function as optional argument to sort that, given a value from the ParallelArray to be sorted, returns a key used for comparison, which again needs to be a primitive. This would at least enable sorting of objects by a field and, at some runtime cost, sorting of general data.

The tradeoff between these two approaches, and probably other designs, is hard to judge without knowing what sort is used for. So we decided to wait for some good use cases before deciding on a specific design.

Sorry, no answers only further questions.
  Stephan

From: es-discuss-bounces at mozilla.org<mailto:es-discuss-bounces at mozilla.org> [mailto:es-discuss-bounces at mozilla.org] On Behalf Of Norm Rubin
Sent: Monday, April 08, 2013 7:46 AM
To: es-discuss at mozilla.org<mailto:es-discuss at mozilla.org>
Subject: parallel arrays and sorting

In comparing ParallelArrays (rivertrail) to the cuda thrust library,  I noticed  that Sorting using parallelArrays looks like a missing primitive  operation.

Sorting is  special because the good (high performance) algorithm on a gpu is radix sort, while the good (high performance) algorithm on the cpu is parallel merge sort,  The other way around radix sort of cpu, or parallel merge sort of a gpu is very slow and often worse than serial implementations

Sadly it is well past a jit to take the code for one flavor of sort and transform it into the other,  while  it would be pretty simple for a run-time to pick a good sort, if only it knew that a sort was going on.  Run times would not be required to do  so  here since they can always pick a slow sort.

I know this is a slippery road, since once you add another prim, adding more prims becomes ever easier  but sorting seems pretty important.   And array already has a sort, so adding a version that works on ParallelArrays does not seem so bad.

What do you guys think?


________________________________
This email message is for the sole use of the intended recipient(s) and may contain confidential information.  Any unauthorized review, use, disclosure or distribution is prohibited.  If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.
________________________________
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20130411/0e828cee/attachment-0001.html>

# Kevin Gadd (12 years ago)

One simple-ish real world example of a use case that can benefit from a parallel sort:

It's common in games and other realtime rendering scenarios to want to sort your queue of rendering operations by various attributes, in order to minimize the number of hardware state changes and allow you to batch up operations.

The proposed compromise of providing a function that maps the objects being sorted to 'sort keys' seems a suitable choice for this kind of scenario - the draw calls can have a sort key implicitly (like described in that blog post), or you can trivially manufacture one (that's what I do). This is definitely a use case where only being able to sort primitives would not be adequate. However, a sort primitive that operates on a pair of arrays would also suffice - Norm calls this 'sort by key', and the .NET BCL exposes this as an overload of Array.Sort. It's quite useful and it seems like it might be feasible to expose this, even if only the key-sorting occurs in parallel (and the value sorting can't because the values are ordinary JS objects).

One simple-ish real world example of a use case that can benefit from a
parallel sort:

It's common in games and other realtime rendering scenarios to want to sort
your queue of rendering operations by various attributes, in order to
minimize the number of hardware state changes and allow you to batch up
operations.

http://realtimecollisiondetection.net/blog/?p=86 is an article that
describes an approach used to generate 64-bit 'sort keys' for drawing
operations that was used in some current-gen console games. At present, I
use a similar technique to sort drawing operations in my buffers, and more
importantly, I do all the sorting and preparation operations in parallel.
For complex scenes I could imagine it being quite common to want to
leverage parallel machinery (or even GPU hardware) to sort the thousands of
individual drawing operations in a given scene before clustering them up
into larger hardware operations.

The proposed compromise of providing a function that maps the objects being
sorted to 'sort keys' seems a suitable choice for this kind of scenario -
the draw calls can have a sort key implicitly (like described in that blog
post), or you can trivially manufacture one (that's what I do). This is
definitely a use case where only being able to sort primitives would not be
adequate. However, a sort primitive that operates on a pair of arrays would
also suffice - Norm calls this 'sort by key', and the .NET BCL exposes this
as an overload of Array.Sort. It's quite useful and it seems like it might
be feasible to expose this, even if only the key-sorting occurs in parallel
(and the value sorting can't because the values are ordinary JS objects).


On Thu, Apr 11, 2013 at 11:35 AM, Norm Rubin <nrubin at nvidia.com> wrote:

> Good set of questions about sort,  thanks for looking at this carefully.
> You guys have  been great about responding to my emails.****
>
> ** **
>
> ** **
>
> ** **
>
> One sorting application that I would expect is physics particle
> simulation, where the code  might sort objects based on distance from the
> viewer. This could be used so that  far away objects get rendered smaller.
> ****
>
> ** **
>
> Another might be  computing a histogram  over a visual image – sort the
> data first so nearby values will go into the same bucket, without needing
> atomic synchronization.****
>
> ** **
>
> Sorting with an optional compare seems ok  to me- even if implementations
> can only optimize for  some  comparisons ****
>
> ** **
>
> One  interesting  question you brought up as  part of sorting by key is
> pretty tricky.****
>
> ** **
>
> In general parallel data arrays  would like to be stored in transposed
>  order from classic cpu styles.****
>
> We usually call this array of structures or structures of arrays****
>
> Do we sort  an array of objects each with multiple fields stored  one
> object after another or do we sort multiple arrays,  with all the values of
> field1 followed by all the values of field2 etc ****
>
> ** **
>
> I’d hope we allow  implementations to reorder the data within a
> parallelArray as they like without needing to expose the layout to
> developers.  Thrust, on the other hand, pushes the layout  to the developer
> and thus has a significant set of transpose routines.****
>
> ** **
>
> Just for a point of reference there are  8  versions of sort in the thrust
> library. As  you can see, Thrust never aimed to be a minimal set it just
> gained operations as applications appeared. AMD has a similar library
> called bolt which uses the same interface.****
>
> Two different parallel  libraries with the same routines does suggest that
> this set is enough  to do useful work.****
>
> ** **
>
> ** **
>
> Internally thrust notices  that the compare is  <  on primitive  types and
>  uses radix sort (on the gpu).****
>
> ** **
>
> Thurst includes:****
>
>   Stable and unstable forms****
>
> Ascending only or with a comparison function****
>
> Single array or a pair of keys and values ****
>
> ** **
>
> **1)      **Sort(array)  - sorts the array into ascending order, not
> guaranteed to be stable****
>
> **2)      **Sort (array)– with a comparison function****
>
> **3)      **Sort by key (keys, values) returns both the reordered keys
> and  the reordered values, sorted by the array keys. This one is kind of
> confusing so here is an example****
>
> ** **
>
> // an example of key sorting****
>
> #include <thrust/sort.h<http://docs.thrust.googlecode.com/hg/sort_8h.html>
> >****
>
>   const int N = 6;****
>
>   int    keys[N] = {  1,   4,   2,   8,   5,   7};****
>
>   char values[N] = {'a', 'b', 'c', 'd', 'e', 'f'};****
>
>   thrust::sort_by_key<http://docs.thrust.googlecode.com/hg/group__sorting.html#ga2bb765aeef19f6a04ca8b8ba11efff24>(keys,
> keys + N, values, thrust::greater<int><http://docs.thrust.googlecode.com/hg/structthrust_1_1greater.html>
> ());****
>
>   // keys is now   {  8,   7,   5,   4,   2,   1}****
>
>   // values is now {'d', 'f', 'e', 'b', 'c', 'a'}****
>
> ** **
>
> ** **
>
> **4)      **Sort by key with a  comparison function****
>
> ** **
>
> 5-8 the stable forms ****
>
> ** **
>
> ** **
>
> Based on the thrust primitives, one  parallel array sort  might be ****
>
> ** **
>
> Sort an array of objects ****
>
> Besides the array the arguments might be ****
>
> - a flag indicating if the sort must be  stable****
>
> - an optional user compare function that defaulted to < on primitive types
> ****
>
> ** **
>
> ** **
>
> *From:* Herhut, Stephan A [mailto:stephan.a.herhut at intel.com]
> *Sent:* Thursday, April 11, 2013 1:09 PM
> *To:* Norm Rubin; es-discuss at mozilla.org
> *Subject:* RE: parallel arrays and sorting****
>
> ** **
>
> Rick is travelling, so let me chime in.****
>
> ** **
>
> We have discussed this back and forth but have not come to a conclusion.
> Generally, we agree that adding a sort primitive makes a lot of sense, in
> particular as the Array object in JavaScript already has a sort method.
> Also, as you too mentioned, implementing an efficient sort as a library
> function without knowing the details of the parallel hardware used is
> difficult, to say the least. So sort ticks all the boxes to become a
> primitive.****
>
> ** **
>
> The other, and arguably more difficult question, is what a sort method
> should look like. If we take JavaScript’s existing Array.sort, the sort
> method would get an (optional) comparator function. However, using a
> comparator would preclude the use of radix sort. ****
>
> An alternative would be to implement sorting of primitive types only. This
> brings back more choice in sort algorithms but limits use. For such a
> design, we considered a function as optional argument to sort that, given a
> value from the ParallelArray to be sorted, returns a key used for
> comparison, which again needs to be a primitive. This would at least enable
> sorting of objects by a field and, at some runtime cost, sorting of general
> data. ****
>
> ** **
>
> The tradeoff between these two approaches, and probably other designs, is
> hard to judge without knowing what sort is used for. So we decided to wait
> for some good use cases before deciding on a specific design.****
>
> ** **
>
> Sorry, no answers only further questions.****
>
>   Stephan****
>
> ** **
>
> *From:* es-discuss-bounces at mozilla.org [
> mailto:es-discuss-bounces at mozilla.org <es-discuss-bounces at mozilla.org>] *On
> Behalf Of *Norm Rubin
> *Sent:* Monday, April 08, 2013 7:46 AM
> *To:* es-discuss at mozilla.org
> *Subject:* parallel arrays and sorting****
>
> ** **
>
> In comparing ParallelArrays (rivertrail) to the cuda thrust library,  I
> noticed  that Sorting using parallelArrays looks like a missing primitive
>  operation.****
>
> ** **
>
> Sorting is  special because the good (high performance) algorithm on a gpu
> is radix sort, while the good (high performance) algorithm on the cpu is
> parallel merge sort,  The other way around radix sort of cpu, or parallel
> merge sort of a gpu is very slow and often worse than serial implementations
> ****
>
> ** **
>
> Sadly it is well past a jit to take the code for one flavor of sort and
> transform it into the other,  while  it would be pretty simple for a
> run-time to pick a good sort, if only it knew that a sort was going on.
> Run times would not be required to do  so  here since they can always pick
> a slow sort. ****
>
> ** **
>
> I know this is a slippery road, since once you add another prim, adding
> more prims becomes ever easier  but sorting seems pretty important.   And
> array already has a sort, so adding a version that works on ParallelArrays
> does not seem so bad.****
>
> ** **
>
> What do you guys think?****
>
> ** **
>
> ** **
> ------------------------------
>
> This email message is for the sole use of the intended recipient(s) and
> may contain confidential information.  Any unauthorized review, use,
> disclosure or distribution is prohibited.  If you are not the intended
> recipient, please contact the sender by reply email and destroy all copies
> of the original message. ****
> ------------------------------
>
> _______________________________________________
> es-discuss mailing list
> es-discuss at mozilla.org
> https://mail.mozilla.org/listinfo/es-discuss
>
>


-- 
-kg
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20130411/404b9b2e/attachment-0001.html>

# Herhut, Stephan A (12 years ago)

Thanks for the overview of thrust. I had completely missed the issue of stability before. This needs careful consideration and I have not yet made up my mind. I understand the desire to have unstable sort for performance reasons, but it also introduces a form of non-determinism.

Histogram can also be expressed using scatter in out API. Assuming some array A of color values, you can do

A.map((v) => 1).scatter(A.map((v) => bucket(v)), 0, (a,b) => a+b, numberOfBuckets)

So, essentially create a vector of 1's and then scatter those to your buckets based on their bucket value, using addition as the conflict resolution. Sorting the scatter indices indeed makes a lot of sense and an implementation would be free to first sort the vector of scatter indices to reduce synchronization cost. I also thought about an explicit version directly using sort and some other primitives of our API but I could not come up with a formulation that would benefit from sorting.

The array of structures (AoS) vs. structures of arrays (SoA) transformation is problematic in JavaScript. In C like languages one has a lot of structural information, in particular the fields and types of a structure are statically known and homogeneous across an array. In JavaScript, on the other hand, objects can be used like structures but in general there may be a lot more dynamism: Arrays may contain objects with varying labels, properties may have different types and, probably worst of all, the programmer can add new properties to objects at any time. So even though a structure may start out as a homogeneous array, this can change during its lifetime. Analyzing these data structures and tracking the 'anomalies' seems not straight forward to me and I am not convinced that this can be implemented efficiently.

However, there is also good news. The binary data proposal (see harmony:binary_data) brings C like data types, including arrays of structures, to JavaScript and we have proposed an extension to Parallel JavaScript that makes use of binary data as storage format (see strawman:data_parallelism#support_for_binary_data). In such a setting, doing automatic AoS to SoA transformations is possible and our programming model does not preclude this.

Stephan

From: Norm Rubin [mailto:nrubin at nvidia.com] Sent: Thursday, April 11, 2013 11:36 AM To: Herhut, Stephan A; es-discuss at mozilla.org<mailto:es-discuss at mozilla.org>

Cc: dev-tech-js-engine-rivertrail at lists.mozilla.org<mailto:dev-tech-js-engine-rivertrail at lists.mozilla.org>

Subject: RE: parallel arrays and sorting

Good set of questions about sort, thanks for looking at this carefully. You guys have been great about responding to my emails.

Another might be computing a histogram over a visual image - sort the data first so nearby values will go into the same bucket, without needing atomic synchronization.

Sorting with an optional compare seems ok to me- even if implementations can only optimize for some comparisons

One interesting question you brought up as part of sorting by key is pretty tricky.

Internally thrust notices that the compare is < on primitive types and uses radix sort (on the gpu).

Thurst includes: Stable and unstable forms Ascending only or with a comparison function Single array or a pair of keys and values

 Sort(array)  - sorts the array into ascending order, not guaranteed to be stable

 Sort (array)- with a comparison function

 Sort by key (keys, values) returns both the reordered keys and  the reordered values, sorted by the array keys. This one is kind of confusing so here is an example

// an example of key sorting #include <thrust/sort.hdocs.thrust.googlecode.com/hg/sort_8h.html> const int N = 6; int keys[N] = { 1, 4, 2, 8, 5, 7}; char values[N] = {'a', 'b', 'c', 'd', 'e', 'f'}; thrust::sort_by_keydocs.thrust.googlecode.com/hg/group__sorting.html#ga2bb765aeef19f6a04ca8b8ba11efff24(keys, keys + N, values, thrust::greater<int>docs.thrust.googlecode.com/hg/structthrust_1_1greater.html()); // keys is now { 8, 7, 5, 4, 2, 1} // values is now {'d', 'f', 'e', 'b', 'c', 'a'}

 Sort by key with a  comparison function

5-8 the stable forms

Based on the thrust primitives, one parallel array sort might be

Sort an array of objects Besides the array the arguments might be

a flag indicating if the sort must be stable
an optional user compare function that defaulted to < on primitive types

From: Herhut, Stephan A [mailto:stephan.a.herhut at intel.com] Sent: Thursday, April 11, 2013 1:09 PM To: Norm Rubin; es-discuss at mozilla.org<mailto:es-discuss at mozilla.org>

Subject: RE: parallel arrays and sorting

Rick is travelling, so let me chime in.

Sorry, no answers only further questions. Stephan

From: es-discuss-bounces at mozilla.org<mailto:es-discuss-bounces at mozilla.org> [mailto:es-discuss-bounces at mozilla.org] On Behalf Of Norm Rubin

Sent: Monday, April 08, 2013 7:46 AM To: es-discuss at mozilla.org<mailto:es-discuss at mozilla.org>

Subject: parallel arrays and sorting

In comparing ParallelArrays (rivertrail) to the cuda thrust library, I noticed that Sorting using parallelArrays looks like a missing primitive operation.

What do you guys think?

Thanks for the overview of thrust. I had completely missed the issue of stability before. This needs careful consideration and I have not yet made up my mind. I understand the desire to have unstable sort for performance reasons, but it also introduces a form of non-determinism.

Histogram can also be expressed using scatter in out API. Assuming some array A of color values, you can do

A.map((v) => 1).scatter(A.map((v) => bucket(v)), 0, (a,b) => a+b, numberOfBuckets)

So, essentially create a vector of 1's and then scatter those to your buckets based on their bucket value, using addition as the conflict resolution. Sorting the scatter indices indeed makes a lot of sense and an implementation would be free to first sort the vector of scatter indices to reduce synchronization cost. I also thought about an explicit version directly using sort and some other primitives of our API but I could not come up with a formulation that would benefit from sorting.

The array of structures (AoS) vs. structures of arrays (SoA) transformation is problematic in JavaScript. In C like languages one has a lot of structural information, in particular the fields and types of a structure are statically known and homogeneous across an array. In JavaScript, on the other hand, objects can be used like structures but in general there may be a lot more dynamism: Arrays may contain objects with varying labels, properties may have different types and, probably worst of all, the programmer can add new properties to objects at any time. So even though a structure may start out as a homogeneous array, this can change during its lifetime. Analyzing these data structures and tracking the 'anomalies' seems not straight forward to me and I am not convinced that this can be implemented efficiently.

However, there is also good news. The binary data proposal (see http://wiki.ecmascript.org/doku.php?id=harmony:binary_data) brings C like data types, including arrays of structures, to JavaScript and we have proposed an extension to Parallel JavaScript that makes use of binary data as storage format (see http://wiki.ecmascript.org/doku.php?id=strawman:data_parallelism#support_for_binary_data). In such a setting, doing automatic AoS to SoA transformations is possible and our programming model does not preclude this.

Stephan

From: Norm Rubin [mailto:nrubin at nvidia.com]
Sent: Thursday, April 11, 2013 11:36 AM
To: Herhut, Stephan A; es-discuss at mozilla.org<mailto:es-discuss at mozilla.org>
Cc: dev-tech-js-engine-rivertrail at lists.mozilla.org<mailto:dev-tech-js-engine-rivertrail at lists.mozilla.org>
Subject: RE: parallel arrays and sorting

Good set of questions about sort,  thanks for looking at this carefully.  You guys have  been great about responding to my emails.



One sorting application that I would expect is physics particle simulation, where the code  might sort objects based on distance from the viewer. This could be used so that  far away objects get rendered smaller.

Another might be  computing a histogram  over a visual image - sort the data first so nearby values will go into the same bucket, without needing atomic synchronization.

Sorting with an optional compare seems ok  to me- even if implementations can only optimize for  some  comparisons

One  interesting  question you brought up as  part of sorting by key is pretty tricky.

In general parallel data arrays  would like to be stored in transposed  order from classic cpu styles.
We usually call this array of structures or structures of arrays
Do we sort  an array of objects each with multiple fields stored  one object after another or do we sort multiple arrays,  with all the values of field1 followed by all the values of field2 etc

I'd hope we allow  implementations to reorder the data within a parallelArray as they like without needing to expose the layout to developers.  Thrust, on the other hand, pushes the layout  to the developer and thus has a significant set of transpose routines.

Just for a point of reference there are  8  versions of sort in the thrust library. As  you can see, Thrust never aimed to be a minimal set it just gained operations as applications appeared. AMD has a similar library called bolt which uses the same interface.
Two different parallel  libraries with the same routines does suggest that this set is enough  to do useful work.


Internally thrust notices  that the compare is  <  on primitive  types and  uses radix sort (on the gpu).

Thurst includes:
  Stable and unstable forms
Ascending only or with a comparison function
Single array or a pair of keys and values


1)      Sort(array)  - sorts the array into ascending order, not guaranteed to be stable

2)      Sort (array)- with a comparison function

3)      Sort by key (keys, values) returns both the reordered keys and  the reordered values, sorted by the array keys. This one is kind of confusing so here is an example

// an example of key sorting
#include <thrust/sort.h<http://docs.thrust.googlecode.com/hg/sort_8h.html>>
  const int N = 6;
  int    keys[N] = {  1,   4,   2,   8,   5,   7};
  char values[N] = {'a', 'b', 'c', 'd', 'e', 'f'};
  thrust::sort_by_key<http://docs.thrust.googlecode.com/hg/group__sorting.html#ga2bb765aeef19f6a04ca8b8ba11efff24>(keys, keys + N, values, thrust::greater<int><http://docs.thrust.googlecode.com/hg/structthrust_1_1greater.html>());
  // keys is now   {  8,   7,   5,   4,   2,   1}
  // values is now {'d', 'f', 'e', 'b', 'c', 'a'}




4)      Sort by key with a  comparison function

5-8 the stable forms


Based on the thrust primitives, one  parallel array sort  might be

Sort an array of objects
Besides the array the arguments might be
- a flag indicating if the sort must be  stable
- an optional user compare function that defaulted to < on primitive types


From: Herhut, Stephan A [mailto:stephan.a.herhut at intel.com]
Sent: Thursday, April 11, 2013 1:09 PM
To: Norm Rubin; es-discuss at mozilla.org<mailto:es-discuss at mozilla.org>
Subject: RE: parallel arrays and sorting

Rick is travelling, so let me chime in.

We have discussed this back and forth but have not come to a conclusion. Generally, we agree that adding a sort primitive makes a lot of sense, in particular as the Array object in JavaScript already has a sort method. Also, as you too mentioned, implementing an efficient sort as a library function without knowing the details of the parallel hardware used is difficult, to say the least. So sort ticks all the boxes to become a primitive.

The other, and arguably more difficult question, is what a sort method should look like. If we take JavaScript's existing Array.sort, the sort method would get an (optional) comparator function. However, using a comparator would preclude the use of radix sort.
An alternative would be to implement sorting of primitive types only. This brings back more choice in sort algorithms but limits use. For such a design, we considered a function as optional argument to sort that, given a value from the ParallelArray to be sorted, returns a key used for comparison, which again needs to be a primitive. This would at least enable sorting of objects by a field and, at some runtime cost, sorting of general data.

The tradeoff between these two approaches, and probably other designs, is hard to judge without knowing what sort is used for. So we decided to wait for some good use cases before deciding on a specific design.

Sorry, no answers only further questions.
  Stephan

From: es-discuss-bounces at mozilla.org<mailto:es-discuss-bounces at mozilla.org> [mailto:es-discuss-bounces at mozilla.org] On Behalf Of Norm Rubin
Sent: Monday, April 08, 2013 7:46 AM
To: es-discuss at mozilla.org<mailto:es-discuss at mozilla.org>
Subject: parallel arrays and sorting

In comparing ParallelArrays (rivertrail) to the cuda thrust library,  I noticed  that Sorting using parallelArrays looks like a missing primitive  operation.

Sorting is  special because the good (high performance) algorithm on a gpu is radix sort, while the good (high performance) algorithm on the cpu is parallel merge sort,  The other way around radix sort of cpu, or parallel merge sort of a gpu is very slow and often worse than serial implementations

Sadly it is well past a jit to take the code for one flavor of sort and transform it into the other,  while  it would be pretty simple for a run-time to pick a good sort, if only it knew that a sort was going on.  Run times would not be required to do  so  here since they can always pick a slow sort.

I know this is a slippery road, since once you add another prim, adding more prims becomes ever easier  but sorting seems pretty important.   And array already has a sort, so adding a version that works on ParallelArrays does not seem so bad.

What do you guys think?


________________________________
This email message is for the sole use of the intended recipient(s) and may contain confidential information.  Any unauthorized review, use, disclosure or distribution is prohibited.  If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.
________________________________
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20130419/9c991a3a/attachment-0001.html>

# Herhut, Stephan A (12 years ago)

Thanks, this indeed is an interesting use case.

Previously, I thought that using a function to map objects to a sort key is at least as expressive as having two arrays, as a naïve implementation of the mapping function could just be to read the appropriate value from the key array (assuming the mapping function would know the index). However, such a solution does not generate a sorted key array and thus the correlation between the sorted objects and their keys is lost.

After reading the blog post you linked, I am still not sure whether that actually is an issue. It seems for a rendering pipeline only the sorted objects are of importance.

So, potential runtime efficiency issues aside, do you know a use case where one also needs the array of sorted keys?

Stephan

From: Kevin Gadd [mailto:kevin.gadd at gmail.com] Sent: Thursday, April 11, 2013 11:43 AM To: Norm Rubin Cc: Herhut, Stephan A; es-discuss at mozilla.org<mailto:es-discuss at mozilla.org>; dev-tech-js-engine-rivertrail at lists.mozilla.org<mailto:dev-tech-js-engine-rivertrail at lists.mozilla.org>

Subject: Re: parallel arrays and sorting

One simple-ish real world example of a use case that can benefit from a parallel sort: It's common in games and other realtime rendering scenarios to want to sort your queue of rendering operations by various attributes, in order to minimize the number of hardware state changes and allow you to batch up operations.

realtimecollisiondetection.net/blog/?p=86 is an article that describes an approach used to generate 64-bit 'sort keys' for drawing operations that was used in some current-gen console games. At present, I use a similar technique to sort drawing operations in my buffers, and more importantly, I do all the sorting and preparation operations in parallel. For complex scenes I could imagine it being quite common to want to leverage parallel machinery (or even GPU hardware) to sort the thousands of individual drawing operations in a given scene before clustering them up into larger hardware operations. The proposed compromise of providing a function that maps the objects being sorted to 'sort keys' seems a suitable choice for this kind of scenario - the draw calls can have a sort key implicitly (like described in that blog post), or you can trivially manufacture one (that's what I do). This is definitely a use case where only being able to sort primitives would not be adequate. However, a sort primitive that operates on a pair of arrays would also suffice - Norm calls this 'sort by key', and the .NET BCL exposes this as an overload of Array.Sort. It's quite useful and it seems like it might be feasible to expose this, even if only the key-sorting occurs in parallel (and the value sorting can't because the values are ordinary JS objects).

On Thu, Apr 11, 2013 at 11:35 AM, Norm Rubin <nrubin at nvidia.com<mailto:nrubin at nvidia.com>> wrote:

Good set of questions about sort, thanks for looking at this carefully. You guys have been great about responding to my emails.

Another might be computing a histogram over a visual image – sort the data first so nearby values will go into the same bucket, without needing atomic synchronization.

Sorting with an optional compare seems ok to me- even if implementations can only optimize for some comparisons

One interesting question you brought up as part of sorting by key is pretty tricky.

I’d hope we allow implementations to reorder the data within a parallelArray as they like without needing to expose the layout to developers. Thrust, on the other hand, pushes the layout to the developer and thus has a significant set of transpose routines.

Internally thrust notices that the compare is < on primitive types and uses radix sort (on the gpu).

Thurst includes: Stable and unstable forms Ascending only or with a comparison function Single array or a pair of keys and values

 Sort(array)  - sorts the array into ascending order, not guaranteed to be stable

 Sort (array)– with a comparison function

 Sort by key (keys, values) returns both the reordered keys and  the reordered values, sorted by the array keys. This one is kind of confusing so here is an example

 Sort by key with a  comparison function

5-8 the stable forms

Based on the thrust primitives, one parallel array sort might be

Sort an array of objects Besides the array the arguments might be

a flag indicating if the sort must be stable
an optional user compare function that defaulted to < on primitive types

From: Herhut, Stephan A [mailto:stephan.a.herhut at intel.com<mailto:stephan.a.herhut at intel.com>]

Sent: Thursday, April 11, 2013 1:09 PM To: Norm Rubin; es-discuss at mozilla.org<mailto:es-discuss at mozilla.org>

Subject: RE: parallel arrays and sorting

Rick is travelling, so let me chime in.

The other, and arguably more difficult question, is what a sort method should look like. If we take JavaScript’s existing Array.sort, the sort method would get an (optional) comparator function. However, using a comparator would preclude the use of radix sort. An alternative would be to implement sorting of primitive types only. This brings back more choice in sort algorithms but limits use. For such a design, we considered a function as optional argument to sort that, given a value from the ParallelArray to be sorted, returns a key used for comparison, which again needs to be a primitive. This would at least enable sorting of objects by a field and, at some runtime cost, sorting of general data.

Sorry, no answers only further questions. Stephan

From: es-discuss-bounces at mozilla.org<mailto:es-discuss-bounces at mozilla.org> [mailto:es-discuss-bounces at mozilla.org] On Behalf Of Norm Rubin

Sent: Monday, April 08, 2013 7:46 AM To: es-discuss at mozilla.org<mailto:es-discuss at mozilla.org>

Subject: parallel arrays and sorting

In comparing ParallelArrays (rivertrail) to the cuda thrust library, I noticed that Sorting using parallelArrays looks like a missing primitive operation.

What do you guys think?

Thanks, this indeed is an interesting use case.

Previously, I thought that using a function to map objects to a sort key is at least as expressive as having two arrays, as a naïve implementation of the mapping function could just be to read the appropriate value from the key array (assuming the mapping function would know the index). However, such a solution does not generate a sorted key array and thus the correlation between the sorted objects and their keys is lost.

After reading the blog post you linked, I am still not sure whether that actually is an issue. It seems for a rendering pipeline only the sorted objects are of importance.

So, potential runtime efficiency issues aside, do you know a use case where one also needs the array of sorted keys?

Stephan

From: Kevin Gadd [mailto:kevin.gadd at gmail.com]
Sent: Thursday, April 11, 2013 11:43 AM
To: Norm Rubin
Cc: Herhut, Stephan A; es-discuss at mozilla.org<mailto:es-discuss at mozilla.org>; dev-tech-js-engine-rivertrail at lists.mozilla.org<mailto:dev-tech-js-engine-rivertrail at lists.mozilla.org>
Subject: Re: parallel arrays and sorting

One simple-ish real world example of a use case that can benefit from a parallel sort:
It's common in games and other realtime rendering scenarios to want to sort your queue of rendering operations by various attributes, in order to minimize the number of hardware state changes and allow you to batch up operations.

http://realtimecollisiondetection.net/blog/?p=86 is an article that describes an approach used to generate 64-bit 'sort keys' for drawing operations that was used in some current-gen console games. At present, I use a similar technique to sort drawing operations in my buffers, and more importantly, I do all the sorting and preparation operations in parallel. For complex scenes I could imagine it being quite common to want to leverage parallel machinery (or even GPU hardware) to sort the thousands of individual drawing operations in a given scene before clustering them up into larger hardware operations.
The proposed compromise of providing a function that maps the objects being sorted to 'sort keys' seems a suitable choice for this kind of scenario - the draw calls can have a sort key implicitly (like described in that blog post), or you can trivially manufacture one (that's what I do). This is definitely a use case where only being able to sort primitives would not be adequate. However, a sort primitive that operates on a pair of arrays would also suffice - Norm calls this 'sort by key', and the .NET BCL exposes this as an overload of Array.Sort. It's quite useful and it seems like it might be feasible to expose this, even if only the key-sorting occurs in parallel (and the value sorting can't because the values are ordinary JS objects).

On Thu, Apr 11, 2013 at 11:35 AM, Norm Rubin <nrubin at nvidia.com<mailto:nrubin at nvidia.com>> wrote:
Good set of questions about sort,  thanks for looking at this carefully.  You guys have  been great about responding to my emails.



One sorting application that I would expect is physics particle simulation, where the code  might sort objects based on distance from the viewer. This could be used so that  far away objects get rendered smaller.

Another might be  computing a histogram  over a visual image – sort the data first so nearby values will go into the same bucket, without needing atomic synchronization.

Sorting with an optional compare seems ok  to me- even if implementations can only optimize for  some  comparisons

One  interesting  question you brought up as  part of sorting by key is pretty tricky.

In general parallel data arrays  would like to be stored in transposed  order from classic cpu styles.
We usually call this array of structures or structures of arrays
Do we sort  an array of objects each with multiple fields stored  one object after another or do we sort multiple arrays,  with all the values of field1 followed by all the values of field2 etc

I’d hope we allow  implementations to reorder the data within a parallelArray as they like without needing to expose the layout to developers.  Thrust, on the other hand, pushes the layout  to the developer and thus has a significant set of transpose routines.

Just for a point of reference there are  8  versions of sort in the thrust library. As  you can see, Thrust never aimed to be a minimal set it just gained operations as applications appeared. AMD has a similar library called bolt which uses the same interface.
Two different parallel  libraries with the same routines does suggest that this set is enough  to do useful work.


Internally thrust notices  that the compare is  <  on primitive  types and  uses radix sort (on the gpu).

Thurst includes:
  Stable and unstable forms
Ascending only or with a comparison function
Single array or a pair of keys and values


1)      Sort(array)  - sorts the array into ascending order, not guaranteed to be stable

2)      Sort (array)– with a comparison function

3)      Sort by key (keys, values) returns both the reordered keys and  the reordered values, sorted by the array keys. This one is kind of confusing so here is an example

// an example of key sorting
#include <thrust/sort.h<http://docs.thrust.googlecode.com/hg/sort_8h.html>>
  const int N = 6;
  int    keys[N] = {  1,   4,   2,   8,   5,   7};
  char values[N] = {'a', 'b', 'c', 'd', 'e', 'f'};
  thrust::sort_by_key<http://docs.thrust.googlecode.com/hg/group__sorting.html#ga2bb765aeef19f6a04ca8b8ba11efff24>(keys, keys + N, values, thrust::greater<int><http://docs.thrust.googlecode.com/hg/structthrust_1_1greater.html>());
  // keys is now   {  8,   7,   5,   4,   2,   1}
  // values is now {'d', 'f', 'e', 'b', 'c', 'a'}




4)      Sort by key with a  comparison function

5-8 the stable forms


Based on the thrust primitives, one  parallel array sort  might be

Sort an array of objects
Besides the array the arguments might be
- a flag indicating if the sort must be  stable
- an optional user compare function that defaulted to < on primitive types


From: Herhut, Stephan A [mailto:stephan.a.herhut at intel.com<mailto:stephan.a.herhut at intel.com>]
Sent: Thursday, April 11, 2013 1:09 PM
To: Norm Rubin; es-discuss at mozilla.org<mailto:es-discuss at mozilla.org>
Subject: RE: parallel arrays and sorting

Rick is travelling, so let me chime in.

We have discussed this back and forth but have not come to a conclusion. Generally, we agree that adding a sort primitive makes a lot of sense, in particular as the Array object in JavaScript already has a sort method. Also, as you too mentioned, implementing an efficient sort as a library function without knowing the details of the parallel hardware used is difficult, to say the least. So sort ticks all the boxes to become a primitive.

The other, and arguably more difficult question, is what a sort method should look like. If we take JavaScript’s existing Array.sort, the sort method would get an (optional) comparator function. However, using a comparator would preclude the use of radix sort.
An alternative would be to implement sorting of primitive types only. This brings back more choice in sort algorithms but limits use. For such a design, we considered a function as optional argument to sort that, given a value from the ParallelArray to be sorted, returns a key used for comparison, which again needs to be a primitive. This would at least enable sorting of objects by a field and, at some runtime cost, sorting of general data.

The tradeoff between these two approaches, and probably other designs, is hard to judge without knowing what sort is used for. So we decided to wait for some good use cases before deciding on a specific design.

Sorry, no answers only further questions.
  Stephan

From: es-discuss-bounces at mozilla.org<mailto:es-discuss-bounces at mozilla.org> [mailto:es-discuss-bounces at mozilla.org] On Behalf Of Norm Rubin
Sent: Monday, April 08, 2013 7:46 AM
To: es-discuss at mozilla.org<mailto:es-discuss at mozilla.org>
Subject: parallel arrays and sorting

In comparing ParallelArrays (rivertrail) to the cuda thrust library,  I noticed  that Sorting using parallelArrays looks like a missing primitive  operation.

Sorting is  special because the good (high performance) algorithm on a gpu is radix sort, while the good (high performance) algorithm on the cpu is parallel merge sort,  The other way around radix sort of cpu, or parallel merge sort of a gpu is very slow and often worse than serial implementations

Sadly it is well past a jit to take the code for one flavor of sort and transform it into the other,  while  it would be pretty simple for a run-time to pick a good sort, if only it knew that a sort was going on.  Run times would not be required to do  so  here since they can always pick a slow sort.

I know this is a slippery road, since once you add another prim, adding more prims becomes ever easier  but sorting seems pretty important.   And array already has a sort, so adding a version that works on ParallelArrays does not seem so bad.

What do you guys think?


________________________________
This email message is for the sole use of the intended recipient(s) and may contain confidential information.  Any unauthorized review, use, disclosure or distribution is prohibited.  If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.
________________________________

_______________________________________________
es-discuss mailing list
es-discuss at mozilla.org<mailto:es-discuss at mozilla.org>
https://mail.mozilla.org/listinfo/es-discuss



--
-kg
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20130419/78b52e7f/attachment-0001.html>

# Norm Rubin (12 years ago)

are looking like a good collection of examples for experimenting with ParallelJS

Here are some histogram options

 dense histograms where some bins have a zero entry and

  sparse histograms where only the non-zero bins are stored

Sometimes you know the number of bins beforehand but in general, you need to look at the data to find the min and max

For example given the data set [2 1 0 0 2 2 1 1 1 1 4] contains 2 zeros, 5 ones, and 3 twos and 1 four, is[2 5 3 0 1] using the dense method and[(0,2), (1,5), (2,3), (4,1)] using the sparse method. Since there are no threes, the sparse histogram representation does not contain a bin for that value

One way to store the sparse histogram is as two separate arrays one of keys and one of bin counts [0 1 2 4] and [2 5 3 1] (The soa format )

If the number of bins is relatively small compared to the input size, then binary search based dense might be the best performing while if the number of bins is near the input size then the sparse version might be better

If I just add prims, here are two ways to calculate histograms

Dense Sort the data Find the number of bins, equal to max value plus one Allocate the space to hold the histogram, and fill it with 0,1,2,3.. max value Initially set each histogram bin to upperbound Upper bound is a useful prim Given a sorted array, and a set of values use binary search to find the last position in the sorted array where each data value could be inserted, all the binary search ops happen in parallel (I'm not sure if there is a nice way to express this op in ParallelArrays)

n At this point we have a vector of positions the pair hist[i], hist[i+1] are the indexes in the sorted data that go into bin i In parallel do all the adds, no need for atomics

Your alternative - find the max Allocate the bins Scatter each element to the bin number, handle collisions by an add, works but I suspect it will be a lot slower, because each add has to be atomic and will need synchronization

Another alternative for dense histograms would require knowing a lot of machine details If you knew you were on a cpu not a gpu, you could form a partial histogram per core and then serially combine the results but on a gpu with massive parallelism, the space for all the partial histograms would be problematic so you would need to lay this out carefully - like share some of the partial histograms over blocks of threads

Sparse Sort the data Need to find the number of bins, which is the number of unique values Reduce using sum ( Map if data [i] = data[j] then 0, else 1) - counts the number of values that are different from the prior Allocate two arrays keys and counts Reduce by key - this is a generalization of reduce, Reduce each element with the same key returning the set of keys and the reduced results which sets up both arrays, (another tricky prim to express)

From: Herhut, Stephan A [mailto:stephan.a.herhut at intel.com] Sent: Friday, April 19, 2013 12:45 AM To: Norm Rubin; es-discuss at mozilla.org Cc: dev-tech-js-engine-rivertrail at lists.mozilla.org Subject: RE: parallel arrays and sorting

Histogram can also be expressed using scatter in out API. Assuming some array A of color values, you can do

A.map((v) => 1).scatter(A.map((v) => bucket(v)), 0, (a,b) => a+b, numberOfBuckets)

Stephan

From: Norm Rubin [mailto:nrubin at nvidia.com] Sent: Thursday, April 11, 2013 11:36 AM To: Herhut, Stephan A; es-discuss at mozilla.org<mailto:es-discuss at mozilla.org>

Cc: dev-tech-js-engine-rivertrail at lists.mozilla.org<mailto:dev-tech-js-engine-rivertrail at lists.mozilla.org>

Subject: RE: parallel arrays and sorting

Good set of questions about sort, thanks for looking at this carefully. You guys have been great about responding to my emails.

Another might be computing a histogram over a visual image - sort the data first so nearby values will go into the same bucket, without needing atomic synchronization.

Sorting with an optional compare seems ok to me- even if implementations can only optimize for some comparisons

One interesting question you brought up as part of sorting by key is pretty tricky.

Internally thrust notices that the compare is < on primitive types and uses radix sort (on the gpu).

Thurst includes: Stable and unstable forms Ascending only or with a comparison function Single array or a pair of keys and values

 Sort(array)  - sorts the array into ascending order, not guaranteed to be stable

 Sort (array)- with a comparison function

 Sort by key (keys, values) returns both the reordered keys and  the reordered values, sorted by the array keys. This one is kind of confusing so here is an example

 Sort by key with a  comparison function

5-8 the stable forms

Based on the thrust primitives, one parallel array sort might be

Sort an array of objects Besides the array the arguments might be

a flag indicating if the sort must be stable
an optional user compare function that defaulted to < on primitive types

From: Herhut, Stephan A [mailto:stephan.a.herhut at intel.com] Sent: Thursday, April 11, 2013 1:09 PM To: Norm Rubin; es-discuss at mozilla.org<mailto:es-discuss at mozilla.org>

Subject: RE: parallel arrays and sorting

Rick is travelling, so let me chime in.

Sorry, no answers only further questions. Stephan

From: es-discuss-bounces at mozilla.org<mailto:es-discuss-bounces at mozilla.org> [mailto:es-discuss-bounces at mozilla.org] On Behalf Of Norm Rubin

Sent: Monday, April 08, 2013 7:46 AM To: es-discuss at mozilla.org<mailto:es-discuss at mozilla.org>

Subject: parallel arrays and sorting

In comparing ParallelArrays (rivertrail) to the cuda thrust library, I noticed that Sorting using parallelArrays looks like a missing primitive operation.

What do you guys think?

Histograms are looking like a good collection of  examples for experimenting with ParallelJS

Here are some histogram options

1)      dense histograms where some bins have a zero entry and

2)       sparse histograms where only the non-zero bins are stored

Sometimes you know the number of bins beforehand but in general, you need to look at the data to find the min and max


For example given the data set    [2 1 0 0 2 2 1 1 1 1 4] contains 2 zeros, 5 ones, and 3 twos and 1 four, is[2 5 3 0 1]
using the dense method and[(0,2), (1,5), (2,3), (4,1)] using the sparse method. Since there are no threes, the sparse histogram representation does not contain a bin for that value

One way to store the sparse histogram is as two separate arrays one of keys and one of bin counts
[0 1 2 4] and [2 5 3 1]  (The soa format )

If the number of bins is relatively small compared to the input size, then binary search based dense might be the best performing  while if the number of bins is near  the input size then the sparse version might be better

If I just add prims,  here are two ways to calculate histograms

Dense
  Sort the data
  Find the number of bins,  equal to max value plus one
  Allocate the space to hold the histogram, and fill it with 0,1,2,3.. max value
  Initially set each histogram bin  to  upperbound
       Upper bound is  a  useful prim
            Given a sorted array, and a set of values use binary search to find the last position in the sorted array where each data value could be inserted, all the binary search ops happen in parallel
(I'm not sure if there is a nice way to express this op in ParallelArrays)

n  At this point we have a vector of positions  the pair hist[i],  hist[i+1] are the indexes in the sorted data that go into bin i
In parallel do all the adds, no need for atomics

Your  alternative -
find the max
Allocate the bins
  Scatter  each element to the bin number,  handle collisions  by an  add,  works but I suspect it will be a lot slower, because each  add has to be atomic and will need synchronization

Another alternative for dense histograms would require knowing a lot of machine details
If you knew you were on a cpu not a gpu,  you could form a partial histogram per core and then serially combine the results but on a gpu with massive parallelism, the space for all the partial histograms would be problematic so you would need to lay this out carefully - like share some of the partial histograms over blocks of threads


Sparse
  Sort the data
  Need to find the number of bins, which is the number of unique values
     Reduce using sum ( Map  if data [i] = data[j] then 0, else 1) - counts the number of values that are different from the prior
Allocate two arrays keys and counts
  Reduce by key - this is a generalization of reduce,  Reduce each element with the same key returning the set of keys and the reduced  results which sets up both arrays,  (another tricky prim to express)



























From: Herhut, Stephan A [mailto:stephan.a.herhut at intel.com]
Sent: Friday, April 19, 2013 12:45 AM
To: Norm Rubin; es-discuss at mozilla.org
Cc: dev-tech-js-engine-rivertrail at lists.mozilla.org
Subject: RE: parallel arrays and sorting

Thanks for the overview of thrust. I had completely missed the issue of stability before. This needs careful consideration and I have not yet made up my mind. I understand the desire to have unstable sort for performance reasons, but it also introduces a form of non-determinism.

Histogram can also be expressed using scatter in out API. Assuming some array A of color values, you can do

A.map((v) => 1).scatter(A.map((v) => bucket(v)), 0, (a,b) => a+b, numberOfBuckets)

So, essentially create a vector of 1's and then scatter those to your buckets based on their bucket value, using addition as the conflict resolution. Sorting the scatter indices indeed makes a lot of sense and an implementation would be free to first sort the vector of scatter indices to reduce synchronization cost. I also thought about an explicit version directly using sort and some other primitives of our API but I could not come up with a formulation that would benefit from sorting.

The array of structures (AoS) vs. structures of arrays (SoA) transformation is problematic in JavaScript. In C like languages one has a lot of structural information, in particular the fields and types of a structure are statically known and homogeneous across an array. In JavaScript, on the other hand, objects can be used like structures but in general there may be a lot more dynamism: Arrays may contain objects with varying labels, properties may have different types and, probably worst of all, the programmer can add new properties to objects at any time. So even though a structure may start out as a homogeneous array, this can change during its lifetime. Analyzing these data structures and tracking the 'anomalies' seems not straight forward to me and I am not convinced that this can be implemented efficiently.

However, there is also good news. The binary data proposal (see http://wiki.ecmascript.org/doku.php?id=harmony:binary_data) brings C like data types, including arrays of structures, to JavaScript and we have proposed an extension to Parallel JavaScript that makes use of binary data as storage format (see http://wiki.ecmascript.org/doku.php?id=strawman:data_parallelism#support_for_binary_data). In such a setting, doing automatic AoS to SoA transformations is possible and our programming model does not preclude this.

Stephan

From: Norm Rubin [mailto:nrubin at nvidia.com]
Sent: Thursday, April 11, 2013 11:36 AM
To: Herhut, Stephan A; es-discuss at mozilla.org<mailto:es-discuss at mozilla.org>
Cc: dev-tech-js-engine-rivertrail at lists.mozilla.org<mailto:dev-tech-js-engine-rivertrail at lists.mozilla.org>
Subject: RE: parallel arrays and sorting

Good set of questions about sort,  thanks for looking at this carefully.  You guys have  been great about responding to my emails.



One sorting application that I would expect is physics particle simulation, where the code  might sort objects based on distance from the viewer. This could be used so that  far away objects get rendered smaller.

Another might be  computing a histogram  over a visual image - sort the data first so nearby values will go into the same bucket, without needing atomic synchronization.

Sorting with an optional compare seems ok  to me- even if implementations can only optimize for  some  comparisons

One  interesting  question you brought up as  part of sorting by key is pretty tricky.

In general parallel data arrays  would like to be stored in transposed  order from classic cpu styles.
We usually call this array of structures or structures of arrays
Do we sort  an array of objects each with multiple fields stored  one object after another or do we sort multiple arrays,  with all the values of field1 followed by all the values of field2 etc

I'd hope we allow  implementations to reorder the data within a parallelArray as they like without needing to expose the layout to developers.  Thrust, on the other hand, pushes the layout  to the developer and thus has a significant set of transpose routines.

Just for a point of reference there are  8  versions of sort in the thrust library. As  you can see, Thrust never aimed to be a minimal set it just gained operations as applications appeared. AMD has a similar library called bolt which uses the same interface.
Two different parallel  libraries with the same routines does suggest that this set is enough  to do useful work.


Internally thrust notices  that the compare is  <  on primitive  types and  uses radix sort (on the gpu).

Thurst includes:
  Stable and unstable forms
Ascending only or with a comparison function
Single array or a pair of keys and values


1)      Sort(array)  - sorts the array into ascending order, not guaranteed to be stable

2)      Sort (array)- with a comparison function

3)      Sort by key (keys, values) returns both the reordered keys and  the reordered values, sorted by the array keys. This one is kind of confusing so here is an example

// an example of key sorting
#include <thrust/sort.h<http://docs.thrust.googlecode.com/hg/sort_8h.html>>
  const int N = 6;
  int    keys[N] = {  1,   4,   2,   8,   5,   7};
  char values[N] = {'a', 'b', 'c', 'd', 'e', 'f'};
  thrust::sort_by_key<http://docs.thrust.googlecode.com/hg/group__sorting.html#ga2bb765aeef19f6a04ca8b8ba11efff24>(keys, keys + N, values, thrust::greater<int><http://docs.thrust.googlecode.com/hg/structthrust_1_1greater.html>());
  // keys is now   {  8,   7,   5,   4,   2,   1}
  // values is now {'d', 'f', 'e', 'b', 'c', 'a'}




4)      Sort by key with a  comparison function

5-8 the stable forms


Based on the thrust primitives, one  parallel array sort  might be

Sort an array of objects
Besides the array the arguments might be
- a flag indicating if the sort must be  stable
- an optional user compare function that defaulted to < on primitive types


From: Herhut, Stephan A [mailto:stephan.a.herhut at intel.com]
Sent: Thursday, April 11, 2013 1:09 PM
To: Norm Rubin; es-discuss at mozilla.org<mailto:es-discuss at mozilla.org>
Subject: RE: parallel arrays and sorting

Rick is travelling, so let me chime in.

We have discussed this back and forth but have not come to a conclusion. Generally, we agree that adding a sort primitive makes a lot of sense, in particular as the Array object in JavaScript already has a sort method. Also, as you too mentioned, implementing an efficient sort as a library function without knowing the details of the parallel hardware used is difficult, to say the least. So sort ticks all the boxes to become a primitive.

The other, and arguably more difficult question, is what a sort method should look like. If we take JavaScript's existing Array.sort, the sort method would get an (optional) comparator function. However, using a comparator would preclude the use of radix sort.
An alternative would be to implement sorting of primitive types only. This brings back more choice in sort algorithms but limits use. For such a design, we considered a function as optional argument to sort that, given a value from the ParallelArray to be sorted, returns a key used for comparison, which again needs to be a primitive. This would at least enable sorting of objects by a field and, at some runtime cost, sorting of general data.

The tradeoff between these two approaches, and probably other designs, is hard to judge without knowing what sort is used for. So we decided to wait for some good use cases before deciding on a specific design.

Sorry, no answers only further questions.
  Stephan

From: es-discuss-bounces at mozilla.org<mailto:es-discuss-bounces at mozilla.org> [mailto:es-discuss-bounces at mozilla.org] On Behalf Of Norm Rubin
Sent: Monday, April 08, 2013 7:46 AM
To: es-discuss at mozilla.org<mailto:es-discuss at mozilla.org>
Subject: parallel arrays and sorting

In comparing ParallelArrays (rivertrail) to the cuda thrust library,  I noticed  that Sorting using parallelArrays looks like a missing primitive  operation.

Sorting is  special because the good (high performance) algorithm on a gpu is radix sort, while the good (high performance) algorithm on the cpu is parallel merge sort,  The other way around radix sort of cpu, or parallel merge sort of a gpu is very slow and often worse than serial implementations

Sadly it is well past a jit to take the code for one flavor of sort and transform it into the other,  while  it would be pretty simple for a run-time to pick a good sort, if only it knew that a sort was going on.  Run times would not be required to do  so  here since they can always pick a slow sort.

I know this is a slippery road, since once you add another prim, adding more prims becomes ever easier  but sorting seems pretty important.   And array already has a sort, so adding a version that works on ParallelArrays does not seem so bad.

What do you guys think?


________________________________
This email message is for the sole use of the intended recipient(s) and may contain confidential information.  Any unauthorized review, use, disclosure or distribution is prohibited.  If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.
________________________________
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/es-discuss/attachments/20130419/88e35183/attachment-0001.html>