encapsulated hash codes
On Nov 4, 2009, at 2:49 PM, Allen Wirfs-Brock wrote:
A straw man proposal for Harmony encapsulated hash codes has been
posted to the Wiki athttp://wiki.ecmascript.org/doku.php?id=strawman:encapsulated_hashcodes
-
Wouldn't it be simpler to have a single Object.hash() function that
returns a uniformly distributed number in the 0..2^32 range, and leave
the mod operation up to code using the hash code? Making a new
function that (essentially) encapsulates the hash operation and the
mod does not seem to pull its weight in API complexity. -
What about non-objects? It would be nice if strings, numbers, and
perhaps other primitive types could hash by value rather than by
identity, so you could reasonably mix them as hash keys with objects. -
Security considerations: it should be recommended that the hash
code, if computed from the address of the object, is not reversible.
(Of course, with a copying collector you probably can't compute the
hash code from the address).
, Maciej
See below
From: Maciej Stachowiak [mailto:mjs at apple.com] Sent: Wednesday, November 04, 2009 4:50 PM To: Allen Wirfs-Brock Cc: es-discuss Subject: Re: encapsulated hash codes
On Nov 4, 2009, at 2:49 PM, Allen Wirfs-Brock wrote:
A straw man proposal for Harmony encapsulated hash codes has been posted to the Wiki at.strawman:encapsulated_hashcodes
-
Wouldn't it be simpler to have a single Object.hash() function that returns a uniformly distributed number in the 0..2^32 range, and leave the mod operation up to code using the hash code? Making a new function that (essentially) encapsulates the hash operation and the mod does not seem to pull its weight in API complexity.
Yes, it would and that is the classic approach (perhaps excluding the uniformity of distribution) used by most languages. As an implementer I would be fairly happy with that approach. However, some of our members have significant concerns about the possibility of hash codes being used as a covert communications canner. This proposal tries to avoid that problem by giving each client of identity hashes (most likely each hash table) their own hash function which yields a distinct sequences of hash values (for the same sequence of objects). Since the functions aren't shared the sequence of values they generate can be used as a channel.
Requiring uniform distribution over a 32-bit range may not be practical. If a client only wants hashes in the 0.255 range the obvious function of a 32-bit value (mod 256) ignores most of the bits that actually distinguish the values.
The individual functions probably do a bit more than a simple mod. Typically, the seed that is used to generate per object hash values (for example, the allocation address of an object) has fewer than 32 variable bits (for example, only the size range of the allocation nursery). If the hash function know range of hash values it is expected to generate it can more effectively use the actual variable bits to generate a good distribution.
- What about non-objects? It would be nice if strings, numbers, and perhaps other primitive types could hash by value rather than by identity, so you could reasonably mix them as hash keys with objects.
Sure this can be added, but isn't an essential primitive. Hash values can be easily generated today for all the built-in value types. But object identity hash requires some sort of primitive support.
-
Security considerations: it should be recommended that the hash code, if computed from the address of the object, is not reversible. (Of course, with a copying collector you probably can't compute the hash code from the address).
Yes, that would be a good requirement. In practice, it's usually the case anyway. Actually, the address is a good seed hash value even for copying collectors. There are tricks that are known by lisp/smalltalk/etc. implementers for efficiently doing this. Happy to share... Ideally, identity hashes cost nothing to compute and require no additional per object storage. The ideal isn't obtainable, but you can get pretty close.
, Maciej
On Nov 4, 2009, at 6:43 PM, Allen Wirfs-Brock wrote:
See below
From: Maciej Stachowiak [mailto:mjs at apple.com] Sent: Wednesday, November 04, 2009 4:50 PM To: Allen Wirfs-Brock Cc: es-discuss Subject: Re: encapsulated hash codes
On Nov 4, 2009, at 2:49 PM, Allen Wirfs-Brock wrote:
A straw man proposal for Harmony encapsulated hash codes has been
posted to the Wiki at.
Wouldn't it be simpler to have a single Object.hash()
function that returns a uniformly distributed number in the 0..2^32
range, and leave the mod operation up to code using the hash code?
Making a new function that (essentially) encapsulates the hash
operation and the mod does not seem to pull its weight in API
complexity.Yes, it would and that is the classic approach (perhaps excluding
the uniformity of distribution) used by most languages. As an
implementer I would be fairly happy with that approach. However,
some of our members have significant concerns about the possibility
of hash codes being used as a covert communications canner. This
proposal tries to avoid that problem by giving each client of
identity hashes (most likely each hash table) their own hash
function which yields a distinct sequences of hash values (for the
same sequence of objects). Since the functions aren’t shared the
sequence of values they generate can be used as a channel.
How can you use it as a covert channel? If you can only produce a hash
code given a value, you need an overt communication channel to
transfer the value to use a hash function for covert communication. Is
there a more complicated scenario that I'm not thinking of?
Requiring uniform distribution over a 32-bit range may not be
practical. If a client only wants hashes in the 0.255 range the
obvious function of a 32-bit value (mod 256) ignores most of the
bits that actually distinguish the values.
Thats ok. For a good quality hash function, this will be distributed
just as uniformly as a hash that produces values 0..255. Uniform
distribution over a 32-bit range is quite practical, there are many
well-lnown algorithms.
The individual functions probably do a bit more than a simple mod.
Typically, the seed that is used to generate per object hash values
(for example, the allocation address of an object) has fewer than 32
variable bits (for example, only the size range of the allocation
nursery). If the hash function know range of hash values it is
expected to generate it can more effectively use the actual variable
bits to generate a good distribution.
I don't believe that is the case. With a good hash function, changing
any one bit in the input will on average change half the bits in the
output (and similar properties for correlations of multiple bits).
Thus, the low 28 bits of 32-bit hash code on values with only (say)
28 bits of entropy will be just as good as the best 28-bit hash code
you could make.
See here for some references: burtleburtle.net/bob/hash/#lookup
- What about non-objects? It would be nice if strings, numbers, and
perhaps other primitive types could hash by value rather than by
identity, so you could reasonably mix them as hash keys with objects.Sure this can be added, but isn’t an essential primitive. Hash
values can be easily generated today for all the built-in value
types. But object identity hash requires some sort of primitive
support.
True, but that way is much less convenient and likely to be less
performant.
Security considerations: it should be recommended that the
hash code, if computed from the address of the object, is not
reversible. (Of course, with a copying collector you probably can't
compute the hash code from the address). Yes, that would be a good requirement. In practice, it’s usually
the case anyway.
Actually, many popular non-cryptographic hash functions are fully or
partially reversible. It's a desirable property that certain passes of
computing the hash function are reversible. To guarantee what I
suggested you'd have to combine an unpredictable salt with the object
address or use a cryptographic hash (which is more expensive) or do
something more tricky.
Actually, the address is a good seed hash value even for copying
collectors. There are tricks that are known by lisp/smalltalk/etc.
implementers for efficiently doing this. Happy to share… Ideally,
identity hashes cost nothing to compute and require no additional
per object storage. The ideal isn’t obtainable, but you can get
pretty close.
I would be interested to hear bout these tricks.
, Maciej
A straw man proposal for Harmony encapsulated hash codes has been posted to the Wiki at strawman:encapsulated_hashcodes