Hash Function HSH 11/13
|
URI: |
http://herbert.gandraxa.com/herbert/hsh.asp |
|
Link template: |
<a href="http://herbert.gandraxa.com/herbert/hsh.asp">Hash Function HSH 11/13</a> |
|
Link symbols: |
|
Hash Function HSH 11/13
|
URI: |
http://herbert.gandraxa.com/herbert/hsh.asp |
|
Link template: |
<a href="http://herbert.gandraxa.com/herbert/hsh.asp">Hash Function HSH 11/13</a> |
|
Link symbols: |
|
Table of Contents
Home »
Linoleum Source Code
» Hash Function HSH 11/13
Source Code Library: The implementation in the cross-platform Assembler Linoleum.
Source Code Test Program: Test program using the library.Download as a WinZip file see below.
This is proposing a new Hash Function, named HSH 11/13 due to the algorithm's characteristical bit rotations. The
HSH library contains an implementation in the
Linoleum Programming Language, a cross-platform Assembler.
Porting to any language should provide no problems, as it consists of rather trivial operations: see pseudo-code below. If you do so, I appreciate a mail with a link to your published code. Porting requires, that your programming language supports the rotate-left [1] instruction.
[1] Machinecode mnemonic ROL (Rotate Left) on Intel 80x86, Motorola 680x0, Motorola 68300. ROTL (Rotate Long) on DEC VAX.
Dictionary of Algorithms and Data Structures, from Paul E. Black (
NIST)
Bob Jenkins, author of Lookup3, with a lot of material about hash functions.
Bret Mulvey has an excellent 11 pages article about hash functions.
Julienne Walker has a very good article about the Art of Hashing, suiting especially non-Mathematicians.The algorithm works on memory chunks of 32 bits (sometimes called words, or units). The key to be hashed must be in successive units. If the key consists of an "odd" number of bytes which don't fill the last word (unit) completely, you must ensure that the unused 1 to 3 bytes are zeroed and do not contain any garbage from recent operations.
In Pseudocode, the algorithm does the following:
state := initvalue
result := 0
Do for all keyunit:=1...n containing parts of the key
{
result := result xor keyunit
Do precision times
{
state := state left-rotate 11 bits
result := result left-rotate 13 bits
result := result xor state
result := result left-rotate (state mod 32) bits
state := state left-rotate (result mod 32) bits
}
}
Note: If you use this on a 32-bit Intel CPU, left-rotate is mod 32 anyway, so you simply can code A left-rotate B instead of A left-rotate (B mod 32).
| keyunit | Unit = "word" = 32 bit = 4 bytes. keyunit is a vector of n units, each holding a part (1...32 bits) of the key to be hashed |
| state, result | A piece of RAM (register, variable) of unit size, i.e. 32 bit each. state is needed only while the algorithm calculates the hash: it does not preserve any internal state and is completely local. The same is true for result with the exception, that its final value (the hash) is returned to the caller. |
| precision | A measure for the required precision of the distribution. As a rule of thumb, use the number of bits of your input symbols minus 1. E.g.: If your input symbols are organized in bytes, use 8-1=7. If you process 32-bit words, use 32-1=31 etc. The number of symbols in your key does not influence this value. Don't use less than 7, even if you process smaller symbol widths. |
| initvalue | A constant. The original implementation uses the hexadecimal value 0x40490FDB (pi in IEEE-754 format, single precision, big endian). Other good values include 0xC0C0C0C0, 0x03030303, and also 1. Bad values are 0 and 0xFFFFFFFF. All of the test data given below base on the value 0x40490FDB. |
The space complexity is 0: no variables are used to store intermediate values of any kind. The 2 constants can be hard-coded, and all operations can be executed directly in CPU registers.
Performance is reasonably fast especially for keys coming in bytes (ASCII- or ANSI-strings). Processing data in units (32 bit words) takes approx. 4 times as long.
In assembly, above pseudo code can be coded such, that, when processing bytestreams (sequential ASCII characters), 7*6+4=46 machine instructions are executed per processed unit of 32 bits, or 11½ machine instructions per processed byte (character).
E.g.: A key like "Yvonne", when being processed as ASCII characters, occupies 2 units [1] and thus requires 2*46=92 machine instructions to generate a hash value.
[1] Big Endian: [ 'Y' 'v' 'o' 'n' ][ 'n' 'e' \0 \0 ]; Little Endian: [ 'n' 'o' 'v' 'Y' ][ \0 \0 'e' 'n' ]
The following ZIP file contains both the library's source (HSH.txt) as well as a test program (HSH Test.txt).
Consider the two ASCII strings "Yvonne" and "Herbert". When reading in from a file into 32 bit units of a Little Endian machine, the strings would be held in RAM as follows:
| Text | Unit 0 | Unit 1 |
| Yvonne\0 | 6E 6F 76 59 | xx 00 65 6E |
| Herbert\0 | 62 72 65 48 | 00 74 72 65 |
Converting them into an 8-bit bitstream with zero padding after the last character would turn the units into:
| Unit 0 | Unit 1 | |
| "Yvonne" | 59 76 6F 6E | 6E 65 00 00 |
| "Herbert" | 48 65 72 62 | 65 72 74 00 |
Because we process ASCII keys, we set the precision to 8-1=7 when calling the hash function. The number of units to be processed can be calculated with (NumCharacters + 3) \ 4.
The calculated 32-bit hash values for above bitstreams are:
| "Yvonne" | 92 3F 2D B7 |
| "Herbert" | 22 51 0D DC |
To see, how well the hash function distributes, we don't create more values manually. Instead, we process a file containing many different words.
The test file for this test series was the
Berkeley Word List containing 173,528 words in English language (our keys). The file is 1,923,517 bytes large. Each word of this list was extracted and fed to the hash generator as if it was in Big Endian, i.e. MSB first.
10 bits of the generated keys were used, the remaining 22 bits discarded. Since we can map only 1,024 values in 10 bits, hash collisions must occur. We would expect an average 173,528 / 1,024 = 169.46 keys to result in the same hash, if we could provide a perfect hash function. Since we usually don't know the number of keys, we can not create a 1:1 mapping function. However, the requirement is in any case, to remain as close to the expectancy as possible.
The number of hash collisions were counted, i.e. the number of keys producing the same hash value. Each of the hash values 0...1023 ideally would be occupied by 169.46 keys.
For aforementioned word list, each word converted to a Big Endian bitstream with 8-bit/character, the proposed algorithm distributes as follows:
Fig. 1: Distribution of hash values for test series 1. X axis: from left to right the hash values 0...1023. Y axis: from top to bottom the number of keys providing the hash value X. Blue horizontal Line: Expectancy value (169.46 keys).
This test series consists of 4 individual tests:
To see, if the hash function distributes well also for clusters, i.e. for values being very close to each other, a 32 bit unit is used, which simply is incremented by 1, beginning with 0. Note, that the precision must be set to 32-1=31 here when calling the hash function. In this test series, the number of units to be processed is 1 for each key.
The first 10 hash values are given here to facilitate checking your own implementation (all values in hexadecimal):
| Key | Hash |
| 00 00 00 00 | 8A F5 70 B4 |
| 00 00 00 01 | 70 1E C6 F5 |
| 00 00 00 02 | 03 E8 E9 44 |
| 00 00 00 03 | D0 85 7B 72 |
| 00 00 00 04 | 7F 80 60 1F |
| 00 00 00 05 | F4 2B F7 DD |
| 00 00 00 06 | 19 82 72 04 |
| 00 00 00 07 | 42 37 C5 F9 |
| 00 00 00 08 | 76 46 98 28 |
| 00 00 00 09 | 8D 8C A3 BA |
The counts are dense clusters in the low bits. The created hash values were mapped onto a 10 bit range from 0...1,023. 4 tests were run in this test series:
Again, 10 bits of the generated keys were used, the remaining 22 bits discarded, for each of the tests (a) to (d) once the 10 LSBs were used and the other time the 10 MSBs.
Fig. 2: Distribution of hash values for test series 2a. Test (a) for 10 MSBs. X axis: from left to right the hash values 0...1023. Y axis: from top to bottom the number of keys providing the hash value X. Blue horizontal Line: Expectancy value (1,000 keys).
Fig. 3: Distribution of hash values for test series 2d. Test (d) for 10 MSBs. X axis: from left to right the hash values 0...1023. Y axis: from top to bottom the number of keys providing the hash value X. Blue horizontal Line: Expectancy value (1,000,000 keys).
For the third test series a number of pseudo random values (algorithm Mersenne Twister 19937 as implemented in the
RNG library) were provided, all consisting of a single unit of 32 bits width. The hash function's arguments are the same as with test series 2.
The first 1,000 numbers of the pseudo random number generator are available from the
University Hiroshima.
As was the case with the former test series, all hash values were mapped onto a 10 bit range from 0...1,023. The four tests of this series:
And as with previous tests, 10 bits of the generated keys were used, the remaining 22 bits discarded, for each of the tests (a) to (d) once the 10 LSBs were used and the other time the 10 MSBs.
Fig. 4: Distribution of hash values for test series 3a. Test (a) for 10 MSBs. X axis: from left to right the hash values 0...1023. Y axis: from top to bottom the number of keys providing the hash value X. Blue horizontal Line: Expectancy value (1,000 keys).
Fig. 5: Distribution of hash values for test series 3d. Test (d) for 10 LSBs. X axis: from left to right the hash values 0...1023. Y axis: from top to bottom the number of keys providing the hash value X. Blue horizontal Line: Expectancy value (1,000,000 keys).
The bits are equivalent in their quality. Using the 10 LSBs (Least Significant Bits) or the 10 MSBs (Most Significant Bits) by shifting right 22 bits, we obtain similar distributions (see table below).
Static filler bits don't affect the quality. The test series 1 was run for keys having a varying number of characters of 8 bit width, although the text actually consists of pure ASCII characters. This means, that the most significant bit of each input character was 0. When the input was converted into a bitstream of 7 bits each, thus eliminating the leading 0, the distribution did not change significantly. In test 2a, the 10 MSBs all were 0 for all 1,024,000 fed values. Similarly, tests 2b to 2d have leading zero bits in a decreasing number. In test series 3, no static bits exist.
Quality vs quantity. As can be seen from test series 2 and 3, each additional order of magnitude provides about the same variance:mean-ratio. Consequently, with each additional magnitude the relative standard deviation decreases rapidly (by sqrt(10) per magnitude), tending towards 0.
| Series | Number of Keys | Num Symbols | Symbol Width | Map Range | Exp. Collisions | Prec | Variance | Std Dev | RSD |
| 1a | 173,528 | Variable | 8 bit | 10 LSBs | 169.46 | 7 | 212 | 169.46±14.56 | 8.59% |
| 1b | 173,528 | Variable | 8 bit | 10 MSBs | 169.46 | 7 | 226 | 169.46±15.03 | 8.87% |
| 1c | 173,528 | Variable | 7 bit | 10 LSBs | 169.46 | 7 | 211 | 169.46±14.53 | 8.57% |
| 1d | 173,528 | Variable | 7 bit | 10 MSBs | 169.46 | 7 | 196 | 169.46±14.00 | 8.26% |
| 2a | 1,024,000 | 1 | 32 bit | 10 LSBs | 1,000 | 31 | 1,176 | 1,000±34.29 | 3.43% |
| 2a | 1,024,000 | 1 | 32 bit | 10 MSBs | 1,000 | 31 | 1,078 | 1,000±32.83 | 3.28% |
| 2b | 10,240,000 | 1 | 32 bit | 10 LSBs | 10,000 | 31 | 10,895 | 10,000±104.38 | 1.04% |
| 2b | 10,240,000 | 1 | 32 bit | 10 MSBs | 10,000 | 31 | 10,646 | 10,000±103.18 | 1.03% |
| 2c | 102,400,000 | 1 | 32 bit | 10 LSBs | 100,000 | 31 | 112,546 | 100,000±335.48 | 0.34% |
| 2c | 102,400,000 | 1 | 32 bit | 10 MSBs | 100,000 | 31 | 113,981 | 100,000±337.61 | 0.34% |
| 2d | 1,024,000,000 | 1 | 32 bit | 10 LSBs | 1,000,000 | 31 | 1,166,177 | 1,000,000±1079.90 | 0.11% |
| 2d | 1,024,000,000 | 1 | 32 bit | 10 MSBs | 1,000,000 | 31 | 1,147,042 | 1,000,000±1071.00 | 0.11% |
| 3a | 1,024,000 | 1 | 32 bit | 10 LSBs | 1,000 | 31 | 979 | 1,000±31.29 | 3.13% |
| 3a | 1,024,000 | 1 | 32 bit | 10 MSBs | 1,000 | 31 | 990 | 1,000±31.46 | 3.15% |
| 3b | 10,240,000 | 1 | 32 bit | 10 LSBs | 10,000 | 31 | 9,489 | 10,000±97.41 | 0.97% |
| 3b | 10,240,000 | 1 | 32 bit | 10 MSBs | 10,000 | 31 | 10,457 | 10,000±102.26 | 1.02% |
| 3c | 102,400,000 | 1 | 32 bit | 10 LSBs | 100,000 | 31 | 97,826 | 100,000±312.77 | 0.31% |
| 3c | 102,400,000 | 1 | 32 bit | 10 MSBs | 100,000 | 31 | 98,868 | 100,000±314.43 | 0.31% |
| 3d | 1,024,000,000 | 1 | 32 bit | 10 LSBs | 1,000,000 | 31 | 1,265,855 | 1,000,000±1,125.10 | 0.11% |
| 3d | 1,024,000,000 | 1 | 32 bit | 10 MSBs | 1,000,000 | 31 | 1,298,419 | 1,000,000±1,139.48 | 0.11% |
HSH 11/13 satisfies the strict avalanche criterion. Probably the best short definition of the criterion is [1].
[1] "A function is said to satisfy the strict avalanche criterion if, whenever a single input bit is complemented (toggled between 0 and 1), each of the output bits should change with a probability of one half for an arbitrary selection of the remaining input bits." -- Bret Mulvey. More about the strict avalanche criterion on his site (see
links above).
The following explains some more in-depth, how the avalanche criterion was tested for the 32 bit key 0x00000000 (just 1 unit):
The hash value for the key 0x00000000 is generated: the returned hash value is 0x8AF570B4. In bynary, this is 10001010111101010111000010110100.
Now all 32 bits of the original key (0x00000000) are toggled one after the other, that is, a 0 becomes a 1 and a 1 would become a zero (though in this example we have no 1's in the key). This returns 32 modified keys, from 0x00000001 for the toggled bit number 0, to 0x80000000 for the toggled bit number 31.
For each of the modified keys, a new hash is calculated. For example does 0x00000001 provide the hash 0x701EC6F5, or in binary 01110000000111101100011011110101. Each of the new hash value's bits is now individually compared with the original hash value's bits. Either a bit at the new hash value's positions 31...0 changed, or it remained the same. When it changed, then a counter for that individual bit is incremented, otherwise the counter remains as is. This means, that we have an individual counter for each of the bit positions 31 to 0. The following table's 3rd row shows, which bits changed ("x") and cause the corresponding counter to be incremented, and which bits remained unchanged ("."), thus not affecting its counter:
| Original key's hash | 10001010111101010111000010110100 |
| Modified key's hash | 01110000000111101100011011110101 |
| Changed bits "x" | xxxxx.x.xxx.x.xxx.xx.xx..x.....x |
This causes counters 31, 30, 29, 28, 27, 25, 23 etc. to be incremented, but not 26, 24, 20 etc.
The following table shows all the changes for the modifications of key 0x00000000:
(Column Mod is the modified bit, Hash is the hash value for the whole modified key, and columns 31 to 0 have an "x" if the hash value's bit changed in comparison to the original key's hash value bit.)
| Mod | Hash | 31 | 30 | 29 | 28 | 27 | 26 | 25 | 24 | 23 | 22 | 21 | 20 | 19 | 18 | 17 | 16 | 15 | 14 | 13 | 12 | 11 | 10 | 9 | 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0 |
| 31 | 3BD35803 | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | |||||||||||||||||
| 30 | 14D5BA0E | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | |||||||||||||||||
| 29 | C3A1C9E7 | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | |||||||||||||||||
| 28 | 4E1CAB86 | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | |||||||||||||||
| 27 | 8A3E1291 | x | x | x | x | x | x | x | x | x | x | x | |||||||||||||||||||||
| 26 | 0FE363DE | x | x | x | x | x | x | x | x | x | x | x | x | x | |||||||||||||||||||
| 25 | DEB00AAC | x | x | x | x | x | x | x | x | x | x | x | x | x | |||||||||||||||||||
| 24 | 20B4E8B1 | x | x | x | x | x | x | x | x | x | x | x | |||||||||||||||||||||
| 23 | 72114D4C | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | |||||||||||||
| 22 | FA57AB9E | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | |||||||||||||||||
| 21 | A033C55E | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | |||||||||||||||
| 20 | C40FBB53 | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | |||||||||||
| 19 | AD772ECF | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | |||||||||||||||
| 18 | 38B0D678 | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | |||||||||||||||||
| 17 | 8C7A8F82 | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | |||||||||||||
| 16 | C9606CA7 | x | x | x | x | x | x | x | x | x | x | x | x | x | |||||||||||||||||||
| 15 | 866AF261 | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | |||||||||||||||||
| 14 | 51209E81 | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | |||||||||||
| 13 | 90662D8C | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | |||||||||||||||||
| 12 | AC7DCA83 | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | |||||||||||||||||
| 11 | 07FCCF2D | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | |||||||||||||||
| 10 | 8A37F1D4 | x | x | x | x | x | x | x | |||||||||||||||||||||||||
| 9 | C1E9FA7A | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | |||||||||||||||||
| 8 | 368AC6C3 | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | |||||||||
| 7 | 4789BCE9 | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | |||||||||||||
| 6 | 18F35B5A | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | |||||||||||||||||
| 5 | 88376C1D | x | x | x | x | x | x | x | x | x | x | x | |||||||||||||||||||||
| 4 | 62C10D3F | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | |||||||||||||||
| 3 | 76469828 | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | |||||||||||||
| 2 | 7F80601F | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | |||||||||||||||
| 1 | 03E8E944 | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | |||||||||||||||||
| 0 | 701EC6F5 | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | x | |||||||||||||
| Counters: | 17 | 14 | 13 | 13 | 17 | 14 | 15 | 11 | 17 | 16 | 12 | 14 | 12 | 18 | 16 | 16 | 20 | 13 | 16 | 20 | 23 | 14 | 18 | 14 | 15 | 15 | 20 | 18 | 17 | 14 | 17 | 17 |
Because we manipulated 32 bits, and each bit must have an approx. chance of 0.5 to change in order to satisfy the strict avalanche criterion, in average we would expect a count of 16 for each of the 32 counters.
However, above were the counts for just one key; this can not really be regarded as representative. We are going to repeat this procedure now with some more keys.
These are the test results for the first 1M pseudo random numbers of the Mersenne Twister 19937 algorithm, 32 bits each.
Fig. 6: Deviation from mean for all 32 bits of the first 1M values.
| Affected bit | 1M random keys |
| Bit 31 | 16697465 |
| Bit 30 | 16692483 |
| Bit 29 | 16698023 |
| Bit 28 | 16693781 |
| Bit 27 | 16697553 |
| Bit 26 | 16698353 |
| Bit 25 | 16693012 |
| Bit 24 | 16697464 |
| Bit 23 | 16693497 |
| Bit 22 | 16695501 |
| Bit 21 | 16696010 |
| Bit 20 | 16693835 |
| Bit 19 | 16693142 |
| Bit 18 | 16690967 |
| Bit 17 | 16696933 |
| Bit 16 | 16694732 |
| Bit 15 | 16693283 |
| Bit 14 | 16693254 |
| Bit 13 | 16696063 |
| Bit 12 | 16693136 |
| Bit 11 | 16697684 |
| Bit 10 | 16696468 |
| Bit 9 | 16691867 |
| Bit 8 | 16692805 |
| Bit 7 | 16695621 |
| Bit 6 | 16689653 |
| Bit 5 | 16694045 |
| Bit 4 | 16691465 |
| Bit 3 | 16693177 |
| Bit 2 | 16692075 |
| Bit 1 | 16691613 |
| Bit 0 | 16694428 |
| Mean | 16694355.88 |
| Variance | 166611930 |
| Std Dev | 16694355.88±12907.82 |
| RSD | 0.08% |