Hash Function HSH 11/13


URI:

http://herbert.gandraxa.com/herbert/hsh.asp

Link template:   

<a href="http://herbert.gandraxa.com/herbert/hsh.asp">Hash Function HSH 11/13</a>


Link symbols:   

Local LinkOn current page | DocumentOn this site | External PageOn external site | WikipediaWikipedia article | Compressed ArchiveZIP archive | PDF documentPDF | E-MailE-Mail


Article

Organization

DocumentHome » DocumentLinoleum Source Code » Hash Function HSH 11/13

Download as a WinZip file see below.

Scope

This is proposing a new Hash Function, named HSH 11/13 due to the algorithm's characteristical bit rotations. The DocumentHSH library contains an implementation in the WikipediaLinoleum Programming Language, a cross-platform Assembler.

Porting to any language should provide no problems, as it consists of rather trivial operations: see pseudo-code below. If you do so, I appreciate a mail with a link to your published code. Porting requires, that your programming language supports the rotate-left [1] instruction.

[1] Machinecode mnemonic ROL (Rotate Left) on Intel 80x86, Motorola 680x0, Motorola 68300. ROTL (Rotate Long) on DEC VAX.

Author

DocumentHerbert Glarner

External Links


Outline

Keys

The algorithm works on memory chunks of 32 bits (sometimes called words, or units). The key to be hashed must be in successive units. If the key consists of an "odd" number of bytes which don't fill the last word (unit) completely, you must ensure that the unused 1 to 3 bytes are zeroed and do not contain any garbage from recent operations.

Pseudocode

In Pseudocode, the algorithm does the following:

state := initvalue
result := 0
Do for all keyunit:=1...n containing parts of the key
{
    result := result xor keyunit
    Do precision times
    {
        state := state left-rotate 11 bits
        result := result left-rotate 13 bits
        result := result xor state
        result := result left-rotate (state mod 32) bits
        state := state left-rotate (result mod 32) bits
    }
}

Note: If you use this on a 32-bit Intel CPU, left-rotate is mod 32 anyway, so you simply can code A left-rotate B instead of A left-rotate (B mod 32).

keyunit Unit = "word" = 32 bit = 4 bytes. keyunit is a vector of n units, each holding a part (1...32 bits) of the key to be hashed
state, result A piece of RAM (register, variable) of unit size, i.e. 32 bit each. state is needed only while the algorithm calculates the hash: it does not preserve any internal state and is completely local. The same is true for result with the exception, that its final value (the hash) is returned to the caller.
precision A measure for the required precision of the distribution. As a rule of thumb, use the number of bits of your input symbols minus 1. E.g.: If your input symbols are organized in bytes, use 8-1=7. If you process 32-bit words, use 32-1=31 etc. The number of symbols in your key does not influence this value. Don't use less than 7, even if you process smaller symbol widths.
initvalue A constant. The original implementation uses the hexadecimal value 0x40490FDB (pi in IEEE-754 format, single precision, big endian). Other good values include 0xC0C0C0C0, 0x03030303, and also 1. Bad values are 0 and 0xFFFFFFFF. All of the test data given below base on the value 0x40490FDB.

Space Complexity

The space complexity is 0: no variables are used to store intermediate values of any kind. The 2 constants can be hard-coded, and all operations can be executed directly in CPU registers.

Performance

Performance is reasonably fast especially for keys coming in bytes (ASCII- or ANSI-strings). Processing data in units (32 bit words) takes approx. 4 times as long.

In assembly, above pseudo code can be coded such, that, when processing bytestreams (sequential ASCII characters), 7*6+4=46 machine instructions are executed per processed unit of 32 bits, or 11½ machine instructions per processed byte (character).

E.g.: A key like "Yvonne", when being processed as ASCII characters, occupies 2 units [1] and thus requires 2*46=92 machine instructions to generate a hash value.

[1] Big Endian: [ 'Y' 'v' 'o' 'n' ][ 'n' 'e' \0 \0 ]; Little Endian: [ 'n' 'o' 'v' 'Y' ][ \0 \0 'e' 'n' ]


Download

The following ZIP file contains both the library's source (HSH.txt) as well as a test program (HSH Test.txt).

Zipped ArchiveHSH.zip


Test Layouts

Series 1 (Text)

Consider the two ASCII strings "Yvonne" and "Herbert". When reading in from a file into 32 bit units of a Little Endian machine, the strings would be held in RAM as follows:

Text Unit 0 Unit 1
Yvonne\0 6E 6F 76 59 xx 00 65 6E
Herbert\0 62 72 65 48 00 74 72 65

Converting them into an 8-bit bitstream with zero padding after the last character would turn the units into:

Unit 0 Unit 1
"Yvonne" 59 76 6F 6E 6E 65 00 00
"Herbert" 48 65 72 62 65 72 74 00

Because we process ASCII keys, we set the precision to 8-1=7 when calling the hash function. The number of units to be processed can be calculated with (NumCharacters + 3) \ 4.

The calculated 32-bit hash values for above bitstreams are:

"Yvonne" 92 3F 2D B7
"Herbert" 22 51 0D DC

To see, how well the hash function distributes, we don't create more values manually. Instead, we process a file containing many different words.

The test file for this test series was the External SiteBerkeley Word List containing 173,528 words in English language (our keys). The file is 1,923,517 bytes large. Each word of this list was extracted and fed to the hash generator as if it was in Big Endian, i.e. MSB first.

10 bits of the generated keys were used, the remaining 22 bits discarded. Since we can map only 1,024 values in 10 bits, hash collisions must occur. We would expect an average 173,528 / 1,024 = 169.46 keys to result in the same hash, if we could provide a perfect hash function. Since we usually don't know the number of keys, we can not create a 1:1 mapping function. However, the requirement is in any case, to remain as close to the expectancy as possible.

The number of hash collisions were counted, i.e. the number of keys producing the same hash value. Each of the hash values 0...1023 ideally would be occupied by 169.46 keys.

For aforementioned word list, each word converted to a Big Endian bitstream with 8-bit/character, the proposed algorithm distributes as follows:

HSH Distribution Test Series 1

Fig. 1: Distribution of hash values for test series 1. X axis: from left to right the hash values 0...1023. Y axis: from top to bottom the number of keys providing the hash value X. Blue horizontal Line: Expectancy value (169.46 keys).

This test series consists of 4 individual tests:

Series 2 (Clusters)

To see, if the hash function distributes well also for clusters, i.e. for values being very close to each other, a 32 bit unit is used, which simply is incremented by 1, beginning with 0. Note, that the precision must be set to 32-1=31 here when calling the hash function. In this test series, the number of units to be processed is 1 for each key.

The first 10 hash values are given here to facilitate checking your own implementation (all values in hexadecimal):

Key Hash
00 00 00 00 8A F5 70 B4
00 00 00 01 70 1E C6 F5
00 00 00 02 03 E8 E9 44
00 00 00 03 D0 85 7B 72
00 00 00 04 7F 80 60 1F
00 00 00 05 F4 2B F7 DD
00 00 00 06 19 82 72 04
00 00 00 07 42 37 C5 F9
00 00 00 08 76 46 98 28
00 00 00 09 8D 8C A3 BA

The counts are dense clusters in the low bits. The created hash values were mapped onto a 10 bit range from 0...1,023. 4 tests were run in this test series:

Again, 10 bits of the generated keys were used, the remaining 22 bits discarded, for each of the tests (a) to (d) once the 10 LSBs were used and the other time the 10 MSBs.

HSH Distribution Test Series 2 (a)

Fig. 2: Distribution of hash values for test series 2a. Test (a) for 10 MSBs. X axis: from left to right the hash values 0...1023. Y axis: from top to bottom the number of keys providing the hash value X. Blue horizontal Line: Expectancy value (1,000 keys).

HSH Distribution Test Series 2 (d)

Fig. 3: Distribution of hash values for test series 2d. Test (d) for 10 MSBs. X axis: from left to right the hash values 0...1023. Y axis: from top to bottom the number of keys providing the hash value X. Blue horizontal Line: Expectancy value (1,000,000 keys).

Series 3 (Random)

For the third test series a number of pseudo random values (algorithm Mersenne Twister 19937 as implemented in the DocumentRNG library) were provided, all consisting of a single unit of 32 bits width. The hash function's arguments are the same as with test series 2.

The first 1,000 numbers of the pseudo random number generator are available from the External SiteUniversity Hiroshima.

As was the case with the former test series, all hash values were mapped onto a 10 bit range from 0...1,023. The four tests of this series:

And as with previous tests, 10 bits of the generated keys were used, the remaining 22 bits discarded, for each of the tests (a) to (d) once the 10 LSBs were used and the other time the 10 MSBs.

Fig. 4: Distribution of hash values for test series 3a. Test (a) for 10 MSBs. X axis: from left to right the hash values 0...1023. Y axis: from top to bottom the number of keys providing the hash value X. Blue horizontal Line: Expectancy value (1,000 keys).


Fig. 5: Distribution of hash values for test series 3d. Test (d) for 10 LSBs. X axis: from left to right the hash values 0...1023. Y axis: from top to bottom the number of keys providing the hash value X. Blue horizontal Line: Expectancy value (1,000,000 keys).


Hash Quality

Distribution

The bits are equivalent in their quality. Using the 10 LSBs (Least Significant Bits) or the 10 MSBs (Most Significant Bits) by shifting right 22 bits, we obtain similar distributions (see table below).

Static filler bits don't affect the quality. The test series 1 was run for keys having a varying number of characters of 8 bit width, although the text actually consists of pure ASCII characters. This means, that the most significant bit of each input character was 0. When the input was converted into a bitstream of 7 bits each, thus eliminating the leading 0, the distribution did not change significantly. In test 2a, the 10 MSBs all were 0 for all 1,024,000 fed values. Similarly, tests 2b to 2d have leading zero bits in a decreasing number. In test series 3, no static bits exist.

Quality vs quantity. As can be seen from test series 2 and 3, each additional order of magnitude provides about the same variance:mean-ratio. Consequently, with each additional magnitude the relative standard deviation decreases rapidly (by sqrt(10) per magnitude), tending towards 0.

Series Number of Keys Num Symbols Symbol Width Map Range Exp. Collisions Prec Variance Std Dev RSD
1a 173,528 Variable 8 bit 10 LSBs 169.46 7 212 169.46±14.56 8.59%
1b 173,528 Variable 8 bit 10 MSBs 169.46 7 226 169.46±15.03 8.87%
1c 173,528 Variable 7 bit 10 LSBs 169.46 7 211 169.46±14.53 8.57%
1d 173,528 Variable 7 bit 10 MSBs 169.46 7 196 169.46±14.00 8.26%
2a 1,024,000 1 32 bit 10 LSBs 1,000 31 1,176 1,000±34.29 3.43%
2a 1,024,000 1 32 bit 10 MSBs 1,000 31 1,078 1,000±32.83 3.28%
2b 10,240,000 1 32 bit 10 LSBs 10,000 31 10,895 10,000±104.38 1.04%
2b 10,240,000 1 32 bit 10 MSBs 10,000 31 10,646 10,000±103.18 1.03%
2c 102,400,000 1 32 bit 10 LSBs 100,000 31 112,546 100,000±335.48 0.34%
2c 102,400,000 1 32 bit 10 MSBs 100,000 31 113,981 100,000±337.61 0.34%
2d 1,024,000,000 1 32 bit 10 LSBs 1,000,000 31 1,166,177 1,000,000±1079.90 0.11%
2d 1,024,000,000 1 32 bit 10 MSBs 1,000,000 31 1,147,042 1,000,000±1071.00 0.11%
3a 1,024,000 1 32 bit 10 LSBs 1,000 31 979 1,000±31.29 3.13%
3a 1,024,000 1 32 bit 10 MSBs 1,000 31 990 1,000±31.46 3.15%
3b 10,240,000 1 32 bit 10 LSBs 10,000 31 9,489 10,000±97.41 0.97%
3b 10,240,000 1 32 bit 10 MSBs 10,000 31 10,457 10,000±102.26 1.02%
3c 102,400,000 1 32 bit 10 LSBs 100,000 31 97,826 100,000±312.77 0.31%
3c 102,400,000 1 32 bit 10 MSBs 100,000 31 98,868 100,000±314.43 0.31%
3d 1,024,000,000 1 32 bit 10 LSBs 1,000,000 31 1,265,855 1,000,000±1,125.10 0.11%
3d 1,024,000,000 1 32 bit 10 MSBs 1,000,000 31 1,298,419 1,000,000±1,139.48 0.11%

Avalanche Test

Outline

HSH 11/13 satisfies the strict avalanche criterion. Probably the best short definition of the criterion is [1].

[1] "A function is said to satisfy the strict avalanche criterion if, whenever a single input bit is complemented (toggled between 0 and 1), each of the output bits should change with a probability of one half for an arbitrary selection of the remaining input bits." -- Bret Mulvey. More about the strict avalanche criterion on his site (see Section links on this pagelinks above).

The following explains some more in-depth, how the avalanche criterion was tested for the 32 bit key 0x00000000 (just 1 unit):

The hash value for the key 0x00000000 is generated: the returned hash value is 0x8AF570B4. In bynary, this is 10001010111101010111000010110100.

Now all 32 bits of the original key (0x00000000) are toggled one after the other, that is, a 0 becomes a 1 and a 1 would become a zero (though in this example we have no 1's in the key). This returns 32 modified keys, from 0x00000001 for the toggled bit number 0, to 0x80000000 for the toggled bit number 31.

For each of the modified keys, a new hash is calculated. For example does 0x00000001 provide the hash 0x701EC6F5, or in binary 01110000000111101100011011110101. Each of the new hash value's bits is now individually compared with the original hash value's bits. Either a bit at the new hash value's positions 31...0 changed, or it remained the same. When it changed, then a counter for that individual bit is incremented, otherwise the counter remains as is. This means, that we have an individual counter for each of the bit positions 31 to 0. The following table's 3rd row shows, which bits changed ("x") and cause the corresponding counter to be incremented, and which bits remained unchanged ("."), thus not affecting its counter:

Original key's hash 10001010111101010111000010110100
Modified key's hash 01110000000111101100011011110101
Changed bits "x" xxxxx.x.xxx.x.xxx.xx.xx..x.....x

This causes counters 31, 30, 29, 28, 27, 25, 23 etc. to be incremented, but not 26, 24, 20 etc.

The following table shows all the changes for the modifications of key 0x00000000:

(Column Mod is the modified bit, Hash is the hash value for the whole modified key, and columns 31 to 0 have an "x" if the hash value's bit changed in comparison to the original key's hash value bit.)

Mod Hash 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
31 3BD35803 x x x x x x x x x x x x x x x
30 14D5BA0E x x x x x x x x x x x x x x x
29 C3A1C9E7 x x x x x x x x x x x x x x x
28 4E1CAB86 x x x x x x x x x x x x x x x x x
27 8A3E1291 x x x x x x x x x x x
26 0FE363DE x x x x x x x x x x x x x
25 DEB00AAC x x x x x x x x x x x x x
24 20B4E8B1 x x x x x x x x x x x
23 72114D4C x x x x x x x x x x x x x x x x x x x
22 FA57AB9E x x x x x x x x x x x x x x x
21 A033C55E x x x x x x x x x x x x x x x x x
20 C40FBB53 x x x x x x x x x x x x x x x x x x x x x
19 AD772ECF x x x x x x x x x x x x x x x x x
18 38B0D678 x x x x x x x x x x x x x x x
17 8C7A8F82 x x x x x x x x x x x x x x x x x x x
16 C9606CA7 x x x x x x x x x x x x x
15 866AF261 x x x x x x x x x x x x x x x
14 51209E81 x x x x x x x x x x x x x x x x x x x x x
13 90662D8C x x x x x x x x x x x x x x x
12 AC7DCA83 x x x x x x x x x x x x x x x
11 07FCCF2D x x x x x x x x x x x x x x x x x
10 8A37F1D4 x x x x x x x
9 C1E9FA7A x x x x x x x x x x x x x x x
8 368AC6C3 x x x x x x x x x x x x x x x x x x x x x x x
7 4789BCE9 x x x x x x x x x x x x x x x x x x x
6 18F35B5A x x x x x x x x x x x x x x x
5 88376C1D x x x x x x x x x x x
4 62C10D3F x x x x x x x x x x x x x x x x x
3 76469828 x x x x x x x x x x x x x x x x x x x
2 7F80601F x x x x x x x x x x x x x x x x x
1 03E8E944 x x x x x x x x x x x x x x x
0 701EC6F5 x x x x x x x x x x x x x x x x x x x
Counters: 17 14 13 13 17 14 15 11 17 16 12 14 12 18 16 16 20 13 16 20 23 14 18 14 15 15 20 18 17 14 17 17

Because we manipulated 32 bits, and each bit must have an approx. chance of 0.5 to change in order to satisfy the strict avalanche criterion, in average we would expect a count of 16 for each of the 32 counters.

However, above were the counts for just one key; this can not really be regarded as representative. We are going to repeat this procedure now with some more keys.

Test Results

These are the test results for the first 1M pseudo random numbers of the Mersenne Twister 19937 algorithm, 32 bits each.

Deviation from Mean

Fig. 6: Deviation from mean for all 32 bits of the first 1M values.

Affected bit 1M random keys
Bit 31 16697465
Bit 30 16692483
Bit 29 16698023
Bit 28 16693781
Bit 27 16697553
Bit 26 16698353
Bit 25 16693012
Bit 24 16697464
Bit 23 16693497
Bit 22 16695501
Bit 21 16696010
Bit 20 16693835
Bit 19 16693142
Bit 18 16690967
Bit 17 16696933
Bit 16 16694732
Bit 15 16693283
Bit 14 16693254
Bit 13 16696063
Bit 12 16693136
Bit 11 16697684
Bit 10 16696468
Bit 9 16691867
Bit 8 16692805
Bit 7 16695621
Bit 6 16689653
Bit 5 16694045
Bit 4 16691465
Bit 3 16693177
Bit 2 16692075
Bit 1 16691613
Bit 0 16694428
Mean 16694355.88
Variance 166611930
Std Dev 16694355.88±12907.82
RSD 0.08%