Dictionary (Radix Tree)
|
URI: |
http://herbert.gandraxa.com/herbert/dic.asp |
|
Link template: |
<a href="http://herbert.gandraxa.com/herbert/dic.asp">Dictionary (Radix Tree)</a> |
|
Link symbols: |
|
Dictionary (Radix Tree)
|
URI: |
http://herbert.gandraxa.com/herbert/dic.asp |
|
Link template: |
<a href="http://herbert.gandraxa.com/herbert/dic.asp">Dictionary (Radix Tree)</a> |
|
Link symbols: |
|
Table of Contents
Home »
Linoleum Source Code
» Dictionary (Radix Tree)
Source Code Library: The implementation in the cross-platform Assembler Linoleum.
Source Code Test Program Random Entries: Test program using the library, visualizing node quantity and memory usage (uses also the
RNG library to create random numbers for the length and content of keys as well as the length of random nodes).
Source Code Test Program Word Entries: Similar to above test, but operating on English words in ASCII rather than on random bitstreams.Next to the RNG library also the
BST library is used. Furthermore you need a wordlist file.
Source Code Test Sort File: This program is demonstrating a dictionary traversal with Get First/Get Next loops. It takes an unsorted wordlist and creates an output file with a sorted wordlist and additional information about the original location in the unsorted wordlist.Download as a WinZip file see below.
This DIC library implements a Dictionary with variable-length keys and variable-length user data ("payload" of an entry) in the
Linoleum Programming Language, a cross-platform Assembler.
Internally, the dictionary is kept as a binary radix tree with coalesced prefixes, in some literature also called PATRICIA trees (PATRICIA is an acronym standing for "Practical Algorithm to Retrieve Information Coded in Alphanumeric").
The main focus was on a very fast traversal in lexicographical order as realized in First/Next loops, regardless of the nature of keys and the lexicographical order the keys were inserted. All standard operations like adding an entry, removing an entry, finding an entry by key are supported.
The DSA library manages the dynamic memory requests, of which there are plenty: the radix tree is kept as tight as possible all the time, removing superfluous internal nodes by coalescing them if possible, and by creating nodes only when inavoidably needed. This approach ensures fastest possible location for any given leaf. User data is kept in separate memory chunks.
Dictionary of Algorithms and Data Structures, maintained by Paul E. Black (
NIST)
Radix Tree on WikipediaRadix Tree. Within the memory area dedicated to the dictionary and maintained by the DSA, the DIC library establishes a Radix Tree to keep the nodes of the provided entries such, that the tree integrity is never compromised. User data (the entries' payload) is stored separated from Radix Tree nodes.
Key. Keys of dictionary entries may be of any length between 1 bit and 1 Gigabit, provided that your system has the required RAM. Note, that the keys effectively do not need to have a byte- or word-boundary: this makes the library also useful for processing pure bitstreams.
Bitstreams. All keys are treated and must be provided as left-aligned bitstreams, which is not native to
Little Endian systems like Intel's. The differentiation between the different "endiannesses" is important, because the DIC library needs the key to be delivered such, that the most significant bit of the first unit is treated as the very first bit of the key. The
BST library (Bitstream Library) helps to deal with the conversion, if a such is needed: particularly is it able to convert a representation of a character set with any symbol width to an output set of any desired width with optionally converting between Small and Big Endian.
Data. The user data may consist of any desired length, and may vary in length for individual entries: the size of the payload ("objects") in fact is only limited by the amount of available RAM. As a measure for the user data serve units at 32 bits each. User data may be structured or unstructured.
Standard Operations. The operations you expect from a dictionary are all implemented: Inserting new nodes with a new (unique) key, retrieving and removing nodes by (existing) key. The Radix Tree is kept as tight as possible during all these operations.
Lexicographical Access. Get First/Get Next loops are supported: whenever a node was successfully retrieved by whatever method, Get Next will deliver the entry which is the next in lexicographical order. Shorter bitstreams are considered smaller than longer bitstreams, which is the usual approach in dictionaries: for example will "Herb" be delivered before "Herbert", "magic" before "magical" etc.
Branch Prediction. For direct access by a provided key, the DIC library features a software branch prediction: often used entries will be found more quickly than less often requested ones, because a hardware data prefetch issued at the right time will have the more frequent branch ready in the CPU's L1 cache before it actually may be needed without affecting normal throughput too much. This feature provides substantial performance gains especially with text based keys.
Linearization. To assist Get First/Get Next loops with multiple passes at fast speed, an existing tree can be re-arranged anytime in physical memory such, that cache misses are minimized. This process is called linearization and does not affect the logical tree structure. Get First/Get Next loops on linearized dictionaries can provide subsequent nodes in a speed of typically less than 250 CPU cycles per Entry (with caches providing small cache lines of 32 Bytes only, as is typical for Intel Celerons: average cycles may be far less for cache lines of 64 Bytes, which would be typical for e.g. Pentium 4 CPUs).
The following ZIP file contains both the library's source (DIC.txt) as well as several test programs as listed above.
On an Intel Celeron D 346 wth 3,066 MHz, running Windows 2000 Professional, the DIC library provided the times for processing wordlists [1] of 173,526 entries as per following tables. Note the effects of applied Linearization.
The times in the tables are given in milliseconds (ms) and are valid for 173,526 of those operations mentioned in the column's header. Note, that "Add Entry" and "Remove Entry" in average operate on the half list [2], whereas "Find Entry" as well as "Get First"/"Get Next" operate on the whole list.
There were 12 test series, for which all measured times are given in the rows labeled "1" to "12". The times vary because Windows feels that some tasks must be done exactly now; however, none of the services were shut down to artificially enhance throughput: thus the times given are "real life" times. The row labeled "Average" is the average time of the 12 test series.
"Entries per second" is an extrapolated value, and as such it must be treated with a certain caution, because performance greatly depends on the number of existing nodes: it is very well possible, that all operations perform significantly faster on an in average smaller corpus, and accordingly slower on a considerably bigger dictionary.
The columns labeled "With Linearize" repeat the former tests, but including an issued Linearize function just after filling up the whole dictionary with "Add Entry". The performance change when compared with the values given in the columns labeled "Without Linearize" can be seen in the row labeled "Effect of 'Linearize'". Note, that First/Next loops over the whole dictionary speed up quite tremendously by making most efficient use of the CPU's cache [3]: in fact, there is no difference anymore between presorted input and non-lexicographical input after linearization. This could be expected, as the linearizer was written with this intent. The average of 12.7 ms to access the whole dictionary in lexicographical order means for a 3.066 GHz CPU, that a mere 224 CPU cycles were needed to return an entry's variable-length key and the address of its user data record, including the tasks which Windows felt to do during that time [4].
Notes:
[1] The word lists are available from there:
sorted.txt (from Berkeley) and
shuffled.txt.
[2] This is because "Add Entry" has 0 entries to consider when inserting the first word, but must find it's logical position within a corpus of 173,525 existing entries when inserting the last word; similarly "Remove Entry" must find the first entry to be deleted among 173,526 still existing entries, whereas it has just to consider one single entry when removing the last entry.
[3] Also note, that "Remove Entry" may suffer from an applied linearization: this is due to the fact that linearization rearranges nodes to be kept together, but leaves user data where they originally were allocated, which is in the immediate physical neighborhood of the leaf to which they belong. Since removing an entry means also physically removing the memory allocated for user data, we have an increased chance for cache misses when comparing with the unlinearized arrangement. However, this applies to presorted input only, where the traversal speed is fast anyway: when the nodes are far apart, as is the case in feeding keys in non-lexicographical order, then linearization leads to a considerably faster traversal speed which ultimately outweighs the cache misses.
[4] The time of 224 CPU cycles includes memory transfers from physical RAM to the CPU's L1 cache. The tests were performed on an Intel Celeron Series D 346, which has a L1 cache line of only 32 Bytes. A standard Pentium 4 has a cache line of 64 Bytes, resulting in fewer transfer operations from physical RAM: thus it is more than likely that on a 3.066 GHz Pentium 4 the access speed will be even faster.
| "sorted.txt" | 173,526 Entries | Without Linearize | With Linearize | |||||||
| Test Series | Add Entry | Find Entry | First/Next | Remove Entry | Find Entry | First/Next | Remove Entry | |||
| 1 | 146 ms | 205 ms | 29.5 ms | 226 ms | 196 ms | 13.5 ms | 251 ms | |||
| 2 | 141 ms | 199 ms | 29.4 ms | 226 ms | 192 ms | 12.9 ms | 247 ms | |||
| 3 | 137 ms | 202 ms | 30.2 ms | 227 ms | 192 ms | 12.5 ms | 250 ms | |||
| 4 | 141 ms | 206 ms | 29.2 ms | 221 ms | 197 ms | 12.9 ms | 245 ms | |||
| 5 | 148 ms | 210 ms | 28.9 ms | 223 ms | 193 ms | 13.0 ms | 251 ms | |||
| 6 | 142 ms | 204 ms | 29.6 ms | 221 ms | 188 ms | 12.6 ms | 246 ms | |||
| 7 | 141 ms | 201 ms | 29.1 ms | 222 ms | 197 ms | 12.8 ms | 246 ms | |||
| 8 | 141 ms | 201 ms | 28.8 ms | 222 ms | 196 ms | 12.6 ms | 246 ms | |||
| 9 | 142 ms | 200 ms | 29.3 ms | 222 ms | 192 ms | 12.4 ms | 252 ms | |||
| 10 | 140 ms | 205 ms | 29.5 ms | 222 ms | 197 ms | 13.3 ms | 246 ms | |||
| 11 | 141 ms | 205 ms | 28.9 ms | 227 ms | 194 ms | 10.1 ms [1] | 245 ms | |||
| 12 | 141 ms | 201 ms | 35.1 ms | 223 ms | 196 ms | 12.9 ms | 247 ms | |||
| Average | 142 ms | 203 ms | 29.8 ms | 224 ms | 194 ms | 12.6 ms | 248 ms | |||
| Entries per second | 1,224,169 | 853,756 | 5,824,649 | 776,403 | 893,696 | 13,744,634 | 700,643 | |||
| Effect of "Linearize" | +5% | +136% | -10% | |||||||
[1] Windows had nothing else to do?
| "shuffled.txt" | 173,526 Entries | Without Linearize | With Linearize | |||||||
| Test Series | Add Entry | Find Entry | First/Next | Remove Entry | Find Entry | First/Next | Remove Entry | |||
| 1 | 600 ms | 671 ms | 54.2 ms | 911 ms | 475 ms | 14.0 ms | 835 ms | |||
| 2 | 596 ms | 670 ms | 58.2 ms | 907 ms | 485 ms | 12.6 ms | 831 ms | |||
| 3 | 600 ms | 675 ms | 53.1 ms | 905 ms | 469 ms | 13.0 ms | 860 ms | |||
| 4 | 598 ms | 679 ms | 58.0 ms | 908 ms | 470 ms | 12.3 ms | 838 ms | |||
| 5 | 598 ms | 670 ms | 53.7 ms | 915 ms | 468 ms | 12.0 ms | 836 ms | |||
| 6 | 596 ms | 676 ms | 54.2 ms | 908 ms | 469 ms | 12.7 ms | 834 ms | |||
| 7 | 598 ms | 672 ms | 58.6 ms | 906 ms | 474 ms | 12.8 ms | 841 ms | |||
| 8 | 596 ms | 676 ms | 52.8 ms | 906 ms | 471 ms | 13.0 ms | 843 ms | |||
| 9 | 592 ms | 681 ms | 53.3 ms | 907 ms | 472 ms | 14.0 ms | 837 ms | |||
| 10 | 605 ms | 676 ms | 53.9 ms | 909 ms | 473 ms | 12.6 ms | 838 ms | |||
| 11 | 591 ms | 675 ms | 53.3 ms | 908 ms | 470 ms | 12.5 ms | 836 ms | |||
| 12 | 598 ms | 676 ms | 53.9 ms | 909 ms | 468 ms | 12.6 ms | 834 ms | |||
| Average | 597 ms | 675 ms | 54.8 ms | 908 ms | 472 ms | 12.8 ms | 839 ms | |||
| Entries per second | 290,501 | 257,171 | 3,168,460 | 191,055 | 367,640 | 13,512,732 | 206,928 | |||
| Effect of "Linearize" | +43% | +326% | +8% | |||||||
The library was thoroughly tested in a long-time test with 6.75 billion random insertions and deletions (scenario I) and 3.9 billion random insertions and deletions (scenario II). No errors occured during these tests.
Fig. 1: The DIC test visualization in action (test program
Dic Test Random Entries). The two upper bars reflect existing leaves and nodes, the next one branch prediction quality (unsurprisingly 50% for random bitstreams), and the two lower bars measure memory properties (the cyan bar is fragmented memory in relation to the available space). The 4 rows with squares are a "graphical loop counter" and reflect the current loop iteration (yellow millions to magenta billions).
Fig. 2: Test program
Dic Test Word Entries. Note, that there are much fewer internal nodes per leaf when compared with the random keys from the previous test. Also note, how the branch predictor is able to work, reaching 69% eventhough the entries were not read one single time. When reading entries, the branch predictor is trained and will display considerably higher values.