Dictionary (Radix Tree) Source Code Test Program Sort File
|
|
URI:
|
http://herbert.gandraxa.com/herbert/dic_test_sort.asp
|
|
Link template:
|
<a href="http://herbert.gandraxa.com/herbert/dic_test_sort.asp">Dictionary (Radix Tree) Source Code Test Program Sort File</a>
|
|
|
Link symbols:
|
On current page |
On this site |
On external site |
Wikipedia article |
ZIP archive |
PDF |
E-Mail
|
Article
Organization
Home »
Linoleum Source Code »
Dictionary (Radix Tree) »
Source Code Test Program Sort File
Scope
This is the source code of the Dictionary (Radix Tree) Test Program to sort a file, testing the Dictionary (Radix Tree) library, implemented in the
Linoleum programming language.
Author
Herbert Glarner
Usage
This test program demonstrates the ability of the
Dictionary (Radix Tree) library to quickly perfoem Get First/Get Next loops, thereby sorting a file with shuffled words lexicographically.
Make sure you have the library named dic.txt in the same folder as this test program before assembling the latter.
To make use of this test, you also need a wordlist file with shuffled entries:
shuffled.zip (831 kB). Save this file as "shuffled.txt" in the same directory as this program.
Source Code
(Test application, using the Dictionary Routines "DIC.txt".)
(NOTE: Needs the file 'shuffled.txt' in the directory in which you use this
test program. It is available at http://herbert.wikispaces.com/DIC both as
plain vanilla ASCII text file or also zipped.)
(WARNING: Creates a file 'DICtest.txt' in the directory in which you use this
test program.)
(*****************************************************************************)
(Test Scenario:
1. Reads in the file 'shuffled.txt'
2. Adds the words into the dictionary, as userdata we store the line in which
the word was found.
3. Traverses the dictionary in a Get First/Get Next loop and outputs the words
in a file 'DICtest.txt', including the userdata, i.e. the line in which the
word was in the original file
4. Writing the file 'DICtest.txt' into the same dictionary as this file.
)
(*****************************************************************************)
"libraries"
(=========================================================================)
(Dictionary; the library which is tested with this test program.)
(-------------------------------------------------------------------------)
DIC;
(=========================================================================)
(=========================================================================)
(Bitstream Generator)
(-------------------------------------------------------------------------)
(Used to extract words from the wordlist. Since the wordlist is an ASCII
file and we read it into workspace units, we also need to convert from
Little Endian. This all can be done by the Bitstream Generator. Available
at http://herbert.wikispaces.com/BST)
(-------------------------------------------------------------------------)
BST;
(=========================================================================)
(*****************************************************************************)
"directors"
program name = { DIC Test };
unit = 32;
(*****************************************************************************)
"constants" (Convention: "DIC ...", UPPER CASE)
(=========================================================================)
(Wordlist)
(-------------------------------------------------------------------------)
(The testfile is a list of English plain ASCII words. It can be downloaded
from xxx. It is 1,923,517 bytes large
including CR/LF characters at the end of each word. It contains 173,528
different words.)
(-------------------------------------------------------------------------)
(The original file 'shuffled.txt' is 1 923 517 Bytes long. Four bytes each
occupy a unit, thus the final 'div 4'. Of the last unit only 1 byte is used:
it is the 0Ah byte of the newline sequence 0D0Ah after the last word. We
need some space to read in the file into one go, including that last byte.
The following 3 bytes are 00h each. Because this very space is also used to
create the output file, we make the space somewhat larger. 7 mtp 173528
/bytes/ are needed to additionally store a 'Tab/xxxxxx' sequence when we
ultimately use this space to output a sorted file with some userdata. The
'xxxxxx' are 1...6 digits to indicate the line where the word was in the
input file. And the 'minus n' are the missing digits in the lower numbers.
It's exact to the last byte, just trust me ;)
DIC TEST FILENAME SIZE = 7 mtp DIC TEST MAX ENTRIES
minus 99999 minus 9999 minus 999 minus 99 minus 9
plus 1 923 517 plus 3 div 4; (Space in units)
(-------------------------------------------------------------------------)
(Longest word size in units. Without really checking, I assume that the
longest word within the test file does not exceed 9 units = 36 ASCII
characters = 288 bits.)
DIC TEST MAX WORD SIZE = 9;
(=========================================================================)
(=========================================================================)
(Requested memory area for our Dictionary)
(-------------------------------------------------------------------------)
DIC TEST MAX ENTRIES = 173 528; (All words)
DIC TEST RAM SIZE = 88 mtp DIC TEST MAX ENTRIES;
(=========================================================================)
(*****************************************************************************)
"variables" (Convention: "dic ...", lower case)
(=========================================================================)
(Name of the file we want to process)
(-------------------------------------------------------------------------)
dic test filename = { shuffled.txt }; (From here we read)
dic test sortname = { DICtest.txt }; (This we create)
(=========================================================================)
(=========================================================================)
(Just a counter for the conversion of hexnumbers to ASCII digits)
(-------------------------------------------------------------------------)
dic test signif digits = 0; (Number of significant digits)
(=========================================================================)
(*****************************************************************************)
"workspace" (Convention: "dic ...", lower case)
(=========================================================================)
(Space to read in the whole wordlist)
(-------------------------------------------------------------------------)
dic test file = DIC TEST FILENAME SIZE;
(=========================================================================)
(=========================================================================)
(This is the 'object' vector. It contains just 1 unit and represents the
line of the file 'shuffled.txt' in which we originally found the particular
word.)
(-------------------------------------------------------------------------)
dic test object = 1;
(=========================================================================)
(=========================================================================)
(Into the following vector the actual key is stored when dealing with
the dictionary.)
(-------------------------------------------------------------------------)
dic test key = DIC TEST MAX WORD SIZE;
(=========================================================================)
(=========================================================================)
(Some space for creating a bitstream for individual words.)
(-------------------------------------------------------------------------)
dic test bitstream = DIC TEST MAX WORD SIZE;
(=========================================================================)
(*****************************************************************************)
"programme" (Convention: "Dic ...", Mixed case)
(*************************************************************************)
(I. SETTING UP THE TEST ENVIRONMENT)
(*************************************************************************)
(=========================================================================)
(Providing some RAM for the Dictionary.)
(-------------------------------------------------------------------------)
D = DIC TEST RAM SIZE; (Number of units)
=> Dic Create;
? ok -> Dic Test Dic Created;
A = 1; (Testing phase)
B = [dic error]; (What error occured)
-> Dic Test Stop;
"Dic Test Dic Created"
(=========================================================================)
(=========================================================================)
(Reading in the testfile. It contains 8 bit ASCII characters. There are
1,923,517 bytes in the file.)
(-------------------------------------------------------------------------)
(Check if file exists)
[File Name] = dic test filename;
[File Command] = TEST;
isocall;
? ok -> Dic Test File Found;
A = 2; (Testing phase)
[dic error] = 100000; (Pseudo Error Code)
-> Dic Test Stop;
"Dic Test File Found"
(-------------------------------------------------------------------------)
(Check if we may read the file)
? [File Status] + PERMIT TO READ -> Dic Test File Readable;
A = 2; (Testing phase)
[dic error] = 100001; (Pseudo Error Code)
-> Dic Test Stop;
"Dic Test File Readable"
(-------------------------------------------------------------------------)
(The testfile is read into RAM.)
[Block Pointer] = dic test file; (Buffer holding the whole file)
[File Command] = READ;
[File Position] = ZERO;
[Block Size] = DIC TEST FILENAME SIZE mtp BYTESPERUNIT;
isocall;
? ok -> Dic Test File Read;
A = 2; (Testing phase)
[dic error] = 100002; (Pseudo Error Code)
-> Dic Test Stop;
"Dic Test File Read"
(=========================================================================)
(*************************************************************************)
(II. FILLING THE DICTIONARY)
(*************************************************************************)
(=========================================================================)
(On Intels, the just read in file is in Little Endian. We need to convert
the file into Big Endian, such that the bytes can be read from 'left to
right'. Also we want individual words in their own slot of the reserved
workspace 'at dic test node mgmt'. There are DIC TEST MAX ENTRIES such
slots, numbered from 0 to DIC TEST MAX ENTRIES-1. Each slot has a size of
1+DIC TEST MAX WORD SIZE units. The first unit holds the word's size in
bits, which always is a multiple of 8, since we process bytes. The remaining
DIC TEST MAX WORD SIZE units hold the bitstream of this word, i.e. the
left-aligned bytes eradable from left to right.)
(-------------------------------------------------------------------------)
(Initialising a bitstream area. The resulting bitstream has 8 bits per
character, because most applications won't fiddle around with the bits.
However, as the input file I used contains plain vanilla ASCII characters,
a 7-bit bitstream could be used: if you want to test this out, set LSW to 7.)
A = 8; (PSW: Characters are in 8 bit units)
B = 8; (LSW: using 8 bit bitstreams)
C = 13; (End-of-symbols symbol is 'CR' = 0Dh = 13)
=> Bst Init;
? ok -> Dic Test Bst Created;
A = 3; (Testing phase)
[dic error] = 100000; (Pseudo Error Code)
-> Dic Test Stop;
"Dic Test Bst Created"
(-------------------------------------------------------------------------)
(Make a bitstream from the first word.)
E = dic test file; (Address to unit with start of 1st word)
D = dic test key; (Address of outstream buffer)
=> Bst Stream;
(We need a word counter to provide some user data: the line in which the
word was found first.)
E = 1; (First word was processed)
(-------------------------------------------------------------------------)
(Starting at address 'dic test key' there is a bitstream of A bits. We feed
it to the dictionary.)
-> Dic Test Add To Dic;
(-------------------------------------------------------------------------)
(Now loop through the whole file, processing word by word. Make a bitstream
from the next word.)
"Dic Test Stream Word"
C = 1; (Symbols to skip: We don't want the LF)
D = dic test key; (Address of output bitstream)
=> Bst Resume;
E + 1; (Next word was processed)
(-------------------------------------------------------------------------)
(Starting at address 'dic test key' there is a bitstream of A bits. We feed
it to the dictionary.)
"Dic Test Add To Dic"
[dic test object] = E; (User data = the line where the word is)
A = dic test key; (Address where the bitstream is)
B = [bst bits]; (Number of bits this word has)
C = dic test object; (Address of the payload)
D = 1; (Always just 1 unit in this test program)
=> Dic Add Entry;
? ok -> Dic Test Add To Dic Added; (Process next word)
A = 4; (Testing phase)
-> Dic Test Stop;
(-------------------------------------------------------------------------)
(Looping until all words processed.)
"Dic Test Add To Dic Added"
? E < DIC TEST MAX ENTRIES -> Dic Test Stream Word;
(=========================================================================)
(*************************************************************************)
(III. TRAVERSING THE DICTIONARY IN LEXICOGRAPHICAL ORDER)
(*************************************************************************)
(=========================================================================)
(The dictionary contains all 173,528 words of 'shuffled.txt' now: One
entry per word. We don't need the work area starting at 'dic test file'
anymore, so we re-use it to 'build' the output file. We simply transfer
the returned bitstream starting at address A to the next free BYTE position
of the work area starting at 'dic test file'. This might look somewhat
tricky, because we have to deal with Byte/Unit alignments, but is is a lot
faster than issuing 173,528 superfluous isocalls. Fortunately, we can make
use of the library's 'Dic Reduce Key' function. What is that? Well it shifts
bits out to the left of several adjacent units, and that comes in very handy
for our task. Check the code, if you are familiar with shifts and shift-
rotates. If not, better don't: it will give you a head-ache. Anyway: Register
D shall be the unit in which to write, E the left-aligned bit offset 0...31
of that unit.)
(-------------------------------------------------------------------------)
D = dic test file; (Start of workarea for sorted file)
E = 0; (Begin with 1st bit of that unit)
(=========================================================================)
(=========================================================================)
(A Get First/Get Next traversal returns them in alphabetical order, which
is what we want.)
(-------------------------------------------------------------------------)
A = dic test key; (Address where the bitstream has to go)
=> Dic Get First; (B: key length, C: user data address)
? ok -> Dic Test Got Word; (Process next word)
A = 5; (Testing phase)
-> Dic Test Stop;
"Dic Test Got Word"
(-------------------------------------------------------------------------)
(Composing a line to be written into the new file. Because this loop shall
mainly demonstrate Get First/Get Next dictionary traversal, this task is
done in a routine.)
=> Dic Test Write Line;
(-------------------------------------------------------------------------)
(Get the next word)
A = dic test key; (Address where the bitstream has to go)
=> Dic Get Next; (B: key length, C: user data address)
? ok -> Dic Test Got Word; (Process next word)
(Possibly an error, but if it is DIC ERR END OF DICTIONARY then we only
have no more words to process and thus all is fine and we go to write the
file.)
? [dic error] = DIC ERR END OF DICTIONARY -> Dic Test File Ready;
(Yes, this is a real error.)
A = 5; (Testing phase)
-> Dic Test Stop;
(=========================================================================)
(=========================================================================)
(IV. WRITING THE FILE)
(-------------------------------------------------------------------------)
"Dic Test File Ready"
(To output the new file, we need to convert what is now a bitstream into
Intel's Little Endian format.)
E = dic test object;
E - dic test file; (So many units has the file area)
D = dic test file; (And here it starts)
"Dic Test Add Convert"
A = [D]; (aabbccdd)
B = A; (aabbccdd)
A <@ 8; (bbccddaa)
B @> 8; (ddaabbcc)
A & 00FF00FFh; (00cc00aa)
B & FF00FF00h; (dd00bb00)
A | B; (ddccbbaa)
[D] = A; (Write back)
D + 1; (Next unit)
E ^ Dic Test Add Convert;
(-------------------------------------------------------------------------)
(Let's output the new sorted file)
[File Name] = dic test sortname; ('output.txt')
[File Command] = WRITE;
[File Position] = ZERO; (from the beginning of file)
[Block Pointer] = dic test file;
[Block Size] = DIC TEST FILENAME SIZE mtp 4; (in Bytes )
isocall;
(=========================================================================)
(*************************************************************************)
(IV. TERMINATING)
(*************************************************************************)
(=========================================================================)
(Address we jump if all was fine.)
(-------------------------------------------------------------------------)
"Dic Test Exit"
(Destroying the Dictionary. No input parameters required. No result
provided. No errors possible. Still needs to be called to return the
allocated memory to the OS.)
=> Dic Destroy;
(-------------------------------------------------------------------------)
(Output Some statistical data.)
A = 99; (A = 99 = all ok, otherwise failure)
show registers;
(=========================================================================)
(=========================================================================)
(Address we jump to if there is an error. In A is the testing phase, in
B is the error number.)
(-------------------------------------------------------------------------)
"Dic Test Stop"
show registers;
(=========================================================================)
(*****************************************************************************)
"Dic Test Write Line"
(=========================================================================)
(This routine creates a line of the output file. It is a routine to keep
the codeflow legible and to hide the ugly technical details of writing a
line into a file.)
(-------------------------------------------------------------------------)
(Move B bits from address starting at A to left-aligned bit offset E
starting at address D; append a TAB symbol to separate the word from the
user data; append the content of the user data unit C; and append a CR/LF
sequence to terminate the line. Adjust D and E to the new locations.)
(-------------------------------------------------------------------------)
(We have B bits to copy from [A] to [D]. B can be a large number > 32.
E has the number of already occupied bits in [D]. This is eactly what the
internal DIC routine 'Dic Modify Key' can handle perfectly.)
A -->; B -->; C -->; E -->;
C = B; (Replace so many bits after the initial B bits)
B = E; (Skip this number of bits, replace the following bits)
E = A; (Address of Input Buffer, bitstream is left aligned)
A = D; (Address of Output Buffer, bitstream is left aligned)
=> Dic Modify Key;
<-- E; <-- C; <-- B; <-- A;
(-------------------------------------------------------------------------)
(Adjust the address D and offset E. For this we calculate the bit offset
of the memory address.)
D < 5; D + E;
(To this the number of just inserted bits is added.)
D + B;
(D mod 32 is the new E, and D/32 the new D.)
E = D; E & 11111b; D > 5;
(-------------------------------------------------------------------------)
(The word was already inserted into the buffer at 'dic test file'. Now we
use the buffer 'dic test bitstream' to construct the rest of the line.)
A = dic test bitstream;
[A plus 0] = 09 000000h; (TAB char, rest clear for max 3 digits)
[A plus 1] = 00000000h; (Clear to possibly accept 3 digits+CR)
[A plus 2] = 00 000000h; (Clear to possibly accept LF)
B = 24;
(-------------------------------------------------------------------------)
(To Address A we want to output an ASCII string with decimal digits. The
address with the userdata is in C. Its first unit contains the number.)
(-------------------------------------------------------------------------)
D -->; E -->;
C = [C]; (This is the number we want to convert)
(Note that there are far better algorithms than this clumsy approach.)
E = 100 000; (Max possible digit for 173,528 words)
B = 16; (Shift value, skipping the TAB)
[dic test signif digits] = 0; (No significant digits yet)
"Dic Test Hex To Asc Next Digit"
D = C;
D / E;
(If already significant digits, then process also when 0)
? [dic test signif digits] > 0 -> Dic Test Hex To Asc Out;
? D = 0 -> Dic Test Hex To Asc No Out;
"Dic Test Hex To Asc Out"
[dic test signif digits] + 1; (One more significant digit)
(The digit D needs to be output as an ASCII character)
C % E; (Remove that digit)
D + 30h; (The digits ASCII code)
D < B; (Shift into position of unit)
[A] | D; (Insert into unit)
(Prepare to accept next digit)
B - 8; (To the right of the inserted digit)
? B >= 0 -> Dic Test Hex To Asc Has Space;
A + 1; (Starting at next unit)
B = 24; (At leftmost position)
"Dic Test Hex To Asc Has Space"
"Dic Test Hex To Asc No Out"
E / 10; (Next digit, if any)
? E > 0 -> Dic Test Hex To Asc Next Digit;
(-------------------------------------------------------------------------)
(Now we append a CR/LF sequence to skip to the next line. B tells us where
the sequence has to go in the actual unit [A]. B can be 24, 16, 8 or 0. If
it is 8 or larger, then the whole sequence has space in [A], if it is 0,
then only the CR fits in [A] and LF must go into [A plus 1].)
? B = 0 -> Dic Test Split Linefeed;
D = 0D0Ah; (CRLF)
B - 8; (We move 2 chars into position)
D < B; (Shift into position of unit)
[A] | D; (Insert into unit)
-> Dic Test Userdata Complete;
"Dic Test Split Linefeed"
[A] | 0Dh; (Only CR fits into unit)
[A plus 1] | 0A 000000h; (LF goes into next unit)
"Dic Test Userdata Complete"
<-- E; <-- D;
(-------------------------------------------------------------------------)
(Append all that data to the output file)
A = dic test bitstream; (Contains TAB, some digits, CR, LF)
B = [dic test signif digits]; (So many digits)
B + 3; (Plus 3 characters for TAB, CR, LF)
B < 3; (So many bits)
B -->; E -->;
C = B; (Replace so many bits after the initial B bits)
B = E; (Skip this number of bits, replace the following bits)
E = A; (Address of Input Buffer, bitstream is left aligned)
A = D; (Address of Output Buffer, bitstream is left aligned)
=> Dic Modify Key;
<-- E; <-- B;
(-------------------------------------------------------------------------)
(Adjust the address D and offset E. For this we calculate the bit offset
of the memory address.)
D < 5; D + E;
(To this the number of just inserted bits is added.)
D + B;
(D mod 32 is the new E, and D/32 the new D.)
E = D; E & 11111b; D > 5;
(-------------------------------------------------------------------------)
leave;
(=========================================================================)
(*****************************************************************************)
Linoleum Syntax Highlighting produced with LSH (© 2007 by Herbert Glarner)