Article
Organization
Home »
Linoleum Source Code »
Bitstream Generator »
Source Code Library
Scope
This is the source code of the Bitstream Generator library, implemented in the
Linoleum programming language.
Author
Herbert Glarner
Source Code
(Bit Streams "BST.txt" 1.1 - Copyright 2006 Herbert Glarner)
(*****************************************************************************)
(=========================================================================)
( Copyright Herbert Glarner [2006]
Mail: herbert.glarner@bluewin.ch
Free usage granted for all purposes, including commercial, provided
that above credits are given.
Suggestions for performance improvement welcome.
No liability for any damage usage may cause. )
(=========================================================================)
(=========================================================================)
(Converts input data into a bitstream. With respect to a bitstream, input
data is characterized by one or several of the following problems:
1. The symbols come not in expected order, due to being packed into 32 bit
words such, that the entirety of the packed symbols expose a right-to-
left order, whereas each symbol in itself still maintains a left-to-
right order.
2. The individual input symbols {letters, numbers, ...} are of an
unsuitable size, mainly by exposing an unwanted number of prefixed
Zero bits;
3. Symbols do not neccessarily need to respect 32-bit unit alignments nor
8-bit byte alignments. For example do 24-bits encodings respect byte
alignments, but not unit alignments. A 7 bit encoding respects neither
of the two.
4. We are faced with different conventions of reading files into the RAM,
i.e. we must distinguish between 'little endian' and 'big endian'.)
(-------------------------------------------------------------------------)
(INPUT DATA PROBLEMS:
Files on Harddisk essentially are bitstreams: Seen from the file, we don't
have any of above problems, with the exception of problem #2.
But as soon as we read such a file into units, all of aforementioned
problems may become manifest.
This is an example of a very trivial ANSI file being read into a Linoleum
Workarea starting at n. The file's ANSI characters require 8 bits each.
They are packed into consecutive 32 bit units, filling them from right-to-
left. Suppose, there exists an ANSI file with just the following content
{where '\0' marks a NUL byte, i.e. eight consecutive zero bits}:
'Yvonne\0'
PROBLEM 1.
Linoleum will read the bits into units as follows:
B3 B2 B1 B0 = Bytes in Unit
+-------------+ Within each unit, the symbols are
| n o v Y | Unit n not suitably ordered to be used as
+-------------+ a consective bitstream: they are
| \0 \0 e n | Unit n+1 right-to-left. But the bits in the
+-------------+ symbols are left-to-right.
Let's have a closer look at the first unit, this time at the bit level.
Byte 3 Byte 2 Byte 1 Byte 0
'n' 'o' 'v' 'Y'
+-----------+-----------+-----------+-----------+
| 0110 1110 | 0110 1111 | 0111 0110 | 0101 1001 | Unit n
+-----------+-----------+-----------+-----------+
^ ^ ^ ^
| | | |
+-----------+-----------+-----------+
|
The MSB is to the byte's left, i.e. each symbol's bitstream runs from
left-to-right.
PROBLEM 2.
If we were to build a compact bitstream, and knew, that the ANSI file did
not really contain ANSI characters but pure 7-bit ASCII characters, then we
would waste 1 bit per character {the MSB}. This seems not be that severe,
but in fact it still is 12.5% and sums up as the symbols become many,
especially if those ASCII symbols were to be used as bitstream-keys of
specialized data stuctures like 'Patricia Tries' or 'Suffix Tries'.
Byte 3 Byte 2 Byte 1 Byte 0
'n' 'o' 'v' 'Y'
+-----------+-----------+-----------+-----------+
| 0110 1110 | 0110 1111 | 0111 0110 | 0101 1001 | Unit n
+-----------+-----------+-----------+-----------+
^ ^ ^ ^
| | | |
+-----------+-----------+-----------+
|
The MSB is always 0 in 7-bit ASCII.
The problem is exposed to a far larger extent, when we look at the way on
how we save strings in Linoleum: Now we in fact use 7-bit ASCII characters
and store each character into an individual 32-bit ! unit.
Byte 3 Byte 2 Byte 1 Byte 0
+-----------+-----------+-----------+-----------+
'Y' | 0000 0000 | 0000 0000 | 0000 0000 | 0101 1001 | Unit n
+-----------+-----------+-----------+-----------+
'v' | 0000 0000 | 0000 0000 | 0000 0000 | 0111 0110 | Unit n+1
+-----------+-----------+-----------+-----------+
'o' | 0000 0000 | 0000 0000 | 0000 0000 | 0110 1111 | Unit n+2
+-----------+-----------+-----------+-----------+
'n' | 0000 0000 | 0000 0000 | 0000 0000 | 0110 1110 | Unit n+3
+-----------+-----------+-----------+-----------+
^^^^ ^^^^ ^^^^ ^^^^ ^^^^ ^^^^ ^
|||| |||| |||| |||| |||| |||| |
++++-++++---++++-++++---++++-++++---+
|
Here we waste 25 of 32 bits, or more than 78%.
The two aforementioned problems 1 and 2 can be immanent in a variety of
mixed cases. One would be the storage of each symbol within 16 bits, as is
often the case for several Microsoft-produced files, which use 16-bit-
'Unicode' {quoted, because real Unicode can not be represented with just
16 bits, there's only room for a subset of Unicode}. Again, we have the
right-to-left problem, and furthermore, if the content of such a file was
known to be ANSI-coded in reality, this would be a waste of 50% with
respect to a bitstream.
PROBLEM 3.
Another problem is the fact, that the input by no means is restricted to
symbols being 8, 16 or 32 bits wide. In fact, it is very well thinkable to
have a file which was output in 7-bit ASCII, mainly for the reason of
saving on file size. Such a file usually would not be readable with
standard commercial programs, but applications emitting them very well can
and do process their own output. This library is able to process them as
well.
Let's assume our mini-file from above being in 7-bit ASCII now, not in
8-bit ANSI anymore {this implies, that the final \0 represents a 7-bit NUL}.
'Yvonne\0'
The 7-bit symbol bitstream on file, from left to right, would be:
1011001 1110110 1101111 1101110 1101110 1100101 0000000
Y v o n n e \0
The bits would be stored in a file as 8-bit bytes as follows - still a
perfect bitstream:
10110011 11011011 01111110 11101101 11011001 01000000 00000000
Y v o n n e \0 Filler
But when read into Linoleum units, the content would be exposed pretty
cryptic:
+-----<-------+ +---------<---------+
| | | |
| +-------|------<----+ +-------|-----<-----+
| | | | | | | |
| +----n--+ +-----n-+ +------o+ Y-------v
| +-----------+-----------+-----------+-----------+
| | 1110 1101 | 0111 1110 | 1101 1011 | 1011 0011 | Unit n
| +-----------+-----------+-----------+-----------+
|
+-------------------->--------------------+
|
+---------<---------+ |
| | |
+-------<-----------+ +-------|-------<---+
| | | | | | |
+-------End +Filler-+ +-\0----+ +--e----+
+-----------+-----------+-----------+-----------+
| 0000 0000 | 0000 0000 | 0100 0000 | 1101 1001 | Unit n+1
+-----------+-----------+-----------+-----------+
It is obvious, that this is quite cumbersome to deal with. Not only do we
have a left-to-right bitstream in symbols and right-to-left bytes within
units, but also shifting symbol starts on byte level and symbols being
split between bytes and even units {the second 'n', for example}.
Is this merely academical? Well, 7-bit symbols are not streamed out often
these days with abundant external storage possibilities. However, 'odd'
bitstreams do very well exist. A more common example includes 24 bits
{3 byte coded bitmaps, one byte each for the color components Red, Green,
and Blue.} It also is common to output mass data gathered by realtime
systems as 4-bit BCD {Binary Coded Digits}. In any case, whether academical
or not, this library handles all the possibilities for completeness reasons,
including such awkward 7-bit encodings.
PROBLEM 4.
Different systems use different conventions when it comes to the order
in which the individual bytes of a multi-byte-symbol are represented in
RAM. There are systems which read the lowest byte before the higher bytes,
and other system which treat this vice-versa. These different approaches
are known as 'Endianness'. Note, that the Endianness relates to te
representation in RAM, not to the representation on the file!
The 32-bit symbols 'Y', 'v' in:
Big Endian:
In Linoleum Units:
B3 B2 B1 B0
+----+----+----+----+
| \0 | \0 | \0 | Y | Unit n
+----+----+----+----+
| \0 | \0 | \0 | v | Unit n+1
+----+----+----+----+
This system is also used when encoding normal strings in Linoleum,
as with 'String Variable = { Yvonne };'.
Little Endian:
In Linoleum Units:
B3 B2 B1 B0
+----+----+----+----+
| Y | \0 | \0 | \0 | Unit n
+----+----+----+----+
| v | \0 | \0 | \0 | Unit n+1
+----+----+----+----+
This is how a file's bitstream '\0\0\0Y' would be read into a Linoleum
unit on a Little Endian system like Intel or AMD.
Assuming, that Linoleum is mostly used on Intel PCs and compatibles like
AMD, Little Endian is the default.
If a Big Endian file needs to be processed, or if the keys originate from
pure Linoleum strings, then after Initialization with 'Bst Init' a special
flag needs to be set:
[bst status] | BST BIG ENDIAN
The library's routines 'Bst Stream' and 'Bst Resume' then will care for
the correct representation.)
(-------------------------------------------------------------------------)
(OUTPUT FORMAT:
This library reconstructs the original left-to-right bitstream. Whether
the input is in symbols of 32 or of 7 bits width, whether the machine uses
little or big endian, it will convert the symbols in a user-defined output
width and present the whole 'string' as a contiguous bitstream which can be
read from left to right.
Would the user choose an output width of 8 bit, then the resulting bit-
stream in /any/ case would contain:
Byte 3 Byte 2 Byte 1 Byte 0
+-----------+-----------+-----------+-----------+
| 0101 1001 | 0111 0110 | 0110 1111 | 0110 1110 | Unit n
+-----------+-----------+-----------+-----------+
Y v o n
Byte 3 Byte 2 Byte 1 Byte 0
+-----------+-----------+-----------+-----------+
| 0110 1110 | 0110 0101 | 0000 0000 | 0000 0000 | Unit n+1
+-----------+-----------+-----------+-----------+
n e \0 Filler
Would the user choose a 7 bit width:
Byte 3 Byte 2 Byte 1 Byte 0
+-----------+-----------+-----------+-----------+
| 1011 0011 | 1101 1011 | 0111 1110 | 1110 1101 | Unit n
+-----------+-----------+-----------+-----------+
Y v o n n
Byte 3 Byte 2 Byte 1 Byte 0
+-----------+-----------+-----------+-----------+
| 1101 1001 | 0100 0000 | 0000 0000 | 0000 0000 | Unit n+1
+-----------+-----------+-----------+-----------+
e \0 Filler
Note, that it would have been possible to make the whole bitstream right-
to-left, i.e. beginning at bit 0 of each unit. Since in the end the user
just needs to /know/ how the bits within a unit were streamed {here: left
to right, i.e. a unit's MSB is the first bit of the bitstream} the direc-
tion seems to be a secondary decision only. However, this is not the case.
There is quite a big disadvantage by creating a bitstream running from
right to left: we would need to revert each individual symbol's bit posi-
tions to maintain lexicographical integrity. Maintaining this integrity is
crucial, if bitstreams are to become keys of data structure implementation
relying on such. The fastest method I was able to think of to revert a
unit's bits still needs 24 Linoleum instructions, and that per symbol.
Making the stream run from left-to-right, though, there simply is no need
for such.)
(-------------------------------------------------------------------------)
(What is this all good for?
'Endless' bitstreams have their usage in many fields. Compression, pattern
recognition, image and sound processing, cryptography are among them, just
to name a few.
The actual reason for this library is some research about efficiently
storing alphanumeric information.)
(-------------------------------------------------------------------------)
(TERMINOLOGY:
Successfull processing requires knowledge about the Data formats. Within
this source, the following terms are used.
For the INPUT:
The 'PSW' {Physical Symbol Width}: It answers the question, how many bits
are needed to represent an individual symbol. ANSI files usually have a
PSW of 8. An example of 7 for an ASCII file was given above, though also
pure ASCII file usually are encoded with a PSW of 8. Some manufacturers
prefer symbol encoding in 2 bytes: use 16 for those. If you are processing
32-bit Unicode, use 32.
'Endianness': Assuming that Linoleum is mostly processed on PCs, the
default is 'Little Endian'. However, the library can process 'Big Endian':
to enforce such, set the flag BST BIG ENDIAN. Also set this flag if the
input symbols are coded in pure Linoleum strings with the {...} delimiters.
'Source': Address telling where the first input symbol is, left aligned in
the symbol width given in PSW.
'EOS' {End of Symbols}: This tells the library the symbol which, when
read in, terminates the conversion process. This in most cases is NUL,
i.e. a series of binary zeros, in the same width PSW as an ordinary symbol.
However, you are free to code whatever symbol here: sometimes a comma or
space suits the input better. Also frequently used are horizontal tab,
carriage return, line feed and more. Whatever is used: keep in mind, that
it in the end is an ordinary symbol, thus PSW bits wide.
For the OUTPUT:
The 'LSW' {Logical Symbol Width}: By no means are we bound to blindly
accept the input data's symbol width: we can change it to a smaller width,
thus enforcing a compression without any insignificant leading bits. This,
however, depends largely on what shall be done with the resulting bit-
stream. Should the bitstreams be input into your own string-processing
routine, such as a key into an implementation of a trie, then I suggest
that one uses the effective symbol size without wasting bits. If we know,
for example, that we in fact are processing 7 bit ASCII, it is recommmended
to set the LSW to 7, though: this will save a nice 12.5% on the resulting
bitstream. If, on the other hand, the data is to be output later on into a
file, then it in most cases would be unefficient to process them first as
7 bits and later on convert them once again into 8 bits: in this case
better use the expected standard widths of 8, 16 or 32 bits. As was said:
it depends on the processing task.
'Target': Address telling where the library may write the bitstream into.
It is your responsibility to provide enough space for all symbols, the
library just streams out bits into the space as long as the EOS symbol was
read.
NOTE: The library does /not/ write any EOS string after the last streamed
out symbol. Insted, the /number of bits/ starting with the MSB of the
specified target address is counted into [bst bits]. The effectively used
number of /units/ then can be calculated with '{[bst bits] + 31} > 5'.
It is guaranteed, that the LSB unused bits of the last used unit are 0.)
(=========================================================================)
(=========================================================================)
(Routines summary:
Name of routine Short description)
(-------------------------------------------------------------------------)
( PUBLIC ROUTINES:
Bst Init Telling about PSW, LSW and EOS
Bst Stream Converting units into a bitstream
Bst Resume Resuming after the last EOS symbol
INTERNAL ROUTINES:
Bst Reverse Bytes Converts a unit's 'DCBA' into 'ABCD'
Although 'Bst Reverse Bytes' is declared to be an internal routine, no
harm is done if called from the outside world for other purposes: it
just does its job in a register and leaves the other registers alone.
)
(=========================================================================)
(=========================================================================)
(Memory requirements:
13 variables = 52 bytes. When streaming, an area large enough to hold
the streamed out bits needs to be provided. If that area is too small,
anything beyond it will be overwritten, until an EOS is found: this
may crash your application.
)
(=========================================================================)
(=========================================================================)
(Version history:
07 Jul 2006: 1.0 Released.
11 Jul 2006: 1.1 Released. Outstream area is nulled now before being
used. New function 'Bst Resume" resumes streaming after
the last EOS symbol.
)
(=========================================================================)
(*****************************************************************************)
"libraries"
(*****************************************************************************)
"directors"
program name = { BST };
unit = 32;
(*****************************************************************************)
"constants" (Convention: "BST ...", UPPER CASE)
(=========================================================================)
(Error codes)
(-------------------------------------------------------------------------)
BST ERR NONE = 0; (No error)
(-------------------------------------------------------------------------)
(From 'Bst Init')
BST ERR INVALID PSW = 1; (Must be 1...32)
BST ERR INVALID LSW = 2; (Must be 1...PSW)
(=========================================================================)
(=========================================================================)
(Status flags)
(-------------------------------------------------------------------------)
(The variable [bst status] will hold some flags about the current status
of the library.)
BST STATUS INIT = 0000 0001h; (set when 'Bst Init' was successful)
BST BIG ENDIAN = 0000 0002h; (default = clear = little endian)
(Theoretically, the other bits within the [bst status] variable are
free to be used. However, usage is discouraged.)
(=========================================================================)
(*****************************************************************************)
"variables" (Convention: "bst ...", lower case)
(=========================================================================)
(Error condition as per BST ERR XXX constants.)
(-------------------------------------------------------------------------)
bst error = 0;
(=========================================================================)
(=========================================================================)
(Some flags about the current status of the library. The bits are given in
the BST STATUS XXX constants.)
(-------------------------------------------------------------------------)
bst status = 0;
(=========================================================================)
(=========================================================================)
(Saving the input and output formats as provided to 'Bst Init'.)
(-------------------------------------------------------------------------)
bst psw = 0; (Input: Physical Symbol Width)
bst lsw = 0; (Input/Output: Logical Symbol Width)
bst eos = 0; (Input: End-of-symbols symbol)
(=========================================================================)
(=========================================================================)
(The following takes a shiftable bitmask, masking PSW bits. At the start
of a bitstream, this is left-aligned, i.e. the MSB is set.)
(-------------------------------------------------------------------------)
bst pswm = 0;
(-------------------------------------------------------------------------)
(Similarly, there exists a mask for the LSW bits, only that it is always
right aligned.)
(-------------------------------------------------------------------------)
bst lswm = 0;
(=========================================================================)
(=========================================================================)
(Depending on the little or big endian scheme, read units might need to be
reversed. In any case does the following variable hold the actual unit such,
that its bits can be read from left to right.)
(-------------------------------------------------------------------------)
bst unit stream = 0;
(=========================================================================)
(=========================================================================)
(The lowest bit of the next symbol to be streamed out is held in this
variable. 'OOS' means 'Offset Out-Stream'. It starts with 32 for a new
outstream and is reduced by LSW prior to streaming a symbol out. After this
it may contain any value 0...31, depending on the LSW.)
(-------------------------------------------------------------------------)
bst oos = 0;
(=========================================================================)
(=========================================================================)
(The following variable counts the streamed out bits. Recall, that we
/always/ stream from left to right. This means, that the value of this
variable mod 32 is so many bits on the MSB side of the last unit, when 1
or larger. If mod 32 is 0, then thelast unit was completely consumed.)
(-------------------------------------------------------------------------)
bst bits = 0;
(=========================================================================)
(=========================================================================)
(In order to allow 'Bst Resume' to resume with the symbol following the
last found EOS symbol after a 'Bst Stream' or 'Bst Resume' call, the unit
in which it was found needs to be stored, and also the bit position of the
LSB of the EOS symbol.)
(-------------------------------------------------------------------------)
bst last eos unit = 0;
bst last eos bit = 0;
(-------------------------------------------------------------------------)
('Bst Resume' also allows to skip the first few symbols. This is of
interest when processing lists which are separated by two symbols, e.g.
CRLF, for which the first symbol, CR, has been declared to be the EOS
symbol: the second symbol LF then can be skipped rather than being inter-
preted as the first new symbol of the next stream. The number of symbols
to be skipped are passed to 'Bst Resume' via a register. It is saved into
this variable and decrements in here, as long as larger than 0.)
(-------------------------------------------------------------------------)
bst skip symbols = 0;
(=========================================================================)
(*****************************************************************************)
"workspace" (Convention: "bst ...", lower case)
(None)
(*****************************************************************************)
"programme" (Convention: "Bst ...", Mixed case)
(=========================================================================)
(This is a library only, it can not be run stand-alone. See "BST Test.txt"
for a test setup.)
(-------------------------------------------------------------------------)
end;
(=========================================================================)
(*****************************************************************************)
"Bst Init"
(=========================================================================)
(Initializes a bitstream environment by providing the neccessary input and
output format settings. This is used just once for a given input/output-
system. After this, use 'Bst Stream' for the first or only bitstream to be
created, then consecutive calls to 'Bst Stream' or 'Bst Resume' for new
inputs.)
(-------------------------------------------------------------------------)
(Input: A: Physical Symbol Width PSW of Input Data, in bits
B: Logical Symbol Width LSW of Input and Output data, in bits
C: End-of-symbols symbol, usually 0, but often also CR etc.
Output: -
NOTE: 1 <= A <= B.
Example: For 7-bit ASCII symbols coded in whole 32-bit units, pass:
A = 32; B = 7; if you want a bitstream as compact as possible
A = 32; B = 8; if the bitstream shall consists of easily
readable bytes
When reading ANSI or ASCII files, pass:
A = 8; B = 8; if the file is ANSI or ASCII coded and you
want the bitstream consist of bytes
A = 8; B = 7; if you know the file is pure ASCII and you want
the bitstream as compact as possible
A = 7; B = 7; if the file was streamed out in 7-bit ASCII,
which is quite rare
All registers unchanged.
Can cause errors. If fails, the error code is in [bst error].
)
(-------------------------------------------------------------------------)
C -->;
(-------------------------------------------------------------------------)
[bst psw] = A;
[bst lsw] = B;
[bst eos] = C;
(-------------------------------------------------------------------------)
(The PSW must be at least 1 bit and may not exceed 32 bits. Although, 1 bit
most likely does not make much sense as 1 of the 2 symbols must be declared
as the EOS symbol, leaving just the other as the only symbol. Anyhow, there
might be an application in which this makes sense, so this library does not
restrict it artificially just because I don't see it.)
? A <= 0 -> Bst Init Psw Invalid;
? A <= 32 -> Bst Init Psw Ok;
"Bst Init Psw Invalid"
[bst error] = BST ERR INVALID PSW;
-> Bst Init Exit Failed;
"Bst Init Psw Ok"
(-------------------------------------------------------------------------)
(The LSW must be at least 1 bit and may not be larger than the PSW.)
? B <= 0 -> Bst Init Lsw Invalid;
? B <= A -> Bst Init Lsw Ok;
"Bst Init Lsw Invalid"
[bst error] = BST ERR INVALID LSW;
-> Bst Init Exit Failed;
"Bst Init Lsw Ok"
(-------------------------------------------------------------------------)
(To extract a symbol from an input stream, we need a bitmask with so many
set bits as the PSW is wide. We need this bitmask /left/-aligned, i.e., the
masks MSBs are set, the LSBs cleared.)
(Note, that C < 32 would not work, as per Intel manual, only the 5 LSBs are
masked, and thus C < 32 equals C < 0. Because of this we need to test for a
PSW of 32 and treat it separately.)
C = 0;
? A = 32 -> Bst Init Psw is 32 bit;
(7bit 8bit 9bit 32bit )
C = 1; ( 1b 1b 1b )
C < A; (10000000b 100000000b 1000000000b )
"Bst Init Psw is 32 bit"
C -; (01111111b 011111111b 0111111111b FFFFFFFFh)
(Now we have a bitmask, but we need it left-aligned, i.e. the MSBs must be
set. Thus we shift left 32-PSW bits. To the same effect we can rotate PSW
bits to the right, though. In case of 32 bits, this rotates right 0 bits,
but in this case that provides the same result, i.e. no changes at all.)
C @> A; (FE000000h FF000000h FF800000h FFFFFFFFh)
[bst pswm] = C;
(-------------------------------------------------------------------------)
(To mask out a symbol, a right-aligned symbol mask of LSW bits is created.
It can be 32 bit as well, so we need to test for 32 bits individually once
again.)
C = 0;
? B = 32 -> Bst Init Lsw is 32 bit;
(7bit 8bit 9bit 32bit )
C = 1; ( 1b 1b 1b )
C < B; (10000000b 100000000b 1000000000b )
"Bst Init Lsw is 32 bit"
C -; (01111111b 011111111b 0111111111b FFFFFFFFh)
[bst lswm] = C; (always right-aligned)
(-------------------------------------------------------------------------)
(Everything was fine, streaming can begin now.)
[bst status] | BST STATUS INIT;
(-------------------------------------------------------------------------)
"Bst Init Exit OK"
<-- C;
[bst error] = BST ERR NONE;
end;
"Bst Init Exit Failed"
<-- C;
fail;
(=========================================================================)
(*****************************************************************************)
"Bst Stream"
(=========================================================================)
(Converts PSW wide symbols in units at E and following into a bitstream of
LSW wide symbols at unit D and following. Note, that the resulting bit-
stream /always/ is readable from left to right, i.e. MSB to LSB, over all
target units.)
(-------------------------------------------------------------------------)
(Input: E: 'Source', Address of input data
NOTE: The content /must/ be terminated by a PSW bits wide
EOS symbol. The routine continues to stream until such a
symbol was found.
D: 'Target', Address of output bitstream
**********************************************************
Version 1.1:
NOTE: The output area's units are NULLED before the bits
are streamed out. However, only as many units are nulled
as required for the bitstream: all units after the last
used one are left as they were, i.e. may contain garbage.
Check [bst bits] for the number of streamed bits.
{[bst bits] + 31} > 5 provides the number of used 32 bit
units.
**********************************************************
NOTE: If the area is not large enough, the output still
will continue to create the bitstream and overwrite what-
ever follows your too small area, most likely resulting in
a crash.
Output: [bst bits]: Number of streamed out bits, beginning at [D] in
its most significant bit.
All registers unchanged.
Can not cause errors. Exits with 'leave': Do not query 'ok' or 'failed'!
)
(-------------------------------------------------------------------------)
A -->; B -->; C -->; D -->; E -->;
(-------------------------------------------------------------------------)
(The offset of the out-stream is initialized with 32. This will be
reduced by LSW prior to each bitstream being streamed out into an address
at [D] and following.)
[bst oos] = 32; (No symbol ready yet to stream out)
[bst bits] = 0; (No bits streamed out yet)
(*************************************************************************)
(Version 1.1 clears the output area now and does not expect a previously
cleaned area. The first bitstream unit is nulled. As soon as we need
another units, it will be nulled as well. Don't expect the units beyond the
last used one to be nulled: it may contain any value.)
[D] = 0; (Clear the first bitstream unit)
(*************************************************************************)
(-------------------------------------------------------------------------)
"Bst Stream New Unit"
B = 32; (B will have lowest bit of symbol)
A = [bst pswm]; (Left-aligned PSW bitmask)
C = [E]; (Get the actual unit)
(If the unit is little endian, such as Intel, then we need to revert the
bytes within the unit so, that 'DCBA' becomes 'ABCD'. Little endian is the
default: If reading Big Endian files, you need to set an according flag
immediately /after/ the call to 'Bst Init', prior any call to 'Bst Stream'.
This is true also when processing Linoleum string variables. The needed
instruction is: '[bst status] | BST BIG ENDIAN'.)
? [bst status] + BST BIG ENDIAN -> Bst Stream New Unit Big Endian;
=> Bst Reverse Bytes; (In C, bytes 'DCBA' reversed to 'ABCD')
"Bst Stream New Unit Big Endian"
[bst unit stream] = C; (Units readable left to right)
(-------------------------------------------------------------------------)
"Bst Stream New Symbol"
C = [bst unit stream]; (The complete unit stream)
B - [bst psw]; (Lowest bit of symbol in unit stream)
C > B; (Right-align the symbol)
(-------------------------------------------------------------------------)
"Bst Stream Extract Symbol"
C & [bst lswm]; (Eliminate garbage in upper bits)
(If the extracted symbol is the EOS symbol, then we are done and leave the
stream-in loop.)
? C = [bst eos] -> Bst Stream Eos Read;
(-------------------------------------------------------------------------)
(*************************************************************************)
(Version 1.1: The new routine 'Bst Resume' allows to indicate a number of
prefix symbols which need not be streamed out, but skipped altogether. In
'Bst Resume', the number of symbols was stored into [bst skip symbols]. As
long as it is not 0, the symbol is not streamed out.)
? [bst skip symbols] = 0 -> Bst Stream Out;
(The actual symbol is /not/ streamed out.)
[bst skip symbols] -;
-> Bst Stream Resume;
"Bst Stream Out"
(*************************************************************************)
(In C's LSW least significant bits is the symbol. It can be streamed out
now into the target area at [D].)
? [bst oos] >= [bst lsw] -> Bst Stream Same Out Unit;
(Only part of the symbol in C, or even no part of it at all, fits into the
actual target address. We need the bits which fit in.)
? [bst oos] = 0 -> Bst Stream Next Unit;
C -->; (Still need the symbol)
C < [bst oos]; C > [bst lsw]; (Fitting-in part is right-aligned)
[D] | C; (Merge MSBs of symbol into target)
<-- C; (Restore complete original symbol)
"Bst Stream Next Unit"
D +; (Next output bits into next unit)
(*************************************************************************)
(Version 1.1 explicitely clears the output area now and does not expect a
previously nulled area anymore.)
[D] = 0; (Clear the new bitstream unit)
(*************************************************************************)
[bst oos] + 32; (This will be reduced as we go on)
"Bst Stream Same Out Unit"
[bst oos] - [bst lsw]; (Lowest bit to merge into target)
C < [bst oos]; (Shift symbol's bits into position)
[D] | C; (Merge LSBs of symbol into target)
[bst bits] + [bst lsw]; (Count the streamed out bits)
(-------------------------------------------------------------------------)
(*************************************************************************)
(Version 1.1: This label is jumped to from the routine 'Bst Resume' after
doing the required synchronization tasks to allow a resumed processing with
the symbol following the last EOS symbol, at an arbitrary position within
the current unit. It is also jumped to from within this routine, when we
skip one or more input symbols as per argument to 'Bst Resume'.)
"Bst Stream Resume"
(*************************************************************************)
(A has the originally left-aligned PSW bitmask. It is shifted into position
now for the next symbol of PSW bits. If the PSW is 32, then this would
result in a shift by 0, because shifts work modulo wordsize. This would be
wrong, but because then also B would be 0, we will fetch a new unit anyway,
and along with it B will be re-initialized. Because of this, no special
measures are needed to adjust A.)
A > [bst psw];
(If B, the lowest bit of the streamed-out symbol, is larger than or equal
to PSW, then there are more complete symbols in the actual unit. Get the
next one.)
? B >= [bst psw] -> Bst Stream New Symbol;
(If B is larger than 0, but smaller than PSW, then we have a symbol which
is distributed in two adjacent units. This happens with odd streams like
24- or 7-bit bitstreams and requires a special treatment to join the two
parts from the adjacent units into a single symbol.)
? B > 0 -> Bst Stream Split Symbol;
(When we arrive here, B is 0, i.e. we just have used up all bits of the
actual unit. We proceed one unit and process it as above.)
E +; (Next input unit)
-> Bst Stream New Unit;
(-------------------------------------------------------------------------)
(This handles a symbol which is distributed in two adjacent units. This
happens with odd streams like 24- or 7-bit bitstreams, but not with symbol
widths which divide 32 without any reminder, i.e. any symbol width which is
a power of 2. The actual unit's LSBs then contain the next symbol's MSBs,
and the next unit's MSBs the next symbol's LSBs.)
"Bst Stream Split Symbol"
B -->; D -->; (Need two additional registers)
C = [bst unit stream]; (Contains symbol's MSBs in its LSBs)
A = [bst psw]; A - B; (So many bits of symbol in next unit)
C < A; (Shift symbol's MSBs to final position)
(Now the next unit is needed. As before, depending on the endianness, its
bytes possibly must be reversed. We want that unit in another register than
C, because we still need C, as it contains the symbol's MSB already. We use
register B. However, since 'Bst Reverse Bytes' works on C, we need to save
it temporarily.)
C -->;
E +; (Address of next unit)
C = [E]; (Get the unit with the symbol's LSBs)
? [bst status] + BST BIG ENDIAN -> Bst Stream Split Unit Big Endian;
=> Bst Reverse Bytes; (In C, bytes 'DCBA' reversed to 'ABCD')
"Bst Stream Split Unit Big Endian"
[bst unit stream] = C; (Units readable left to right)
B = C; (Bring to B)
<-- C;
(In B's MSB are the symbol's LSB. The number of bits is in A. We right-
align these bits.)
B <@ A; (Symbol's LSBs in B's LSBs)
(All bits except the right-aligned A bits are garbage and need to be
masked out. The mask is created in D. A can not be 32, so no precautions
are needed: if A would contain 32, we would have no split symbols and would
not be here to process such.)
D = 1; D < A; D -; (Mask for the symbol's LSBs)
B & D; (Symbol's LSBs extracted now)
C | B; (C contains whole symbol in its LSBs)
(The PSW shift mask must be put back in place to resume processing with the
next symbol. It must be in A, and it must be shifted right as many bits as
we processed already from the new unit.)
B = A; (Processed bits of new unit)
A = [bst pswm]; (The left-aligned PSW bitmask)
A > B; (Shift in position for next symbol)
D <--; B <--; (Restore original registers)
(Also B, the lowest bit of the symbol being processed must be adjusted.)
B + 32; (The bit in the next unit)
B - [bst psw]; (LSB of symbol in next unit)
(Now C contains the right-aligned symbol in its PSW least significant bits.
The upper bits contain some garbage. Resuming at 'Bst Stream Extract Symbol'
will apply the LSW mask on C and then stream out the symbol.)
-> Bst Stream Extract Symbol;
(-------------------------------------------------------------------------)
"Bst Stream Eos Read"
(*************************************************************************)
(Version 1.1: In order to allow 'Bst Resume' to resume with the symbol
following the last found EOS symbol after a 'Bst Stream' or 'Bst Resume'
call, the unit in which it was found needs to be stored, and also the bit
position of the LSB of the EOS symbol.)
[bst last eos unit] = E; (Unit in which EOS was found)
[bst last eos bit] = B; (LSB of EOS symbol)
(*************************************************************************)
(Note, that bitstreams don't end with any special symbol. Instead, the
number of bits is counted into [bst bits]. However, if you for whatever
reason /need/ such an EOS symbol even in bistreams, then this would be the
right place to code a such. Careful: Depending on the LSW, your symbol may
require a split into two units. For the technique, see above code starting
below the label 'Bst Stream Out'.)
(-------------------------------------------------------------------------)
"Bst Stream Exit OK"
<-- E; <-- D; <-- C; <-- B; <-- A;
leave;
(=========================================================================)
(*****************************************************************************)
"Bst Resume"
(=========================================================================)
(Added in version 1.1. The routine 'Bst Stream' assumes, that the first
symbol can be found in the MSBs or LSBs of the provided unit, depending on
the endianness. Sometimes this is inconvenient, though, for example when
the input is ANSI or ASCII and the first symbol to be streamed in is at a
byte boundary or, more cumbersome, at any arbitrary position within the
unit. In such cases, use this routine. It continues to read the input
stream from the position after the last encountered EOS symbol, and treats
that symbol as the first symbol of the new bitstream. However, even this
may be inconvenient at times, for example in the situation, when several
input strings are separated by a CR+LF sequence, of which the first, CR,
was defined to be the EOS symbol. In this situation you don't want to
resume with the next symbol, which would be LF, but you want to skip it.
The number of /symbols/ to be skipped can be given in C: in aforementioned
example you would use 1. Usually, you would use 0 here, though.)
(-------------------------------------------------------------------------)
(Input: C: Symbols to skip after the last EOS symbol, before starting
to stream out. This skips /output symbols/, not input
symbols: Should the position from which we resume to stream
in start with an EOS symbol, then it would /not/ be skipped
and the procedure would immediately terminate with a total
of 0 streamed out bits. Usually, this is 0 = stream out
from first symbol after last EOS symbol. Further notes see
in this routine's header above.
D: 'Target', Address of output bitstream
NOTE: The output area's units are NULLED before the bits
are streamed out. However, only as many units are nulled
as required for the bitstream: all units after the last
used one are left as they were, i.e. may contain garbage.
Check [bst bits] for the number of streamed bits.
{[bst bits] + 31} > 5 provides the number of used 32 bit
units.
NOTE: If the area is not large enough, the output still
will continue to create the bitstream and overwrite what-
ever follows your too small area, most likely resulting in
a crash.
Output: [bst bits]: Number of streamed out bits, beginning at [D] in
its most significant bit.
All registers unchanged.
Can not cause errors. This routine will jump into 'Bst Stream' after the
neccessary symchronization tasks. There it exits with 'leave': Do not
query 'ok' or 'failed'!
)
(-------------------------------------------------------------------------)
(These are the same register saves as in 'Bst Stream'. The routine will
jump to 'Bst Stream' and terminate there. This requires same stack
arrangement.)
A -->; B -->; C -->; D -->; E -->;
(-------------------------------------------------------------------------)
(Effectively skipping the symbols is done in 'Bst Stream'. Here we just
initialize the counter variable.)
[bst skip symbols] = C;
(-------------------------------------------------------------------------)
(The offset of the out-stream is initialized with 32. This will be
reduced by LSW prior to each bitstream being streamed out into an address
at [D] and following.)
[bst oos] = 32; (No symbol ready yet to stream out)
[bst bits] = 0; (No bits streamed out yet)
(-------------------------------------------------------------------------)
(Clear the first bitstream unit)
[D] = 0;
(-------------------------------------------------------------------------)
(Any previous call to 'Bst Stream' or 'Bst Resume' stored the position,
at which the last EOS symbol was found, into two variables. The address in
which the EOS occured was stored in [bst last eos unit], and the LSB of
the EOS symbol in [bst last eos bit]. We load the appropriate registers
with those values now, as if we just processed the EOS symbol as a valid
symbol and are about to process the next symbol.)
E = [bst last eos unit]; (Unit in which EOS was found)
B = [bst last eos bit]; (LSB of EOS symbol)
(-------------------------------------------------------------------------)
(The actual unit must be present when we jump into 'Bst Stream'. If needed,
it must be brought to big endian. It is expected in register C.)
C = [E]; (Get the actual unit)
? [bst status] + BST BIG ENDIAN -> Bst Resume Big Endian;
=> Bst Reverse Bytes; (In C, bytes 'DCBA' reversed to 'ABCD')
"Bst Resume Big Endian"
(-------------------------------------------------------------------------)
(The PSW mask was left in a position suitable to extract the EOS last
time. It is expected in A.)
A = [bst pswm];
(-------------------------------------------------------------------------)
(These are all informations we need. We can continue to stream in now as
if we never did something else.)
-> Bst Stream Resume;
(-------------------------------------------------------------------------)
("Bst Resume Exit OK"
This routine jumps into 'Bst Stream' and finishes there. It has no
own exit labels.
)
(=========================================================================)
(*****************************************************************************)
"Bst Reverse Bytes"
(=========================================================================)
(Internal routine. Reverses the bytes in C such, that 'DCBA' becomes 'ABCD'.
This reversion is needed for Small Endian processors such as 'Intel' to get
a bitstream readable from left to right.)
(-------------------------------------------------------------------------)
(Input: C: Bits 'DDDDDDDDCCCCCCCCBBBBBBBBAAAAAAAA'
Output: C: Bits 'AAAAAAAABBBBBBBBCCCCCCCCDDDDDDDD'
All registers except C unchanged.
Can not cause errors. Exits with 'leave': Do not query 'ok' or 'failed'!
)
(-------------------------------------------------------------------------)
A -->;
(-------------------------------------------------------------------------)
('DCBA' are the four bytes in a unit)
A = C; (A: 'DCBA')
A @> 8; (A: 'ADCB')
C <@ 8; (C: 'CBAD')
A & FF 00 FF 00h; (A: 'A-C-')
C & 00 FF 00 FFh; (C: '-B-D')
C | A; (C: 'ABCD')
(Although 6 instructions are not that bad, if you can think of a faster
method to revert the bytes within a unit, drop me a note: I like efficient
bit manipulation techniques. Thank you.)
(-------------------------------------------------------------------------)
"Bst Reverse Bytes Exit OK"
<-- A;
leave;
(=========================================================================)
(*****************************************************************************)
Linoleum Syntax Highlighting produced with LSH (© 2007 by Herbert Glarner)