Mathematical Formula Decomposer


URI:

http://herbert.gandraxa.com/herbert/mfd.asp

Link template:   

<a href="http://herbert.gandraxa.com/herbert/mfd.asp">Mathematical Formula Decomposer</a>


Link symbols:   

Local LinkOn current page | DocumentOn this site | External PageOn external site | WikipediaWikipedia article | Compressed ArchiveZIP archive | PDF documentPDF | E-MailE-Mail


Article

Organization

DocumentHome » DocumentOnline Tools » Mathematical Formula Decomposer

Scope

This page describes a technique to decompose a mathematical formula into its components, consisting of single operations in correct order. It lets you enter a mathematical formula (linearily like you would enter it into an algebraic calculator), and shows each parsing step using your input as an example.

If you find any bugs, please let me know.

Author

DocumentHerbert Glarner

Credits

To display mathematics, this page uses External SitejsMath from Davide P. Cervone, Department of Mathematics, Union College at Schenectady, New York. Details about jsMath see in the control panel, located in the bottom right corner of this page.

Published

2008-May-07

2008-May-09 (update), fixed a bug detected by Stefan Clementz: π and e were treated as numbers too early. This led to the situation, that e.g. the expression π/e was retranslated as the number 1, only considering its sign. Similarly, -e·π was retranslated as -1. This is fixed now. Thanks, Stefan.


Introduction

Approach

In computing, it is quite often the case that we need to "understand" a given formula, i.e. comprehend of which individual basic operations it consists, before we can go on and do something meaningful with the formula (for example differentiating it).

A process which is able to do that oftentimes is called a parser. A parser usually has the notion of analyzing an expression syntactically or lexically. However, our approach demonstrated here actually decomposes the formula in such a way, that we are able to process it bottom-up or top-down, whichever comes in more handy or is more suitable for a specific problem. Consequently we call the process «Mathematical Formula Decomposing» and shall refer to it as MFD in the remainder of this document.

Because we believe that many things are much clearer when one can work with own examples, this page does not only include a full script of an MFD, but also lets you enter formulae linearily and study the results directly beneath the text describing a particular step.

However, before doing that, we need to provide a proper framework, under and with which our MFD shall work as intended. The next subsections will outline a such.

Providing Formulae

As always, the first step is to have something to work with, in our case a mathematical formula, entered linearily. "Linearily" in this context means, that the formula is provided in such a way like you would enter it in many computer languages or algebraic calculators. Examples of such formulae include:

2*x+sqrt(3*a)
sin(cos(tan(x))))
(sin(x))^2

However, our MFD shall be able to go a step further, in that we also accept mathematical expressions with implicit operators and rules, as shown in the following examples:

2x+sqrt(3a) implicit multiplication between 2 and x, and between 3 and a.
sin cos tan x When no parantheses are given, unary functions always work on the next term on which they can be applied, hence tan on x, but cos on tan x and sin on cos(tan x). The other way round, e.g. sin on tan, can not be applied directly, because tan was not worked out yet.
sin x^2 Unary functions are priorised over exponentiation, so in the example to the left sin works on x, and exponentiation on sin x. However, because this looks dangerously ambigeous, mathematicians usually prefer the notation sin2x before sin x2. Notice, however, that in both cases the same is meant. If we would want to exponentiate x before applying the function to it, we would need to use parantheses and write sin(x^2) to mean sin(x2).

Determining the Rules

Priority Classes

By mathematical wisdom accumulated over the recent millenia, we pass on the following conventions and rules by which we want our MFD to comply:

Evaluation Directions

Now that we have defined operative priorities, we need to settle the potential conflicts within the same priority class. It is true, that as long as an expression has different priorities only, everything is allright and we marrily can go along. However, things look different when in an expression same priorities clash, and unfortunately the conventions were not always set up in a consistent way.

For example, look at the operations of class 1. Whereas additions are commutative (e.g.: 2+3+5 can be calculated from left to right, i.e. (2+3)+5, and also from right to left, i.e. 2+(3+5), and nevertheless both yield the same result), the same is not true for subtractions (i.e. 5-3-2 must be calculated from left to right, i.e. (5-3)-2, yielding 0, and not from right to left, i.e. 5-(3-2) yielding 4 and thus a different (and wrong) result).

So, because we need to obey direction for subtractions, and have a don't-care for additions, the most sensible rule for class 1 would be: process from left to right.

It turns out, that exactly the same is true for the class 2 operators. Multiplications are commutative, but divisions are not (e.g.: consider the expression 24/4/2. Now, from left to right, we have (24/4)/2=3, but from right to left we obtain wrongly 24/(4/2)=12). So, also for class 2 the rule must be: process from left to right.

Unfortunately, we can not extend this any further to class 3. Mathematicians have set forth (somewhat arbitrarily, perhaps?) the convention, that "stacked powers" are evaluated top-down, or, in terms of linear formulae, from right to left. This is important, because exponentiation is not commutative either. E.g. 5^2^3 must be evaluated from right to left as 5^(2^3)=5^8=390625, and not from left to right as (5^2)^3=25^3=15625. Hence the proper rule for class 3 operations is to process them from right to left.

Unary functions of class 4 in one sense are the least complicated, as we only have to deal with one operand, which usually is the one to the right of a linear formula (e.g.: simply the x in the function sin x, but also the content of a possibly complex expression, if enclosed in parantheses, e.g.: everything in the parantheses of the expression sin(x^2+tan x/2) ). because of this the issue of a priorisation is largely obsolete. However, as with the other classes 1 to 3 we have the possibility to nest functions into each other. An example involving classes 1 to 3 would be: x^(4*(2+3)). As long as this is done explicitely with parantheses, priorities are apparent immediately, e.g. in sin(cos x) one always will evaluate the parantheses first, i.e. cos x, before applying sin to the result. However, with this class we have the possibility to write implicitely nested formulae, e.g. sin cos x. Here the only possibility leading to a practical result is to priorize cos x as well, because one can not apply sin on a per se meaningless cos (it doesn't get a meaning before it is applied to something). But this implies also the way in which we have to process this class of unary functions, and this is from right to left.

Above justifications provide the following two rules in concise form:

Defining the Input

Build of Formula Elements

Now a rather dry section: we need to define exactly, what we want to accept as a valid input to the MFD. Without such a definition, we would not get very far in our attempt to build a decomposing machine.

Because there are different elements, which do work together in different ways, we will look at them individually. In this process we will name these elements, and these names shall have the status as definitions for the remainder of the document (and for that matter, for all references to our MFD).

It may or may not come as a surprise, but the notion number in its informal everyday meaning in reality is a truely complex animal with many facets. Nevertheless we must have a very thorough understanding of how it may look like, or we will fail in recognizing a number when we encounter a such.

Digit. A digit is any of the numerical symbols 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9.

Digit Sequence. A Digit Sequence consists of at least 1 digit, or the special numbers e or p (Latinized from the Greek letter π, which usually is not readily available as an input in online forms). (Note, that no negation or paranthesation is defined here yet.)

Negator. The negation sign -. (The negation sign in fact is the factor -1 applied to the immediately following expression, e.g. -3 is shorthand for (-1)*3.)

Number. A number is a Digit Sequence, which may be preceded by a Negator, all optionally in parantheses. It may be preceded by a Negator, and if so, it may optionally be in parantheses again. Examples of valid Numbers as per this definition: 0, e, 144, (144), -321, -(321), (-p), and in the most complex form allowed: (-(-321)).

Divider. The division operator /.

(Fraction.) A fraction consists of two Number elements, separated by a Divider. This sequence may optionally be in parantheses, in which case the whole construction can be preceded by a negator, in which case the negated construction can be in parantheses once more. Examples of valid Fractions in their simplest forms are: 1/2, -1/-2, (-(-1))/(-(-2)), and an example of the most complex form would be (-((-(-p))/(-(-e)))).

Numeral. The Fraction above was introduced merely as a predefinition for the constructions which we really want to accept: the Numerals. A Numeral is defined as a Fraction as above, but with the inner elements sequence Divider plus the Number following it being optional. If the optional sequence is omitted, then we in essence are back to a simple Number, and if it is used, we have a Fraction. Hence for examples see above in Number and Fraction, as all of them are Numerals by this definition.

Coefficient. With the above, purely numerical expressions can be entered in every needed form, with the exception of a mere negation of a following non-numerical argument, as for example in -x. For this reason, we define an additional element, allowing for exactly an isolated Negator, and also for a full-fledged Numeral. Because such a construct does not make sense in other instances than as a Coefficient to something else, we call it that way.

Variable. Variables are another dominant feature of formulae. To facilitate our future machine's life, we define, that every lower-case character of the Latin alphabet, with the exception of those letters representing special numbers, are variables. This gives us 24 variables (26 letters, minus the special numbers e and p). Examples of valid Variables are: a, n, x.

Function Name. An unary function is defined as consisting of a sequence of alphabetic letters in predefined ways. The exact match of such letters is important, because cos refers to the function name cosine, but the only slightly modified cor means an (implicit) multiplication of the three variables c, o and r. ("Safe" characters not possibly interfering unintentionally with function names are b, d, f, g, h, j, k, m, u, v, w, x, y, and z.) Our MFD shall be able to recognize the following unary functions: arccos, arccot, arccsc, arcsec, arcsin, arctan, sqrt, cos, cot, csc, sec, sin, tan, and ln, all of them in prefix notation, i.e. having exactly one following argument.

Operator. The last group is formed by the binary operator signs. Our machine shall support the 5 binary operators +, -, *, / and ^.

With above definitions we are able to make the first step in identifying any formula's components, as we shall see in the next section.

Elements in BNF Notation

For readers familiar with the concise Wikipedia ArticleBNF metasyntax, here it is:

Digit       := "0"..."9"

Negator     := "-"

Divider     := "/"

DigitSeq    := Digit+ | "e" | "p"

Number      :=   (     [Negator] ( DigitSeq | ( "(" [Negator] DigitSeq ")" ))     ) 
               | ( "("  Negator  ( DigitSeq | ( "(" [Negator] DigitSeq ")" )) ")" )

Numeral     :=   (                   Number [ Divider Number ]         )
               | (     [Negator] "(" Number [ Divider Number ] ")"     )
               | ( "("  Negator  "(" Number [ Divider Number ] ")" ")" )

Coefficient := Numeral | Negator

Variable    := "a"..."d" | "f"..."o" | "q"..."z"

Function    := "arccos" | "arccot" | "arccsc" | "arcsec" | "arcsin" | "arctan
               | "sqrt" | "cos" | "cot" | "csc" | "sec" | "sin" | "tan" | "ln"

Operator    := "+" | Negator | "*" | Divider | "^"

Building the MFD Machine

Your Input

Enter a mathematical formula using expressions defined in the last sections, or use one of the examples provided beneath the input box, then click on the button.

(Note, that the input is restricted to 200 characters.)

Predefined examples:

(Own input)
e^(2*x^2^3*a*2)
sin cos tan x
((((-(3/5))*-sin ((-3/5)x) )+-(-3/5))/((-3/5)(x+(-3/5))))^-x^((-3/5)x)
x^(a*(-1/2))^(-1/3)*tanx/(ln((-1/4)+x/(a+(-1/5)-sin(x/(-1/6))+sin(cos(-1/7))))+1/8)
 

Error 1: There was no input at all. You might want to check out one of the examples. To do so, just select any of the radio buttons with the exception of the first one, then click on the button. – Alternatively, enter your own formula into the text box.

Atomizing the Formula

The first task is to reduce the input in such a way, that we are left with the smallest possible identifiable elements, as defined in the subsection Link on same pageDefining the Input above.

This reduction happens such, that we use just one distinct symbol for each of these elements. These symbols are stripped from any meaning other than being a placeholder for any possible representation of the element in question. To distinguish these symbols from their counterparts with a concrete meaning, we will call them Atoms from now on.

We also want to make sure, that the input formula complies with some basic semantical rules.

For example, it is non-sensical to have two or more level 2 or level 3 operators in a sequence, because no meaning is assigned [1] to such constructs like, e.g.: **, */, *^ or ^/.

On the other hand, a meaning certainly is assigned to level 1 operators in a sequence, because the ones closer to the following operand can be regarded as the operand's sign, which is either - or +. Mixing level 2/level 3 and level 1 operators are allowed to a certain degree, namely insofar that a single level 2 or 3 operator, followed by any number of level 1 operators acting as the following operand's sign, is allowed.

To enforce a proper operation level assignment later on, it is a good idea to add (additional) brackets [ and ] around any negated term. To what extent a negated term is included depends largely on what follows after the immediate term after the negation sign. Technically, an exponentiation or a multiplication/division after the first term both are included in the negation. However, because [-A*B] is equivalent to [-A]*B, we only need to check for exponentiation (since [-A^B] is not equivalent to [-A]^B).

Thus, in concise form, the rules to which the input formula must adhere are the following:

[1] To be specific, no meaning is assigned in our context. However, a meaning can be assigned in other contexts. For instance, there are programming languages in which the sequence ** bears the notion Exponentiation, etc.

We define the following Atoms:

Atom Symbol Remarks
Coefficient # Note, that this Atom defines Coefficients, and hence does include the stand-alone negation sign before variables and function, e.g.: -x. Negation signs before a numeral were sublimated already into the numeral.
Variable @
Class 1 Binary Operators + This Atom includes the binary operators + and -.
Class 2 Binary Operators * This Atom includes the binary operators * and /.
Class 3 Binary Operators ^
Class 4 Unary Operators (Function names) ~
Parantheses ( and )
Brackets [ and ] These atoms are inserted artificially to group negated terms. They do not have corresponding characters in the originally input formula.

We can reduce any formula to the atoms in this table, and the process doing this shall be called "atomizing the formula".

Note, that whereas any formula can be atomized, this does not automatically warrant validity. The result will be valid only then, when the whole string consists of nothing else then valid atom symbols.

If you provided a formula in the above form, the script on this page will attempt an atomization for it now, and also tell you something about its validity.

Thorough Semantical Validation

Although some basic semantical rules were implicitely applied in the previous subsection, no actual tests were performed; for example was rule 8 not cosidered at all. We need to gain some more confidence, that we indeed deal with a semantically correct mathematical formula.

One requirement is to make sure, that for every opening paranthese a matching closing one exists, and vice-versa.

Note, that the emphasis on matching means, that not only the number of opening and closing parantheses shall be considered, but also their positional meaning. For instance would both the sequences "(())" and "()()" be perfectly wellformed, but the same would not be true for ")()(" or "(()". A simple strategy to test this is by scanning the whole string and counting an opening paranthese as +1 and a closing one as -1. Should the sum of these values fall below 0, then an opening paranthese is missing. Should the sum in the end be above 0, then a closing paranthese is missing.

It also is an important question family, what atoms are allowed to follow each other.

For example, an unary level 4 function is applied to exactly one argument to its right. This argument can be a coefficient, a variable, an expression in parantheses, or a function again.

The whole question family is examined thoroughly in the following table. In the first column we find the atom, for which the rules about the following characters are set forth in the columns to the right. An entry in the table cell indicates, that the two atoms may follow each other.

Maybe somewhat surprisingly it turns out, that we have only two different continuation models, i.e. all atoms are in one of two choice sets.

Atoms Following Atom
# @ + * ^ ~ ( ) [ ] End
# @ ) ] 1 1 x x x 1 1 x 1 x x
+ * ^ ~ ( [ x x x x x

x: This sequence is allowed. – 1: This sequence is allowed, but it implies a multiplication in between the two atoms.

Hence the following rules can be defined:

Because rule 10 does not allow for a sensible test, it may seem to be superfluous to even state the rule. After all, when everything is allowed as a continuation, why even bother to test anything? – The important part lies in the rule's second half, namely in the statement about the implicit multiplication. This is an important step in understanding the formula. In fact, it is so important, that we provide a distinct atom symbol ' (apostrophe) for implicit multiplications, and insert this atom at the proper places into the atom string.

The atom ' is a class 2 operator like *, but still is warranted an own symbol, because there is no corresponding character in the formula. Later, when we have progressed so far that the whole formula is "understood" by program code, we will need to reassemble the formula, using the atoms string as the "controller". It is then, when the choice of a distinct symbol pays off.

If you provided a formula, and that formula turned out in the previous test to consist of valid atoms only, then the script on this page will check for above semantical rules now.

Establishing an Order of Operations

In this step we structure the atoms string such, that an order of operations is established. Initially, we are not yet interested in the priorisation of operations, but in resolving levels imposed by parantheses.

To indicate this initial order, we can attach an "operations level" to any atom of the atoms string. We start out with a level of 1, signifying lowest priority, and then increase and decrease the actual level for each atom as appropriate by the following rules:

While creating the order of operations string, we keep track of the highest used level: we will need this value in the next step. We also count the number of effectively registered atoms, i.e. all atom symbols with the exception of the parantheses symbols ( and ).

Note, that the paranthese atoms ( and ) are not needed anymore in the atoms string, from a technical view, and thus could be discarded from the atoms string altogether. However, it is a good idea to still retain them, because they will be of great help in the next subsection, when we need to parse through the input string again, with the atoms string as a controller. – The brackets [ and ] are effectively not needed anymore, as they do have no corresponding characters in the originally input formula.

If you provided a formula, and that formula passed semantical validation, then the script on this page will determine the order of operations now as per rules set out above.

To better visualize the levels, the script converts the internally used levels 1, 2, 3... etc. into the letters A, B, C etc. Unregistered parantheses (internally level 0) are displayed with a . (period). Such a conversion contributes much to a better legibility, especially when there are 10 or more levels. However, legibility suffers for higher levels, if there are more than 26, making necessary a rendering beyond Z. Such levels are indicated by a + sign. Fortunately, these cases arise only in very complicated formulae with very deep nesting levels. Also note, that internally up to 256 levels could be stored, but with a restricted input length of 200 characters such depths can not possibly be attained.

Creating Process Tables

Atom Records

We know much about the formula by now. In particular we have all needed data to work with the individual atoms, except the true values of numeral atoms and variables, represented by # and @, resp.

In order to do also this last step, we need to parse the input string, with the atoms string as a "controller". This means, that we actually iterate through the atoms string, and attempt to fetch the according sequence from the actual location in the input string.

This is not really a difficult task for all atoms, with the exception of the numerals #. These can be of pretty complex formatting; in essence we need a BNF parser to extract valid numerals from the input string.

For all atoms, with the exception of the paranthese atoms ( and ) on the one hand, and the bracket atoms [ and ] on the other hand, there will be a record containing information about that particular atom.

The role of the paranthese atoms ( and ) is restricted to indicate extraction of the according symbol in the input string, and with such to keep the input string synchronized with the controlling atoms string. This is not the case for the bracket atoms [ and ]. However, both the parantheses as well as the brackets provide level information, which was sublimated already into the Order of Operations string. Consequently, there is no need anymore to keep any information about parantheses and brackets and hence there will be no record for them: they are discarded as redundant.

An atoms record is indexed by an integer starting with 0 and incremented in steps of 1. It consists of the following fields:

In the remainder of this document, we shall display any atom record as follows:

Index  Atom Order of Operation Position Name Value of Numeral Molecule
Sign Numerator Denominator

Or for short:

Index  Atom Order of Operation Position Name Sign Numerator Denominator Molecule

If you provided a formula, and that formula passed semantical validation, then the script on this page will create and display the atom records now as per rules set out above.

Priority Table and Priority List

In parallel to the creation of a record for each atom, a priority table is created.

The priority table has one record (row) per Order of Operations level, and a field (column) for each of the operator priority classes 1 to 4. Entries in the fields consist of comma-delimited enumerations of atom record indices for those atom records representing operations, in the order they need to be processed.

Note, that this means, that the atom records representing variables and numerals do not appear as entries in the comma-separated fields of the priority table. – It also means, that the indices are not necessarily collected in the order they appeared when the atom records were created (that is, in ascending order, starting with 0). Instead, they will be in reverse (descending) order for class 3 exponentiations and class 4 functions, because those need to be processed from right to left, according to rule 5.

As soon as the table has been created, we can concatenate its fields into a single priority list, by working through the table bottom-up (highest order of operation first) and within each order of operation from right to left (highest operation class first). Within each entry, the priorities are already in the correct order, and so the whole content of a cell can be concatenated as a whole.

Molecularizing the Formula

Building the Molecules

By now we have identified all operations and the order in which they need to be applied to their "environment" in order to comply with the rules set out in above sections.

What remains is to effectively apply the identified operations to our atoms, i.e. to state uncompound operations with their operator and the two involved operands, in case of a binary operator, or its only operand, in case of an unary function.

This process involves grouping of two or three elements into a compound structure. In analogy to the term Atomization we call this process Molecularization.

Note, that the use of the term elements is intentional here: usually, only in the minority of all cases we do effectively work on two or three Atoms directly; as we progress, more and more already molecularized atoms, i.e. Molecules, will become involved as the operands of operators. The term element does include both Atoms and Molecules.

With unary class 4 functions, only one operand is involved (which, however, can be a parantheses group). This operand is following the function name. Such an operand always also exists with binary class 1 to 3 operators. Because the operand follows the operator, we call it the postfix operand. Binary operators always involve two operands, though, the other one preceding the operator. Such operands do not appear in unary class 4 functions. We call them prefix operands.

With this step it is time to retranslate the atom symbols # into their effective number representation in order to display a proper formula component (a molecule). Note, that this number may differ from what was entered originally, i.e. it may undergo a little simplification. For example would the (per se perfectly legit) expression -(-(-3/-5)) be represented by the equivalent but simpler 3/5.

However, the decomposer is not a calculator, hence it will neither simplify fractions from 6/10 into 3/5, nor will it recognize that 6/3 or even 3/3 are integers and don't require representation as a fraction, etc.

The whole molecule, i.e. the operator including its operand(s), is then typeset into TeX. For example will a term like sin 3/5 be typeset as sin {3 \over 5}. The latter expression is called the Molecule Body. Every Molecule Body has an internal Molecule Name, which is an index starting with 0. In our tables, the numbers are rendered with the upper-case letters A, B, C... (which may become awkward to read for definition numbers greater than 26).

From the second molecule body on, nesting may take place, and as the script iterates through the priority list, it will make more and more references to already named molecules, i.e. refer to whole already defined terms by their name as the involved operands. For example will an input formula sin(cos p) define a first molecule named A with the molecule content cos p, and subsequently make use of it in the next molecule named B by referring to it as sin A.

Above paragraphs may have reminded you, that the atoms table had a so far unused final column labelled Molecule. It is during this Molecularization, that we fill in the actual molecule name into that column, for the involved operator as well as its one or two associated operands. Note, however, that "operand" can involve many atoms as we progress through the table, and ultimately the last (topmost) molecule will comprise all atoms. Hence, this column is only of use while internal processing takes place; its content both before and after processing is absolutely predictable and of no particular interest in both cases.

Only a few typesetting rules are applied by the script on this page:

Typesetting the Molecules

We have achieved the goal of decomposing the input formula. What remains is to typeset it properly. A very pleasing tool is provided by Davide P. Cervone (see Link on same pageCredits). Should the formula have passed all validations, the components' formulae will appear beneath now.