Using nested intervals with rational endpoints to model tree structures.

What is a tree again?

 - a finite set of nodes
 - a root node
 - each node except the root node has a parent node
 - the root has no parent
 - child nodes are ordered

A tree can be represented as a diagram with the root node
at the top, and arrows leading down to the child nodes,
which are ordered left to right.

The order is important in this discussion. For example, these two trees
are different:-

     1        1
    / \      / \
   2   3    3   2

Nodes that have the same parents are called siblings. So in the trees
above, 1 is the parent of 2 and 3, who are siblings.

Suppose we have a tree consisting of n nodes, how can we
make this into a tree with n+1 nodes?

 - the new node must have a parent, so choose one
 - add the new node to the list of child nodes for the parent
   at a specific position

All trees are constructed this way. That is, we start with one root
node, and then we keep adding nodes, attach them to parent nodes,
at a specific location.

The 'adding-a-node' procedure is one of these:-

 C  add a child node to a node that does not already have any children
    (i.e. a leaf node)

 LS add a new sibling to the immediate left of an existing child node

 RS add a new sibling to the immediate right of an existing child node

Note on redundancy: the last two operations may coincide:

      1
     /|\
    / | \
   /  |  \
  2   4   3

Node 4 is both a right sibling of 2 as well as a left sibling of 3.

We are going to introduce more redundancy and add some other node-
adding operations

 LC  add a child node to the left of all its siblings (if any)

 RC  add a child to the right of all its siblings

For leaf nodes, operations C, LC and RC are all equivalent.

In the diagram below

      1
     /|\
    / | \
   /  |  \
  2   3   4

Node 4 could be viewed either as a new rightmost child of 1, or as a
new right sibling of 3.

--

One way to model trees is by thinking of them as nested sets:
- each node is represented as an interval, with a left and a right endpoint
- every node's interval is properly contained within
  the interior of the parent's interval (that is, no interval endpoints
  coincide)
- the order of the child nodes is preserved by the order of the intervals

For example, the tree

     *
    / \
   /   \
  *     *
  |    / \
  |   /   \
  *  *     *

(I won't number the nodes, because I'm going to put some other numbers next
to them shortly) can be modeled as


  [-------------------]
    [----]  [--------]
      [-]     [-] [-]

To realize this model in, say, a database, there must be some way to determine
what the endpoints for the corresponding endpoints should be. There is a well
known [*] technique that involves walking along the tree from the root in a
counter-clockwise direction, and labeling the left and right sides of nodes as
we meet them with increasing numbers.

In our example tree above:

    1*12
    /  \
   /    \
 2*5    6*11
  |     / \
  |    /   \
 3*4 7*8   9*10

And our intervals hence become:
[1,12], [2,5], [6,11], [3,4], [7,8], [9,10].

The problem with doing things this way is that if we insert a node anywhere
in our tree, at least one interval will have to be adjusted. And, indeed, if
we add a left child to the root *all* endpoints must be recalculated.

So we want a way to use (pairs of) numbers to represent tree nodes that
don't need to be adjusted when a node is added to the tree.

One obvious solution is to just use real numbers: make the root node correspond
to the unit interval [0,1], and choose ever smaller numbers as endpoints
for the child intervals. The problem here of course is that computers and
real numbers don't really work together.

For example, here is one spectacularly naive implementation: suppose we want to
add a new right child to the root node.
Then we cannot use 0 as the new left endpoint, and we cannot use 1 as the new
right endpoint. So the obvious solution (you would think) is to go somewhere
in the middle:

   [--------|--------|--------]
   0       1/3      2/3       1

This approach would make each new interval a third the size of the existing
space. Which means that as soon as you have added about 32 right children to
the root node, the endpoints will be numbers very close to 1, but with a
difference so small, that the numbers become equal when expressed using
floating-point arithmetic.

My goal here is to demonstrate something that works slightly better.

--

First, it is more convenient to think in rational, rather than real numbers.
So we are going to deal with numbers of the form

       numerator / denominator

The goal is to

        minimize the denominators!

Let's look at the unit interval [0,1] again. We want to divide this into
three parts to obtain a child node and room for more siblings later.

The 'simplest' (in the sense that it has the smallest denominator) rational
number between 0 and 1 is 1/2. The next simplest is 1/3.
So why not use those numbers then?

   [--------|--------|--------]
   0       1/3      1/2       1

Here we have our first child, [1/3, 1/2].

What do we do if we want to add a child to that?

We need two fractions between 1/3 and 1/2 that have the smallest possible
denominator. A little guessing reveals that these numbers are 2/5 and 3/7.
In fact, we can do without any guesswork, because there is a theorem about
fractions we can use here:-

  if a, b, c, d are positive integers such that

       bc - ad = 1

  then

      ( a + b ) / ( c + d )

  is the fraction that lies between a / b and c / d that has the smallest
  possible denominator

(meaning that: if p/q is another fraction between a/b and c/d, and q > 0,
then q >= c+d)

(Proof:- let p/q be a number between a/b and c/d with q > 0.
Let x = -dp + cq, y = bp -aq
Then xa + yc = p (since bc-ad=1), and xb + yd = q.
Then
  0 < p/q - a/b = (again since bc-ad=1) = x / bq
and
  0 < c/d - p/q = y / bq
Since x and y are integers, and b and q are positive, it follows that both
numbers must be equal to or greater than 1(!).
Since b and d are positive, it follows that xb + yd >= b + d,
or in other words, q >= b + d. QED.

The expression bc - ad = 1 is another way of saying that
the determinant of the matrix

  a b
  c d

is -1. Hence that matrix has an inverse with integer coefficients. This
is how one obtains the formulas for x and y.)

Furtunately the numbers a/b = 1/3 and c/d = 1/2 satisfy the 'determinant'
relation bc - ad = 1. Hence the simplest rational between 1/3 and 1/2 is
(1+1)/(3+2) = 2/5, which is indeed one of the numbers we found earlier.

The next crucial observation is that whenever ( a/b, c/d ) are pairs of
fractions for which the 'determinant' property holds, the property also holds
for the pairs

  ( a/b, (a+c)/(b+d) )

and

  ( (a+c)/(b+d), b/d )

From this, it follows that the fraction with the smallest denominator
between a/b and (a+c)/(b+d) is (2a+c)/(2b+d), and the fraction with the
smallest denominator between (a+c)/(b+d) and c/d is (a+2c)/(b+2c)!

Returning to our [1/3, 1/2] interval: we first found 2/5 as the first
endpoint by 'adding' (1/3, 1/3). We can now repeat this with the pair
(2/5, 1/2) and get 3/7 as our second endpoint. Note that the pair
(1/3, 2/5) also has the determinant property, but results in a larger
denominator.

To summarize:

Let a, b, c, d be positive integers such that

   bc - ad = 1

Then the two numbers between a/b and c/d with the lowest possible
denominator will be

   (2a+c)/(2b+d), (a+c)/(b+d)           if b <= d
   (a+c)/(b+d), (a+2c)/(b+2d)           otherwise

The final observation is that the 'determinant' property (bc-ad=1) holds
for all pairs of *successive* endpoints that are obtained in this way.

Our original set of endpoints was: (0/1, 1/1).
We then got (0/1, 1/3, 1/2, 1/1).
And subsequentely (0/1, 1/3, 2/5, 3/7, 1/2, 1/1).

When we calculate the 'bc-ad' of successive endpoints we get, as expected,

 1*1 - 0*3 = 1
 3*2 - 1*5 = 1
 5*3 - 2*7 = 1
 7*1 - 3*2 = 1
 2*1 - 1*1 = 1

The 'determinant' property has some other nice implications as well.
Let (a/b, c/d) again be fractions with bc - ad = 1. Then gcd(a,b) = 1,
and gcd(c,d) = 1, so the fractions are already in lowest terms.
Also gcb(b,d) = 1, which imples that b and d will be unequal, unless they
are both equal to 1.

Now, just to practice, let's divide the remaining intervals as well.


   [--------|--|--|--|--------]
   0        1  2  3  1        1
   1        3  5  7  2        1

   [--|--|--|--|--|--|--|--|--]
   0  1  1  1  2  3  1  2  3  1
   1  5  4  3  5  7  2  3  4  1

--

Now we can translate the tree-building tasks into something more concrete.

Let's say we have a tree, that is, a set of nodes, and hence, a set
of intervals, and a set of endpoints.

Grab a node, let's call it the current node. We want to add a new node
to the tree in relation to the current node.

To create a Left Sibling, calculate new endpoints as described above
between the current node's left endpoint and the largest endpoint
that is smaller. If there is no such endpoint, the current node is
the root node, for which siblings make no sense.

To create a Right Sibling, determine the smallest endpoint that is greater
than the current node's right endpoint, and create a new interval in between
those. Again, if such and endpoint does not exist the current node is the
root node.

To create a Left Child, determine the smallest endpoint that is greater than
our the current node's left endpoint. Such an endpoint will always exist.
(Because the right endpoint is greater than the left endpoint.)

To create a Right Child, get the largest endpoint less than the current node's
right endpoint.

The remaining process (adding a Child to a leaf node) is slightly different.
Here the verification is the bigger task.

To create a proper Child node, determine the smallest endpoint greater
than the current node's left endpoint (similar to what is done for LC.)
If this endpoint is not the current node's right endpoint, the node already
has children, and we stop. Otherwise, use the current node's left and right
endpoints to create a new interval.

Let's apply this to our sample tree illustrated below:

     *
    / \
   /   \
  *     *
  |    / \
  |   /   \
  *  *     *

This corresponds to the following set of nodes:-

  [--------------------------------]
     [--------]  [--------------]
        [--]        [--]  [--]

  0  1  2  3  1  1  4  3  2  3  1  1
  1  5  9 13  4  3 11  8  5  7  2  1

Note that the order in which nodes are added determine the value of
the endpoints. For example the following would also be valid representations
of the same tree:-

  [--------------------------------]
     [--------]  [--------------]
        [--]        [--]  [--]


  0  1  2  3  1  2  7  5  8 11  3  1
  1  3  5  7  2  3 10  7 11 15  4  1

--

The depth of a node is the amount of steps it takes to get from that node
via its parent (if it exists) to the root node.

The root node has depth 0, the children of the root have depth 1, the
children of the children of the root have depth 2, and so on.

A node A is an ancestor of a node B if A is on the path between B and the
root. B is then also called a descendant of A.

For a node it is sometimes needed to calculate its ancestors or its
descendants. For our nested interval representation this is relatively
straigtforward:

 a node A is an ancestor of B if and only if the interval corresponding
 with B is contained within the interval corresponding with A

To calculate the path to the root when using our numbering scheme,
we can do the following.

 starting with node [a/b, c/d]:

  - if b and d are equal, both numbers are equal to 1 (see above),
    which can only happen if the node is the root node [0, 1],
    and we are done
  - otherwise there exists a node with endpoint (a-c)/(b-d) (!!)
  - take that node, say, [a'/b', c'/d']
  - if a'/b' < a/b and c'/d' > c/d, the node is our parent,
    and we note this as one of our ancestors
  - otherwise, the node is a sibling
  - either way, replace a,b,c,d with a', b', c' and d', and restart

Since the largest denominator will decrease, this process will always
end.