Using nested intervals with rational endpoints to model tree structures. What is a tree again? - a finite set of nodes - a root node - each node except the root node has a parent node - the root has no parent - child nodes are ordered A tree can be represented as a diagram with the root node at the top, and arrows leading down to the child nodes, which are ordered left to right. The order is important in this discussion. For example, these two trees are different:- 1 1 / \ / \ 2 3 3 2 Nodes that have the same parents are called siblings. So in the trees above, 1 is the parent of 2 and 3, who are siblings. Suppose we have a tree consisting of n nodes, how can we make this into a tree with n+1 nodes? - the new node must have a parent, so choose one - add the new node to the list of child nodes for the parent at a specific position All trees are constructed this way. That is, we start with one root node, and then we keep adding nodes, attach them to parent nodes, at a specific location. The 'adding-a-node' procedure is one of these:- C add a child node to a node that does not already have any children (i.e. a leaf node) LS add a new sibling to the immediate left of an existing child node RS add a new sibling to the immediate right of an existing child node Note on redundancy: the last two operations may coincide: 1 /|\ / | \ / | \ 2 4 3 Node 4 is both a right sibling of 2 as well as a left sibling of 3. We are going to introduce more redundancy and add some other node- adding operations LC add a child node to the left of all its siblings (if any) RC add a child to the right of all its siblings For leaf nodes, operations C, LC and RC are all equivalent. In the diagram below 1 /|\ / | \ / | \ 2 3 4 Node 4 could be viewed either as a new rightmost child of 1, or as a new right sibling of 3. -- One way to model trees is by thinking of them as nested sets: - each node is represented as an interval, with a left and a right endpoint - every node's interval is properly contained within the interior of the parent's interval (that is, no interval endpoints coincide) - the order of the child nodes is preserved by the order of the intervals For example, the tree * / \ / \ * * | / \ | / \ * * * (I won't number the nodes, because I'm going to put some other numbers next to them shortly) can be modeled as [-------------------] [----] [--------] [-] [-] [-] To realize this model in, say, a database, there must be some way to determine what the endpoints for the corresponding endpoints should be. There is a well known [*] technique that involves walking along the tree from the root in a counter-clockwise direction, and labeling the left and right sides of nodes as we meet them with increasing numbers. In our example tree above: 1*12 / \ / \ 2*5 6*11 | / \ | / \ 3*4 7*8 9*10 And our intervals hence become: [1,12], [2,5], [6,11], [3,4], [7,8], [9,10]. The problem with doing things this way is that if we insert a node anywhere in our tree, at least one interval will have to be adjusted. And, indeed, if we add a left child to the root *all* endpoints must be recalculated. So we want a way to use (pairs of) numbers to represent tree nodes that don't need to be adjusted when a node is added to the tree. One obvious solution is to just use real numbers: make the root node correspond to the unit interval [0,1], and choose ever smaller numbers as endpoints for the child intervals. The problem here of course is that computers and real numbers don't really work together. For example, here is one spectacularly naive implementation: suppose we want to add a new right child to the root node. Then we cannot use 0 as the new left endpoint, and we cannot use 1 as the new right endpoint. So the obvious solution (you would think) is to go somewhere in the middle: [--------|--------|--------] 0 1/3 2/3 1 This approach would make each new interval a third the size of the existing space. Which means that as soon as you have added about 32 right children to the root node, the endpoints will be numbers very close to 1, but with a difference so small, that the numbers become equal when expressed using floating-point arithmetic. My goal here is to demonstrate something that works slightly better. -- First, it is more convenient to think in rational, rather than real numbers. So we are going to deal with numbers of the form numerator / denominator The goal is to minimize the denominators! Let's look at the unit interval [0,1] again. We want to divide this into three parts to obtain a child node and room for more siblings later. The 'simplest' (in the sense that it has the smallest denominator) rational number between 0 and 1 is 1/2. The next simplest is 1/3. So why not use those numbers then? [--------|--------|--------] 0 1/3 1/2 1 Here we have our first child, [1/3, 1/2]. What do we do if we want to add a child to that? We need two fractions between 1/3 and 1/2 that have the smallest possible denominator. A little guessing reveals that these numbers are 2/5 and 3/7. In fact, we can do without any guesswork, because there is a theorem about fractions we can use here:- if a, b, c, d are positive integers such that bc - ad = 1 then ( a + b ) / ( c + d ) is the fraction that lies between a / b and c / d that has the smallest possible denominator (meaning that: if p/q is another fraction between a/b and c/d, and q > 0, then q >= c+d) (Proof:- let p/q be a number between a/b and c/d with q > 0. Let x = -dp + cq, y = bp -aq Then xa + yc = p (since bc-ad=1), and xb + yd = q. Then 0 < p/q - a/b = (again since bc-ad=1) = x / bq and 0 < c/d - p/q = y / bq Since x and y are integers, and b and q are positive, it follows that both numbers must be equal to or greater than 1(!). Since b and d are positive, it follows that xb + yd >= b + d, or in other words, q >= b + d. QED. The expression bc - ad = 1 is another way of saying that the determinant of the matrix a b c d is -1. Hence that matrix has an inverse with integer coefficients. This is how one obtains the formulas for x and y.) Furtunately the numbers a/b = 1/3 and c/d = 1/2 satisfy the 'determinant' relation bc - ad = 1. Hence the simplest rational between 1/3 and 1/2 is (1+1)/(3+2) = 2/5, which is indeed one of the numbers we found earlier. The next crucial observation is that whenever ( a/b, c/d ) are pairs of fractions for which the 'determinant' property holds, the property also holds for the pairs ( a/b, (a+c)/(b+d) ) and ( (a+c)/(b+d), b/d ) From this, it follows that the fraction with the smallest denominator between a/b and (a+c)/(b+d) is (2a+c)/(2b+d), and the fraction with the smallest denominator between (a+c)/(b+d) and c/d is (a+2c)/(b+2c)! Returning to our [1/3, 1/2] interval: we first found 2/5 as the first endpoint by 'adding' (1/3, 1/3). We can now repeat this with the pair (2/5, 1/2) and get 3/7 as our second endpoint. Note that the pair (1/3, 2/5) also has the determinant property, but results in a larger denominator. To summarize: Let a, b, c, d be positive integers such that bc - ad = 1 Then the two numbers between a/b and c/d with the lowest possible denominator will be (2a+c)/(2b+d), (a+c)/(b+d) if b <= d (a+c)/(b+d), (a+2c)/(b+2d) otherwise The final observation is that the 'determinant' property (bc-ad=1) holds for all pairs of *successive* endpoints that are obtained in this way. Our original set of endpoints was: (0/1, 1/1). We then got (0/1, 1/3, 1/2, 1/1). And subsequentely (0/1, 1/3, 2/5, 3/7, 1/2, 1/1). When we calculate the 'bc-ad' of successive endpoints we get, as expected, 1*1 - 0*3 = 1 3*2 - 1*5 = 1 5*3 - 2*7 = 1 7*1 - 3*2 = 1 2*1 - 1*1 = 1 The 'determinant' property has some other nice implications as well. Let (a/b, c/d) again be fractions with bc - ad = 1. Then gcd(a,b) = 1, and gcd(c,d) = 1, so the fractions are already in lowest terms. Also gcb(b,d) = 1, which imples that b and d will be unequal, unless they are both equal to 1. Now, just to practice, let's divide the remaining intervals as well. [--------|--|--|--|--------] 0 1 2 3 1 1 1 3 5 7 2 1 [--|--|--|--|--|--|--|--|--] 0 1 1 1 2 3 1 2 3 1 1 5 4 3 5 7 2 3 4 1 -- Now we can translate the tree-building tasks into something more concrete. Let's say we have a tree, that is, a set of nodes, and hence, a set of intervals, and a set of endpoints. Grab a node, let's call it the current node. We want to add a new node to the tree in relation to the current node. To create a Left Sibling, calculate new endpoints as described above between the current node's left endpoint and the largest endpoint that is smaller. If there is no such endpoint, the current node is the root node, for which siblings make no sense. To create a Right Sibling, determine the smallest endpoint that is greater than the current node's right endpoint, and create a new interval in between those. Again, if such and endpoint does not exist the current node is the root node. To create a Left Child, determine the smallest endpoint that is greater than our the current node's left endpoint. Such an endpoint will always exist. (Because the right endpoint is greater than the left endpoint.) To create a Right Child, get the largest endpoint less than the current node's right endpoint. The remaining process (adding a Child to a leaf node) is slightly different. Here the verification is the bigger task. To create a proper Child node, determine the smallest endpoint greater than the current node's left endpoint (similar to what is done for LC.) If this endpoint is not the current node's right endpoint, the node already has children, and we stop. Otherwise, use the current node's left and right endpoints to create a new interval. Let's apply this to our sample tree illustrated below: * / \ / \ * * | / \ | / \ * * * This corresponds to the following set of nodes:- [--------------------------------] [--------] [--------------] [--] [--] [--] 0 1 2 3 1 1 4 3 2 3 1 1 1 5 9 13 4 3 11 8 5 7 2 1 Note that the order in which nodes are added determine the value of the endpoints. For example the following would also be valid representations of the same tree:- [--------------------------------] [--------] [--------------] [--] [--] [--] 0 1 2 3 1 2 7 5 8 11 3 1 1 3 5 7 2 3 10 7 11 15 4 1 -- The depth of a node is the amount of steps it takes to get from that node via its parent (if it exists) to the root node. The root node has depth 0, the children of the root have depth 1, the children of the children of the root have depth 2, and so on. A node A is an ancestor of a node B if A is on the path between B and the root. B is then also called a descendant of A. For a node it is sometimes needed to calculate its ancestors or its descendants. For our nested interval representation this is relatively straigtforward: a node A is an ancestor of B if and only if the interval corresponding with B is contained within the interval corresponding with A To calculate the path to the root when using our numbering scheme, we can do the following. starting with node [a/b, c/d]: - if b and d are equal, both numbers are equal to 1 (see above), which can only happen if the node is the root node [0, 1], and we are done - otherwise there exists a node with endpoint (a-c)/(b-d) (!!) - take that node, say, [a'/b', c'/d'] - if a'/b' < a/b and c'/d' > c/d, the node is our parent, and we note this as one of our ancestors - otherwise, the node is a sibling - either way, replace a,b,c,d with a', b', c' and d', and restart Since the largest denominator will decrease, this process will always end.