Self Joins
Introduction
In this paper we introduce a new database operator, the self join, that encapsulates queries on data with inter-row dependencies. The idea is not new, they are also known as recursive joins or hierarchical queries.
This document describes the self join from a logical development point of view. This means the useful aspects of self joins are provided last, along with some of the simplest examples. I may make a document that covers this same material, but with emphasis on reducing the learning curve, at a later time.
Self Join syntax is not supported by KMAL at this time.
History
In the standard database theory there are eight relational operators: The set operators (Union, Intersection, Difference, Product) and the relational operators (Select, Filter, Join, Divide). These operators assume that all the rows in any table are independent. This assumption is limiting when the rows are not independent. This type of dependency appears in adjacency tables; tables defining graph edges. An example is:
CLASS DirectedEdge
FIELD Key AS Integer; //every edge will be given an identification
FIELD Head AS Vertex; //the verticies connected to this edge
FIELD Tail AS Vertex;
END DirectedEdge
Suppose we have a database of employee relationships and we need to write a query to find all the employees of a particular manager. To solve the problem, there are currently two choices. First, the DB administrator can design the schema to show every manager of each employee. This is an extremely redundant table, and breaks normal form. Second, the DB administrator can use some, often slow, proprietary extension of SQL to do the job for you. I find both options ungraceful.
Benefits
Some may argue that the usefulness and efficiency of databases comes from the row independence; adding a dependency operator would be a bad thing. I do not totally agree. I advocate designs that use row independence as much as possible. Yet row independence can not solve all problems, and the addition of the self join solves these extra problems in an elegant way.
By defining the self join as a standard we are able to separate the use of the self join from the implementation. It allows specialization either in the application of, or in the implementation of, the self-join. By identifying a self-join, and defining its interface, we can simplify the DB administrators SQL, and allow the backend designers to optimize for this type of query.
Long Paths
- The use of long paths is only a conceptual view of self-joins, actual implementation using this method can lead to an exponential explosion. There are many implementation shortcuts that can be performed to bring most self joins down to a reasonable order.
- The self join does not consider cycles and is guaranteed completion.
- We will see that in the presence of aggregation many optimizations can be made.
The context of the Self-Join is taken with respect to the long paths in an existing graph. This means that dynamically determined paths are not handled by the self join. Yet this restriction may not be a hindrance, because we may be able to show that any dynamic path can be calculated in a two phase self join. We will leave this proof for later.
Given a set of start verticies, we have P paths that can be taken. Any vertex in one of those paths will have E edges pointing in. Each of those edges can represent a subset of the P paths.
Let us define the type to hold all these paths
CLASS Path
FIELD PathKey AS Integer //A UID
FIELD Node AS Vertex
FIELD SubpathKey AS Path
END Path
We will consider directed graphs with no cycles. Here are the most three most interesting patterns:
Example A, (list)
Technically, the list is a specialization of a tree, yet the list has the added flexibility to be upward or downward pointing. This is good semantically when defining limits to the data structure and providing parameters for optimization.
There is only one long path in a list, (P = 1)
Example B, (tree)
The number of long paths P = Nleaf
Example C, (double list)
We have this example to demonstrate the possible exponential explosion in long paths. The number of long paths here is 8, but in general P = 2 ^ (N / 2) = 2 ^ (E - 1)
Semantics
The self join only considers that edge set that defines the graph, the vertex set is completely dependent. The syntax of the (relational) self is as follows:
SELFJOIN ( start = <seed vertex>,
set = <edge set>,
next = <next field>,
prev = <previous field>,
groupby = <Path | Node | Leaf>
)
<start code>
EXECUTE
<code>
END SELFJOIN
The seed vertex is the starting point of the self join, this can be the root of a tree, the beginning of a list, or any arbitrary vertex in a graph. The edge set is just that. By defining the edge set, the edge type is implied. This means each edge is expected to have head and tail attributes, defined by the next and prev fields respectively. We will get into the details of groupby and "relational" distinction later.
Traveling Salesperson
(relational, groupby=Path)
This is a classic problem that has a simple specification in self join notation. We will assume we have a directed graph, and we want to find the path of least total weight that includes all verticies, starting at START. This can be a Hamiltonian cycle if the graph is complete or a Eulerian cycle if not. Here is the table definition for members of the edge set
CLASS DirectedEdge
FIELD Key AS Integer;
//every edge will be given an identification
FIELD Head AS Vertex;
//the vertex connected to this edge
FIELD Tail AS Vertex;
FIELD Weight AS Float;
END DirectedEdge
Since the self join considers long paths, one of those long paths will be a solution to TSP.
FIELD AllPaths AS Set_Class;
AllPaths = SELFJOIN (start=START,
set=DirectedEdge,
next=Head, prev=Tail,
groupby=Path
)
FIELD Length=0 AS Integer_Class;
FIELD Path=List_Class.newInstance() AS List_Class;
EXECUTE
Length=Length+Weight
Path.append(Key)
END SELFJOIN
The type of the members of AllPaths is defined in the start code , and we can now simply filter out the shortest:
BestPaths = FILTER AllPaths HAVING Length=Max(Length)
Of course, the implementation of this specification is O(2^n).
Types of Self Joins
There are effectively 6 different types of self join, all made to be semantically similar, but each has a significantly different efficient implementation. These are:
| Type |
|
|
| groupby=Path |
|
|
| groupby=Node |
|
|
| groupby=Leaf |
|
|
Aggregation
Very rarely are all paths though a directed graph needed. It is more common to aggregate the results. With aggregation there are efficiency gains to be made. The Node and Leaf options on the groupby clause reduces the order of the result set to around O(|V|).
Grouping by Node allows every node in the graph to be considered after all it's parents, and calculate its attributes based on the parent attributes. Note that cycles are still not considered.
Grouping by Leaf is much like grouping by Node, except only the final leaf nodes are provided as output.
Hierarchical Domains
Many times the verticies and edges of the graph are not distinguished by separate vertex and edge classes. They are often the same class, and effectively define a hierarchical structure. The example we use is one that defines a corporate employee structure.
Employee
Key
Name
Manager
Using the self join on this type of hierarchical structure is exactly the same as in the case of the relational case except that one of the next/prev parameters is excluded, depending on the direction taken though the hierarchy. We will see examples of this later.
Garbage Collector
(relational, groupby=Node)
We can make a simple marking query for a garbage collector. We have a SeedObject and access to all the variables/object slots.
Variable
Object
//the object this variable belongs
Field
//the particular slot in the object
Value
//the object pointed to
First we annotate all objects with a marker for deletion. Second we a self join to toggle that marker if it is referenced. Finally we remove all objects marked for delete.
ANNOTATE Object_Class {
FIELD DeleteMe=TRUE AS Boolean_Class;
END ANNOTATE
SELFJOIN ( start=SeedObject,
set=Variable,
next=Value, prev=Object,
groupby=Node
) EXECUTE
DeleteMe=FALSE;
END SELFJOIN
DELETE FROM
Object_Set
WHERE
DeleteMe=TRUE
;
Notice that there is no start code required for this self join, start code is always optional. In this case, the GC is considering more than just the nodes reachable from the SeedObject .
END OF DOCUMENT, REST IS NOTES
- The corporate substructure of the 90's allowed for the possibility of two direct managers (dotted line relationships). This information has to be held in a many-many join table; separate from the employee table.
Like the above example
There are two flavors of the self join. The first is the Single Self Join. This is a join that would happen between a field
have at least one field defined recursively to be useful. To not have a recursive result field would result in an expensive join.
Lets look at an example. Here we are finding the depth of every person in the corporate structure.
SELECT
key,
depth=depth+1
FROM
Employee
WHERE
Boss=Key
INITAILIZE
key=<key of head-honcho>,
depth=0The recursive relationship (depth=depth+1) calculates how to know an employees level if the bosses is known. The where clause is in a special order. Here the dependent field is on the left, and the independent on the right. The INITIALIZE assigns the depth of the head-honcho.
Another example will calculate the height of a manger in the corporate ladder. We have arbitrarily decided that height of the manager should be based on the deepest tree she manages.
SELECT
key,
height=MAX(height+1)
FROM
Employee
WHERE
Key=Boss
INITIALIZE
<set all records to height=0>It is important to initially set all employees heights to 0. Then the self-join will alter the ones that have employees. The recursive relationship (height=MAX(height+1)) calculates the managers height based on all the direct employees heights. The independent variable here is a field, and therefore has to be grouped-by. If the independent variable was NOT grouped, then calculating the height of the employee would result in may different answers, one for each employee. Non-aggregate calculations will be handled later in this paper.
Double Self Joins need another example. The corporate structure of the 90s allows for the dreadful possibility of two direct managers (dotted line relationships). This information has to be held in a many-many join table; separate from the employee table.
Let us calculate the depth of an employee in this new structure
SELECT
key=Employee,
depth=MAX(depth+1)
FROM
Managers
WHERE
Boss=Employee
INITAILIZE
key=<key of head-honcho>,
depth=0Now we see the effect of having to find depth, there has to be an aggregate function to reduce all the possible depths to a single number. Otherwise a manager might have a range of depths.
We can find all the employees under the control any one.
SELECT
employee=Boss,
indicator=MIN(indicator)
FROM
Managers
WHERE
Employee=Boss
INITIALIZE
key=<key of the boss>,
indicator=1By adding a set of initialized records several bosses can be worked on at once. Notice that the MIN aggregate specifies preference for one boss over another. This preference may not be what you desire when you run this type of query. There could be a scenario where you have a list of managers, and would like a complete list of employees for each. The aggregation function is preventing this from happening. In our next example we will see a Double Self Join with no aggregation.
We can find all the employees under the control any one.
SELECT
employee=Boss,
indicator=indicator
FROM
Managers
WHERE
Employee=Boss
INITIALIZE
key=A, indicator=1
key=B, indicator=2We will look at the source table for this query.
Boss Employee A C A D B C B E C F C G C H We will get the following result from the query:
Employee Indicator A 1 C 1 D 1 F 1 G 1 H 1 B 2 C 2 E 2 F 2 G 2 H 2 Manager C has three individuals working for her, and she has two bosses. Therefore her whole group, including herself, will show indicated as managed by both A and B.
Single Self Join
- There is a special case of self join when the directed graph has an out-degree of, at most, one; that is one tail edge per vertex. The representation of this type of graph can be done with a simpler primitive; exposing the 1-1 relationship. It is the 1-1 nature that allows the omission of the aggregate term and simplifies the self join specification
UnaryEdge
Key
Next //Points to object of type UnaryEdge - Dispit that logical difference, semantically this structure is not too different from the DirectedEdge structure (Let Key=Tail, and Next=Head).
- Unary graphs are best for representing lists and hierarchies.
The number of long paths that can be found in any unary graph is equal to the number of nodes considered.
Employee
Key
Manager
The Down Join never needs to aggregate because the number of long paths equals the number of nodes. WRONG, MAY WANT MAXIMUM DEPTH OF WHOLE TREE.
- Find the depth of each person in the hierarchy.
SELECT
depth=depth+1
FROM
Employee
SELF JOIN (COULD BE DOWN JOIN)
Head=Manager
Tail=Key
INITAILIZE
depth=0
If the in-degree or out-degree are not allowed to be greater than one, then the self join is allowed for semantic brevity.
Find the height of the head honcho on the hierarchy. The WHERE clause reduces the result set to the single employee we are interested in.
SELECT
Key
height=max(height)+1
FROM
Employee
SELF JOIN (COULD BE UP JOIN)
Head=Key
Tail=Manager
WHERE
Key=<key of head honcho>
GROUP BY
Key
INITAILIZE
height=0
Better yet, the addition of the self join to the relation database operators makes SQL Turing complete!
<MAKE Universal Turing Machine>