In this scribble I'll discuss an implementation of different topological structures, which can be used for indexing different "relation systems", for example trees for menus, book structures, etc... graphs for caching links between pages.
We can split the problem into two main subproblems. First, navigation and queries withing a single relation domain. For example local table of contents for a section in a book. Second is higher order or inter-domain queries. For example: given a node, determine the all related nodes, ordering them by their distance from the node, based on taxonomy, their position in a book(s), and their position in the site's "link web". We should be able to implement second and hier order based queries, based on the results of these two basic problems.
The first problem is domain structure specific. We can have unordered, linear ordered sets, as well as having topologigal order imposed - trees or, more generally, graphs. The graphs can be directed or undirected.
We can represent an undirected graph, by either having two arcs for each undirected arc, one each way, or by simply assuming undirected semantics, and tailoring the code to be direction ignorant. The general relation form is (x,y,R), where x,y are things. R is the relation domain the relation domain is R(X,Y), where X is the set of all possible x of some types Tx which can act as the left side of the relation, similarly for the right side Y is the set of al possible y of some types Ty.
The first query of interest is - which things are related to X.
Y = related(R,x); Y is a set of items directly related to x
The second query is a generalisation of the first:
Y = related_hier(R,x,depth) - Y is a tree of items related indirectly to x upto depth.
In pseudo code:
function related_tree(R,x,depth) {
Y=related(R,x);
if(depth != 0) {
foreach(y in Y) {
Y[y] = related_hier(Rx, depth-1);
}
}
return Y;
}
We could generalise this to related_graph(R,x,depth), where the tree from the previous example is folded into a graph, so that there is no datas redundancy in the structure.
If we know the structure of R ins some case we can build indexes to help us optimise the performance of the this algorithm. For trees, we can use the nested sets model to avoid recursion, which is not efficient when you need to implement the function in SQL.
For domains, where the order is generally unknown, or complex we will need to fail over to the recursive function, or an iterative alternative, performing a sequential traversal of the graph.
To create the model, we need to know what can be X, Y, and what is R. X is evry possible x:T (of type in T), where T is the set of possible types distinguished by their id - tid. This means that we should have x is (xid,tid), similarly y is (yid, tid). R is the relation domain rid. This means that a relation is identified uniquely by:
((xid, txid),(yid,tyid), rid)
We can create an SQL table to reflect that in db.
CREATE TABLE relations (
left_id INTEGER NOT NULL,
lt_id INTEGER NOT NULL,
right_id INTEGER NOT NULL,
rt_id INTEGER NOT NULL,
relt_id INTEGER NOT NULL,
KEY left (left_id,lt_id),
KEY right (left_id,lt_id),
KEY rel (rt_id)
);
To insert a relation into the db :
INSERT INTO relations ($left_id,$left_type, $right_id, $right_type, $relation_type);
To get all related things to a particular thing:
SELECT right_id AS thing, rt_id AS type, relt_id AS relation
FROM relations
WHERE left_id = $thing->id AND lt_id = $thing->type;
To get all related by R things to a particular thing:
SELECT right_id AS thing, rt_id AS type
WHERE left_id = $thing->id AND lt_id = $thing->type AND relt_id=$relation_type;
The wrapper code in php is obvious so I want pollute the explanation, with anything but the interface of the functions:
function add_relation($left_thing,$right_thing,$relation)
function related($thing){}
function related_by_type($thing,$relation_type){}
In theory this should be sufficient to represent every type of relation, but indirect or computed ones, although even for them the selection functions interface could suffice.
The nested are nothing more but a pre-walked tree, with left and right index of each node filled in during the depth-first walk. The relation becomes:
(((xid,tid),(ns_left_id,ns_right_id)),((xid,tid),(ns_left_id,ns_right_id),rid)
We need to create an index table:
CREATE TABLE nested_set_index (
thing_id INTEGER NOT NULL,
thing_type INTEGER NOT NULL,
left INTEGER NOT NULL,
right INTEGER NOT NULL,
rel INTEGER NOT NULL,
KEY thing (thing_id,thing_type),
KEY left (left),
KEY right (right),
KEY rel (rel)
);
To select all "children" or "next", i.e things to the right we can do:
SELECT rel.left_id,rel.lt_id FROM relations AS rel
LEFT JOIN nested_set_index AS t ON rel.left_id=t.thing_id AND rel.lt_id=t.thing_type AND rel.relt_id=t.rel
WHERE t.left > $thing->left AND t.right right
ORDER by t.left;
To select all "parents" or "previous", i.e things to the left we can do:
SELECT rel.left_id,rel.lt_id FROM relations AS rel
LEFT JOIN nested_set_index AS t ON rel.left_id=t.thing_id AND rel.lt_id=t.thing_type AND rel.relt_id=t.rel
WHERE t.left left AND t.right right
ORDER by t.left;
To select all things in a subtree or the current branch starting from the root we can do:
SELECT rel.left_id,rel.lt_id FROM relations AS rel
LEFT JOIN nested_set_index AS t ON rel.left_id=t.thing_id AND rel.lt_id=t.thing_type AND rel.relt_id=t.rel
WHERE (t.left left AND t.right right)
OR t.left => $thing->left AND t.right =right
ORDER by t.left;
To add a new element to the tree we need to know the parent and the parent's rightmost immediate child.
INSERT INTO nested_set_index VALUES ($thing_id,thing_type,$sibling->right+($parent->right-$sibling->right)/3,$parent->right-($parent->right-$sibling->right)/3, $thing->rel);
The operations to determine the left and right index are required in order to minimise the need for tree reindexing, by using the whole INTERGER number space.
As you can see these are domain specific operations, relying on the fact that the relations domain has a tree structure.
This is an example of a compund or more complex relation, and is based on the existing drupal structures. It can be refactored to use a similar notation as the earlier examples, but I'm too lazy to do it at the moment.
SELECT tn.nid, n.title, r.teaser, count(tn.tid) AS metric
FROM {node} AS n
LEFT JOIN {term_node} AS tn ON tn.nid=n.nid
INNER JOIN {node_revisions} AS r ON r.vid = n.vid
WHERE tn.tid IN (SELECT tid FROM term_node WHERE nid=%d) AND tn.nid != %d
GROUP BY tn.nid ORDER by metric DESC LIMIT %d;
This is a simple example off a higher order query.
If we were to make everything in Drupal a node - users, terms, vocabularies, relations,... we will be
able to simplify the above examples, by losing the thing type ids, when determining the relations, as well, as allowing for easier building richer semantics.
To insert a relation into the db :
INSERT INTO relations ($left, $right, $relation);
To get all related things to a particular thing:
SELECT right AS thing , relation
FROM relations
WHERE left = $thing->id;
To get all related by R things to a particular thing:
SELECT right AS thing,relation
WHERE left = $thing->id AND relation=$relation_type;
And so on. This will result in faster queries, and we will be able to abstract things. Most current queries will fall in either one of the simple queries immediately above, or will look like the freetagging case query.
For menus, books and similar structures, we will be able to use tree indexes, based on the nested sets model. We will be able provide a cosistent api for the most common types of queries, so reducing a lot of code duplication, thus speeding up the code loading stage of drupal.
Oh, yes. Another herecy. What if we consider the field types of cck as atomic, rather than nodes? Everything a field?
That will come in another installment of my scribbles. The basic idea is to break up the SQL implementation into their different parts and produce code generators, capable of producing relatively complicated statements.
Another inetersting consequence of the approach described here, is the introduction of relation metrics, similar to the freetagging case, but within multiple domains. So we could produce ordered lists based on different 'closeness' measures. It can be really useful to produce semantically rich applications. Think related books on Amason, related things in anyMeta (look at http://mediamatic.net), and many others. And yes, we don't need to change Drupal core. Just reimplement user.module, taxonomy.module, maybe block.module in contrib. Just maintain the interfaces consistent. Then it should be possible to test them in site specific directories and bingo - everything works.