Jacques Mattheij

Technology, Coding and Business

A simple way to store and retrieve tree based information in the Django object relational model (ORM)

Django is a pretty neat piece of software, but if you want to stick to using just the Django based ORM and not resort to putting hard coded SQL in your code then there are some things that are quite difficult.

One of these is the storage of tree-based data in Django. For instance, a nested comments set.

You could store these ‘naively’, but it would require a lot of extra trips to the database in order to retrieve a tree or a portion of it. Every ‘subtree’ (all the child nodes hanging off a particular node) needs another call to the database. This gets expensive very quickly, especially if a tree is large.

There are solutions for this, such as the Modified, preorder tree traversal method. This is a clever method where two extra fields (‘left’ and ‘right’) are added to each element in the tree and the ordering is done by using these fields.

This is more efficient, because in order to retrieve the subtree hanging off any node all you need to do is retrieve items whose ‘left’ field are between the ‘left’ and ‘right’ fields of the node you are interested in, and that’s a single database query.

The downside of this method is that if you want to insert a node all the nodes that are to the ‘right’ of that node get new numbers. This can get very expensive, especially with large trees, and it will require locking the table during the update.

Here is a different way of achieving the same effect, minus some of the downsides:

The assumption is still that you are going to be doing less writes to your database than reads, so a bit of overhead during the write phase is acceptable, but not as bad as it was in the previous example.

Instead of storing the ‘links’ between the fields, let’s store the whole path inside a node. This puts a price on the depth of the tree (that means longer paths), and it puts a price on inserting nodes if you want to change the sort order (you’ll have to update the paths of all the other nodes at this level).

You can still select all the other nodes descending from a single node in the tree with one query, only this time you use the ‘%’ wildcard operator in the query like this:

select [fields] from nodes where nodepath like ‘/0000000008/0000000010/%’;

Because all of the leading part of the nodepath is absolute this is quite fast.

In order to keep the nodepath contents consistent you will have to override the save method of your model.

If you want you can use the path to store items in an order that is pre-sorted by some arbitrary field or a formula.

Note that this has a scale limit built in because of the number of ‘0’s in the node id, make sure you use enough of them to accomodate your applications needs (storage is cheap, and modifying the datastructure to expand it is a pain).

I’m not an expert django programmer and if there is a much better way of doing this then I would be very happy to learn about it!

HN Submission/Discussion
If you read this far you should probably follow me on twitter: