Wednesday, March 24, 2010

Rendering Trees with Closure Tables

I got a comment from a reader about the Naive Trees section of my presentation SQL Antipatterns Strike Back. I've given this presentation at the MySQL Conference & Expo in the past.

I'd also like to mention that I've developed these ideas into a new book, SQL Antipatterns: Avoiding the Pitfalls of Database Programming. The book is now available in Beta and for pre-order from Pragmatic Bookshelf.

Here's the reader's question:
I would like to ask if there's a way I can dump all the hierarchies in a single query using a closure table? For example I have a following tree:

rootree
- 1stbranch
- midbranch
- corebranch
- leafnodes
- lastbranch
- lastleaf

and I want to display it like:

rootree -> 1stbranch
rootree -> midbranch
rootree -> midbranch -> corebranch
rootree -> midbranch -> corebranch -> leafnodes
rootree -> lastbranch
rootree -> lastbranch -> lastleaf

The Closure Table is a design for representing trees in a relational database by storing all the paths between tree nodes. Using the reader's example, one could define and populate two tables like this:
drop table if exists closure;
drop table if exists nodes;

create table nodes (
node int auto_increment primary key,
label varchar(20) not null
);

insert into nodes (node, label) values
(1, 'rootree'),
(2, '1stbranch'),
(3, 'midbranch'),
(4, 'corebranch'),
(5, 'leafnodes'),
(6, 'lastbranch'),
(7, 'lastleaf');

create table closure (
ancestor int not null,
descendant int not null,
primary key (ancestor, descendant),
foreign key (ancestor) references nodes(node),
foreign key (descendant) references nodes(node)
);

insert into closure (ancestor, descendant) values
(1,1), (1,2), (1,3), (1,4), (1,5), (1,6), (1,7),
(2,2),
(3,3), (3,4), (3,5),
(4,4), (4,5),
(5,5),
(6,6), (6,7),
(7,7);
What we need to do is find all the descendants of the root node 1, then for each of these descendant nodes, list its ancestors in order, separated by an arrow. We can use MySQL's useful GROUP_CONCAT() function to build this list for us.
select group_concat(n.label order by n.node separator ' -> ') as path
from closure d
join closure a on (a.descendant = d.descendant)
join nodes n on (n.node = a.ancestor)
where d.ancestor = 1 and d.descendant != d.ancestor
group by d.descendant;

Here's the output in the MySQL client. It looks like what the reader asked for:
+-------------------------------------------------+
| path |
+-------------------------------------------------+
| rootree -> 1stbranch |
| rootree -> midbranch |
| rootree -> midbranch -> corebranch |
| rootree -> midbranch -> corebranch -> leafnodes |
| rootree -> lastbranch |
| rootree -> lastbranch -> lastleaf |
+-------------------------------------------------+
I do assume for the purposes of ordering that all of a node's ancestors have a lower node number. You could alternatively use a pathlength column to the closure table and sort by that.

The Closure Table design is nice compared to the Nested Sets (or Preorder Traversal) design, because it supports the use of referential integrity. By using indexes, the EXPLAIN report shows that MySQL query optimizer does a pretty good job on it (I've omitted a few columns for brevity):
+-------+--------+-------------------+--------------------------+
| table | type | ref | Extra |
+-------+--------+-------------------+--------------------------+
| d | range | NULL | Using where; Using index |
| a | ref | test.d.descendant | |
| n | eq_ref | test.a.ancestor | |
+-------+--------+-------------------+--------------------------+

71 comments:

lakiboy said...

Wow. It seems your book is what I need. Ordered a PDF version -)

Unknown said...

Hi Bill,

How would you do this and taking all the structure into an array instead of a big string?

Thanks!

Vani said...

I came across your slides while looking for better patterns for storing hierarchical data. It's definitely one of the best articles i came across.

Closure tables just got my attention. I am planning to use it for a folders design for my project.Folders in my project are heavily shared, moved, created, deleted in the system. They are also permission-ed.

Do you think closure tables is a good idea? or Path enumeration is better?

I really appreciate your feedback.

Thanks!

Bill Karwin said...

@Vani: Hi thanks for your comment.

Closure Table has many advantages, but if one of the common operations you want to do is to move a subtree, this may not be the best design.

Path Enumeration may be easier for that particular kind of change. To move a subtree, you can do it by doing a string-replace on the prefix of a path.

In the example in this blog, suppose I want to move 'corebranch' to be a child of 'lastbranch':

UPDATE Folders
SET path = REPLACE(path, 'rootree/midbranch/', 'rootree/lastbranch/)
WHERE path LIKE 'rootree/midbranch/corebranch/%'

The best tree design for your app depends on what kinds of operations you need to do with your tree.

Vani said...

@Bill: Thank you Sir for your prompt reply.

My app simply resembles the google docs for example. I let users organize their files in folders and share them with fellow users.I was looking at hierarchical tags(labels, as in gmail/google docs). But the problem with that is sharing, i am not sure if hierarchical tags works well with permissions.

Sheldon said...

Hi Bill

First, thanks a lot for the slides from your presentation on closure trees. They are great, finally I'll be able to get fast trees working with referential integrity.

Anyway, I wanted to ask, when you mention that you depend on the id order in this post, and the alternative would be to depend on some kind of length field, I'm really not sure what you mean, or how that could be used. Would you mind elaborating, I'm trying to get this working with preorder traversal like in the example, but as far as I can tell the order wouldn't be correct if I say, moved a subtree about?

Would really appreciate the help.

Regards
Sheldon

Unknown said...

Bill, thanks for pointing me to your blog! I'm reading your book and it's very helpful (as well as Joe Celko's two SQL for Smarties books on hierarchical data). I'm sold on the versatility and integrity of closure tables; however, I'm finding the queries for returning the data I need to be less than intuitive.

Using the above example, how would one restructure the SELECT statement to list only rows with complete paths, i.e., exclude the rows containing sub-paths of succeeding rows?

Desired result:
rootree -> 1stbranch
rootree -> midbranch -> corebranch -> leafnodes
rootree -> lastbranch -> lastleaf

Thanks again for your help!

Unknown said...

Hi Bill, first of all, I am reading your and I think it is a treasure! Its full of real examples.

My question is how I can implement a threaded comment system, where the comments are paginated?

I don't know how I can get it done with closure tables.

For example: Show 5 comments and their replies

Comment 1
Reply
Reply
Reply
Comment 2
Reply
Comment 3
Comment 4
Comment 5
Reply
Reply
Reply
Reply

Unknown said...

I think I'll do 2 queries. The first to return all 5 comments and then make a query that returns me the answers to these 5 comments.

In a column of the table where comment information resides, will store if it is a comment or a response.

Sergio said...

Hi Bill. I've noticed in your closure table samples each node (even leaf nodes) are referencing to themselves. There's no way to distinguish root nodes from branches and last segments of the tree, if yo do it this way. I would rather let root nodes to have a reference to themselves only. In this case they can be distinguished from the rest by either length, which is 0 or equal ancestor and descendant values.

On another note, wondering if you have any suggestions regarding the most effective way to find all last segments of a particular tree?

Bill Karwin said...

Hi @Sergio, thanks for the questions.

The root node of a tree is the node that has no ancestors, besides itself.

SELECT c.* FROM closure AS c
LEFT OUTER JOIN closure AS anc
ON anc.descendant = c.descendant AND anc.ancestor <> c.ancestor
WHERE anc.ancestor IS NULL

Finding leaves is the complementary query, finding nodes with no descendent besides themselves.

SELECT c.* FROM closure AS c
LEFT OUTER JOIN closure AS des
ON des.ancestor = c.ancestor AND des.descendant <> c.descendant
WHERE des.descendant IS NULL

Russ said...

I've looked into the closure tables and implemented a version using path_length, and so far I love it.

However, there is one thing I have yet to solve. I can grab all ancestors or descendants of a node easily with path length. However, in my tree some nodes are in multiple subtrees, and I haven't figured out how to enumerate a single path. That is, when I say give me all ancestors and sort by path_length, this does not give me a single path to that particular node (as it would if the node was only in a single path). Instead, it's a mix of two or more paths, but I can't match a level 2 with a level 3 correctly (which ancestor of the level 3 is the correct match for that particular path?).

Is there a way around this in SQL, or must it be left up to the application layer. The path chosen isn't all that important - just that it's a correct path (or alternatively, return all paths and then leave it up to the application layer to pick one).

Altec said...

Hi, interesting topic.
I would like ask which models (for hierarchical data) it is best for which application.
(I mean shich model is the best for e-shop, blog, ,....)
Thanks.

Altec said...
This comment has been removed by the author.
Bill Karwin said...

Hi Altec, thanks for your question, but I can't answer it at that level. There are many types of tasks you might need to do with hierarchical data in any given application. The choice of what design to use for the data has more to do with the specific queries you need to run, not so much with the purpose of the application.

Niko Brauc said...

I've used closure tables in some of my recent apps and have mixed feelings. I still like nested set very much but
I wanted to implement multiple parents. That's ideal for closure tables. And now everything works simple and fast.

But I wonder if self referencing nodes are really needed.
Isn't it simpler using table with null-able anestors, eg:

CREATE TABLE `category_closure` (
`category_id` int(10) unsigned NOT NULL,
`ancestor_id` int(10) unsigned DEFAULT NULL,
`distance` tinyint(3) unsigned DEFAULT NULL,
UNIQUE KEY `category_id` (`category_id`,`ancestor_id`),
KEY `ancestor_id` (`ancestor_id`)
);

In this case we needn't (shouldn't) inserting self referencing nodes, eg. (2, 2, 0),
only childrend and other descendants.

I know that now we should by hand prevent inserting multimple inserting
of the same root category, eg.
(1, null, null)
(1, null, null)

But on the other hand root categories manipulating can be much simpler.

What's your opinion? Is it worth using null-able ancestor?

Bill Karwin said...

@Nico: I have at least three reasons why the self-referencing nodes are useful.

1. Primary key columns must be non-nullable, and the two columns of ancestor, descendant in the closure table serve as the best primary key.

2. The self-referencing row makes it easier when you add a new child node. For example, if you have a path A-B-C-D and you want to add a new child of D, you just run:

SELECT ancestor, E FROM closure WHERE descendant = D

If you didn't have a self-referencing row (D,D), you'd have to add the last path (D,E) by hand anyway.

3. Most of the time when you want to query either a chain of ancestors of D, you'd want to include D in the result too. E.g. a breadcrumbs query. If you didn't have the self-referencing row, you'd get the ancestors of D, but not D itself, from this query:

SELECT ancestor FROM closure WHERE descendant = D

With the self-referencing row, you get ancestors and also D itself. You get a similar benefit when you want to query for a subtree and include the top node of that subtree.

SELECT descendant FROM closure WHERE ancestor = B

So yes, I do think the self-referencing row gives several benefits, even though it looks superfluous at first.

Niko Brauc said...

More than enough reasons for the symmetrical structure :-) Thanks for the quick and comprehensive response.

Justin Zaun said...

Hi Bill,

First, thanks for all your different posts and reticles on Closure Tables. They've helped a lot. I have everything just about working but I can't seem to select my tree in the correct order out of my table. I get all the correct nodes, just not in the correct order, even if ordering by pathlength.

You can see my full SQL here:

http://stackoverflow.com/questions/8252323/mysql-closure-table-hierarchical-database-how-to-pull-information-out-in-the-c

Thanks.

Tukki said...

@Bill, thanks for your great posts on closure tables.

And I made a closure table demo base on your posts, and choosing another solution to render sub-tree without care about the parents. I'm just a newbie, and I don't know if this absolutely right, but it worked.

Here this my note: http://goo.gl/UIXPA

anietog said...

Hi Bill,
first of all thanks a lot for this article, it is really useful at least for me.

I was wondering this desing needs to be changed slightly if we want to store more than one tree? i mean to have N root nodes? Or as you have answered @Sergio each node with no ancestors will be the root of a particular tree? Thanks a lot for your time

anietog said...

I have already found an answer to my question. http://stackoverflow.com/questions/4958570/how-to-find-distinct-root-nodes-of-newest-nodes-in-trees-hold-in-closure-table

Thanks a lot

Unknown said...

Bill, thanks for all your great work on the subject. It's been helpful to me multiple times in the past and currently with your writings on closure tables.

I have replicated the model you prescribed in this post, but the query you use to generate the below...

+-------------------------------------------------+
| path |
+-------------------------------------------------+
| rootree -> 1stbranch |
| rootree -> midbranch |
| rootree -> midbranch -> corebranch |
| rootree -> midbranch -> corebranch -> leafnodes |
| rootree -> lastbranch |
| rootree -> lastbranch -> lastleaf |
+-------------------------------------------------+

I can't figure out how to also get the ID of each element in there. Once I extract that data from my database, I will manipulate the results and build a tree structure for my application. The user will select something from that tree and I will have to know the primary key (not the friendly name) of the node they selected for when I go to commit the changes back to the database... So I'd like something like the following

+-------------------------------------------------+
| path |
+-------------------------------------------------+
| 1.rootree -> 2.1stbranch |
| 1.rootree -> 3.midbranch |
| 1.rootree -> 3.midbranch -> 4.corebranch |
| 1.rootree -> 3.midbranch -> 4.corebranch -> 5.leafnodes |
+-------------------------------------------------+

That would allow me to parse out the ID into my applications rendered tree object so I can associate that ID with the user's selection (from the tree).

It's not obvious to me how I'd do this with the group_concat function providing some of the heavy lifting.

Bill Karwin said...

Kevin, thanks for your comment.

You can put expressions inside the GROUP_CONCAT. For example:

SELECT GROUP_CONCAT( CONCAT(id, '.', n.label) ORDER BY n.node SEPARATOR ' -> ') AS path

Unknown said...

Bill, thanks that worked well... Oddly, that's the first thing I tried, before I replied to you, but I was accessing my database through phpMyAdmin running on Ubuntu/Apache. The results of embedding the CONCAT() within the GROUP_CONCAT came up as the below

path
[BLOB - 13B]
[BLOB - 17B]
[BLOB - 12B]
[BLOB - 43B]
[BLOB - 20B]
[BLOB - 42B]
[BLOB - 56B]
[BLOB - 67B]

When I ran the query from the command line, it came up just as expected.

+---------------------------------------------------------------------+
| path |
+---------------------------------------------------------------------+
| 1.root>2.Form |
| 1.root>3.Function |
| 1.root>6.Fit |
| 1.root>2.Form>4.Architectural Documentation |
| 1.root>2.Form>5.Code |
| 1.root>2.Form>5.Code>7.Software Components |
| 1.root>2.Form>5.Code>7.Software Components>8.Source Code |
| 1.root>2.Form>5.Code>7.Software Components>9.Compiled / Binary Code |
+---------------------------------------------------------------------+

Unknown said...

Ahhh.. in PhpMyAdmin, there's an expansion above the results that allows you to select "Show BLOB Contents"... that did it.

Bill Karwin said...

Paul d'Aoust asked me: "You allude to using a pathlength column to assist in sorting, in a blog post; can you elaborate on that?"

insert into closure (ancestor, descendant, pathlength) values
(1,1,0), (1,2,1), (1,3,1), (1,4,2), (1,5,3), (1,6,1), (1,7,2),
(2,2,0),
(3,3,0), (3,4,1), (3,5,2),
(4,4,0), (4,5,1),
(5,5,0),
(6,6,0), (6,7,1),
(7,7,0);

select group_concat(n.label order by a.pathlength desc separator ' -> ') as path
from closure d
join closure a on (a.descendant = d.descendant)
join nodes n on (n.node = a.ancestor)
where d.ancestor = 1 and d.descendant != d.ancestor
group by d.descendant;

This allows you to order the nodes correctly even if the node id is not usable for sorting.

JDelage said...

Bill,

This is super useful. I'm wondering if you'd have a suggestion for adding a sort order to a closure table. That is to say, what if I want to list all the leaves of a given tree in order (e.g., starting with the first subtree on the left and so on).

How would you do that? The best I can come up with is an additional table for all the relationships where depth=1, and include a sort_order column...

Gerryjun said...
This comment has been removed by the author.
Gerryjun said...

Sir, can you help me, how can i get the top most ancestor of a given child, the child can be of any depth and i need only the ancestor with no parent or is self referencing.

Thanks

Danilo said...

To create whole tree, even if you have more than one root node (as in my case), you can do this:

SELECT * FROM `nodes` a
join closure b
on a.id=b.descendant
where ancestor in (
SELECT c.ancestor FROM closure AS c
LEFT OUTER JOIN closure AS anc
ON anc.descendant = c.descendant AND anc.ancestor <> c.ancestor
WHERE anc.ancestor IS NULL
)

Thanks for this BrILLiant technique!

Unknown said...

You've got a pretty amazing answer there. I wonder if you could help me with my dilemma: http://stackoverflow.com/questions/11790108/what-table-structure-to-use-for-nested-data

It's basically a hierarchical structure like the one you describe but each branch has numerous sub-branches that can be moved around (id stays the same position is different) and I can't wrap my head around calling up (for example):

the first top-level branch, the second mid-level branch within that top-level branch, and the fourth bottom level branch within that mid-level branch. Does that make sense?

I can imagine adding a position column to the closure table to simplify this but i still can't wrap my head around getting this done.

Bill Karwin said...

Hi Antonin,

Order of siblings within a tree isn't handled well in any design for hierarchical data that I've seen. The Nested Sets design is the closest one that allows you to order siblings implicitly, since the left/right values have a progressive sequence. But figuring out a given ordinal child from that is not supported.

If you add a position column in a closure table, you'd have to store the same position in many rows, because a given node could have many paths leading to it. Storing the values redundantly like that would create a risk for data anomalies.

I'm still struggling with this problem, but I think in closure table it would require an extra table to record the sibling order for each entry in the hierarchy.

P.S.: I don't answer questions on StackOverflow anymore.

NexTech IT Solutions said...

@Bill,

It was really a valuable resource for me to get to know about Closure Tables at the right time. I am not novice, but yes, I am new to the field, and I am doing an MLM business website.

Closure tables are good for it, I suppose. But do you suggest it's use for a very deep structure? A tree that may go down up to say 500 steps? At next step, there will be 500 inserts to the closure table per node insertion. According to you, is Closure table a good way in that situation?

Thanks.

Bill Karwin said...

@NexTech IT Solutions,

No algorithm or pattern is best for all cases. To find out if any given solution is the best for your application, you need to test it yourself.

The answer, I would guess, depends on how frequently you insert versus how frequently you query the tree. So it depends on your application's use of the tree data. Even if it's expensive or inconvenient to add entries to the tree, but you only need to do that once per day, maybe it's worth it.

I'm not going to be able to answer that for you.

nayyer said...

Hi Bill, do you know of a generic sql query for finding all the leaves which would work on all databases?

Bill Karwin said...

@nayyer,

I can't speak for all databases, because there are some databases that fail to support standard SQL.

But the following solutions should both find leaf nodes, and they are both standard SQL:

SELECT leaf.ancestor
FROM closure AS leaf
LEFT OUTER JOIN closure AS subleaf
ON leaf.ancestor = subleaf.ancestor
AND subleaf.ancestor <> subleaf.descendant
WHERE subleaf.ancestor IS NULL;

SELECT leaf.ancestor
FROM closure AS leaf
GROUP BY leaf.ancestor
HAVING COUNT(*) = 1;

NZ said...

Hi Bill, thanks for the answer above. Do you know of a better way using standard sql to display the closure tree. Right now I get all the leaves and then build using all the paths from leaves to root. Any other better way?

Bill Karwin said...

NZ,

I haven't found a way to get this kind of output except by using GROUP_CONCAT(). That function is MySQL-specific, but there are ways to simulate it with other brands of database.

There could also be a way to use standard SQL recursive CTE syntax to get this output, but I haven't experimented with that, since MySQL doesn't support it.

You might find some interesting tips here: http://explainextended.com/2010/06/21/group_concat-in-sql-server/

ken said...

Hi Bill,

Did not know where to post a question.

I am using a closure tree to describe the relationship of questions within sections of a dynamically created question air. I am able to build the questions just fine. Now I want to traverse the nodes. When I get the end of a branch, I want to move to the next branch. Since sibling nodes are not necessarily sequential how do I do this.

for example
Q1 leads to Q1.1 which leads to Q1.1.1, now it is time to move to Q2, how do I do this?

I am using php and mysql

Bill Karwin said...

Hi ken, thanks for your question.

This question comes up frequently with discussions of rendering trees. All tree designs have the same challenge, it's not just a Closure Table thing.

The problem is to define a preferred order when rendering a tree, and make a query that can fetch the tree entries in that order.

The solution is to generate a string that contains the "breadcrumbs" of ancestors of a given node, and then sort by that string.

If you want the sort order to be independent of the node id's or their names, then you have to store the sort order in another table. Basically, these would represent ordinal position of siblings in any subtree. Then generate a breadcrumbs string based on those values.

I would like to write another blog to demonstrate this technique.

ken said...

I for one would love to read it :)

Webnet said...

I would also enjoy reading it :)

Unknown said...

I would also be quite interested in reading about the sorting method

Bangali said...

Awesome post..I was searching for such a easy post.

vinoth said...

Hi Bill,

Am going to use it for category management for a shopping card website. Do you think this closure method will be useful for me. Because I may not know the depth of the child nodes in category. For example

Shoes->Fila->black shoes->shoe1

This chain may increase accoding to its need. As of I know this will resolve my requirements. But am asking you for suggestion, will this me helpful for me or do you have some other suggestion.

Bill Karwin said...

Hi Vinoth,

I think Closure Tables are quite useful. But the right solution in any given application depends on the queries you need to run against hierarchical data. So I can't say for certain which is best for your app.

I would recommend you to experiment with a few different solutions that I describe in my presentation "Models for Hierarchical Data" (http://www.slideshare.net/billkarwin/models-for-hierarchical-data) and see which one fits your needs the best.

Pawan Joshi said...

Hi Bill,
I am trying to create a business directory where there will be category levels max depth 5 levels including root.
I can't figure out which solutions is more appropriate in the given situation i.e. adjecent table or closure table.
I want a output like
root
Garments
Gents Garments
Ladies Garments
Bridal Garments
IT
IT Hardware
IT Software
Web Design
Web Solution
I have put up a detailed question on stack-overflow with no answer.

http://stackoverflow.com/questions/19601585/mysql-query-to-indent-child-and-group-all-childs?noredirect=1#comment29095945_19601585

I will appreciate your advice in in arriving at a decision which is more appropriate in the given situation.

Pawan Joshi said...

Hi Bill
I am adding this category layout in my previous post category tree was not correctly reflected the way I want
root
--Garments
----Gents Garments
----Ladies Garments
------Bridal Garments
--IT
----IT Hardware
----IT Software
------Web Design
------Web Solution

vinoth said...
This comment has been removed by the author.
Bill Karwin said...

Hi Pawan Joshi,

Your question is similar to several of those above, which is how to produce a result showing the hierarchy of data elements, but also sort them according to that hierarchy.

I give a solution to this problem in this StackOverflow question:
Sorting a subtree in a closure table hierarchical-data structure

vinoth said...
This comment has been removed by the author.
vinoth said...

Hi Bill,

Thanks for replying my previous comment. I have a issue in fetching and displaying the tree as below using php

Example:
Roottree
-1stbranch
--midbranch
---corebranch
----leafnodes
-lastbranch
--lastleaf

Please help me with this.

Bill Karwin said...

Hi vinoth,

See my answer to MySQL Closure Table hierarchical database - How to pull information out in the correct order for an example of formatting the result like you show.

Unknown said...

Hi Bill,

Wonderful knowledge here. I am still struggling with deleting full subtree. I tried
DELETE FROM TreePaths
WHERE descendant IN (SELECT descendant
FROM TreePaths
WHERE ancestor = 4);
but that does not delete everything to the leaf (inclusive). am I missing something?

Cheers

Bill Karwin said...

Hi Tsvetan, thanks for your comment.

You almost had it.

Say for example you have a rudimentary (and somewhat skinny) subtree, where 4 is the parent of 5 and 5 is the parent of 6.

4 -> 5 -> 6

You want to delete all the paths that start with 4:
(4,4,0), (4,5,1), (4,6,2).

But that would not delete any of the other path that start from a descendant of 4: (5,5,0), (5,6,1), (6,6,0).

So you need to delete any path that starts from any of the descendants of 4. "Starts from" means "ancestor" in this context.

DELETE FROM TreePaths
WHERE ancestor IN (SELECT descendant
FROM TreePaths
WHERE ancestor = 4);

Unknown said...

Hello Bill,

I love the simplicity of the Closure Table approach in comparison to Nested Sets but I would like to know if there is any comparison table available between the performance of Nested Sets to that of Closure Table for different problems (printing whole tree, printing subtrees, moving subtrees, deleting subtrees etc.).

Thanks,

Sven

Bill Karwin said...

Hi Sven, thanks for your comment.

I don't have benchmarks or quantitative comparisons, but I outlined the pros and cons in my presentation Models for Hierarchical Data.

Unknown said...

Hi Bill,

Thanks for your previous comment. That helped. I initially saw your method in a presentation http://www.slideshare.net/billkarwin/models-for-hierarchical-data there you claim that adjacency list can't handle deep trees for example if we want to query the whole subtree of an ancestor. What about this query:

SELECT ancestor, @child := descendant
FROM treeTable
JOIN (
SELECT @child :=4
) tmp
WHERE descendant = @child

This would get all descendants of ancestor #4 in an adjacency list. It seems easy not sure about performance. What would be the draw back compared to closure table method?

Thanks

Bill Karwin said...

Hi Tsvetan,

Why don't you try an experiment with a practice table and see if it works?

It turns out that assigning a new value to the variable in the SELECT-list does not make the WHERE clause re-evaluate the user variable. Effectively the WHERE clause only sees the initial value of 4 for that variable.

Even if MySQL did allow a constantly-changing variable in a WHERE expression, it would not be efficient. Indexes work by searching for a constant value or range of values that are known *before* execution time. But if the values of the variable are not known until execution begins, then the optimizer can't infer that it can use an index.

Another SQL blogger has come up with more complex uses of variables in MySQL statements to simulate recursive queries with adjacency list hierarchies. See http://explainextended.com/2009/07/20/hierarchical-data-in-mysql-parents-and-children-in-one-query/

But I would say the solution shown in that blog is too complex to be practical.

It makes me think of a classic quote from Brian Kernighan: "Everyone knows that debugging is twice as hard as writing a program in the first place. So if you're as clever as you can be when you write it, how will you ever debug it?"

Unknown said...

Thanks Bill. Can't agree more. Love the quote

xafarr said...

Hi Bill,
Great article and slides, very well written and explained. I have not come across any better than your stuff. Thanks for sharing your knowledge.

I am designing an in-house financial accounting system, and for that trying to model Chart of Accounts (COA). Do you think its better to model COA as Closure Tables in mysql database than Adjacency List. I have come across an example of COA being modelled using Adjacency List.

Thanks a lot.

Bill Karwin said...

Hi @xafarr,

As with most data design decisions, the best design depends what you're going to do with the data. Different ways of storing hierarchical data have their own strengths and weaknesses. That is, they each make some types of searches easier or harder.

Adjacency List is really easy to understand, and it's easy to represent immediate child or parent relationships. Also there is no chance of the data consistency failing. Either an entry is referencing its parent or not. There is no possibility that two rows can disagree on that. That is a big advantage.

You can even query whole trees stored with Adjacency List if you use an RDBMS that supports recursive queries.

Unknown said...

Hi Bill!
I see Closure table as right solution for threaded comments in my application, but I can't do right sorting to get threads like

Comment1
- Comment 2. Reply to c1
-- Comment 5. Reply to c2
- Comment 4. Reply to c1
Comment 6

All I can get is:
Comment1
- Comment 2. Reply to c1
- Comment 4. Reply to c1
-- Comment 5. Reply to c2
Comment 6

I have an additional field "depth" in db which represents the level of the comment if it can help.
Thanks a lot if you would help me to find the right way.

Bill Karwin said...

Hello Юрий Петрушкин,

Yes, it's a pretty common question to sort trees so we can view them in depth-first order. I answered this question some time ago on StackOverflow, let me know if this helps you:

http://stackoverflow.com/questions/14069674/sorting-a-subtree-in-a-closure-table-hierarchical-data-structure/14074310#14074310

Unknown said...

Hello @Bill!
Thanks for the reply! Great solution. Looks a bit complicated. And yes, it helps when I know the very one root of the tree, but in my case there can be lots of roots of threaded comments.
I think about adding empty root comment as a parent of others and hide it on page. But I don't know if it is a good solution.

nguyenquy said...

I was read your book, but i don't understand how to render structure closure table to HTML like ul, li tag display parent-sub

koofah said...

How can you parse the data in a JSON format. with the three. There is a vague solution at stack overflow and help is really needed thanks.

http://stackoverflow.com/questions/19348694/how-to-create-array-of-nested-comments-out-of-flat-array-from-db

Seth said...

I'm struggling with finding a way to find the direct parent or direct child of a node. If you use depth in your closure table, it is easy, but if you already have a closure table without depth and you need to find a parent or child, I don't think I'm using the most elegant way. This seems to work, but I'm not sure if there is a better or more elegant way.

SELECT n.label
FROM nodes n
JOIN `closure` anc ON n.node = anc.ancestor
JOIN `closure` d ON anc.ancestor = d.ancestor
WHERE anc.descendant =4
AND anc.descendant <> anc.ancestor
GROUP BY anc.ancestor
ORDER BY COUNT( * ) ASC
LIMIT 1

Also, if you are trying to add a depth column on an existing closure table, you would need something like this to populate the data in the column. Is there a better way of calculating depth for an existing closure table?

Anonymous said...

Hi @bill karwin

please i need your reply on this, because i am a little caught up.

I used the closure table exactly as explained in your article. I have a PRODUCTS table where I store products (productID PK) with categoryID (FK from categories table)

I want to show the count of products in each category on a page for managing products.
Now the issue is that when adding a product i add LEAF ID and not all its ancestors IDs. say if i add "samsung 32"

stock->electronics->tv

then in the products table TV leaf's categoryID is stored. in this case if I want to COUNT products in STOCK or ELECTRONICS categories then it will not count TV products under them.

so please how to query to find out this?

thank you and please reply.

Marc Shakes Blog said...

I'll try to write what I want to archive.

I need to create a "adjancy list" for importing in an ERP-System. I have already stored my data correctly with this model but now I need

myname
1
2


myname_child_of1
1
3


myname_child_of3
3
5