The basic table from vendor is an adjacency list that looks something
like this. This is pseudo code, so please no comments about data
element naming:
CREATE TABLE GENERIC_H
(
FREAKING_LONG_IDENTFIERvarchar(100)
,HIERARCHY_TYPE (varchar 20)
,ROOT varchar(40)
,NODE varchar(40)
,PARENT_NODE
)
ALTER TABLE GENERIC_H ADD CONSTRAINT GENERIC_H_PK PRIMARY KEY
(
FREAKING_LONG_IDENTFIER
,HIERARCHY_TYPE
,ROOT
,NODE
)
Some interesting notes:
1. Long (weak) primary key because not only does table have different
types of Hierarchies (second column), it also holds multiple
hierarchies of the same type (thus the ROOT_NODE_NM column, which
identifies that particular hierarchy within a given type)
2. The same child (NODE_NM) can belong to more than one parent
(PARENT_NODE_NM), becuase the table has more than one hierarchy per
type.
I have to traverse this structure all the time to pick up all the
descendants for a given hierarchy type and particularly hierarchy
(e.g. Root Node) within that type, starting at any node that the user
specifies.
I created another column of SQL server heirarchyID datatype with a CTE
to make it easier to traverse the tree instead of running the CTE
every time someone wants all the descendants of a particular node.
Of course, the heirarchyID is not unique, because it's just the
enumerated path within a particular hierarchy, of which there are more
than one in the table.
My thought was to remove NODE_NM from the primary key and replace it
with the heirarchyID, clustering so the traversal would go faster.
Howver, the heirachyID is 892 bytes long, so I overrun the 900 byte
limit for the key. From what I have read, I'll never need anywhere
near 892 bytes given the number of nodes in my hierarchy.
So, I created a column that uses ToString() on the HierarchyID, made
it a reasonable length, and used that column in the primary key.
This thing performs okay (return in a few seconds, but I think it
should fly, which it isn't.
Questions:
1. What do you do about the crazy long length of heirachyID
2. Any other suggestions (besides #3) to speed up?
3. When I get a little time in the next week or so, I would like to
rewrite the CTE to created a nested-sets (hat tip, CELKO) version,
since this thing is write-few, read-many-many usage pattern. I have a
feeling nested set query for descendants would blow away the function
IsDescendantOf performance wise.
4. What is the advanatage of hierarchyID over the nested sets for
MSFT? Why didn't they write built in functions for manipulating a
nested sets tree instead of doing the enumerated path?
THanks,
Bill
CREATE TABLE Foo (
fookey INT PRIMARY KEY,
foo_hierarchy HIERARCHYID);
INSERT INTO Foo VALUES(1, HIERARCHYID::GetRoot());
INSERT INTO Foo VALUES(2, '/1/');
INSERT INTO Foo VALUES(3, '/1/2/')
INSERT INTO Foo VALUES(4, '/2/');
SELECT DATALENGTH(foo_hierarchy) AS bin_bytes,
DATALENGTH(foo_hierarchy.ToString()) AS string_bytes
FROM Foo;
/*
bin_bytes string_bytes
----------- ------------
0 2
1 6
2 10
1 6
*/
--
Plamen Ratchev
http://www.SQLStudio.com
If I try to specify a shorter length when creating the table, SQL
server gives me an error saying I cannot specify length for the
hierarchyID datatype. At least with the stringed version, I can limit
it to some reasonable length, and then put a key over all the columns
that are required.
Is there some way to limit the size of the hierarchyID so I don't get
the warning when creating the key? All the examples I have seen seem
to assume that a table only has one hierarchy of one type, thus making
the hierarchyID unique, so you can just put a primary key on that
column alone. In my case, I can't do that, there are multiple
hierarchies of multiple types in the same table (i.e. multiple roots,
multiple /1/1 children, etc)
Thanks,
Bill
--
Erland Sommarskog, SQL Server MVP, esq...@sommarskog.se
Links for SQL Server Books Online:
SQL 2008: http://msdn.microsoft.com/en-us/sqlserver/cc514207.aspx
SQL 2005: http://msdn.microsoft.com/en-us/sqlserver/bb895970.aspx
SQL 2000: http://www.microsoft.com/sql/prodinfo/previousversions/books.mspx
I am not sure what warning you get, but I can create nonclustered index
on the example I gave you without problems:
CREATE NONCLUSTERED INDEX ix_Hierarchy
ON Foo(foo_hierarchy);
Warning message is:
Warning! The maximum key length is 900 bytes. The index
'TESTHIERARCHY_UK' has maximum length of 935 bytes. For some
combination of large values, the insert/update operation will fail.
To get this message you have to make your index or constraint be a
composite of the hierarchy id and at least one other column. The
other colum(s) need to have a total length of more than 8 bytes. That
way, the max key size is the 892 bytes of the hierarchyID plus the > 9
bytes of the other columns.
II agree that I'll never need anything near the 892 total bytes. I
would rather be able (but know I cannot) to limit the size of the
hierarchyID column, and then watch for the error on INSERT (that will
never occur) instead of getting the warning message from my code, but
realizing it's normal.
Question: Why didn't MSFT work instead of an enumerated path
solution on functions that allow you to manipulate nested sets
easily? THe one reason I can think of (and it's a big one) is taht
with the enumerated path approach, you can insert nodes without having
to recompute the left and right values for the rest of the tree. I
think with enumerated path, the engine still has to recompute the path
if you move a subtee, but the total amount of recomputing is probably
less.
Do you guys agree wih this, or am I missing something?
Materialized path makes more sense to be implemented as optimized CLR
data type with methods as CLR can shine in parsing and manipulating the
path. Plus nested sets have issues that can be more difficult to resolve
(not all hierarchies are static which is where the strength of nested
sets is):
http://www.dbazine.com/oracle/or-articles/tropashko4
Nested sets should be much faster, since you get entire subtrees in
one query. The bad news is that a recursive CTE is a cursor in
disguise. Doing the transverse over and over with each query might be
expensive. Put it in a temp table?
>> . What is the advantage of hierarchyID over the nested sets for MSFT? Why didn't they write built in functions for manipulating a nested sets tree instead of doing the enumerated path? <<
I don't know of any advantages for the users. Nested sets work well
with a clustered index, which is current technology. Get a copy of
TREES & HIERARCHIES for the simple stored procedures you can write in
SQL/PSM or any other 4GL that your SQL might have.
But for Microsoft, locking people into proprietary code is a big
advantage :) Ever wonder about the CONNECT BY in Oracle?
Insertion is not that bad with nested sets. If you do frequent
insertions, leave extra free space on the data pages. Since each row
is two INTEGERs and a FK to the nodes table, the rows are short and a
lot of them fit onto a data page. It is usually as fast to insert an
entire subtree as one node (which is just a special case of a
subtree).
I probably didn't explain my reference to the CTE very well. I just
use the CTE to turn the adjacency list into a write-once-in-a-while,
read-a-lot table that has a hierarchyID. I want to try changing that
to making the CTE spit out table(s) with a nested set model. So the
CTE would only run once in a while (the hierarchies don't change very
often), but then I could query on the nested sets.
After tweaking, I am getting pretty darn good performance with the
hierarchyID table, but I've just got to think that the nested sets
will be blindingly fast to query. I figure that IsDescendentOf
function is either parsing the enumerated path (at best) or (at worst)
traversing the tree. In either case, it seems the simple math
operators for the nested set would be a lot less work for the engine.
I do very few insertions (In fact, right now, I can build the write-
once table in under 15 seconds), but it's good to know that the nested
sets insertions are fast. What about computing all the {Left, Right}
values? I suppose your book goes into that . . .
Speaking of nested sets (and overlaping sets), does your book talk
about an easy way to identify date ranges that improperly overlap? I
see this quite frequently, because people do crazy stuff like putting
both the START_DT and END_DT in the primary key, when just one of them
should be in the key, but if the table is designed that way, and a
given, it would be at least good to find the overlaps. I usually just
write a query, but it would be nice to put it into a parameterized
stored procedure. I always have to think a little about it when I
write the query.
Thanks,
Bill
<snip> The bad news is that a recursive CTE is a cursor in
I do that in another book, THINKING IN SETS. The quickest way is to
set up a Calendar table then see if there is a calendar day that is
inside two date ranges.
If SQL Server were up to Standards, this would be easy, but you can
still use a TRIGGER or a VIEW that has a WITH CHECK OPTION.