Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Hierarchyid is long, how to index?

152 views
Skip to first unread message

bill

unread,
May 6, 2009, 5:02:55 PM5/6/09
to
Got a table that holds multiple hierarchies in the same structure.
Suggestions for redesign not be of interest, because I am working with
a vendor package.


The basic table from vendor is an adjacency list that looks something
like this. This is pseudo code, so please no comments about data
element naming:

CREATE TABLE GENERIC_H
(
FREAKING_LONG_IDENTFIERvarchar(100)
,HIERARCHY_TYPE (varchar 20)
,ROOT varchar(40)
,NODE varchar(40)
,PARENT_NODE
)

ALTER TABLE GENERIC_H ADD CONSTRAINT GENERIC_H_PK PRIMARY KEY
(
FREAKING_LONG_IDENTFIER
,HIERARCHY_TYPE
,ROOT
,NODE
)

Some interesting notes:
1. Long (weak) primary key because not only does table have different
types of Hierarchies (second column), it also holds multiple
hierarchies of the same type (thus the ROOT_NODE_NM column, which
identifies that particular hierarchy within a given type)
2. The same child (NODE_NM) can belong to more than one parent
(PARENT_NODE_NM), becuase the table has more than one hierarchy per
type.

I have to traverse this structure all the time to pick up all the
descendants for a given hierarchy type and particularly hierarchy
(e.g. Root Node) within that type, starting at any node that the user
specifies.

I created another column of SQL server heirarchyID datatype with a CTE
to make it easier to traverse the tree instead of running the CTE
every time someone wants all the descendants of a particular node.

Of course, the heirarchyID is not unique, because it's just the
enumerated path within a particular hierarchy, of which there are more
than one in the table.

My thought was to remove NODE_NM from the primary key and replace it
with the heirarchyID, clustering so the traversal would go faster.
Howver, the heirachyID is 892 bytes long, so I overrun the 900 byte
limit for the key. From what I have read, I'll never need anywhere
near 892 bytes given the number of nodes in my hierarchy.

So, I created a column that uses ToString() on the HierarchyID, made
it a reasonable length, and used that column in the primary key.

This thing performs okay (return in a few seconds, but I think it
should fly, which it isn't.

Questions:
1. What do you do about the crazy long length of heirachyID
2. Any other suggestions (besides #3) to speed up?
3. When I get a little time in the next week or so, I would like to
rewrite the CTE to created a nested-sets (hat tip, CELKO) version,
since this thing is write-few, read-many-many usage pattern. I have a
feeling nested set query for descendants would blow away the function
IsDescendantOf performance wise.
4. What is the advanatage of hierarchyID over the nested sets for
MSFT? Why didn't they write built in functions for manipulating a
nested sets tree instead of doing the enumerated path?

THanks,

Bill

Plamen Ratchev

unread,
May 6, 2009, 5:45:37 PM5/6/09
to
Why do you use the ToString version of HIERARCHYID to index? The
original binary version is a lot more compact and efficient.

CREATE TABLE Foo (
fookey INT PRIMARY KEY,
foo_hierarchy HIERARCHYID);

INSERT INTO Foo VALUES(1, HIERARCHYID::GetRoot());
INSERT INTO Foo VALUES(2, '/1/');
INSERT INTO Foo VALUES(3, '/1/2/')
INSERT INTO Foo VALUES(4, '/2/');

SELECT DATALENGTH(foo_hierarchy) AS bin_bytes,
DATALENGTH(foo_hierarchy.ToString()) AS string_bytes
FROM Foo;

/*

bin_bytes string_bytes
----------- ------------
0 2
1 6
2 10
1 6

*/


--
Plamen Ratchev
http://www.SQLStudio.com

bill

unread,
May 6, 2009, 6:04:38 PM5/6/09
to
HeirarchyID datatype is new to me, so I may well be missing
something. I know I'll never need the 892 bytes of the id, but it
seems to be counted against the 900 byte limit of creating a key.
Granted, it's just a warning, not an error, but I hate having some
code where warning are "normal".

If I try to specify a shorter length when creating the table, SQL
server gives me an error saying I cannot specify length for the
hierarchyID datatype. At least with the stringed version, I can limit
it to some reasonable length, and then put a key over all the columns
that are required.

Is there some way to limit the size of the hierarchyID so I don't get
the warning when creating the key? All the examples I have seen seem
to assume that a table only has one hierarchy of one type, thus making
the hierarchyID unique, so you can just put a primary key on that
column alone. In my case, I can't do that, there are multiple
hierarchies of multiple types in the same table (i.e. multiple roots,
multiple /1/1 children, etc)


Thanks,

Bill

Erland Sommarskog

unread,
May 6, 2009, 5:52:28 PM5/6/09
to
bill (billma...@gmail.com) writes:
> HeirarchyID datatype is new to me, so I may well be missing
> something. I know I'll never need the 892 bytes of the id, but it
> seems to be counted against the 900 byte limit of creating a key.
> Granted, it's just a warning, not an error, but I hate having some
> code where warning are "normal".

Sometimes you have to live with it. This was quite common in SQL 2000
if you used sql_variant in a table, because an sql_variant column can
fit 8000 bytes, so with a few more columns in the table, the possible
max size of the columns exceeded what you could fit on a page. (This is
no longer a problem, since SQL 2005 and later is capable to store a
row on multiple pages under some circumstances.)


--
Erland Sommarskog, SQL Server MVP, esq...@sommarskog.se

Links for SQL Server Books Online:
SQL 2008: http://msdn.microsoft.com/en-us/sqlserver/cc514207.aspx
SQL 2005: http://msdn.microsoft.com/en-us/sqlserver/bb895970.aspx
SQL 2000: http://www.microsoft.com/sql/prodinfo/previousversions/books.mspx

Plamen Ratchev

unread,
May 6, 2009, 6:16:05 PM5/6/09
to
HierarchyId is a system data type with variable length and you cannot
set size. It will expand as needed up to 892 bytes. But you must have a
huge hierarchy with many levels to reach that level.

I am not sure what warning you get, but I can create nonclustered index
on the example I gave you without problems:

CREATE NONCLUSTERED INDEX ix_Hierarchy
ON Foo(foo_hierarchy);

bill

unread,
May 6, 2009, 6:38:20 PM5/6/09
to
I guess I'll just live with the warning (yuck), though I think I found
some other things to index that work very fast and avoid the
problem . . .

Warning message is:
Warning! The maximum key length is 900 bytes. The index
'TESTHIERARCHY_UK' has maximum length of 935 bytes. For some
combination of large values, the insert/update operation will fail.

To get this message you have to make your index or constraint be a
composite of the hierarchy id and at least one other column. The
other colum(s) need to have a total length of more than 8 bytes. That
way, the max key size is the 892 bytes of the hierarchyID plus the > 9
bytes of the other columns.

II agree that I'll never need anything near the 892 total bytes. I
would rather be able (but know I cannot) to limit the size of the
hierarchyID column, and then watch for the error on INSERT (that will
never occur) instead of getting the warning message from my code, but
realizing it's normal.

Question: Why didn't MSFT work instead of an enumerated path
solution on functions that allow you to manipulate nested sets
easily? THe one reason I can think of (and it's a big one) is taht
with the enumerated path approach, you can insert nodes without having
to recompute the left and right values for the rest of the tree. I
think with enumerated path, the engine still has to recompute the path
if you move a subtee, but the total amount of recomputing is probably
less.

Do you guys agree wih this, or am I missing something?

Plamen Ratchev

unread,
May 6, 2009, 7:02:42 PM5/6/09
to
bill wrote:
> Question: Why didn't MSFT work instead of an enumerated path
> solution on functions that allow you to manipulate nested sets
> easily? THe one reason I can think of (and it's a big one) is taht
> with the enumerated path approach, you can insert nodes without having
> to recompute the left and right values for the rest of the tree. I
> think with enumerated path, the engine still has to recompute the path
> if you move a subtee, but the total amount of recomputing is probably
> less.
>

Materialized path makes more sense to be implemented as optimized CLR
data type with methods as CLR can shine in parsing and manipulating the
path. Plus nested sets have issues that can be more difficult to resolve
(not all hierarchies are static which is where the strength of nested
sets is):
http://www.dbazine.com/oracle/or-articles/tropashko4

--CELKO--

unread,
May 6, 2009, 7:05:04 PM5/6/09
to
>> 3. When I get a little time in the next week or so, I would like to rewrite the CTE to create a nested-sets (hat tip, CELKO) version, since this thing is write-few, read-many-many usage pattern. I have a feeling nested set query for descendants would blow away the function IsDescendantOf performance wise. <<

Nested sets should be much faster, since you get entire subtrees in
one query. The bad news is that a recursive CTE is a cursor in
disguise. Doing the transverse over and over with each query might be
expensive. Put it in a temp table?

>> . What is the advantage of hierarchyID over the nested sets for MSFT? Why didn't they write built in functions for manipulating a nested sets tree instead of doing the enumerated path? <<

I don't know of any advantages for the users. Nested sets work well
with a clustered index, which is current technology. Get a copy of
TREES & HIERARCHIES for the simple stored procedures you can write in
SQL/PSM or any other 4GL that your SQL might have.

But for Microsoft, locking people into proprietary code is a big
advantage :) Ever wonder about the CONNECT BY in Oracle?

--CELKO--

unread,
May 6, 2009, 7:14:18 PM5/6/09
to
>> THe one reason I can think of (and it's a big one) is that with the enumerated path approach, you can insert nodes without having to recompute the left and right values for the rest of the tree. I think with enumerated path, the engine still has to recompute the path

if you move a subtee, but the total amount of recomputing is probably
less. <<

Insertion is not that bad with nested sets. If you do frequent
insertions, leave extra free space on the data pages. Since each row
is two INTEGERs and a FK to the nodes table, the rows are short and a
lot of them fit onto a data page. It is usually as fast to insert an
entire subtree as one node (which is just a special case of a
subtree).

bill

unread,
May 6, 2009, 8:13:05 PM5/6/09
to
I'm gonna get a copy of your book.

I probably didn't explain my reference to the CTE very well. I just
use the CTE to turn the adjacency list into a write-once-in-a-while,
read-a-lot table that has a hierarchyID. I want to try changing that
to making the CTE spit out table(s) with a nested set model. So the
CTE would only run once in a while (the hierarchies don't change very
often), but then I could query on the nested sets.

After tweaking, I am getting pretty darn good performance with the
hierarchyID table, but I've just got to think that the nested sets
will be blindingly fast to query. I figure that IsDescendentOf
function is either parsing the enumerated path (at best) or (at worst)
traversing the tree. In either case, it seems the simple math
operators for the nested set would be a lot less work for the engine.

I do very few insertions (In fact, right now, I can build the write-
once table in under 15 seconds), but it's good to know that the nested
sets insertions are fast. What about computing all the {Left, Right}
values? I suppose your book goes into that . . .

Speaking of nested sets (and overlaping sets), does your book talk
about an easy way to identify date ranges that improperly overlap? I
see this quite frequently, because people do crazy stuff like putting
both the START_DT and END_DT in the primary key, when just one of them
should be in the key, but if the table is designed that way, and a
given, it would be at least good to find the overlaps. I usually just
write a query, but it would be nice to put it into a parameterized
stored procedure. I always have to think a little about it when I
write the query.

Thanks,

Bill

<snip> The bad news is that a recursive CTE is a cursor in

--CELKO--

unread,
May 7, 2009, 6:07:52 PM5/7/09
to
>> Speaking of nested sets (and overlapping sets), does your book talk about an easy way to identify date ranges that improperly overlap? <<

I do that in another book, THINKING IN SETS. The quickest way is to
set up a Calendar table then see if there is a calendar day that is
inside two date ranges.

If SQL Server were up to Standards, this would be easy, but you can
still use a TRIGGER or a VIEW that has a WITH CHECK OPTION.

Erland Sommarskog

unread,
May 7, 2009, 5:49:39 PM5/7/09
to
bill (billma...@gmail.com) writes:
> Speaking of nested sets (and overlaping sets), does your book talk
> about an easy way to identify date ranges that improperly overlap? I
> see this quite frequently, because people do crazy stuff like putting
> both the START_DT and END_DT in the primary key, when just one of them
> should be in the key, but if the table is designed that way, and a
> given, it would be at least good to find the overlaps. I usually just
> write a query, but it would be nice to put it into a parameterized
> stored procedure. I always have to think a little about it when I
> write the query.

SQL Server MVP Alex Kuznetsov has presented some ideas in a blogpost:
http://sqlblog.com/blogs/alexander_kuznetsov/archive/2009/03/08/storing-intervals-of-time-with-no-overlaps.aspx
His idea does not fit in all cass, but it is interesting reading.
0 new messages