Q:关于permuterm的B树实现

25 views
Skip to first unread message

Han Jiang

unread,
Mar 20, 2013, 10:38:46 AM3/20/13
to cs41...@googlegroups.com
这些天试了下用B树处理permuterm,有这么一个问题,因为permuterm中每个元素都是字符串,这样,对应B树的每个结点key大小都是变长的了。请问,这种变长的结点结构要如何在磁盘上进行存储呢?

我目前只有一个支持整形key值的B树结构,对于每个结点,当它不在内存中时,直接利用结点大小与其id算出在磁盘上的偏移,换进内存里。我觉得,如果限定结点中的字符串为定值,这恐怕会浪费很多的空间;而如果把字符串id化的话,permuterm的id化字典也不能假设放得进内存吧。请问有没有更好的设计呢?

谢谢!

Hongfei Yan

unread,
Mar 21, 2013, 1:32:24 AM3/21/13
to cs41...@googlegroups.com
因为permuterm中每个元素都是字符串
为什么不能每个元素是一个字母或者$?


2013/3/20 Han Jiang <jiang...@gmail.com>
这些天试了下用B树处理permuterm,有这么一个问题,因为permuterm中每个元素都是字符串,这样,对应B树的每个结点key大小都是变长的了。请问,这种变长的结点结构要如何在磁盘上进行存储呢?

我目前只有一个支持整形key值的B树结构,对于每个结点,当它不在内存中时,直接利用结点大小与其id算出在磁盘上的偏移,换进内存里。我觉得,如果限定结点中的字符串为定值,这恐怕会浪费很多的空间;而如果把字符串id化的话,permuterm的id化字典也不能假设放得进内存吧。请问有没有更好的设计呢?

谢谢!

--
You received this message because you are subscribed to the Google Groups "cs410pku" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cs410pku+u...@googlegroups.com.
To post to this group, send email to cs41...@googlegroups.com.
Visit this group at http://groups.google.com/group/cs410pku?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Han Jiang

unread,
Mar 21, 2013, 6:36:58 AM3/21/13
to cs41...@googlegroups.com
唔,老师我不是很理解,如果每个元素只是取字母的话,那么这棵树的高度就是最长字母的长度对吗?这是构造成前缀树的形式吗?这要如何用B树来实现呢?在B树中,每当结点过满,会出现同一层结点跑到上一层的情况,这样要如何区别哪一层是针对一个词的哪一个字母呢?

2013/3/21 Hongfei Yan <yhf...@gmail.com>



--
Han Jiang

Team of Search Engine and Web Mining,
School of Electronic Engineering  and Computer Science
,
Peking University, China

Han Jiang

unread,
Mar 21, 2013, 6:40:14 AM3/21/13
to cs41...@googlegroups.com
啊,我好像明白了,当结点中总的词长度过满时,进行分裂。这样,虽然B树上每个结点的最小/最大词数会不一样,但确实能保证结点大小的平衡对吧。

2013/3/21 Han Jiang <jiang...@gmail.com>

Hongfei Yan

unread,
Mar 21, 2013, 8:39:45 PM3/21/13
to cs41...@googlegroups.com

http://sewm.pku.edu.cn/cs_book/%5bcs_book%5d.McGraw.Hill.Introduction.To.Algorithms,.Third.Edition.pdf
see chapter 18 B-Trees@page 484

a B-tree with a branching factor of 1001 and height 2 that can store over one billion keys;nevertheless, since
we can keep the root node permanently in main memory, we can find any key in this tree by making at most only two disk accesses.
....
A B-tree T is a rooted tree (whose root is T:root) having the following properties:
1. Every node x has the following attributes:
a. x:n, the number of keys currently stored in node x,
b. the x:n keys themselves, x:key1; x:key2; : : : ;x:keyx: n, stored in nondecreasing
order, so that x:key1 x:key2 x:keyx: n,
c. x:leaf , a boolean value that is TRUE if x is a leaf and FALSE if x is an internal
node.
2. Each internal node x also contains x:nC1 pointers x:c1; x:c2; : : : ;x:cx: nC1 to
its children. Leaf nodes have no children, and so their ci attributes are undefined.
....

2013/3/21 Han Jiang <jiang...@gmail.com>

Han Jiang

unread,
Mar 21, 2013, 10:07:39 PM3/21/13
to cs41...@googlegroups.com
嗯,好的,谢谢老师!

Li Yechen

unread,
Mar 23, 2013, 8:54:58 PM3/23/13
to cs41...@googlegroups.com
2层树霸气
直接退化为静态文件指针索引
Reply all
Reply to author
Forward
0 new messages