字符数组检索[TL][算法]

lianghu xu

unread,

Feb 9, 2012, 1:28:44 AM2/9/12

to pon...@googlegroups.com

假设有一个简单的数据库使用C数组表示(以下只是概念表示）

foostruct table[] =

{

{"aaa", ptr_0},

{"bbbb", ptr_1},

.....................

{"xxxxxx", ptr_xxx},

}

其中每个元素都是一个foostruct结构类型，且其第一个成员是唯一的。

请问，检索这个数据库的最快算法？ (比如，通过“bbbb"检索到其在表中的序号为1，从而得到其对应的操作指针 ptr_1).

Li Ferdinand

unread,

Feb 9, 2012, 1:31:09 AM2/9/12

to pon...@googlegroups.com

hash

WindyWinter

unread,

Feb 9, 2012, 1:31:12 AM2/9/12

to pon...@googlegroups.com

Trie

Soli Deo gloria.

WindyWinter
Email: wi...@ream.at
梦.:如此短暂: http://d.ream.at

2012/2/9 lianghu xu <liang...@gmail.com>

Xinyu LIU

unread,

Feb 9, 2012, 3:20:25 AM2/9/12

to pon...@googlegroups.com

Hi,

我前些年写的：
https://sites.google.com/site/algoxy/trie

但是主要还是取决于你的数据特点，如果你的所谓数据库并不大，Hash是个很不错的选择，最好能达到constant时间的效率(O(1)).
当然Hash函数的构造取决于你的字符串的特点。

如果你的数据库很大，Hashing就不成了，如果你的字符集小，字符串长度不大，那么Trie可以是个不错的选择。
压缩成Patricia能改善不少，速度是O(m)，其中m是你的字符串的长度。

如果你的数据库特别大，Patricia也承受不了，你可以用B-tree，速度是O(lg N)，其中N是你前面项目的个数。
一个简化的情况是用二叉树或者红黑树，速度也是O(lg N)的，适合中等规模数据的情况。

2012/2/9 WindyWinter <wi...@ream.at>

--
Larry, LIU Xinyu
https://sites.google.com/site/algoxy/
https://github.com/liuxinyu95/AlgoXY

Li Ferdinand

unread,

Feb 9, 2012, 3:28:26 AM2/9/12

to pon...@googlegroups.com

Trie，我感觉字符少，发挥不出来比hash的优势，字符量大，又太占内存。。。Trie像是个学术玩具。。。。

rockeet febird

unread,

Feb 9, 2012, 5:04:51 AM2/9/12

to TopLanguage

hash_strmap , 不同于一般的 hash map，这个 map 内部完全是用数组实现的，hash 冲突的链接也是用数组下标来链，而且
使用了 strpool。速度比标准 hash_map<string, val> 快 10 倍。

trie 和 Patricia 之类的 lookup table 相比 hash 的优势在于（前缀）匹配长字符串（比如DNA），匹配到的时候，
不需要再调用 equal 确认。

rockeet febird

unread,

Feb 9, 2012, 5:05:18 AM2/9/12

to TopLanguage

http://code.google.com/p/febird/source/browse/trunk/febird/src/febird/hash_strmap.hpp

emac...@gmail.com

unread,

Feb 9, 2012, 10:07:06 AM2/9/12

to pon...@googlegroups.com

On Thu, Feb 09, 2012 at 02:31:09PM +0800, Li Ferdinand wrote:
> hash

优雅性来说 trie 系列无疑比 hash 好，但效率我不知道是怎么样。
假设有很长的串。因为树不是序列化储存，访问不同节点间造成的CPU跳转的开销不知如何度量。
还有就是 trie
的空间消耗（把一个字节拆成2段减小空间消耗不知性能会怎么样）。

lianghu xu

unread,

Feb 9, 2012, 8:42:17 PM2/9/12

to pon...@googlegroups.com

如果这个数组的大小在1000以内，字符串长度在10以内的话呢？直接比较是不是也不是那么耗时间？

2012/2/9 <emac...@gmail.com>

Li Ferdinand

unread,

Feb 9, 2012, 9:15:00 PM2/9/12

to pon...@googlegroups.com

这么小的数据量,不需要考虑什么高效算法了...

Xpol Wan

unread,

Feb 9, 2012, 9:17:19 PM2/9/12

to pon...@googlegroups.com

你是要做函数名到函数的映射？

看下lua的Table实现，或者Python的字典的实现呢。

Best Regards!

Xpol Wan
_G['China']['Human']['Male']['Software']['Programmer']['Embedded']['Gaming']['C/C++/Lua']

2012/2/10 lianghu xu <liang...@gmail.com>

milo

unread,

Feb 11, 2012, 4:17:36 AM2/11/12

to TopLanguage

我的看法：
国外学术界比较流行hash算法，比如cuckoohash算法，不少库中都用了。这个方法空间大，查询快，索引创建快
然后就是树，b+比较经典。这个占用空间小，支持的数据量大，但是建库慢。

On Feb 9, 2:28 pm, lianghu xu <lianghu...@gmail.com> wrote:

shicker

unread,

Feb 19, 2012, 9:06:57 PM2/19/12

to pon...@googlegroups.com

128位的key 1000w数据查找基本上不耗时间，这个要到什么程度才能提现出性能？

Zhangming Niu

unread,

Feb 20, 2012, 12:47:45 PM2/20/12

to pon...@googlegroups.com

index + pre sort

search("alphabet")

indexs:a,b,c,d,e,f,......

X(index的多少) * log(1000w/x)

.或者加上多层的索引吧。

google数据更多，但是搜什么都是很快。

2012/2/20 shicker <wbs...@gmail.com>

Reply all

Reply to author

Forward