从一组 IP range --> value 的数据中查找单个 IP，允许使用 C++ STL

Shuo Chen

unread,

Apr 2, 2011, 11:27:28 PM4/2/11

to TopLanguage

在这里看到一道面试题：
http://blog.csdn.net/whinah/archive/2011/04/02/6299288.aspx

从一组 IP range --> value 的数据中查找单个 IP，允许使用 C++ STL。

我的解法：

每个 range 的表示是三元组：(start ip, end ip, value)

struct IPrange // 闭区间 [startIp, endIp]
{
uint32_t startIp;
uint32_t endIp; // startIp <= endIp
int value; // 或者是 Value* pValue;
};

写一个函数，输入是一个 ip，类型为 uint32_t，如果某个 range 包含它，那么输出该 IP range 的 value，否则返回
-1。
假设合法的 value 不含 -1。如果 value 不是 int 而是 Value*，那么返回 NULL 表示失败。
假设 IP range 不重复，那么一个 ip 至多对应一个 value。

数据结构：vector<IPRange> ranges;
定义排序关系：

bool operator<(const IPrange& lhs, const IPrange& rhs)
{
return lhs.startIp < rhs.startIp;
}

整个 vector 是排好序且没有重复元素：

sort(ranges.begin(), ranges.end());

assert(adjacent_find(ranges.begin(), ranges.end()) == ranges.end());

其中 operator==() 定义为：

bool operator==(const IPrange& lhs, const IPrange& rhs)
{
return lhs.startIp == rhs.startIp;
}

然后，find_ip_value 函数可以用 lower_bound 实现：

int findIpValue(const vector<IPrange>& ranges, uint32_t ip)
{
int result = -1;

if (!ranges.empty())
{
IPrange needle = { ip, 0, 0 };
vector<IPrange>::const_iterator it = lower_bound(ranges.begin(),
ranges.end(), needle);
if (it == ranges.end())
{
--it;
}
else if (it != ranges.begin() && it->startIp > ip)
{
--it;
}

if (it->startIp <= ip && it->endIp >= ip)
{
result = it->value;
}
}

return result;
}

算法的复杂度为 O(log N), N = ranges.size()。

Shuo Chen

unread,

Apr 2, 2011, 11:29:48 PM4/2/11

to TopLanguage

完整的代码：
https://gist.github.com/900153

lzprgmr

unread,

Apr 3, 2011, 12:05:28 AM4/3/11

to pon...@googlegroups.com

看来出题者自己也给出了算法，也是利用了lower_bound/upper_bound：http://blog.csdn.net/whinah/archive/2010/08/05/5791255.aspx

Blog: http://www.cnblogs.com/baiyanhuang/
Douban:http://www.douban.com/people/baiyanhuang/

2011/4/3 Shuo Chen <gian...@gmail.com>

lzprgmr

unread,

Apr 3, 2011, 12:16:44 AM4/3/11

to pon...@googlegroups.com

请教个问题，你一般会在代码中用类似于adjacent_find这样O(n)的方法来assert一个序列是否有重复，或者是否有序吗？

Blog: http://www.cnblogs.com/baiyanhuang/
Douban:http://www.douban.com/people/baiyanhuang/

2011/4/3 lzprgmr <baiya...@gmail.com>

Shuo Chen

unread,

Apr 3, 2011, 12:20:04 AM4/3/11

to TopLanguage

他的解法是错的，举一个反例即可：

考虑 range 是 [255.255.0.0, 255.255.255.255] => 香港。（这是个闭区间）
输入 255.255.255.255，那么输出应该是"香港"，但是他的代码判断 iter != ipmap.end()，认为 value 不存
在。

On Apr 3, 12:05 pm, lzprgmr <baiyanhu...@gmail.com> wrote:
> 看来出题者自己也给出了算法，也是利用了lower_bound/upper_bound：http://blog.csdn.net/whinah/archive/2010/08/05/5791255.aspx
>
> Blog:http://www.cnblogs.com/baiyanhuang/
> Douban:http://www.douban.com/people/baiyanhuang/
>

> 2011/4/3 Shuo Chen <giantc...@gmail.com>

Shuo Chen

unread,

Apr 3, 2011, 12:25:56 AM4/3/11

to TopLanguage

你想问什么？
1. O(n) 太慢？"一个序列是否有重复，或者是否有序" 没有 o(n) 的算法吧？
2. adjacent_find 一般用不用？如果正好适用，就代替手写循环呗。
3. assert 一般用不用？不会用，除非在程序启动阶段。

On Apr 3, 12:16 pm, lzprgmr <baiyanhu...@gmail.com> wrote:
> 请教个问题，你一般会在代码中用类似于adjacent_find这样O(n)的方法来assert一个序列是否有重复，或者是否有序吗？
>
> Blog:http://www.cnblogs.com/baiyanhuang/
> Douban:http://www.douban.com/people/baiyanhuang/
>

> 2011/4/3 lzprgmr <baiyanhu...@gmail.com>

>
>
>
>
>
>
>
> > 看来出题者自己也给出了算法，也是利用了lower_bound/upper_bound：
> >http://blog.csdn.net/whinah/archive/2010/08/05/5791255.aspx
>
> > Blog:http://www.cnblogs.com/baiyanhuang/
> > Douban:http://www.douban.com/people/baiyanhuang/
>

> > 2011/4/3 Shuo Chen <giantc...@gmail.com>

lzprgmr

unread,

Apr 3, 2011, 1:28:38 AM4/3/11

to pon...@googlegroups.com

>>你想问什么？

我主要是考虑assert的效率，在验证一个序列的属性时，可能需要线性的复杂度，如验证输入序列是否有序，或者验证某有序序列是否有相邻元素相等，这可能会对调试版本的速度有很大的影响，尤其在输入有多个序列的时候。

>>1. O(n) 太慢？"一个序列是否有重复，或者是否有序" 没有 o(n) 的算法吧？

验证一个序列是否有序是O(n)吧，”一个序列是否有重复“是我没说清楚，是指你这里用的adjacent_find

>>2. adjacent_find 一般用不用？如果正好适用，就代替手写循环呗。

所以你的回答是会对序列进行assert？

>>3. assert 一般用不用？不会用，除非在程序启动阶段。

这句什么意思？

Blog: http://www.cnblogs.com/baiyanhuang/
Douban:http://www.douban.com/people/baiyanhuang/

2011/4/3 Shuo Chen <gian...@gmail.com>

Shuo Chen

unread,

Apr 3, 2011, 1:38:56 AM4/3/11

to TopLanguage

adjacent_find 没有在 find_ip_value() 里调用，只是在初始化（和每次修改它） vector ranges 之后候调
用了一次，之后 ranges 传给 find_ip_value() 是 by const reference 的。
就是我说的"不会在程序里用 assert，除非在启动阶段用"。

On Apr 3, 1:28 pm, lzprgmr <baiyanhu...@gmail.com> wrote:
> >>你想问什么？
>

> 我主要是考虑assert的效率，在验证一个序列的属性时，可能需要线性的复杂度，如验证输入序列是否有序，或者验证某有序序列是否有相邻元素相等，这可能会对调试版本的速度有很大的影响，尤其在输入有多个序列的时候。

>
> >>1. O(n) 太慢？"一个序列是否有重复，或者是否有序" 没有 o(n) 的算法吧？
>
> 验证一个序列是否有序是O(n)吧，"一个序列是否有重复"是我没说清楚，是指你这里用的adjacent_find
>
> >>2. adjacent_find 一般用不用？如果正好适用，就代替手写循环呗。
>
> 所以你的回答是会对序列进行assert？
>
> >>3. assert 一般用不用？不会用，除非在程序启动阶段。
>
> 这句什么意思？
>
> Blog:http://www.cnblogs.com/baiyanhuang/
> Douban:http://www.douban.com/people/baiyanhuang/
>

> 2011/4/3 Shuo Chen <giantc...@gmail.com>

Wang Danqi

unread,

Apr 3, 2011, 3:15:35 AM4/3/11

to TopLanguage

博主用的是开区间，不过这样最后ip 255.255.255.255就无法包括在内了

rockeet

unread,

Apr 3, 2011, 9:01:31 AM4/3/11

to TopLanguage

这的确是个问题，为了解决方案更优雅，使用开区间需要排除 KEY_MAX。
当然也可以通过增加代码量解决该问题，就如同 LoserTree 的结束条件。

Shuo Chen

unread,

Apr 3, 2011, 9:47:32 AM4/3/11

to TopLanguage

呵呵，作者亲自回复了。

Jia

unread,

Apr 7, 2011, 1:50:50 AM4/7/11

to TopLanguage

对于ip搜索，如果不允许出现找不到的情况（一般不会允许），感觉 qqwry.dat 的索引格式好一些这样，不用定义 iprange，只要有
startip 即可，即
不是定义
(ips1, ipe1),(ips2, ipe2),……(ipsN, ipeN)
而是定义
ips1,ips2,……ipsN
落在 ipsi 到 ipsi+1 之内的，都算 ipsi 的。

这样的数据结构，占用的空间，查找的速度都会快一些，维护数据库也比较容易，找到相应的地方插入就行。
而且，考虑到最极端的情况，每个ip一个地址，读入到容器内处理感觉很困难。

另外，我实在受不了map的实现，有序序列的查找，感觉怎么也应该是线型的数据结构。

jiang...@gmail.com

unread,

Apr 7, 2011, 8:53:27 AM4/7/11

to pon...@googlegroups.com, Jia

假设有 m 个IP范围, 每个IP对应一个bit, 每个范围的起始对应 bit 置 1, 其它为零. 对此序列 S 计算 rank, 将 m 个IP范围对应的值存成一个数组 values, 给定 ip, 返回 values[rank(ip)] 即可. 关于 rank/select 的论文比较多, 大概可以在 O(nH_0(S)) (H_0 为零阶熵) 空间复杂度, 对数时间复杂度里实现.

我没实现过, 想, 理论如此, 实际可能还不如二分查找靠谱.

Changsheng Jiang

2011/4/7 Jia <lij...@gmail.com>

Wang Feng

unread,

Apr 7, 2011, 11:02:44 AM4/7/11

to pon...@googlegroups.com, Shuo Chen

2011/4/3 Shuo Chen <gian...@gmail.com>

在这里看到一道面试题：
http://blog.csdn.net/whinah/archive/2011/04/02/6299288.aspx

从一组 IP range --> value 的数据中查找单个 IP，允许使用 C++ STL。

我的解法：

每个 range 的表示是三元组：(start ip, end ip, value)

我看到了一个 sparse matrix

huangchao

unread,

Apr 7, 2011, 10:05:00 AM4/7/11

to TopLanguage

如果无序的话，sort()函数的时间复杂度估计在O(NlogN)吧，除非用计数排序等方法实现O(N)的方法。所以我认为如果无序的话，这个题的瓶
颈还是在排序上。

Shuo Chen

unread,

Apr 7, 2011, 10:46:36 PM4/7/11

to TopLanguage

按照你的思路，Binary search tree 的查找操作的时间复杂度是Ω(N)，因为生成一个 N 节点的 bst 至少要线性时间。

Wang Feng

unread,

Apr 7, 2011, 11:35:51 PM4/7/11

to pon...@googlegroups.com, Shuo Chen

2011/4/8 Shuo Chen <gian...@gmail.com>

按照你的思路，Binary search tree 的查找操作的时间复杂度是Ω(N)，因为生成一个 N 节点的 bst 至少要线性时间。

我记得 Binary Search Tree 在插入新的节点的时候平均时间复杂度是 O(log n)，最坏情形下是 O(n)，插入 N 个节点则需要 O(N log N) 到 O(N^2)

Navi

unread,

Apr 8, 2011, 1:18:38 AM4/8/11

to pon...@googlegroups.com

预处理有预处理的时间，查询有查询的时间... 复杂度应该是O(预处理时间+Q*每个查询的时间) 其中Q是查询数。当然这里说的是在线算法的情况。

2011/4/8 Shuo Chen <gian...@gmail.com>

--
Sincerely,

Junyuan Zhuang (Navi)

Email: nav...@gmail.com
Twitter: http://twitter.com/navimoe
Facebook: http://www.facebook.com/navi.amoy

Larry, LIU Xinyu

unread,

Apr 8, 2011, 1:28:57 AM4/8/11

to TopLanguage

Hi,

The map in STL is implemented with red-black-tree, which is a kind of
popular balanced binary search tree.
Because of this, we can avoid the worst case.

If we need frequently looked up an IP address and decide which area it
is located. I would recommend std::map as the back-end
data structure, so that it ensure O(lg N) time to locate an IP addr.

Here is my pseudo code for this problem. I am sorry that I have no
time to turn it into real programming language for the time being.

\begin{algorithmic}

data ip-range = (ip-start, ip-end, area)

function start(ip-range@(s, e, a)) = s
function end(ip-range@(s, e, a)) = e
function area(ip-range@(s, e, a)) = a

function key(ip-range@(s, e, a)) = (s, e)
function value = area

function `<=`(x, y):
where x, y are IP addr
return dec(x)<=dec(y) where dec = fold-left (lambda(a, b)-> a*10+b)
0 ip

function `<`(r1, r2):
where r1, r2 are ip-range
return start(r1)<start(r2) and end(r1)<end(r2)

function `in`(x, r):
where x is IP, r is IP range
return start(r)<=x<=end(r)

function insert(dict, x)
where x is IP range
return red-black-tree-insert(dict, x, `<`)

function lookup(dict, x)
where x is IP
if x in dict then return value(dict)
if x < start(dict) then return lookup(left(dict), x)
if not x < end(dict) then return lookup(right(dict), x)

function build-dict(ips)
where ips is a list of IP-range
return fold-left insert empty ips

\end{algorithmic}

Cheers,
--
LIU
https://sites.google.com/site/algoxy/home

On Apr 8, 11:35 am, Wang Feng <wanng.fe...@gmail.com> wrote:
> 2011/4/8 Shuo Chen <giantc...@gmail.com>

Reply all

Reply to author

Forward