一道统计题

7 views
Skip to first unread message

feifei

unread,
Jun 12, 2008, 10:25:31 AM6/12/08
to TopLanguage
面试经典题:
给出1000w条记录,每条记录为1byte-1k,长度不等,请查询出重复次数最高的20条记录,如何优化?

1)如果不考虑内存的话,使用shell或者awk最方便,
sort file|uniq -c |sort -nr|head -n20
awk '{a[$1]++} END{ for(i in a) print a[i],i}' file|sort -nr|head -n20

2)优化

瓶颈在不定字符串的比较,可以先求出签名,比如64位无符号长整形,在此基础上做hashmap进行计数,不过hashmap不能对值排序吧,在把值做
个最小堆,堆的大小为20。复杂度为O(nlogN) ,n为总记录数,N为最高重复记录数
不过想想既然字符串做了签名,在堆排序的时候怎末对应上原有的字符串呢?

呵呵,希望大家交流。

pongba

unread,
Jun 12, 2008, 10:55:03 AM6/12/08
to pon...@googlegroups.com
贴段Tim Bray写的ruby(via):
counts = {}
counts.default = 0

ARGF.each_line do |line|
if line =~ %r{GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+) }
counts[$1] += 1
end
end

keys_by_count = counts.keys.sort { |a, b| counts[b] <=> counts[a] }
keys_by_count[0 .. 9].each do |key|
puts "#{counts[key]}: #{key}"
end


2008/6/12 feifei <dongf...@gmail.com>:



--
刘未鹏(pongba)|C++的罗浮宫
http://blog.csdn.net/pongba
TopLanguage
http://groups.google.com/group/pongba

Wang Xin

unread,
Jun 12, 2008, 10:34:46 PM6/12/08
to pon...@googlegroups.com
你丫居然把beautiful code里面的代码直接copy过来了~~~

2008/6/12 pongba <pon...@gmail.com>:



--
Everything is possible and available if we trust ourselves!

dd_engi

unread,
Jun 13, 2008, 5:14:33 AM6/13/08
to TopLanguage
可以用radix sort,线性复杂度地全部排好序,然后统计就简单了。
不过如果不能一次全部读进内存里,这些记录在硬盘中储存的方式应该预处理成适合radix sort的形式。

dd_engi

unread,
Jun 13, 2008, 5:15:26 AM6/13/08
to TopLanguage
Ruby 的不懂。讲讲用的啥算法吧。

On Jun 12, 10:55 pm, pongba <pon...@gmail.com> wrote:
> 贴段Tim Bray写的ruby(via <http://www.codinghorror.com/blog/archives/001131.html>
> ):
>
> counts = {}
> counts.default = 0
>
> ARGF.each_line do |line|
> if line =~ %r{GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+) }
> counts[$1] += 1
> end
> end
>
> keys_by_count = counts.keys.sort { |a, b| counts[b] <=> counts[a] }
> keys_by_count[0 .. 9].each do |key|
> puts "#{counts[key]}: #{key}"
> end
>
> 2008/6/12 feifei <dongfei...@gmail.com>:

hayate

unread,
Jun 15, 2008, 3:12:25 AM6/15/08
to pon...@googlegroups.com
就是hash table......

2008/6/13 dd_engi <tian...@gmail.com>:
Reply all
Reply to author
Forward
0 new messages