1. Please do not post any solutions or spoiler discussion for this quiz until
48 hours have passed from the time on this message.
2. Support Ruby Quiz by submitting ideas as often as you can:
3. Enjoy!
Suggestion: A [QUIZ] in the subject of emails about the problem helps everyone
on Ruby Talk follow the discussion. Please reply to the original quiz message,
if you can.
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
This week's Ruby Quiz is to write a simple utility. Your program should accept
an IP address as a command-line argument and print out the two letter code for
the country that IP is assigned in. You can find a database for the matching
at:
http://software77.net/cgi-bin/ip-country/geo-ip.pl
To keep the problem interesting though, let's write our programs with a focus on
speed and memory efficiency.
$ time ruby ip_to_country.rb 68.97.89.187
US
real 0m0.314s
user 0m0.259s
sys 0m0.053s
Is an 'initialisation run' allowed to massage the data?
(we should at least split the benchmarks to keep it fair)
Is it motivating or a spoiler to post timings?
cheers
Simon
> Ruby Quiz wrote:
>> [...]
>>
>> $ time ruby ip_to_country.rb 68.97.89.187
>> US
>>
>> real 0m0.314s
>> user 0m0.259s
>> sys 0m0.053s
>
> Is an 'initialisation run' allowed to massage the data?
> (we should at least split the benchmarks to keep it fair)
My script does need and initialization run, yes. I don't see any
harm in paying a one time penalty to set things up right.
> Is it motivating or a spoiler to post timings?
Motivating, definitely. :)
James Edward Gray II
Ok, my script does not need any initialization, it uses the file
IpToCountry.csv exactly as downloaded.
----------------------------------------------------------------
$ ruby -v
ruby 1.8.4 (2005-12-24) [i386-cygwin]
$ time ruby quiz139.rb 68.97.89.187
US
real 0m0.047s
user 0m0.030s
sys 0m0.030s
$ time ruby quiz139.rb 84.191.4.10
DE
real 0m0.046s
user 0m0.046s
sys 0m0.015s
----------------------------------------------------------------
This is on a Pentium M 2.13GHz Laptop with 2GB RAM and rather slow HD.
cheers
Simon
Wow, I'm impressed. Can't wait to see that code!
James Edward Gray II
Thanks. :)
Because startup time startet to take over the benchmark:
----------------------------------------------------------------
$ time ruby quiz139.rb 68.97.89.187 84.191.4.10 80.79.64.128 210.185.128.123
202.10.4.222 192.189.119.1
US
DE
RU
JP
AU
EU
real 0m0.078s
user 0m0.046s
sys 0m0.031s
----------------------------------------------------------------
and by the way: thanks for telling me such a database exists!
cheers
Simon
Maybe we could also write a ruby script that runs all the entry
scripts and time them, and that could be another ruby quiz which will
also be voted on speed and then we could write a ruby script to time
those entries an then we could .... Just ignore this paragraph
Diego Scataglini
> I think that the timing of the scripts are not a good index. It all
> depends on what hardware/os you are running it on.
> If we want to use speed as an index we should probably have J.E.
> compare them all on the same machine.
I view it that we are getting a rough idea of speeds, not exact
counts. Both scripts timed so far seem to be able to answer the
question in under a second on semi-current hardware. Good enough for
me.
> Maybe we could also write a ruby script that runs all the entry
> scripts and time them, and that could be another ruby quiz which
> will also be voted on speed and then we could write a ruby script
> to time those entries an then we could .... Just ignore this paragraph
Thank you for volunteering... ;)
James Edward Gray II
We probably did something similar. :) Mine also works on the
unmodified IpToCountry.csv file.
$ time ruby 139_ip_to_country.rb 67.19.248.74 70.87.101.66 205.234.109.18 217.146.186.221 62.75.166.87
US
US
US
GB
DE
real 0m0.122s
user 0m0.015s
sys 0m0.000s
(ruby 1.8.4 (2005-12-24) [i386-mswin32], timed from cygwin bash shell,
2GHz athlon64, winxp.)
I don't think the timings are very accurate on this system. It
didn't change much whether I looked up one IP or five.
. . Looking up 80 IPs on one command line resulted in:
real 0m0.242s
user 0m0.015s
sys 0m0.016s
Regards,
Bill
Same here.
$ time ruby locate.rb 1.2.3.4 4.5.6.77 128.129.130.131 255.255.255.255
no country
US
US
ZZ
real 0m0.287s
time for 300 random IPs is:
real 0m1.187s
I assume that it's not ok for folk to use my Ruby GeoIP gem?
It has a reasonable compromise between speed and memory usage.
:-)
Clifford Heath.
Please do. I would love to see how it stacks up against the custom
solutions we will no doubt see.
James Edward Gray II
Hi,
This is my second submission to Ruby Quiz, any tip will be greatly
appreciated. First I downloaded the file and removed all comments,
then tried a straightforward approach. Here it is:
require 'faster_csv'
ip = ARGV[0].split(/\./)
ip = ip[0].to_i * 16777216 + ip[1].to_i * 65536 + ip[2].to_i * 256 + ip[3].to_i
file = ARGV[1] || 'ipdb.csv'
FasterCSV.foreach(file) do |line|
if (line[0].to_i <= ip && line[1].to_i >= ip)
puts line[4]
exit
end
end
It allows to pass the IP as the first parameter, then transforms it to
the format in the file (the "base" 256). The file can be passed as the
second argument, by default I used my modified file without the
comments.
The method uses FasterCSV to parse the file line by line, checking ip
against the range. The result for the IP in the example was:
time ruby quiz139a.rb 68.97.89.187
US
real 0m0.355s
user 0m0.328s
sys 0m0.024s
So, it was a little bit more than the one presented with the problem.
Unfortunately, things don't go so good when you search for an IP that
matches entries at the end of the file:
time ruby quiz139a.rb 255.0.0.1
ZZ
real 0m5.337s
user 0m5.036s
sys 0m0.292s
More than 5 seconds :-(
So, I went on to build a second version. This time the approach I
took, as the file is ordered is a binary search directly on the file.
Here it is:
require 'ftools'
ip = ARGV[0].split(/\./)
ip = ip[0].to_i * 16777216 + ip[1].to_i * 65536 + ip[2].to_i * 256 + ip[3].to_i
file = ARGV[1] || 'ipdb.csv'
File.open(file) do |f|
low = 0
high = f.stat.size
f.seek(high / 2)
while true
while ((a = f.getc) != 10)
f.seek(-2, IO::SEEK_CUR)
end
pos = f.pos
line = f.readline.split(",")
low_range = line[0][1..-2].to_i
high_range = line[1][1..-2].to_i
if (low_range > ip)
high = pos
offset = (f.pos - pos) + ((high - low) / 2)
f.seek(-offset, IO::SEEK_CUR)
elsif (high_range < ip)
low = f.pos
f.seek((high-low) / 2, IO::SEEK_CUR)
else
puts line[4][1..-2]
exit
end
end
end
What the method does is a binary search in the file. It starts a low
and high variables to manage the partition, then positions the cursor
in the file to the middle. Every time, the method reads back until it
finds a 10 (\n), then reads that line to check the IP range. If the
lower range is higher than the IP, the method moves the cursor in the
file, down to half of the current low-high range. If the higher range
is lower than the IP, then it moves up half of the range. The timings
here are much better:
time ruby quiz139b.rb 68.97.89.187
US
real 0m0.077s
user 0m0.048s
sys 0m0.004s
time ruby quiz139b.rb 255.0.0.1
ZZ
real 0m0.069s
user 0m0.060s
sys 0m0.008s
Apart from the general strategy of the binary search, I haven't had
the time to give too much thought to the details, like how to extract
the values after the readline. Any tips regarding that would be
greatly appreciated. Also I don't know if there's a better way of
moving back in the file to the previous line break (the while (a =
f.getc) != 10 part). Tips?
Thanks for this quiz. I hope I had time to work on every quiz.
Regards,
Jesus.
one of the (without a doubt) many binary search solutions. This one reads a 1K
chunk of the file around the current position and uses a regexp to extract
valid lines (and values) from this chunk.
The reasoning behind that is that data from disk is read in blocks anyway (in
most cases even larger than 1K) and the last 2 or 3 iterations of the algorithm
can be avoided. (max depth encountered so far was 14)
Using the regexp frees us from 'manually' looking for newlines (it will find
about 10 valid sets of data in each iteration).
I tried a lot of stuff like biasing the median towards the side where the
solution will be most likely found, using 'static' buffers for the File#read
calls or unrolling the recursion but nothing proofed realy usefull so i decided
to go with the short and fast enough(tm) solution.
------------------------------------------------------------------------------
require 'ipaddr'
class IPAddr
def country_code()
open('IpToCountry.csv'){|file| code(file, to_i, 0, file.stat.size)}
end
private
def code(file, ip, min, max)
median = (min + max) / 2
file.seek(median - 512)
ary = file.read(1024).scan(/^"(\d*)","(\d*)","(?:\w*)","(?:\d*)","(\w*)"/)
return code(file, ip, median, max) if ary.empty? || ary.last[1].to_i < ip
return code(file, ip, min, median) if ary.first[0].to_i > ip
ary.find{|l| l[1].to_i >= ip}[2]
end
end
ARGV.each{|arg| puts IPAddr.new(arg).country_code} if $0 == __FILE__
------------------------------------------------------------------------------
timings:
$ time ruby quiz139.rb 0.0.0.0 68.97.89.187 84.191.4.10 80.79.64.128
210.185.128.123 202.10.4.222 192.189.119.1 255.255.255.255
ZZ
US
DE
RU
JP
AU
EU
ZZ
real 0m0.062s
user 0m0.046s
sys 0m0.030s
cheers
Simon
This is my first submission to Ruby Quiz, so don't expect ground-breaking techniques from this implementation.
Let me characterize my solution: Instead of looking up the csv file directly, I precompile it. The search can then be performed quickly. I think my solution will beat some others when it comes to massive lookups in the range of 100th of thousand lookups.
First, I wrote a random IP address generator. It can alternatively output the IP addresses as a space delimited string or each IP address on a separate line. On my bash submitting all IP address worked up to about 8 k entries until the argument list was too long, so I implemented that the addresses can be given on $stdin also. For a proper argument list on stdin, the entries should be separated by \n, so just add a second argument to rand_ip.rb to generate the list of ips -- the script doesn't care what the second parameter is, it just has to be there (i use "byline").
I use the script like this:
$ ./rand_ip.rb 100000 byline > ip-test-list
Second, I use a precompile run. I don't know how folks here can stay under 100 milliseconds without parsing the file beforehand. My precompilation generates a binary table that stores 10 bytes for each entry: 4 bytes each for start/end of address space and 2 bytes for the country code. Additionally, for saving memory, I squeeze memory areas with adjacent addresses in the same country. This saves more than 50% of the records while it doesn't significantly speed up the search (just about 1 step more in my binary search algorithm, see below). the packer (packr.rb) uses pack("NNa2") for building the binary table, and look, the addresses are stored in network byte order (i.e. big endian). The output table holds about 32 k Entries.
The packer doesn't accept any parameters, just call
$ ./packer.rb
Third comes the actual address search. I implemented two versions for comparison: One with a small RAM footprint that looks up everything in the packed table file. The other one reads the whole table into memory and performs all lookups in memory.
The algorithm is the same in both implementations: I use binary search over all entries until the ip address either matches one range or there is no match at all.
Comparison is sped up by making 4-letter string comparisons of the respective binary addresses. That way "search_ip >= addr_start" works even when the addresses are strings. This saves a lot of time because at search time there is no .to_i calculation avoided.
I run the searchers (search_file.rb, search_str.rb) for small numbers of random IP address using the command line:
$ time ./search_file.rb `./rand_ip.rb 3000` > /dev/null
For more addresses I use one of the following -- the last usage is to make successive runs with equal input:
$ time ./rand_ip.rb 100000 byline | ./search_file.rb > /dev/null
$ ./rand_ip.rb 100000 byline > ip-test-list; time ./search_file.rb < ip-test-list > /dev/null
My results for 100000 lookups are:
$ time ./packer.rb
real 0m1.634s
user 0m1.576s
sys 0m0.056s
$ time ./rand_ip.rb 100000 byline > ip-test-list
real 0m0.797s
user 0m0.740s
sys 0m0.056s
$ time ./search_file.rb < ip-test-list > /dev/null
real 0m11.091s
user 0m9.673s
sys 0m1.420s
$ time ./search_str.rb < ip-test-list > /dev/null
real 0m7.131s
user 0m6.960s
sys 0m0.168s
btw: I don't understand the following result -- can you help me improving this? 129 ms just for firing ruby up! Can't be!
$ time ruby /dev/null
real 0m0.129s
user 0m0.120s
sys 0m0.012s
$ uname -a
Linux server 2.6.21-modgentoo #2 SMP Wed May 2 19:07:13 CEST 2007 x86_64 AMD Athlon(tm) 64 Processor 3000+ AuthenticAMD GNU/Linux
$ ruby --version
ruby 1.8.6 (2007-03-13 patchlevel 0) [x86_64-linux]
Cheers,
- Matthias
Hi All,
Here's my submission. It's almost exactly the same as Jesus'. I have a feeling everyone is going to have very similar code for this one.
http://pastie.caboo.se/97332
os = ARGV[0].split(/\./)
ip = os[3].to_i + os[2].to_i*256 + os[1].to_i*256**2 + os[0].to_i*256**3
f = File.open("IpToCountry.csv")
# perform binary search on the file
low = 0
high = f.stat.size
while low <= high
mid = (low + high) / 2
f.seek mid # seek to the middle of our search window
f.seek -2, IO::SEEK_CUR until f.getc == ?\n # walk backwards until we hit a newline
new_high = f.pos - 1
line = f.gets
new_low = f.pos
from, to, x, x, country = line[1..-1].split(/","/)
if to.to_i < ip # we are too low, set new low to after this line
low = new_low
elsif from.to_i > ip # we are too high, set new high to before the last newline
high = new_high
else
puts country; exit
end
end
puts "no country"- steve
#!/usr/bin/env ruby -wKU
require "open-uri"
require "zlib"
begin
require "rubygems"
rescue LoadError
# load without gems
end
begin
require "faster_csv"
FCSV.build_csv_interface
rescue LoadError
require "csv"
end
class IP
def initialize(address)
@address_chunks = address.split(".").map { |n| Integer(n) }
raise AgumentError, "Malformed IP" unless @address_chunks.size == 4
end
def to_i
@address_chunks.inject { |result, chunk| result * 256 + chunk }
end
STRING_SIZE = new("255.255.255.255").to_i.to_s.size
def to_s
"%#{STRING_SIZE}s" % to_i
end
end
class IPToCountryDB
REMOTE = "http://software77.net/cgi-bin/ip-country/geo-ip.pl?
action=download"
LOCAL = "ip_to_country.txt"
RECORD_SIZE = IP::STRING_SIZE * 2 + 5
def self.build(path = LOCAL)
open(path, "w") do |db|
open(REMOTE) do |url|
csv = Zlib::GzipReader.new(url)
last_range = Array.new
csv.each do |line|
next if line =~ /\A\s*(?:#|\z)/
from, to, country = CSV.parse_line(line).values_at(0..1, 4).
map { |f| Integer(f) rescue f }
if last_range[2] == country and last_range[1] + 1 == from
last_range[1] = to
else
build_line(db, last_range)
last_range = [from, to, country]
end
end
build_line(db, last_range)
end
end
end
def self.build_line(db, fields)
return if fields.empty?
db.printf("%#{IP::STRING_SIZE}s\t%#{IP::STRING_SIZE}s\t%s\n",
*fields)
end
private_class_method :build_line
def initialize(path = LOCAL)
begin
@db = open(path)
rescue Errno::ENOENT
self.class.build(path)
retry
end
end
def search(address)
binary_search(IP.new(address).to_i)
end
private
def binary_search(ip, min = 0, max = @db.stat.size / RECORD_SIZE)
return "Unknown" if min == max
middle = (min + max) / 2
@db.seek(RECORD_SIZE * middle, IO::SEEK_SET)
if @db.read(RECORD_SIZE) =~ /\A *(\d+)\t *(\d+)\t([A-Z]{2})\n\z/
if ip < $1.to_i then binary_search(ip, min, middle)
elsif ip > $2.to_i then binary_search(ip, middle + 1, max)
else $3
end
else
raise "Malformed database at offset #{RECORD_SIZE * middle}"
end
end
end
if __FILE__ == $PROGRAM_NAME
require "optparse"
options = {:db => IPToCountryDB::LOCAL, :rebuild => false}
ARGV.options do |opts|
opts.banner = "Usage:\n" +
" #{File.basename($PROGRAM_NAME)} [-d PATH] IP\n" +
" #{File.basename($PROGRAM_NAME)} [-d PATH] -r"
opts.separator ""
opts.separator "Specific Options:"
opts.on("-d", "--db PATH", String, "The path to database file")
do |path|
options[:db] = path
end
opts.on("-r", "--rebuild", "Rebuild the database") do
options[:rebuild] = true
end
opts.separator "Common Options:"
opts.on("-h", "--help", "Show this message.") do
puts opts
exit
end
begin
opts.parse!
raise "No IP address given" unless options[:rebuild] or
ARGV.size == 1
rescue
puts opts
exit
end
end
if options[:rebuild]
IPToCountryDB.build(options[:db])
else
puts IPToCountryDB.new(options[:db]).search(ARGV.shift)
end
end
__END__
James Edward Gray II
BTW, all solutions already submitted will lie for subnets 1,2 and 5 :)
Most (but not all) will break on out of bounds submissions (256.256.256.256
or 0.0.0.-1, latter if comments are stripped out)
(yes, I know, just trying to find an excuse for not very elegant solution -
edge cases kill elegancy :)
>-----------------------------------------------------------------------------
def bin_find(file,search)
if block_given?
compare = lambda { |a, b| yield(a, b) }
else
compare = lambda { |a, b| a <=> b }
end
open(file) do |f|
return bin_search(f,search,f.lstat.size,&compare)
end
end
def bin_search(f,search,size)
start,finish=0,size
while start<finish
dist=(finish-start)/2
f.seek(start+dist)
f.readline unless start+dist==0
case (l1=f.readline.chomp rescue false) ? yield(search,l1) : -1
when -1
next if (finish=start+dist)>0
break
when 0
return l1
else
case (l2=f.readline.chomp rescue false) ? yield(search,l2) : -1
when -1
return l1
when 0
return l2
else
start+=dist; next
end
end
end
nil
end
nums=[]
out=true
if ARGV[0]=='test'
n=ARGV[1].to_i
n.times{nums << rand(4294967296)}
out=false
else
ARGV.each do |argv|
nums << ((($1.to_i*256)+$2.to_i)*256+$3.to_i)*256+$4.to_i if
argv=~/(\d{1,3}).(\d{1,3}).(\d{1,3}).(\d{1,3})/
end
end
if nums.empty?
puts "Please enter valid ip(s) (or use '#{$0} test NNN' for testing)"
exit
end
nums.each do |num|
ctry='Unknown'
res=bin_find('IpToCountry.csv',num) { |search, str|
str.empty? || str[0,1]!='"' ? 1 : search <=>
str.gsub('"','').split(',')[0].to_i
}.gsub('"','').split(',')
ctry=res[4] if (res[0].to_i..res[1].to_i)===num
puts ctry if out
end
Thanks for the fun quiz.
Looks like my solution is a little more verbose than many
submitted so far.
Like many others, I'm performing a binary search on the
original data file. For better or for worse, I imposed the
following restrictions on my design, just out of curiosity
for how it would turn out:
- non-recursive binary search
- only using seek() and gets() to access the database
(using gets means no backward scanning for the beginning
of a record)
Regards,
Bill
====== Begin file: 139_ip_to_country.rb =====
# IP FROM IP TO REGISTRY ASSIGNED CTRY CNTRY COUNTRY
# "1346797568","1346801663","ripencc","20010601","IL","ISR","ISRAEL"
IpRec = Struct.new(:from, :to, :registry, :assigned, :ctry, :cntry, :country) do
def contains?(ip_value)
ip_value >= self.from && ip_value <= self.to
end
end
class IpToCountry
DATABASE_FNAME = "IptoCountry.csv"
def initialize(db_fname=DATABASE_FNAME)
@db_size = File.stat(db_fname).size
@db = File.open(db_fname, "r")
@db_pos = 0
end
def close
@db.close
@db = nil
end
# Lookup IpRec containing ip. Exception is raised if not found.
def search(ip)
ip_value = self.class.ip_to_int(ip)
find_rec(ip_value)
end
# Binary search through sorted IP database.
# The search uses non-record-aligned byte offsets into the
# file, but always scans forward for the next parsable
# record from a given offset. It keeps track of the
# byte offset past the end of the parsed record in @db_pos.
def find_rec(ip_value)
lower, upper = 0, @db_size - 1
prev_rec = nil
loop do
ofst = lower + ((upper - lower) / 2)
rec = next_rec_from_ofst(ofst)
if rec == prev_rec
# We have narrowed the search to where we're hitting
# the same record each time. Can't get any narrower.
# But these are variable-length records, so there may
# be one or more at our lower bound we haven't seen.
# Do a linear scan from our lower bound to examine the
# few records remaining.
ofst = lower
while (rec = next_rec_from_ofst(ofst)) != prev_rec
return rec if rec.contains? ip_value
ofst = @db_pos
end
raise("no record found for ip_value #{ip_value}")
end
return rec if rec.contains? ip_value
if ip_value < rec.from
upper = ofst
else
lower = @db_pos
end
prev_rec = rec
end
end
def next_rec_from_ofst(ofst)
@db.seek(ofst)
@db_pos = ofst
while line = @db.gets
@db_pos += line.length
break if rec = self.class.parse_rec(line)
end
rec || raise("no record found after ofst #{ofst}")
end
def self.ip_to_int(ip_str)
ip_str.split(".").map{|s|s.to_i}.pack("c4").unpack("N").first
end
# NOTE: Using a strict regexp instead of a simpler split operation,
# because it's important we find a valid record, not one embedded
# in a comment or such.
def self.parse_rec(line)
if line =~ %r{\A \s*"(\d+)"\s*,
\s*"(\d+)"\s*,
\s*"(\w+)"\s*,
\s*"(\d+)"\s*,
\s*"(\w+)"\s*,
\s*"(\w+)"\s*,
\s*"([^"]+)"\s* \z
}x
rec = IpRec.new($1.to_i, $2.to_i, $3, $4.to_i, $5, $6, $7)
end
end
end
if $0 == __FILE__
# Accepts zero-or-more IP addresses on the command line.
ip2c = IpToCountry.new
ARGV.each do |ip|
rec = ip2c.search(ip) rescue nil
puts( rec ? rec.ctry : "(#{ip} not found)" )
end
end
====== End file: 139_ip_to_country.rb =====
====== Begin file: test_139_ip_to_country.rb ======
require '139_ip_to_country'
require 'test/unit'
class TestIpToCountry < Test::Unit::TestCase
# NOTE: the following are the first two and last two records in
# my database file:
REC_0 = IpToCountry.parse_rec(%{"0","16777215","IANA","410227200","ZZ","ZZZ","RESERVED"})
REC_1 = IpToCountry.parse_rec(%{"50331648","67108863","ARIN","572572800","US","USA","UNITED STATES"})
REC_NEG_1 = IpToCountry.parse_rec(%{"4261412864","4278190079","IANA","410227200","ZZ","ZZZ","RESERVED"})
REC_LAST = IpToCountry.parse_rec(%{"4278190080","4294967295","IANA","410227200","ZZ","ZZZ","RESERVED"})
def test_find_rec
ip2c = IpToCountry.new
assert_equal( REC_0, ip2c.find_rec(REC_0.from) )
assert_equal( REC_0, ip2c.find_rec(REC_0.to) )
assert_equal( REC_1, ip2c.find_rec(REC_1.from) )
assert_equal( REC_1, ip2c.find_rec(REC_1.to) )
assert_equal( REC_NEG_1, ip2c.find_rec(REC_NEG_1.from) )
assert_equal( REC_NEG_1, ip2c.find_rec(REC_NEG_1.to) )
assert_equal( REC_LAST, ip2c.find_rec(REC_LAST.from) )
assert_equal( REC_LAST, ip2c.find_rec(REC_LAST.to) )
ip2c.close
end
def test_search
ip2c = IpToCountry.new
rec = ip2c.search("67.19.248.74")
assert_not_nil( rec )
assert_equal( "ARIN", rec.registry )
assert_equal( "US", rec.ctry )
end
end
====== End file: test_139_ip_to_country.rb ======
Hi, could you clarify what is meant by lying about subnets
1, 2, and 5?
Thanks,
Bill
Ah, OK. I get:
ruby 139_ip_to_country.rb 0.1.1.1 1.1.1.1 2.1.1.1 3.1.1.1 4.1.1.1 5.1.1.1
ZZ
(1.1.1.1 not found)
(2.1.1.1 not found)
US
US
(5.1.1.1 not found)
Regards,
Bill
Well, we weren't all broken:
$ ruby ip_to_country.rb 5.1.1.1
Unknown
James Edward Gray II
> Well, we weren't all broken:
I've sent my comment and refreshed new headers - and yes, your solution came
in :)
--EK
On 9/16/07, steve d <oks...@yahoo.com> wrote:
> Here's my submission. It's almost exactly the same as Jesus'.
[...]
> while low <= high
[...]
and this:
On 9/16/07, Eugene Kalenkovich <rub...@softover.com> wrote:
> BTW, all solutions already submitted will lie for subnets 1,2 and 5 :)
> Most (but not all) will break on out of bounds submissions (256.256.256.256
> or 0.0.0.-1, latter if comments are stripped out)
I made a couple of modifications to my code. Here is the latest version:
require 'ftools'
ip = ARGV[0].split(/\./)
ip = ip[0].to_i * 16777216 + ip[1].to_i * 65536 + ip[2].to_i * 256 + ip[3].to_i
file = ARGV[1] || 'ipdb.csv'
File.open(file) do |f|
low = 0
high = f.stat.size
f.seek(high / 2)
while low < high
while (((a = f.getc) != 10) && (f.pos > 2))
f.seek(-2, IO::SEEK_CUR)
end
pos = f.pos
line = f.readline.split(",")
low_range = line[0][1..-2].to_i
high_range = line[1][1..-2].to_i
if (low_range > ip)
high = pos
offset = (f.pos - pos) + ((high - low) / 2)
f.seek(-offset, IO::SEEK_CUR)
elsif (high_range < ip)
low = f.pos
f.seek((high-low) / 2, IO::SEEK_CUR)
else
puts line[4][1..-2]
exit
end
end
puts "No country found"
end
I changed to while low < high, and also the walking backwards failed
when reading the first line (btw, Steve, won't your solution also fail
for that case?).
Now:
time ruby quiz139b.rb 2.1.1.1
No country found
real 0m0.030s
user 0m0.016s
sys 0m0.004s
time ruby quiz139b.rb 5.1.1.1
No country found
real 0m0.032s
user 0m0.008s
sys 0m0.008s
time ruby quiz139b.rb 1.1.1.1
No country found
real 0m0.033s
user 0m0.016s
sys 0m0.000s
time ruby quiz139b.rb 256.256.256.256
No country found
real 0m0.096s
user 0m0.008s
sys 0m0.000s
Thanks,
Jesus.
Please shoot...
gegroet,
Erik V. - http://www.erikveen.dds.nl/
BTW, lines 17899 and 17900 of the downloaded file both contain
"2624585728", which is strange... (Downloaded: 2007-09-17 08:34
(UTC).)
----------------------------------------------------------------
$ time ruby quiz139a.rb 11.22.33.44
11.22.33.44 US
real 0m0.011s
user 0m0.006s
sys 0m0.004s
----------------------------------------------------------------
$ wc all_ip_blocks
82358 82358 905115 all_ip_blocks
$ time ruby quiz139b.rb all_ip_blocks
real 0m45.681s # That's 1800 IPs per second.
user 0m40.351s
sys 0m5.300s
----------------------------------------------------------------
class OrderedLinesFile
def initialize(file_name)
@file_name = File.expand_path(file_name)
end
def find_line(&block)
line = nil
File.open(@file_name) do |io|
position = 0
delta = File.size(@file_name)/2
line = io.gets # The first line of the file.
line = io.gets while /^\s*#/ =~ line or /^\s*$/ =~
line
while delta > 0 and line and (space_ship = block.call(line)) !=
0
position += space_ship < 0 ? -delta : +delta
delta /= 2
if position > 0
io.pos = position
broken_line = io.gets # Skip the current (broken)
line.
line = io.gets # The current line of the
file.
line = io.gets while /^\s*#/ =~ line or /^\s*
$/ =~ line
else
delta = 0 # Stop.
end
end
line = nil if delta == 0 # Nothing found.
end
line
end
end
class IpToCountry
FILE = "IpToCountry.csv"
def country_of(ip_addr)
ip_addr = ip_addr.split(/\./).collect{|n| n.to_i}.inject{|n,
m| n*256+m} # "1.2.3.4" --> 16909060
olf = OrderedLinesFile.new(FILE)
res = nil
olf.find_line do |line|
ip_from, ip_to, registry, assigned, ctry, cntry, country =
line.gsub(/"/, "").split(/,/, 7)
if ip_addr < ip_from.to_i
-1 # Go back in the file.
elsif ip_addr > ip_to.to_i
+1 # Go forward in the file.
else
res = ctry
0 # Bingo!
end
end
res
end
end
itc = IpToCountry.new
ARGV.each do |ip_addr|
puts "%-16s %s" % [ip_addr, itc.country_of(ip_addr)||"??"]
end
----------------------------------------------------------------
:}
gegroet,
Erik V.
Yea it seems that mine will loop forever if you give an invalid ip address that evalutes to less than 0. Changing to 'low < high' seems to fix that. Also it incorrectly reports nil instead of no country on 0.0.0.0. This is all running on the stock csv file (with comments), haven't tried it at all on a modified file.
- steve
It's about twice as fast and requires a fraction of the "sys" times.
Also the "empty" ruby run is only 5 ms. Mysteriously.
The packer is down at 0.8s, and 100000 lookups are at 6.2 s (file-based
aka memory efficient) and 3.7 s (everything in-memory).
Anyway -- i'd like to see a 100000 lookups comparison :) *hehe*
Matthias Wächter schrieb:
> My results for 100000 lookups are:
>
> $ time ./packer.rb
>
> real 0m1.634s
> user 0m1.576s
> sys 0m0.056s
$ time ./packer.rb
real 0m0.716s
user 0m0.708s
sys 0m0.008s
> $ time ./rand_ip.rb 100000 byline > ip-test-list
>
> real 0m0.797s
> user 0m0.740s
> sys 0m0.056s
$ time ./rand_ip.rb 100000 byline > ip-test-list
real 0m0.322s
user 0m0.317s
sys 0m0.005s
> $ time ./search_file.rb < ip-test-list > /dev/null
>
> real 0m11.091s
> user 0m9.673s
> sys 0m1.420s
$ time ./search_file.rb <ip-test-list >/dev/null
real 0m6.201s
user 0m5.316s
sys 0m0.885s
> $ time ./search_str.rb < ip-test-list > /dev/null
>
> real 0m7.131s
> user 0m6.960s
> sys 0m0.168s
$ time ./search_str.rb <ip-test-list >/dev/null
real 0m3.714s
user 0m3.707s
sys 0m0.006s
> btw: I don't understand the following result -- can you help me improving this? 129 ms just for firing ruby up! Can't be!
> $ time ruby /dev/null
>
> real 0m0.129s
> user 0m0.120s
> sys 0m0.012s
$ time ruby /dev/null
real 0m0.005s
user 0m0.004s
sys 0m0.001s
> $ uname -a
> Linux server 2.6.21-modgentoo #2 SMP Wed May 2 19:07:13 CEST 2007 x86_64 AMD Athlon(tm) 64 Processor 3000+ AuthenticAMD GNU/Linux
$ uname -a
Linux sabayon2me 2.6.22-sabayon #1 SMP Mon Sep 3 00:33:06 UTC 2007
x86_64 Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz GenuineIntel GNU/Linux
> $ ruby --version
> ruby 1.8.6 (2007-03-13 patchlevel 0) [x86_64-linux]
$ ruby --version
ruby 1.8.6 (2007-03-13 patchlevel 0) [x86_64-linux]
- Matthias
> Anyway -- i'd like to see a 100000 lookups comparison :) *hehe*
Great. Run it for us and let us know how we do. ;)
James Edward Gray II
It depends what you mean by unmodified - my algorithm runs off the
original file, the only "modification" I am doing in the setup stage
is searching for and saving the byte offset of the first and last
records. It looks like l could have done that every time my script
was run and only added 5 ms.
> I would have bet Simon's would be faster. Strange!
I thought block file reads would be faster too, that was the next
thing I was planning to try. Maybe it's the regexp that slowed it
down.
-Adam
Hi,
Yes, I only tested with my version of the file, from which I manually
removed the comments. I don't think I'll have time to fix that, at
least this week. Anyway, I find strange the FasterCSV version doesn't
work, because it delegates the parsing of the file to that gem, and
the rest is pretty simple. On the other hand I don't expect the first
version to perform anywhere near the other solutions, so it's not so
important :-).
Jesus.
> On 9/19/07, Matthias Wächter <matt...@waechter.wiz.at> wrote:
>> James Edward Gray II schrieb:
>>> On Sep 18, 2007, at 4:00 AM, Matthias Wächter wrote:
>>>> Anyway -- i'd like to see a 100000 lookups comparison :) *hehe*
>>> Great. Run it for us and let us know how we do. ;)
>>
>> [**]: O Jesus :), I can't make your FasterCSV version (a) run, and in
>> the later version you sent your direct parsing breaks when it
>> comes to
>> detecting the commented lines in the first part of the file. I
>> couldn't
>> manage to make it run, sorry.
>
> Hi,
>
> Yes, I only tested with my version of the file, from which I manually
> removed the comments. I don't think I'll have time to fix that, at
> least this week. Anyway, I find strange the FasterCSV version doesn't
> work, because it delegates the parsing of the file to that gem, and
> the rest is pretty simple.
Comments are not a part of the CSV specification, so FasterCSV
doesn't address them. I would like to add ignore patterns at some
point though.
James Edward Gray II
mark@server1:~/rubyquiz/139$ time ./ip2country.rb 195.135.211.255
UK
real 0m0.004s
user 0m0.004s
sys 0m0.000s
Here's my code:
#!/usr/bin/ruby
ARGV.each { |ip|
f = ip.split(/\./).join "/"
puts File.open(f).readlines[0] rescue puts "Unknown"
}
I think it's pretty obvious what the preparation step was. Of course,
the tradeoff for this speed is a MASSIVE waste of disk resources, but
that was unlimited in this contest, was it not? :-)
- Mark.
Pretty clever.
I bet with the right prep, this could even be a pretty viable
approach. Instead of building a file for each address you could
create a directory structure for the hexadecimal representations of
each piece of the address. The final layer could be handled as you
have here or with a search through a much smaller file.
James Edward Gray II
LOL! Nice... :)
> Pretty clever.
>
> I bet with the right prep, this could even be a pretty viable
> approach. Instead of building a file for each address you could
> create a directory structure for the hexadecimal representations of
> each piece of the address. The final layer could be handled as you
> have here or with a search through a much smaller file.
Indeed, I was thinking last night about preprocessing the data
into an on-disk hash table. I was thinking about a flat file
at the time, but one could use a subdirectory technique like the
above... using hex components of the hash value to index through
a couple subdirs to reach a leaf file containing one or more
records.
Another thing I'd like to try but won't have time for, is to
use ruby-mmap somehow. (Preprocess the data into a flat binary
file, then use ruby-mmap when performing the binary search.)
Anyway thanks for the fun quiz. :)
Regards,
Bill
>ruby IPToCountry.rb 121.121.121.121
MY
0.000000 0.000000 0.000000 ( 0.000000)
Wrapping the code in 100.times {...}
>ruby IPToCountry.rb 121.121.121.121
MY
MY
...
MY
MY
0.062000 0.000000 0.062000 ( 0.078000)
Hmmmm...clearly I'm using a supercomputer. :P
My solution uses no initialization run. Instead, it uses a binary search, making good use of IO#seek and IO#pos. It's a little atypical though - it uses different numbers to compare less than and greater than, and looking at a position really means looking at the first two numbers on the next line.
Here's the code:
require 'benchmark'
puts Benchmark.measure { 100.times {
dot_dec_ip = ARGV[0].chomp
dec_ip = dot_dec_ip[0..2].to_i << 24
dot_dec_ip = dot_dec_ip[(dot_dec_ip.index(?.)+1)..-1]
dec_ip += dot_dec_ip[0..2].to_i << 16
dec_ip += dot_dec_ip[dot_dec_ip.index(?.)+1,3].to_i << 8
#Last 8 bits are all in the same country; they don't matter
dec_ip = dec_ip
dataf = File.new("IPToCountry.csv")
###Begin binary search, finding high and low
#Hardcoded character offset of where to start. This should be the index of
#a character on the last line of comments
#
#Earlier versions used 0 or calculated this each iteration.
#The former yielded bad results (for obvious reasons);
#the latter doubled the time needed.
low = 6603
dataf.seek(0,IO::SEEK_END)
flen = dataf.pos
high = flen
while true
if low == high - 1
puts "IP not assigned"
break
end
mid = (low + high) >> 1
dataf.seek(mid,IO::SEEK_SET)
dataf.gets
dataf.getc
range_start = dataf.gets('"')
range_start.slice!(-1)
range_start = range_start.to_i
cmpno = dec_ip <=> range_start
if cmpno == -1
high = mid
next
else
dataf.read(2)
range_end = dataf.gets('"')
range_end.slice!(-1)
range_end = range_end.to_i
if (dec_ip <=> range_end) == 1
low = mid
next
else
puts dataf.gets.match(/"(\w\w)"/)[1]
break
end
end
end
}}
----- Original Message ----
From: Ruby Quiz <ja...@grayproductions.net>
To: ruby-talk ML <ruby...@ruby-lang.org>
Sent: Friday, September 14, 2007 7:31:41 AM
Subject: [QUIZ] IP to Country (#139)
The three rules of Ruby Quiz:
1. Please do not post any solutions or spoiler discussion for this quiz until
48 hours have passed from the time on this message.
2. Support Ruby Quiz by submitting ideas as often as you can:
3. Enjoy!
Suggestion: A [QUIZ] in the subject of emails about the problem helps everyone
on Ruby Talk follow the discussion. Please reply to the original quiz message,
if you can.
-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
This week's Ruby Quiz is to write a simple utility. Your program should accept
an IP address as a command-line argument and print out the two letter code for
the country that IP is assigned in. You can find a database for the matching
at:
http://software77.net/cgi-bin/ip-country/geo-ip.pl
To keep the problem interesting though, let's write our programs with a focus on
speed and memory efficiency.
$ time ruby ip_to_country.rb 68.97.89.187
US
real 0m0.314s
user 0m0.259s
sys 0m0.053s
____________________________________________________________________________________
Be a better Globetrotter. Get better travel answers from someone who knows. Yahoo! Answers - Check it out.
http://answers.yahoo.com/dir/?link=list&sid=396545469
>> I would have bet Simon's would be faster. Strange!
>
> I thought block file reads would be faster too, that was the next
> thing I was planning to try. Maybe it's the regexp that slowed it
> down.
Without looking at the other solutions in detail i think one of the problems
may be that my solution opens the file for each lookup - thats of course easy
to fix. I don't know if thats the problem or the overhead of creating ten
thousand IPAddr objects - i refuse to analyse this in depth because i don't
have a usecase for looking up that many locations in a single run.
(on the other hand i do understand how much fun it can be to optimize such a
problem to death - so go on if you like, i don't have the motivation - this time :)
cheers
Simon
This benchmark ran each solution 1000 times with random IPs (invoking a new ruby instance on each run).
-bash-2.05b$ ruby bm.rb
sorted by real time
[["steved.rb", 10.5210349559784],
["adam.rb", 10.817939043045],
["justin.rb", 11.6265380382538],
["james.rb", 12.3434510231018],
["eugene.rb", 12.9024469852448],
["jesus.rb", 13.2812840938568],
["erik.rb", 13.3421401977539],
["simon.rb", 14.0733840465546]]
sorted by sys time
[["simon.rb", 0.796875],
["steved.rb", 1.3046875],
["adam.rb", 1.5078125],
["justinrb", 1.515625],
["jesus.rb", 1.765625],
["eugene.rb", 2.265625],
["erik.rb", 2.421875],
["james.rb", 2.4609375]]
sorted by user time
[["simon.rb", 0.078125],
["eugene.rb", 0.09375],
["jesus.rb", 0.109375],
["steved.rb", 0.1171875],
["justin.rb", 0.1171875],
["jamesrb", 0.125],
["adam.rb", 0.125],
["erik.rb", 0.1953125]]
Here's the benchmark source:
require 'benchmark'
require 'pp'
random_ips = Array.new(1000).map { Array.new(4).map {rand(256)}.join('.') }
sols = Dir["*.rb"] - [$0]
times = {}
sols.each do |fn|
stats = Benchmark.measure do
random_ips.each do |ip|
%x{ ruby #{fn} #{ip} }
end
end
times[fn] = stats
end
sort_times = proc { |m| times.sort_by{|(k,v)| v.send(m)}.map{|(k,v)| [k, v.send(m)]} }
puts "sorted by real time"
pp sort_times[:real]
puts
puts "sorted by sys time"
pp sort_times[:stime]
puts
puts "sorted by user time"
pp sort_times[:utime]
Also, Adam, your solution seems to give incorrect answers sometimes, I noticed 33.33.33.33 and 88.88.88.88 both gave Not Found where as all the other solutions gave US and NO respectively.
- steve
Cool. I guess mine was overlooked but it also does no prep work:
http://groups.google.com/group/comp.lang.ruby/msg/6162dae2bcf1bc3f?dmode=source
Regards,
Bill
Sorry about that, re-ran with yours included:
10000 times each:
-bash-2.05b$ ruby bm.rb
sorted by real time
[["james.rb", 97.2237520217896],
["adam.rb", 100.450285196304],
["steved.rb", 101.930130004883],
["erik.rb", 103.835056066513],
["bill.rb", 104.454197883606],
["eugene.rb", 105.771279811859],
["justin.rb", 106.694982051849],
["jesus.rb", 118.557897090912],
["simon.rb", 141083032131195]]
sorted by sys time
[["simon.rb", 16.359375],
["bill.rb", 17.6328125],
["erik.rb", 17.7109375],
["eugene.rb", 17.734375],
["steved.rb", 17.84375],
["justin.rb", 17.921875],
["jesus.rb", 17.9921875],
["adam.rb", 18.046875],
["james.rb", 18.140625]]
sorted by user time
[["adam.rb", 0.7578125],
["justin.rb", 0.8203125],
["erik.rb", 0.8203125],
["steved.rb", 0.84375],
["jesus.rb", 0.8984375],
["bill.rb", 0.9140625],
["james.rb", 0.9140625],
["simon.rb", 0.9140625],
["eugene.rb", 0.953125]]
Thanks! :D
But first, let's talk speed.
Most of the solutions are quite quick, relatively speaking. What's the shared
secret of speed? Binary search. The records are in sorted order, so it's
possible to perform a simple cut-the-possible-matches-in-half-at-each-step
lookup. That favored algorithm makes short work of what could otherwise be a
lengthy search.
Now you can do a binary search on the database file as is and several solutions
did. This is a little trickier, because the records are not of a standard size.
You have to be super careful not to accidently skip records. While the
solutions handled the challenge very well, you can make things quite a bit
easier if you are willing to preprocess the file.
You can also speed things up with some clever preprocessing. That was the other
trick Matthias used to find answers so quickly.
Matthias realized that while the actual records contain quite a bit of data, we
only really care about the start of a range, the end of a range, and the country
code. You can fit all of that in just ten bytes with four for each address and
two for the country.
The file also contains some consecutive ranges that can be collapsed for the
purposes of our search. Such ranges may differ in some attributes, but as long
as their country codes are the same we can treat them as a single unit.
Now that you know the goal, let's take a look at the packing code Matthias sent
in:
#!/usr/bin/ruby
# comment
last_start=nil
last_end=nil
last_country=nil
File.open("packed-ip.dat","wb") do |wfile|
IO.foreach("geo-ip.csv") do |line|
next if !(line =~ /^"/ )
s,e,d1,d2,co=line.delete!("\"").split(",")
s,e = s.to_i,e.to_i
if !last_start
# initialize with first entry
last_start,last_end,last_country = s,e,co
else
if s==last_end+1 and co==last_country
# squeeze if successive ranges have zero gap
last_end=e
else
# append last entry, remember new one
wfile << [last_start,last_end,last_country].pack("NNa2")
last_start,last_end,last_country = s,e,co
end
end
end
# print last entry
if last_start
wfile << [last_start,last_end,last_country].pack("NNa2")
end
end
The three variables declared up front are for tracking the ranges. These will
be used to collapse consecutive ranges.
The next bit of code creates the binary database we will write into and parses
the CSV formated data. Though the data is in the CSV format, the actual content
is trivial and you don't need a proper parser to get at it. As you can see,
Matthias just checks for lines starting with a quote, pulls out all quote
characters, and split()s on commas. This is faster than using a parser.
The if/else chain in the middle of the code does most of the work. First, it
stores the initial range in the tracking variables. For all other ranges it
tests to see if it is consecutive with the last one recorded and has the same
country code. When it does, the endpoint of the last range is just bumped to
include them both. When it doesn't, the last range is written to the file and
the code starts tracking the new range. The final if statement ensures that we
write out the final range before exiting.
This really isn't too much work and it runs in under two seconds on my box.
That's a small price to pay for a speed gain every time we look up an address.
Let's dive into the search code now. Here's how it begins:
#!/usr/bin/ruby
# take command line or stdin -- the latter has performance advantage
# for long lists
if ARGV[0]
arr=ARGV
else
arr=$stdin
end
# ...
As the comment tells us, Matthias added another way to feed the program
addresses. Where I said we should take one from the command-line arguments,
this code actually handles any number of addresses from command-line arguments
or STDIN.
This next bit of code opens up our database:
# ...
# the binary table file is looked up with each request
File.open("packed-ip.dat","rb") do |rfile|
rfile.seek(0,IO::SEEK_END)
record_max=rfile.pos/10-1
# ...
The seek() and pos() calls here are used to find the end of the file. You need
to know both ends of a data set to perform a binary search. Note that dividing
by ten gives us the count of records since they are a fixed width and Matthias
subtracts one because we will never need to seek() to the end of the last record
again.
Now there's a touch more preparation for each address we want to find:
# ...
arr.each { |argv|
# build a 4-char string representation of IP address
# in network byte order so it can be a string compare below
ipstr= argv.split(".").map {|x| x.to_i.chr}.join
# ...
In order to avoid a bunch of to_i() calls on each range extracted from the
database, Matthias just encodes the address we are hunting for into the
character encoding used in the database. This way simple String comparisons can
determine if the address it in the range.
We're now ready for the actual search code:
# ...
# low/high water marks initialized
low,high=0,record_max
while true
mid=(low+high)/2 # binary search median
rfile.seek(10*mid) # one record is 10 byte, seek to position
str=rfile.read(8) # for range matching, we need only 8 bytes
# at comparison, values are big endian, i.e. packed("N")
if ipstr>=str[0,4] # is this IP not below the current range?
if ipstr<=str[4,4] # is this IP not above the current range?
puts rfile.read(2) # a perfect match, voila!
break
else
low=mid+1 # binary search: raise lower limit
end
else
high=mid-1 # binary search: reduce upper limit
end
if low>high # no entries left? nothing found
puts "no country"
break
end
end
}
end
This is a pretty textbook binary search. You always begin by going to the
middle of the current low and high. In this case that's a seek() call and we
have to remember to multiply the midpoint by our record size of ten.
Once we are at the right record, the code read()s the first eight bytes and
compares the address against the low and high for the range. When it is in the
range, the final two bytes are read and printed. If the address is below the
current range, we drop the high to below the current range. If it's above, we
raise the low to above the current range. A final check is added to catch the
cases where no match was found, in which case our low will bypass the high.
Remember that a second goal of the quiz described search was to be memory
friendly. This code does terrific on that front since only one record needs to
be in memory at a time. Once the checks are done, it can be replaced by the
next record.
My thanks to all who showed so many great examples of this classic algorithm.
Tomorrow we will write programs that can help me to better understand my
wardrobe...