Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

[QUIZ] IP to Country (#139)

10 views
Skip to first unread message

Ruby Quiz

unread,
Sep 14, 2007, 8:31:41 AM9/14/07
to
The three rules of Ruby Quiz:

1. Please do not post any solutions or spoiler discussion for this quiz until
48 hours have passed from the time on this message.

2. Support Ruby Quiz by submitting ideas as often as you can:

http://www.rubyquiz.com/

3. Enjoy!

Suggestion: A [QUIZ] in the subject of emails about the problem helps everyone
on Ruby Talk follow the discussion. Please reply to the original quiz message,
if you can.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

This week's Ruby Quiz is to write a simple utility. Your program should accept
an IP address as a command-line argument and print out the two letter code for
the country that IP is assigned in. You can find a database for the matching
at:

http://software77.net/cgi-bin/ip-country/geo-ip.pl

To keep the problem interesting though, let's write our programs with a focus on
speed and memory efficiency.

$ time ruby ip_to_country.rb 68.97.89.187
US

real 0m0.314s
user 0m0.259s
sys 0m0.053s

Simon Kröger

unread,
Sep 14, 2007, 3:18:45 PM9/14/07
to
Ruby Quiz wrote:
> [...]
>
> $ time ruby ip_to_country.rb 68.97.89.187
> US
>
> real 0m0.314s
> user 0m0.259s
> sys 0m0.053s

Is an 'initialisation run' allowed to massage the data?
(we should at least split the benchmarks to keep it fair)

Is it motivating or a spoiler to post timings?

cheers

Simon

James Edward Gray II

unread,
Sep 14, 2007, 3:23:54 PM9/14/07
to
On Sep 14, 2007, at 2:20 PM, Simon Kröger wrote:

> Ruby Quiz wrote:
>> [...]
>>
>> $ time ruby ip_to_country.rb 68.97.89.187
>> US
>>
>> real 0m0.314s
>> user 0m0.259s
>> sys 0m0.053s
>
> Is an 'initialisation run' allowed to massage the data?
> (we should at least split the benchmarks to keep it fair)

My script does need and initialization run, yes. I don't see any
harm in paying a one time penalty to set things up right.

> Is it motivating or a spoiler to post timings?

Motivating, definitely. :)

James Edward Gray II

Simon Kröger

unread,
Sep 14, 2007, 3:32:28 PM9/14/07
to

Ok, my script does not need any initialization, it uses the file
IpToCountry.csv exactly as downloaded.

----------------------------------------------------------------
$ ruby -v
ruby 1.8.4 (2005-12-24) [i386-cygwin]

$ time ruby quiz139.rb 68.97.89.187
US

real 0m0.047s
user 0m0.030s
sys 0m0.030s

$ time ruby quiz139.rb 84.191.4.10
DE

real 0m0.046s
user 0m0.046s
sys 0m0.015s
----------------------------------------------------------------

This is on a Pentium M 2.13GHz Laptop with 2GB RAM and rather slow HD.

cheers

Simon

James Edward Gray II

unread,
Sep 14, 2007, 3:38:10 PM9/14/07
to

Wow, I'm impressed. Can't wait to see that code!

James Edward Gray II

Simon Kröger

unread,
Sep 14, 2007, 4:15:08 PM9/14/07
to
> Wow, I'm impressed. Can't wait to see that code!

Thanks. :)

Because startup time startet to take over the benchmark:
----------------------------------------------------------------
$ time ruby quiz139.rb 68.97.89.187 84.191.4.10 80.79.64.128 210.185.128.123
202.10.4.222 192.189.119.1
US
DE
RU
JP
AU
EU

real 0m0.078s
user 0m0.046s
sys 0m0.031s
----------------------------------------------------------------

and by the way: thanks for telling me such a database exists!

cheers

Simon

diego scataglini

unread,
Sep 14, 2007, 5:58:44 PM9/14/07
to
I think that the timing of the scripts are not a good index. It all
depends on what hardware/os you are running it on.
If we want to use speed as an index we should probably have J.E.
compare them all on the same machine.

Maybe we could also write a ruby script that runs all the entry
scripts and time them, and that could be another ruby quiz which will
also be voted on speed and then we could write a ruby script to time
those entries an then we could .... Just ignore this paragraph

Diego Scataglini

James Edward Gray II

unread,
Sep 14, 2007, 6:16:27 PM9/14/07
to
On Sep 14, 2007, at 4:58 PM, diego scataglini wrote:

> I think that the timing of the scripts are not a good index. It all
> depends on what hardware/os you are running it on.
> If we want to use speed as an index we should probably have J.E.
> compare them all on the same machine.

I view it that we are getting a rough idea of speeds, not exact
counts. Both scripts timed so far seem to be able to answer the
question in under a second on semi-current hardware. Good enough for
me.

> Maybe we could also write a ruby script that runs all the entry
> scripts and time them, and that could be another ruby quiz which
> will also be voted on speed and then we could write a ruby script
> to time those entries an then we could .... Just ignore this paragraph

Thank you for volunteering... ;)

James Edward Gray II

Bill Kelly

unread,
Sep 14, 2007, 7:45:12 PM9/14/07
to

From: "Simon Kröger" <SimonK...@gmx.de>

>
> Ok, my script does not need any initialization, it uses the file
> IpToCountry.csv exactly as downloaded.

We probably did something similar. :) Mine also works on the
unmodified IpToCountry.csv file.

$ time ruby 139_ip_to_country.rb 67.19.248.74 70.87.101.66 205.234.109.18 217.146.186.221 62.75.166.87
US
US
US
GB
DE

real 0m0.122s
user 0m0.015s
sys 0m0.000s


(ruby 1.8.4 (2005-12-24) [i386-mswin32], timed from cygwin bash shell,
2GHz athlon64, winxp.)

I don't think the timings are very accurate on this system. It
didn't change much whether I looked up one IP or five.

. . Looking up 80 IPs on one command line resulted in:

real 0m0.242s
user 0m0.015s
sys 0m0.016s


Regards,

Bill


steve d

unread,
Sep 14, 2007, 8:11:24 PM9/14/07
to
From: "Simon Kröger" <SimonK...@gmx.de>
>
> Ok, my script does not need any initialization, it uses the file
> IpToCountry.csv exactly as downloaded.

Same here.

$ time ruby locate.rb 1.2.3.4 4.5.6.77 128.129.130.131 255.255.255.255
no country
US
US
ZZ

real 0m0.287s

time for 300 random IPs is:

real 0m1.187s


Clifford Heath

unread,
Sep 15, 2007, 3:44:26 AM9/15/07
to
Ruby Quiz wrote:
> This week's Ruby Quiz is to write a simple utility. Your program should accept
> an IP address as a command-line argument and print out the two letter code for
> the country that IP is assigned in.

I assume that it's not ok for folk to use my Ruby GeoIP gem?
It has a reasonable compromise between speed and memory usage.
:-)

Clifford Heath.

James Edward Gray II

unread,
Sep 15, 2007, 11:47:55 AM9/15/07
to

Please do. I would love to see how it stacks up against the custom
solutions we will no doubt see.

James Edward Gray II

Jesús Gabriel y Galán

unread,
Sep 16, 2007, 8:37:15 AM9/16/07
to
On 9/14/07, Ruby Quiz <ja...@grayproductions.net> wrote:
> This week's Ruby Quiz is to write a simple utility. Your program should accept
> an IP address as a command-line argument and print out the two letter code for
> the country that IP is assigned in. You can find a database for the matching
> at:
>
> http://software77.net/cgi-bin/ip-country/geo-ip.pl
>
> To keep the problem interesting though, let's write our programs with a focus on
> speed and memory efficiency.

Hi,

This is my second submission to Ruby Quiz, any tip will be greatly
appreciated. First I downloaded the file and removed all comments,
then tried a straightforward approach. Here it is:

require 'faster_csv'
ip = ARGV[0].split(/\./)
ip = ip[0].to_i * 16777216 + ip[1].to_i * 65536 + ip[2].to_i * 256 + ip[3].to_i
file = ARGV[1] || 'ipdb.csv'

FasterCSV.foreach(file) do |line|
if (line[0].to_i <= ip && line[1].to_i >= ip)
puts line[4]
exit
end
end

It allows to pass the IP as the first parameter, then transforms it to
the format in the file (the "base" 256). The file can be passed as the
second argument, by default I used my modified file without the
comments.

The method uses FasterCSV to parse the file line by line, checking ip
against the range. The result for the IP in the example was:

time ruby quiz139a.rb 68.97.89.187
US

real 0m0.355s
user 0m0.328s
sys 0m0.024s

So, it was a little bit more than the one presented with the problem.
Unfortunately, things don't go so good when you search for an IP that
matches entries at the end of the file:

time ruby quiz139a.rb 255.0.0.1
ZZ

real 0m5.337s
user 0m5.036s
sys 0m0.292s

More than 5 seconds :-(

So, I went on to build a second version. This time the approach I
took, as the file is ordered is a binary search directly on the file.
Here it is:

require 'ftools'

ip = ARGV[0].split(/\./)
ip = ip[0].to_i * 16777216 + ip[1].to_i * 65536 + ip[2].to_i * 256 + ip[3].to_i
file = ARGV[1] || 'ipdb.csv'

File.open(file) do |f|
low = 0
high = f.stat.size
f.seek(high / 2)
while true
while ((a = f.getc) != 10)
f.seek(-2, IO::SEEK_CUR)
end
pos = f.pos
line = f.readline.split(",")
low_range = line[0][1..-2].to_i
high_range = line[1][1..-2].to_i
if (low_range > ip)
high = pos
offset = (f.pos - pos) + ((high - low) / 2)
f.seek(-offset, IO::SEEK_CUR)
elsif (high_range < ip)
low = f.pos
f.seek((high-low) / 2, IO::SEEK_CUR)
else
puts line[4][1..-2]
exit
end
end
end

What the method does is a binary search in the file. It starts a low
and high variables to manage the partition, then positions the cursor
in the file to the middle. Every time, the method reads back until it
finds a 10 (\n), then reads that line to check the IP range. If the
lower range is higher than the IP, the method moves the cursor in the
file, down to half of the current low-high range. If the higher range
is lower than the IP, then it moves up half of the range. The timings
here are much better:

time ruby quiz139b.rb 68.97.89.187
US

real 0m0.077s
user 0m0.048s
sys 0m0.004s

time ruby quiz139b.rb 255.0.0.1
ZZ

real 0m0.069s
user 0m0.060s
sys 0m0.008s

Apart from the general strategy of the binary search, I haven't had
the time to give too much thought to the details, like how to extract
the values after the readline. Any tips regarding that would be
greatly appreciated. Also I don't know if there's a better way of
moving back in the file to the previous line break (the while (a =
f.getc) != 10 part). Tips?

Thanks for this quiz. I hope I had time to work on every quiz.

Regards,

Jesus.

Simon Kröger

unread,
Sep 16, 2007, 9:13:52 AM9/16/07
to
Hi,

one of the (without a doubt) many binary search solutions. This one reads a 1K
chunk of the file around the current position and uses a regexp to extract
valid lines (and values) from this chunk.

The reasoning behind that is that data from disk is read in blocks anyway (in
most cases even larger than 1K) and the last 2 or 3 iterations of the algorithm
can be avoided. (max depth encountered so far was 14)

Using the regexp frees us from 'manually' looking for newlines (it will find
about 10 valid sets of data in each iteration).

I tried a lot of stuff like biasing the median towards the side where the
solution will be most likely found, using 'static' buffers for the File#read
calls or unrolling the recursion but nothing proofed realy usefull so i decided
to go with the short and fast enough(tm) solution.

------------------------------------------------------------------------------
require 'ipaddr'

class IPAddr
def country_code()
open('IpToCountry.csv'){|file| code(file, to_i, 0, file.stat.size)}
end

private
def code(file, ip, min, max)
median = (min + max) / 2
file.seek(median - 512)
ary = file.read(1024).scan(/^"(\d*)","(\d*)","(?:\w*)","(?:\d*)","(\w*)"/)
return code(file, ip, median, max) if ary.empty? || ary.last[1].to_i < ip
return code(file, ip, min, median) if ary.first[0].to_i > ip
ary.find{|l| l[1].to_i >= ip}[2]
end
end

ARGV.each{|arg| puts IPAddr.new(arg).country_code} if $0 == __FILE__
------------------------------------------------------------------------------

timings:

$ time ruby quiz139.rb 0.0.0.0 68.97.89.187 84.191.4.10 80.79.64.128
210.185.128.123 202.10.4.222 192.189.119.1 255.255.255.255
ZZ


US
DE
RU
JP
AU
EU

ZZ

real 0m0.062s
user 0m0.046s
sys 0m0.030s

cheers

Simon

Matthias Wächter

unread,
Sep 16, 2007, 9:15:00 AM9/16/07
to
Ruby Quiz wrote:
> This week's Ruby Quiz is to write a simple utility. Your program should accept
> an IP address as a command-line argument and print out the two letter code for
> the country that IP is assigned in. You can find a database for the matching
> at:
>
> http://software77.net/cgi-bin/ip-country/geo-ip.pl
>
> To keep the problem interesting though, let's write our programs with a focus on
> speed and memory efficiency.
>
> $ time ruby ip_to_country.rb 68.97.89.187
> US
>
> real 0m0.314s
> user 0m0.259s
> sys 0m0.053s

This is my first submission to Ruby Quiz, so don't expect ground-breaking techniques from this implementation.

Let me characterize my solution: Instead of looking up the csv file directly, I precompile it. The search can then be performed quickly. I think my solution will beat some others when it comes to massive lookups in the range of 100th of thousand lookups.

First, I wrote a random IP address generator. It can alternatively output the IP addresses as a space delimited string or each IP address on a separate line. On my bash submitting all IP address worked up to about 8 k entries until the argument list was too long, so I implemented that the addresses can be given on $stdin also. For a proper argument list on stdin, the entries should be separated by \n, so just add a second argument to rand_ip.rb to generate the list of ips -- the script doesn't care what the second parameter is, it just has to be there (i use "byline").

I use the script like this:

$ ./rand_ip.rb 100000 byline > ip-test-list


Second, I use a precompile run. I don't know how folks here can stay under 100 milliseconds without parsing the file beforehand. My precompilation generates a binary table that stores 10 bytes for each entry: 4 bytes each for start/end of address space and 2 bytes for the country code. Additionally, for saving memory, I squeeze memory areas with adjacent addresses in the same country. This saves more than 50% of the records while it doesn't significantly speed up the search (just about 1 step more in my binary search algorithm, see below). the packer (packr.rb) uses pack("NNa2") for building the binary table, and look, the addresses are stored in network byte order (i.e. big endian). The output table holds about 32 k Entries.

The packer doesn't accept any parameters, just call

$ ./packer.rb


Third comes the actual address search. I implemented two versions for comparison: One with a small RAM footprint that looks up everything in the packed table file. The other one reads the whole table into memory and performs all lookups in memory.

The algorithm is the same in both implementations: I use binary search over all entries until the ip address either matches one range or there is no match at all.

Comparison is sped up by making 4-letter string comparisons of the respective binary addresses. That way "search_ip >= addr_start" works even when the addresses are strings. This saves a lot of time because at search time there is no .to_i calculation avoided.

I run the searchers (search_file.rb, search_str.rb) for small numbers of random IP address using the command line:

$ time ./search_file.rb `./rand_ip.rb 3000` > /dev/null

For more addresses I use one of the following -- the last usage is to make successive runs with equal input:

$ time ./rand_ip.rb 100000 byline | ./search_file.rb > /dev/null
$ ./rand_ip.rb 100000 byline > ip-test-list; time ./search_file.rb < ip-test-list > /dev/null

My results for 100000 lookups are:

$ time ./packer.rb

real 0m1.634s
user 0m1.576s
sys 0m0.056s
$ time ./rand_ip.rb 100000 byline > ip-test-list

real 0m0.797s
user 0m0.740s
sys 0m0.056s
$ time ./search_file.rb < ip-test-list > /dev/null

real 0m11.091s
user 0m9.673s
sys 0m1.420s
$ time ./search_str.rb < ip-test-list > /dev/null

real 0m7.131s
user 0m6.960s
sys 0m0.168s

btw: I don't understand the following result -- can you help me improving this? 129 ms just for firing ruby up! Can't be!
$ time ruby /dev/null

real 0m0.129s
user 0m0.120s
sys 0m0.012s
$ uname -a
Linux server 2.6.21-modgentoo #2 SMP Wed May 2 19:07:13 CEST 2007 x86_64 AMD Athlon(tm) 64 Processor 3000+ AuthenticAMD GNU/Linux
$ ruby --version
ruby 1.8.6 (2007-03-13 patchlevel 0) [x86_64-linux]

Cheers,
- Matthias

rand_ip.rb
packer.rb
search_file.rb
search_str.rb

steve d

unread,
Sep 16, 2007, 9:20:44 AM9/16/07
to
On 9/14/07, Ruby Quiz <ja...@grayproductions.net> wrote:
> This week's Ruby Quiz is to write a simple utility. Your program should accept
> an IP address as a command-line argument and print out the two letter code for
> the country that IP is assigned in. You can find a database for the matching
> at:
>
> http://software77.net/cgi-bin/ip-country/geo-ip.pl
>
> To keep the problem interesting though, let's write our programs with a focus on
> speed and memory efficiency.

Hi All,

Here's my submission. It's almost exactly the same as Jesus'. I have a feeling everyone is going to have very similar code for this one.

http://pastie.caboo.se/97332
os = ARGV[0].split(/\./)
ip = os[3].to_i + os[2].to_i*256 + os[1].to_i*256**2 + os[0].to_i*256**3

f = File.open("IpToCountry.csv")

# perform binary search on the file


low = 0
high = f.stat.size

while low <= high
mid = (low + high) / 2

f.seek mid # seek to the middle of our search window
f.seek -2, IO::SEEK_CUR until f.getc == ?\n # walk backwards until we hit a newline

new_high = f.pos - 1
line = f.gets
new_low = f.pos

from, to, x, x, country = line[1..-1].split(/","/)

if to.to_i < ip # we are too low, set new low to after this line
low = new_low
elsif from.to_i > ip # we are too high, set new high to before the last newline
high = new_high
else
puts country; exit
end
end

puts "no country"- steve


James Edward Gray II

unread,
Sep 16, 2007, 1:34:49 PM9/16/07
to
This is the code I used to generate the quiz example. To give credit
where credit is due though, it was heavily inspired from some code
Allan Odgaard shared with me:

#!/usr/bin/env ruby -wKU

require "open-uri"
require "zlib"

begin
require "rubygems"
rescue LoadError
# load without gems
end

begin
require "faster_csv"
FCSV.build_csv_interface
rescue LoadError
require "csv"
end


class IP
def initialize(address)
@address_chunks = address.split(".").map { |n| Integer(n) }
raise AgumentError, "Malformed IP" unless @address_chunks.size == 4
end

def to_i
@address_chunks.inject { |result, chunk| result * 256 + chunk }
end

STRING_SIZE = new("255.255.255.255").to_i.to_s.size

def to_s
"%#{STRING_SIZE}s" % to_i
end
end

class IPToCountryDB
REMOTE = "http://software77.net/cgi-bin/ip-country/geo-ip.pl?
action=download"
LOCAL = "ip_to_country.txt"

RECORD_SIZE = IP::STRING_SIZE * 2 + 5

def self.build(path = LOCAL)
open(path, "w") do |db|
open(REMOTE) do |url|
csv = Zlib::GzipReader.new(url)

last_range = Array.new
csv.each do |line|
next if line =~ /\A\s*(?:#|\z)/
from, to, country = CSV.parse_line(line).values_at(0..1, 4).
map { |f| Integer(f) rescue f }
if last_range[2] == country and last_range[1] + 1 == from
last_range[1] = to
else
build_line(db, last_range)
last_range = [from, to, country]
end
end
build_line(db, last_range)
end
end
end

def self.build_line(db, fields)
return if fields.empty?
db.printf("%#{IP::STRING_SIZE}s\t%#{IP::STRING_SIZE}s\t%s\n",
*fields)
end
private_class_method :build_line

def initialize(path = LOCAL)
begin
@db = open(path)
rescue Errno::ENOENT
self.class.build(path)
retry
end
end

def search(address)
binary_search(IP.new(address).to_i)
end

private

def binary_search(ip, min = 0, max = @db.stat.size / RECORD_SIZE)
return "Unknown" if min == max

middle = (min + max) / 2
@db.seek(RECORD_SIZE * middle, IO::SEEK_SET)

if @db.read(RECORD_SIZE) =~ /\A *(\d+)\t *(\d+)\t([A-Z]{2})\n\z/
if ip < $1.to_i then binary_search(ip, min, middle)
elsif ip > $2.to_i then binary_search(ip, middle + 1, max)
else $3
end
else
raise "Malformed database at offset #{RECORD_SIZE * middle}"
end
end
end

if __FILE__ == $PROGRAM_NAME
require "optparse"

options = {:db => IPToCountryDB::LOCAL, :rebuild => false}

ARGV.options do |opts|
opts.banner = "Usage:\n" +
" #{File.basename($PROGRAM_NAME)} [-d PATH] IP\n" +
" #{File.basename($PROGRAM_NAME)} [-d PATH] -r"

opts.separator ""
opts.separator "Specific Options:"

opts.on("-d", "--db PATH", String, "The path to database file")
do |path|
options[:db] = path
end
opts.on("-r", "--rebuild", "Rebuild the database") do
options[:rebuild] = true
end

opts.separator "Common Options:"

opts.on("-h", "--help", "Show this message.") do
puts opts
exit
end

begin
opts.parse!
raise "No IP address given" unless options[:rebuild] or
ARGV.size == 1
rescue
puts opts
exit
end
end

if options[:rebuild]
IPToCountryDB.build(options[:db])
else
puts IPToCountryDB.new(options[:db]).search(ARGV.shift)
end
end

__END__

James Edward Gray II

Eugene Kalenkovich

unread,
Sep 16, 2007, 2:01:46 PM9/16/07
to
My solution suffers from over-generality (is supposed to work for any sorted
string file by any sorting criteria), and from over-limiting memory usage -
I like Simon's approach, reading in chunks makes everything easier and
faster.

BTW, all solutions already submitted will lie for subnets 1,2 and 5 :)
Most (but not all) will break on out of bounds submissions (256.256.256.256
or 0.0.0.-1, latter if comments are stripped out)

(yes, I know, just trying to find an excuse for not very elegant solution -
edge cases kill elegancy :)

>-----------------------------------------------------------------------------

def bin_find(file,search)
if block_given?
compare = lambda { |a, b| yield(a, b) }
else
compare = lambda { |a, b| a <=> b }
end
open(file) do |f|
return bin_search(f,search,f.lstat.size,&compare)
end
end

def bin_search(f,search,size)
start,finish=0,size
while start<finish
dist=(finish-start)/2
f.seek(start+dist)
f.readline unless start+dist==0
case (l1=f.readline.chomp rescue false) ? yield(search,l1) : -1
when -1
next if (finish=start+dist)>0
break
when 0
return l1
else
case (l2=f.readline.chomp rescue false) ? yield(search,l2) : -1
when -1
return l1
when 0
return l2
else
start+=dist; next
end
end
end
nil
end

nums=[]
out=true
if ARGV[0]=='test'
n=ARGV[1].to_i
n.times{nums << rand(4294967296)}
out=false
else
ARGV.each do |argv|
nums << ((($1.to_i*256)+$2.to_i)*256+$3.to_i)*256+$4.to_i if
argv=~/(\d{1,3}).(\d{1,3}).(\d{1,3}).(\d{1,3})/
end
end
if nums.empty?
puts "Please enter valid ip(s) (or use '#{$0} test NNN' for testing)"
exit
end

nums.each do |num|
ctry='Unknown'
res=bin_find('IpToCountry.csv',num) { |search, str|
str.empty? || str[0,1]!='"' ? 1 : search <=>
str.gsub('"','').split(',')[0].to_i
}.gsub('"','').split(',')
ctry=res[4] if (res[0].to_i..res[1].to_i)===num
puts ctry if out
end

Bill Kelly

unread,
Sep 16, 2007, 3:12:15 PM9/16/07
to
Greetings!

Thanks for the fun quiz.

Looks like my solution is a little more verbose than many
submitted so far.

Like many others, I'm performing a binary search on the
original data file. For better or for worse, I imposed the
following restrictions on my design, just out of curiosity
for how it would turn out:

- non-recursive binary search

- only using seek() and gets() to access the database
(using gets means no backward scanning for the beginning
of a record)

Regards,

Bill


====== Begin file: 139_ip_to_country.rb =====

# IP FROM IP TO REGISTRY ASSIGNED CTRY CNTRY COUNTRY
# "1346797568","1346801663","ripencc","20010601","IL","ISR","ISRAEL"

IpRec = Struct.new(:from, :to, :registry, :assigned, :ctry, :cntry, :country) do
def contains?(ip_value)
ip_value >= self.from && ip_value <= self.to
end
end


class IpToCountry
DATABASE_FNAME = "IptoCountry.csv"

def initialize(db_fname=DATABASE_FNAME)
@db_size = File.stat(db_fname).size
@db = File.open(db_fname, "r")
@db_pos = 0
end

def close
@db.close
@db = nil
end

# Lookup IpRec containing ip. Exception is raised if not found.
def search(ip)
ip_value = self.class.ip_to_int(ip)
find_rec(ip_value)
end

# Binary search through sorted IP database.
# The search uses non-record-aligned byte offsets into the
# file, but always scans forward for the next parsable
# record from a given offset. It keeps track of the
# byte offset past the end of the parsed record in @db_pos.
def find_rec(ip_value)
lower, upper = 0, @db_size - 1
prev_rec = nil
loop do
ofst = lower + ((upper - lower) / 2)
rec = next_rec_from_ofst(ofst)
if rec == prev_rec
# We have narrowed the search to where we're hitting
# the same record each time. Can't get any narrower.
# But these are variable-length records, so there may
# be one or more at our lower bound we haven't seen.
# Do a linear scan from our lower bound to examine the
# few records remaining.
ofst = lower
while (rec = next_rec_from_ofst(ofst)) != prev_rec
return rec if rec.contains? ip_value
ofst = @db_pos
end
raise("no record found for ip_value #{ip_value}")
end
return rec if rec.contains? ip_value
if ip_value < rec.from
upper = ofst
else
lower = @db_pos
end
prev_rec = rec
end
end

def next_rec_from_ofst(ofst)
@db.seek(ofst)
@db_pos = ofst
while line = @db.gets
@db_pos += line.length
break if rec = self.class.parse_rec(line)
end
rec || raise("no record found after ofst #{ofst}")
end

def self.ip_to_int(ip_str)
ip_str.split(".").map{|s|s.to_i}.pack("c4").unpack("N").first
end

# NOTE: Using a strict regexp instead of a simpler split operation,
# because it's important we find a valid record, not one embedded
# in a comment or such.
def self.parse_rec(line)
if line =~ %r{\A \s*"(\d+)"\s*,
\s*"(\d+)"\s*,
\s*"(\w+)"\s*,
\s*"(\d+)"\s*,
\s*"(\w+)"\s*,
\s*"(\w+)"\s*,
\s*"([^"]+)"\s* \z
}x
rec = IpRec.new($1.to_i, $2.to_i, $3, $4.to_i, $5, $6, $7)
end
end
end


if $0 == __FILE__

# Accepts zero-or-more IP addresses on the command line.

ip2c = IpToCountry.new

ARGV.each do |ip|
rec = ip2c.search(ip) rescue nil
puts( rec ? rec.ctry : "(#{ip} not found)" )
end

end

====== End file: 139_ip_to_country.rb =====

====== Begin file: test_139_ip_to_country.rb ======

require '139_ip_to_country'

require 'test/unit'

class TestIpToCountry < Test::Unit::TestCase

# NOTE: the following are the first two and last two records in
# my database file:
REC_0 = IpToCountry.parse_rec(%{"0","16777215","IANA","410227200","ZZ","ZZZ","RESERVED"})
REC_1 = IpToCountry.parse_rec(%{"50331648","67108863","ARIN","572572800","US","USA","UNITED STATES"})
REC_NEG_1 = IpToCountry.parse_rec(%{"4261412864","4278190079","IANA","410227200","ZZ","ZZZ","RESERVED"})
REC_LAST = IpToCountry.parse_rec(%{"4278190080","4294967295","IANA","410227200","ZZ","ZZZ","RESERVED"})

def test_find_rec
ip2c = IpToCountry.new
assert_equal( REC_0, ip2c.find_rec(REC_0.from) )
assert_equal( REC_0, ip2c.find_rec(REC_0.to) )
assert_equal( REC_1, ip2c.find_rec(REC_1.from) )
assert_equal( REC_1, ip2c.find_rec(REC_1.to) )
assert_equal( REC_NEG_1, ip2c.find_rec(REC_NEG_1.from) )
assert_equal( REC_NEG_1, ip2c.find_rec(REC_NEG_1.to) )
assert_equal( REC_LAST, ip2c.find_rec(REC_LAST.from) )
assert_equal( REC_LAST, ip2c.find_rec(REC_LAST.to) )
ip2c.close
end

def test_search
ip2c = IpToCountry.new
rec = ip2c.search("67.19.248.74")
assert_not_nil( rec )
assert_equal( "ARIN", rec.registry )
assert_equal( "US", rec.ctry )
end

end

====== End file: test_139_ip_to_country.rb ======

Bill Kelly

unread,
Sep 16, 2007, 3:27:41 PM9/16/07
to

From: "Eugene Kalenkovich" <rub...@softover.com>

>
> BTW, all solutions already submitted will lie for subnets 1,2 and 5 :)
> Most (but not all) will break on out of bounds submissions (256.256.256.256
> or 0.0.0.-1, latter if comments are stripped out)

Hi, could you clarify what is meant by lying about subnets
1, 2, and 5?


Thanks,

Bill


Eugene Kalenkovich

unread,
Sep 16, 2007, 7:43:01 PM9/16/07
to
"Bill Kelly" <bi...@cts.com> wrote in message
news:00b301c7f897$a8c9dc20$6442a8c0@musicbox...
Check what ccountry is 5.1.1.1. If you get any valid answer - this answer
is a lie :)


Bill Kelly

unread,
Sep 16, 2007, 8:21:04 PM9/16/07
to

From: "Eugene Kalenkovich" <rub...@softover.com>

Ah, OK. I get:

ruby 139_ip_to_country.rb 0.1.1.1 1.1.1.1 2.1.1.1 3.1.1.1 4.1.1.1 5.1.1.1
ZZ
(1.1.1.1 not found)
(2.1.1.1 not found)
US
US
(5.1.1.1 not found)


Regards,

Bill

James Edward Gray II

unread,
Sep 16, 2007, 9:46:20 PM9/16/07
to

Well, we weren't all broken:

$ ruby ip_to_country.rb 5.1.1.1
Unknown

James Edward Gray II

Eugene Kalenkovich

unread,
Sep 16, 2007, 11:58:38 PM9/16/07
to
"James Edward Gray II" <ja...@grayproductions.net> wrote in message
news:4A414BD9-8B56-4897-A13E-

> Well, we weren't all broken:

I've sent my comment and refreshed new headers - and yes, your solution came
in :)

--EK


Jesús Gabriel y Galán

unread,
Sep 17, 2007, 3:42:09 AM9/17/07
to
After this:

On 9/16/07, steve d <oks...@yahoo.com> wrote:
> Here's my submission. It's almost exactly the same as Jesus'.

[...]
> while low <= high
[...]

and this:

On 9/16/07, Eugene Kalenkovich <rub...@softover.com> wrote:
> BTW, all solutions already submitted will lie for subnets 1,2 and 5 :)
> Most (but not all) will break on out of bounds submissions (256.256.256.256
> or 0.0.0.-1, latter if comments are stripped out)

I made a couple of modifications to my code. Here is the latest version:

require 'ftools'

ip = ARGV[0].split(/\./)
ip = ip[0].to_i * 16777216 + ip[1].to_i * 65536 + ip[2].to_i * 256 + ip[3].to_i
file = ARGV[1] || 'ipdb.csv'

File.open(file) do |f|


low = 0
high = f.stat.size

f.seek(high / 2)
while low < high
while (((a = f.getc) != 10) && (f.pos > 2))


f.seek(-2, IO::SEEK_CUR)
end
pos = f.pos
line = f.readline.split(",")
low_range = line[0][1..-2].to_i
high_range = line[1][1..-2].to_i
if (low_range > ip)
high = pos
offset = (f.pos - pos) + ((high - low) / 2)
f.seek(-offset, IO::SEEK_CUR)
elsif (high_range < ip)
low = f.pos
f.seek((high-low) / 2, IO::SEEK_CUR)
else
puts line[4][1..-2]
exit
end
end

puts "No country found"
end

I changed to while low < high, and also the walking backwards failed
when reading the first line (btw, Steve, won't your solution also fail
for that case?).

Now:


time ruby quiz139b.rb 2.1.1.1
No country found

real 0m0.030s
user 0m0.016s
sys 0m0.004s

time ruby quiz139b.rb 5.1.1.1
No country found

real 0m0.032s
user 0m0.008s
sys 0m0.008s

time ruby quiz139b.rb 1.1.1.1
No country found

real 0m0.033s
user 0m0.016s
sys 0m0.000s

time ruby quiz139b.rb 256.256.256.256
No country found

real 0m0.096s
user 0m0.008s
sys 0m0.000s

Thanks,

Jesus.

Erik Veenstra

unread,
Sep 17, 2007, 4:57:15 PM9/17/07
to
I decided to make one extra, reusable (?), abstraction layer:
Class OrderedLinesFile implements method find_line, which is
given a block. This block is called with a line of the file.
The block has to parse this line and returns -1, 0 or 1 for "go
back", "bingo!" and "go forward", respectively. Method
find_line then repositions the pointer in the file and calls
block again, if necessary.

Please shoot...

gegroet,
Erik V. - http://www.erikveen.dds.nl/

BTW, lines 17899 and 17900 of the downloaded file both contain
"2624585728", which is strange... (Downloaded: 2007-09-17 08:34
(UTC).)

----------------------------------------------------------------

$ time ruby quiz139a.rb 11.22.33.44
11.22.33.44 US

real 0m0.011s
user 0m0.006s
sys 0m0.004s

----------------------------------------------------------------

$ wc all_ip_blocks
82358 82358 905115 all_ip_blocks

$ time ruby quiz139b.rb all_ip_blocks

real 0m45.681s # That's 1800 IPs per second.
user 0m40.351s
sys 0m5.300s

----------------------------------------------------------------

class OrderedLinesFile
def initialize(file_name)
@file_name = File.expand_path(file_name)
end

def find_line(&block)
line = nil

File.open(@file_name) do |io|
position = 0
delta = File.size(@file_name)/2
line = io.gets # The first line of the file.
line = io.gets while /^\s*#/ =~ line or /^\s*$/ =~
line

while delta > 0 and line and (space_ship = block.call(line)) !=
0
position += space_ship < 0 ? -delta : +delta
delta /= 2

if position > 0
io.pos = position
broken_line = io.gets # Skip the current (broken)
line.
line = io.gets # The current line of the
file.
line = io.gets while /^\s*#/ =~ line or /^\s*
$/ =~ line
else
delta = 0 # Stop.
end
end

line = nil if delta == 0 # Nothing found.
end

line
end
end

class IpToCountry
FILE = "IpToCountry.csv"

def country_of(ip_addr)
ip_addr = ip_addr.split(/\./).collect{|n| n.to_i}.inject{|n,
m| n*256+m} # "1.2.3.4" --> 16909060
olf = OrderedLinesFile.new(FILE)
res = nil

olf.find_line do |line|
ip_from, ip_to, registry, assigned, ctry, cntry, country =
line.gsub(/"/, "").split(/,/, 7)

if ip_addr < ip_from.to_i
-1 # Go back in the file.
elsif ip_addr > ip_to.to_i
+1 # Go forward in the file.
else
res = ctry
0 # Bingo!
end
end

res
end
end

itc = IpToCountry.new

ARGV.each do |ip_addr|
puts "%-16s %s" % [ip_addr, itc.country_of(ip_addr)||"??"]
end

----------------------------------------------------------------


Erik Veenstra

unread,
Sep 17, 2007, 5:12:44 PM9/17/07
to
$ ruby quiz139.rb 0.1.1.1 1.1.1.1 2.1.1.1 3.1.1.1 4.1.1.1 5.1.1.1
0.1.1.1 ZZ
1.1.1.1 ??
2.1.1.1 ??
3.1.1.1 US
4.1.1.1 US
5.1.1.1 ??

:}

gegroet,
Erik V.


steve d

unread,
Sep 17, 2007, 6:36:23 PM9/17/07
to
> I changed to while low < high, and also the walking backwards failed
> when reading the first line (btw, Steve, won't your solution also fail
> for that case?).

Yea it seems that mine will loop forever if you give an invalid ip address that evalutes to less than 0. Changing to 'low < high' seems to fix that. Also it incorrectly reports nil instead of no country on 0.0.0.0. This is all running on the stock csv file (with comments), haven't tried it at all on a modified file.

- steve

Matthias Wächter

unread,
Sep 18, 2007, 5:00:37 AM9/18/07
to
I just performed the tests another time on my desktop computer with a
freshly installed Sabayon Linux. I don't know why this one performs that
much better -- it's gentoo-based, it's the same file system, but it has
no SCSI controller - maybe that makes it faster? Or is it the dual-core
.. anyway.

It's about twice as fast and requires a fraction of the "sys" times.
Also the "empty" ruby run is only 5 ms. Mysteriously.

The packer is down at 0.8s, and 100000 lookups are at 6.2 s (file-based
aka memory efficient) and 3.7 s (everything in-memory).

Anyway -- i'd like to see a 100000 lookups comparison :) *hehe*

Matthias Wächter schrieb:


> My results for 100000 lookups are:
>
> $ time ./packer.rb
>
> real 0m1.634s
> user 0m1.576s
> sys 0m0.056s

$ time ./packer.rb

real 0m0.716s
user 0m0.708s
sys 0m0.008s

> $ time ./rand_ip.rb 100000 byline > ip-test-list
>
> real 0m0.797s
> user 0m0.740s
> sys 0m0.056s

$ time ./rand_ip.rb 100000 byline > ip-test-list

real 0m0.322s
user 0m0.317s
sys 0m0.005s

> $ time ./search_file.rb < ip-test-list > /dev/null
>
> real 0m11.091s
> user 0m9.673s
> sys 0m1.420s

$ time ./search_file.rb <ip-test-list >/dev/null

real 0m6.201s
user 0m5.316s
sys 0m0.885s

> $ time ./search_str.rb < ip-test-list > /dev/null
>
> real 0m7.131s
> user 0m6.960s
> sys 0m0.168s

$ time ./search_str.rb <ip-test-list >/dev/null

real 0m3.714s
user 0m3.707s
sys 0m0.006s

> btw: I don't understand the following result -- can you help me improving this? 129 ms just for firing ruby up! Can't be!
> $ time ruby /dev/null
>
> real 0m0.129s
> user 0m0.120s
> sys 0m0.012s

$ time ruby /dev/null

real 0m0.005s
user 0m0.004s
sys 0m0.001s

> $ uname -a
> Linux server 2.6.21-modgentoo #2 SMP Wed May 2 19:07:13 CEST 2007 x86_64 AMD Athlon(tm) 64 Processor 3000+ AuthenticAMD GNU/Linux

$ uname -a
Linux sabayon2me 2.6.22-sabayon #1 SMP Mon Sep 3 00:33:06 UTC 2007
x86_64 Intel(R) Core(TM)2 CPU 6600 @ 2.40GHz GenuineIntel GNU/Linux

> $ ruby --version
> ruby 1.8.6 (2007-03-13 patchlevel 0) [x86_64-linux]

$ ruby --version
ruby 1.8.6 (2007-03-13 patchlevel 0) [x86_64-linux]

- Matthias

James Edward Gray II

unread,
Sep 18, 2007, 8:35:37 AM9/18/07
to
On Sep 18, 2007, at 4:00 AM, Matthias Wächter wrote:

> Anyway -- i'd like to see a 100000 lookups comparison :) *hehe*

Great. Run it for us and let us know how we do. ;)

James Edward Gray II

Message has been deleted
Message has been deleted

Adam Shelly

unread,
Sep 18, 2007, 9:15:25 PM9/18/07
to
On 9/18/07, Bill Kelly <bi...@cts.com> wrote:
>
> >> On Sep 18, 2007, at 4:00 AM, Matthias Wächter wrote:
> >
> > Here are the results of the supplied solutions so far, and it looks like my solution can take the 100k-performance victory :)
>
> If I've understood correctly, it looks like my solution seems
> to be the fastest (so far) of those that operate on the
> unmodified .csv file?
>

It depends what you mean by unmodified - my algorithm runs off the
original file, the only "modification" I am doing in the setup stage
is searching for and saving the byte offset of the first and last
records. It looks like l could have done that every time my script
was run and only added 5 ms.

> I would have bet Simon's would be faster. Strange!

I thought block file reads would be faster too, that was the next
thing I was planning to try. Maybe it's the regexp that slowed it
down.


-Adam

Message has been deleted
Message has been deleted

Jesús Gabriel y Galán

unread,
Sep 19, 2007, 3:27:23 AM9/19/07
to
On 9/19/07, Matthias Wächter <matt...@waechter.wiz.at> wrote:
> James Edward Gray II schrieb:

> > On Sep 18, 2007, at 4:00 AM, Matthias Wächter wrote:
> >> Anyway -- i'd like to see a 100000 lookups comparison :) *hehe*
> > Great. Run it for us and let us know how we do. ;)
>
> [**]: O Jesus :), I can't make your FasterCSV version (a) run, and in
> the later version you sent your direct parsing breaks when it comes to
> detecting the commented lines in the first part of the file. I couldn't
> manage to make it run, sorry.

Hi,

Yes, I only tested with my version of the file, from which I manually
removed the comments. I don't think I'll have time to fix that, at
least this week. Anyway, I find strange the FasterCSV version doesn't
work, because it delegates the parsing of the file to that gem, and
the rest is pretty simple. On the other hand I don't expect the first
version to perform anywhere near the other solutions, so it's not so
important :-).

Jesus.

James Edward Gray II

unread,
Sep 19, 2007, 8:19:56 AM9/19/07
to
On Sep 19, 2007, at 2:27 AM, Jesús Gabriel y Galán wrote:

> On 9/19/07, Matthias Wächter <matt...@waechter.wiz.at> wrote:
>> James Edward Gray II schrieb:
>>> On Sep 18, 2007, at 4:00 AM, Matthias Wächter wrote:
>>>> Anyway -- i'd like to see a 100000 lookups comparison :) *hehe*
>>> Great. Run it for us and let us know how we do. ;)
>>
>> [**]: O Jesus :), I can't make your FasterCSV version (a) run, and in
>> the later version you sent your direct parsing breaks when it
>> comes to
>> detecting the commented lines in the first part of the file. I
>> couldn't
>> manage to make it run, sorry.
>
> Hi,
>
> Yes, I only tested with my version of the file, from which I manually
> removed the comments. I don't think I'll have time to fix that, at
> least this week. Anyway, I find strange the FasterCSV version doesn't
> work, because it delegates the parsing of the file to that gem, and
> the rest is pretty simple.

Comments are not a part of the CSV specification, so FasterCSV
doesn't address them. I would like to add ignore patterns at some
point though.

James Edward Gray II


Mark Thomas

unread,
Sep 19, 2007, 11:11:42 AM9/19/07
to
Ok, here's the result of mine:

mark@server1:~/rubyquiz/139$ time ./ip2country.rb 195.135.211.255
UK

real 0m0.004s
user 0m0.004s
sys 0m0.000s


Here's my code:

#!/usr/bin/ruby
ARGV.each { |ip|
f = ip.split(/\./).join "/"
puts File.open(f).readlines[0] rescue puts "Unknown"
}

I think it's pretty obvious what the preparation step was. Of course,
the tradeoff for this speed is a MASSIVE waste of disk resources, but
that was unlimited in this contest, was it not? :-)

- Mark.

James Edward Gray II

unread,
Sep 19, 2007, 11:42:45 AM9/19/07
to

Pretty clever.

I bet with the right prep, this could even be a pretty viable
approach. Instead of building a file for each address you could
create a directory structure for the hexadecimal representations of
each piece of the address. The final layer could be handled as you
have here or with a search through a much smaller file.

James Edward Gray II

Bill Kelly

unread,
Sep 19, 2007, 1:33:37 PM9/19/07
to

From: "James Edward Gray II" <ja...@grayproductions.net>

> On Sep 19, 2007, at 10:15 AM, Mark Thomas wrote:
>
>> Ok, here's the result of mine:
>>
>> mark@server1:~/rubyquiz/139$ time ./ip2country.rb 195.135.211.255
>> UK
>>
>> real 0m0.004s
>> user 0m0.004s
>> sys 0m0.000s
>>
>>
>> Here's my code:
>>
>> #!/usr/bin/ruby
>> ARGV.each { |ip|
>> f = ip.split(/\./).join "/"
>> puts File.open(f).readlines[0] rescue puts "Unknown"
>> }
>>
>> I think it's pretty obvious what the preparation step was. Of course,
>> the tradeoff for this speed is a MASSIVE waste of disk resources, but
>> that was unlimited in this contest, was it not? :-)

LOL! Nice... :)

> Pretty clever.
>
> I bet with the right prep, this could even be a pretty viable
> approach. Instead of building a file for each address you could
> create a directory structure for the hexadecimal representations of
> each piece of the address. The final layer could be handled as you
> have here or with a search through a much smaller file.

Indeed, I was thinking last night about preprocessing the data
into an on-disk hash table. I was thinking about a flat file
at the time, but one could use a subdirectory technique like the
above... using hex components of the hash value to index through
a couple subdirs to reach a leaf file containing one or more
records.

Another thing I'd like to try but won't have time for, is to
use ruby-mmap somehow. (Preprocess the data into a flat binary
file, then use ruby-mmap when performing the binary search.)

Anyway thanks for the fun quiz. :)


Regards,

Bill

James Koppel

unread,
Sep 19, 2007, 3:22:52 PM9/19/07
to
I am extremely pleased with my solution this week:

>ruby IPToCountry.rb 121.121.121.121
MY
0.000000 0.000000 0.000000 ( 0.000000)

Wrapping the code in 100.times {...}

>ruby IPToCountry.rb 121.121.121.121
MY
MY
...
MY
MY
0.062000 0.000000 0.062000 ( 0.078000)

Hmmmm...clearly I'm using a supercomputer. :P


My solution uses no initialization run. Instead, it uses a binary search, making good use of IO#seek and IO#pos. It's a little atypical though - it uses different numbers to compare less than and greater than, and looking at a position really means looking at the first two numbers on the next line.

Here's the code:

require 'benchmark'
puts Benchmark.measure { 100.times {

dot_dec_ip = ARGV[0].chomp

dec_ip = dot_dec_ip[0..2].to_i << 24
dot_dec_ip = dot_dec_ip[(dot_dec_ip.index(?.)+1)..-1]
dec_ip += dot_dec_ip[0..2].to_i << 16
dec_ip += dot_dec_ip[dot_dec_ip.index(?.)+1,3].to_i << 8
#Last 8 bits are all in the same country; they don't matter

dec_ip = dec_ip

dataf = File.new("IPToCountry.csv")

###Begin binary search, finding high and low

#Hardcoded character offset of where to start. This should be the index of
#a character on the last line of comments
#
#Earlier versions used 0 or calculated this each iteration.
#The former yielded bad results (for obvious reasons);
#the latter doubled the time needed.
low = 6603

dataf.seek(0,IO::SEEK_END)
flen = dataf.pos

high = flen

while true
if low == high - 1
puts "IP not assigned"
break
end
mid = (low + high) >> 1
dataf.seek(mid,IO::SEEK_SET)
dataf.gets
dataf.getc
range_start = dataf.gets('"')
range_start.slice!(-1)
range_start = range_start.to_i
cmpno = dec_ip <=> range_start
if cmpno == -1
high = mid
next
else
dataf.read(2)
range_end = dataf.gets('"')
range_end.slice!(-1)
range_end = range_end.to_i
if (dec_ip <=> range_end) == 1
low = mid
next
else
puts dataf.gets.match(/"(\w\w)"/)[1]
break
end
end
end
}}

----- Original Message ----
From: Ruby Quiz <ja...@grayproductions.net>
To: ruby-talk ML <ruby...@ruby-lang.org>
Sent: Friday, September 14, 2007 7:31:41 AM
Subject: [QUIZ] IP to Country (#139)

The three rules of Ruby Quiz:

1. Please do not post any solutions or spoiler discussion for this quiz until
48 hours have passed from the time on this message.

2. Support Ruby Quiz by submitting ideas as often as you can:

http://www.rubyquiz.com/

3. Enjoy!

Suggestion: A [QUIZ] in the subject of emails about the problem helps everyone
on Ruby Talk follow the discussion. Please reply to the original quiz message,
if you can.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

This week's Ruby Quiz is to write a simple utility. Your program should accept
an IP address as a command-line argument and print out the two letter code for
the country that IP is assigned in. You can find a database for the matching
at:

http://software77.net/cgi-bin/ip-country/geo-ip.pl

To keep the problem interesting though, let's write our programs with a focus on
speed and memory efficiency.

$ time ruby ip_to_country.rb 68.97.89.187
US

real 0m0.314s
user 0m0.259s
sys 0m0.053s



____________________________________________________________________________________
Be a better Globetrotter. Get better travel answers from someone who knows. Yahoo! Answers - Check it out.
http://answers.yahoo.com/dir/?link=list&sid=396545469

Simon Kröger

unread,
Sep 19, 2007, 4:43:29 PM9/19/07
to
Adam Shelly wrote:

>> I would have bet Simon's would be faster. Strange!
>
> I thought block file reads would be faster too, that was the next
> thing I was planning to try. Maybe it's the regexp that slowed it
> down.

Without looking at the other solutions in detail i think one of the problems
may be that my solution opens the file for each lookup - thats of course easy
to fix. I don't know if thats the problem or the overhead of creating ten
thousand IPAddr objects - i refuse to analyse this in depth because i don't
have a usecase for looking up that many locations in a single run.

(on the other hand i do understand how much fun it can be to optimize such a
problem to death - so go on if you like, i don't have the motivation - this time :)

cheers

Simon

steve d

unread,
Sep 19, 2007, 11:51:50 PM9/19/07
to
I did some benchmarking of my own. This one tests out the performance of a single lookup. I only ran this on solutions which work with IpToCountry.csv out of the box (aka do no prep work).

This benchmark ran each solution 1000 times with random IPs (invoking a new ruby instance on each run).

-bash-2.05b$ ruby bm.rb
sorted by real time
[["steved.rb", 10.5210349559784],
["adam.rb", 10.817939043045],
["justin.rb", 11.6265380382538],
["james.rb", 12.3434510231018],
["eugene.rb", 12.9024469852448],
["jesus.rb", 13.2812840938568],
["erik.rb", 13.3421401977539],
["simon.rb", 14.0733840465546]]

sorted by sys time
[["simon.rb", 0.796875],
["steved.rb", 1.3046875],
["adam.rb", 1.5078125],
["justinrb", 1.515625],
["jesus.rb", 1.765625],
["eugene.rb", 2.265625],
["erik.rb", 2.421875],
["james.rb", 2.4609375]]

sorted by user time
[["simon.rb", 0.078125],
["eugene.rb", 0.09375],
["jesus.rb", 0.109375],
["steved.rb", 0.1171875],
["justin.rb", 0.1171875],
["jamesrb", 0.125],
["adam.rb", 0.125],
["erik.rb", 0.1953125]]


Here's the benchmark source:

require 'benchmark'
require 'pp'

random_ips = Array.new(1000).map { Array.new(4).map {rand(256)}.join('.') }

sols = Dir["*.rb"] - [$0]

times = {}

sols.each do |fn|
stats = Benchmark.measure do
random_ips.each do |ip|
%x{ ruby #{fn} #{ip} }
end
end
times[fn] = stats
end

sort_times = proc { |m| times.sort_by{|(k,v)| v.send(m)}.map{|(k,v)| [k, v.send(m)]} }

puts "sorted by real time"
pp sort_times[:real]
puts

puts "sorted by sys time"
pp sort_times[:stime]
puts

puts "sorted by user time"
pp sort_times[:utime]


Also, Adam, your solution seems to give incorrect answers sometimes, I noticed 33.33.33.33 and 88.88.88.88 both gave Not Found where as all the other solutions gave US and NO respectively.

- steve


Bill Kelly

unread,
Sep 20, 2007, 1:00:12 AM9/20/07
to

From: "steve d" <oks...@yahoo.com>

>
> I did some benchmarking of my own. This one tests out the
> performance of a single lookup. I only ran this on solutions
> which work with IpToCountry.csv out of the box (aka do no prep work).

Cool. I guess mine was overlooked but it also does no prep work:
http://groups.google.com/group/comp.lang.ruby/msg/6162dae2bcf1bc3f?dmode=source


Regards,

Bill

steve d

unread,
Sep 20, 2007, 2:02:00 AM9/20/07
to
From: Bill Kelly <bi...@cts.com>

> Cool. I guess mine was overlooked but it also does no prep work:

Sorry about that, re-ran with yours included:

10000 times each:


-bash-2.05b$ ruby bm.rb
sorted by real time

[["james.rb", 97.2237520217896],
["adam.rb", 100.450285196304],
["steved.rb", 101.930130004883],
["erik.rb", 103.835056066513],
["bill.rb", 104.454197883606],
["eugene.rb", 105.771279811859],
["justin.rb", 106.694982051849],
["jesus.rb", 118.557897090912],
["simon.rb", 141083032131195]]

sorted by sys time
[["simon.rb", 16.359375],
["bill.rb", 17.6328125],
["erik.rb", 17.7109375],
["eugene.rb", 17.734375],
["steved.rb", 17.84375],
["justin.rb", 17.921875],
["jesus.rb", 17.9921875],
["adam.rb", 18.046875],
["james.rb", 18.140625]]

sorted by user time
[["adam.rb", 0.7578125],
["justin.rb", 0.8203125],
["erik.rb", 0.8203125],
["steved.rb", 0.84375],
["jesus.rb", 0.8984375],
["bill.rb", 0.9140625],
["james.rb", 0.9140625],
["simon.rb", 0.9140625],
["eugene.rb", 0.953125]]


Bill Kelly

unread,
Sep 20, 2007, 4:01:53 AM9/20/07
to

From: "steve d" <oks...@yahoo.com>

> From: Bill Kelly <bi...@cts.com>
> > Cool. I guess mine was overlooked but it also does no prep work:
>
> Sorry about that, re-ran with yours included:

Thanks! :D


Ruby Quiz

unread,
Sep 20, 2007, 8:18:03 AM9/20/07
to
Matthias Wätcher made a classic mistake this week. He sent in his first ever
quiz solution and told us not to expect anything special out of it. We later
learned that it was a lot faster than all of the other solutions, thanks to more
effort from Matthias. His code taught me some new tricks and I want to share
those with you.

But first, let's talk speed.

Most of the solutions are quite quick, relatively speaking. What's the shared
secret of speed? Binary search. The records are in sorted order, so it's
possible to perform a simple cut-the-possible-matches-in-half-at-each-step
lookup. That favored algorithm makes short work of what could otherwise be a
lengthy search.

Now you can do a binary search on the database file as is and several solutions
did. This is a little trickier, because the records are not of a standard size.
You have to be super careful not to accidently skip records. While the
solutions handled the challenge very well, you can make things quite a bit
easier if you are willing to preprocess the file.

You can also speed things up with some clever preprocessing. That was the other
trick Matthias used to find answers so quickly.

Matthias realized that while the actual records contain quite a bit of data, we
only really care about the start of a range, the end of a range, and the country
code. You can fit all of that in just ten bytes with four for each address and
two for the country.

The file also contains some consecutive ranges that can be collapsed for the
purposes of our search. Such ranges may differ in some attributes, but as long
as their country codes are the same we can treat them as a single unit.

Now that you know the goal, let's take a look at the packing code Matthias sent
in:

#!/usr/bin/ruby
# comment

last_start=nil
last_end=nil
last_country=nil
File.open("packed-ip.dat","wb") do |wfile|
IO.foreach("geo-ip.csv") do |line|
next if !(line =~ /^"/ )
s,e,d1,d2,co=line.delete!("\"").split(",")
s,e = s.to_i,e.to_i
if !last_start
# initialize with first entry
last_start,last_end,last_country = s,e,co
else
if s==last_end+1 and co==last_country
# squeeze if successive ranges have zero gap
last_end=e
else
# append last entry, remember new one
wfile << [last_start,last_end,last_country].pack("NNa2")
last_start,last_end,last_country = s,e,co
end
end
end
# print last entry
if last_start
wfile << [last_start,last_end,last_country].pack("NNa2")
end
end

The three variables declared up front are for tracking the ranges. These will
be used to collapse consecutive ranges.

The next bit of code creates the binary database we will write into and parses
the CSV formated data. Though the data is in the CSV format, the actual content
is trivial and you don't need a proper parser to get at it. As you can see,
Matthias just checks for lines starting with a quote, pulls out all quote
characters, and split()s on commas. This is faster than using a parser.

The if/else chain in the middle of the code does most of the work. First, it
stores the initial range in the tracking variables. For all other ranges it
tests to see if it is consecutive with the last one recorded and has the same
country code. When it does, the endpoint of the last range is just bumped to
include them both. When it doesn't, the last range is written to the file and
the code starts tracking the new range. The final if statement ensures that we
write out the final range before exiting.

This really isn't too much work and it runs in under two seconds on my box.
That's a small price to pay for a speed gain every time we look up an address.

Let's dive into the search code now. Here's how it begins:

#!/usr/bin/ruby

# take command line or stdin -- the latter has performance advantage
# for long lists
if ARGV[0]
arr=ARGV
else
arr=$stdin
end

# ...

As the comment tells us, Matthias added another way to feed the program
addresses. Where I said we should take one from the command-line arguments,
this code actually handles any number of addresses from command-line arguments
or STDIN.

This next bit of code opens up our database:

# ...

# the binary table file is looked up with each request
File.open("packed-ip.dat","rb") do |rfile|
rfile.seek(0,IO::SEEK_END)
record_max=rfile.pos/10-1

# ...

The seek() and pos() calls here are used to find the end of the file. You need
to know both ends of a data set to perform a binary search. Note that dividing
by ten gives us the count of records since they are a fixed width and Matthias
subtracts one because we will never need to seek() to the end of the last record
again.

Now there's a touch more preparation for each address we want to find:

# ...

arr.each { |argv|
# build a 4-char string representation of IP address
# in network byte order so it can be a string compare below
ipstr= argv.split(".").map {|x| x.to_i.chr}.join

# ...

In order to avoid a bunch of to_i() calls on each range extracted from the
database, Matthias just encodes the address we are hunting for into the
character encoding used in the database. This way simple String comparisons can
determine if the address it in the range.

We're now ready for the actual search code:

# ...

# low/high water marks initialized
low,high=0,record_max
while true
mid=(low+high)/2 # binary search median
rfile.seek(10*mid) # one record is 10 byte, seek to position
str=rfile.read(8) # for range matching, we need only 8 bytes
# at comparison, values are big endian, i.e. packed("N")
if ipstr>=str[0,4] # is this IP not below the current range?
if ipstr<=str[4,4] # is this IP not above the current range?
puts rfile.read(2) # a perfect match, voila!
break
else
low=mid+1 # binary search: raise lower limit
end
else
high=mid-1 # binary search: reduce upper limit
end
if low>high # no entries left? nothing found
puts "no country"
break
end
end
}
end

This is a pretty textbook binary search. You always begin by going to the
middle of the current low and high. In this case that's a seek() call and we
have to remember to multiply the midpoint by our record size of ten.

Once we are at the right record, the code read()s the first eight bytes and
compares the address against the low and high for the range. When it is in the
range, the final two bytes are read and printed. If the address is below the
current range, we drop the high to below the current range. If it's above, we
raise the low to above the current range. A final check is added to catch the
cases where no match was found, in which case our low will bypass the high.

Remember that a second goal of the quiz described search was to be memory
friendly. This code does terrific on that front since only one record needs to
be in memory at a time. Once the checks are done, it can be replaced by the
next record.

My thanks to all who showed so many great examples of this classic algorithm.

Tomorrow we will write programs that can help me to better understand my
wardrobe...

0 new messages