WHY IS RUBY SO SLOW?
I implemented a _DataReader_ class in Ruby and Python. The reader:
- reads in a CSV file, in this case tab-separated,
- gets variable names from the header line,
- splits up each row into single items,
- checks for and counts missing values,
- determines the type of the item - using regular expressions -
(integer, float, or else classified as string), and
- counts the number of unique items in each column
finally outputting a short report on what it found. And this result is
quite useful even if you later on perform data mining tasks on these
data utilizing other tools.
The implementation is straightforward with no attempts to optimize in
the first run. I tested it on a quite large data file with 4.3 MB and
1.6 Mill. data items, most of them integers.
Here are the running times for some available Ruby implementations
under Windows:
________data items______1,600,000_________320,000_______
Ruby 1.6.5-2 17:10 min 46 sec
Ruby 1.6.6-0 18:43 min 58 sec
Ruby 1.7.2 (i586-mswin32) 18:05 min 54 sec
As a comparision, I implemented the method in Python too with the
following results:
Python 2.1.1 (Zope) 58 sec 10 sec
Python 2.2 49 sec 9 sec
Active State Python 2.2a 49 sec 11 sec
And I also tested the data with the _read.table_ function of the
public domain statistical package *R* that has a almost the same
functionality (in a way I tried to model it)
R::read.table 30 sec 2 sec
One can see that the Python implementation compares reasonably with
such a well-known package. Unfortunately, the Ruby implementation of
the same method is *unacceptably* slow.
I had experiences with some text analysis functionalities where I did
split some 5,000 news messages into words and then counted and stored
these words for retrieval and for determining similarity between the
news articles.
Ruby was 20-30% slower than Python in this task, which I could really
accept because Ruby is such a nice language. But the time differences
above will kill my project, I'm afraid.
The tests were done on a 1.1 GHz Pentium III PC under Windows 2000 and
with 512 MB main memory. I didn't try Linux for that because the final
application has to run under MS Windows anyway.
So for me the question remains: Why is Ruby so unbelievably slow (more
than 5-20 times slower than Python) in this task -- esp. for larger
data sets?
Many thanks, Hans Werner.
______________________________________________________________________
Loading data set test.dat...
10001 rows loaded, of required length 32.
2.824 secs needed.
0 Id: TYPE Integer (10000 items, 0 missing).
1 V1: TYPE Set (2 items, 0 missing).
2 V2: TYPE Integer (75 items, 0 missing).
3 V3: TYPE Set (2 items, 0 missing).
4 V4: TYPE Set (6 items, 0 missing).
5 V5: TYPE Integer (885 items, 0 missing).
6 V6: TYPE Integer (467 items, 0 missing).
7 V7: TYPE Integer (402 items, 0 missing).
8 V8: TYPE Set (9 items, 0 missing).
9 V9: TYPE Integer (19 items, 0 missing).
10 V10: TYPE Integer (70 items, 0 missing).
11 V11: TYPE Integer (1653 items, 0 missing).
12 V12: TYPE Integer (1316 items, 0 missing).
13 V13: TYPE Integer (52 items, 0 missing).
14 V14: TYPE Set (6 items, 0 missing).
15 V15: TYPE Set (2 items, 0 missing).
16 V16: TYPE Integer (29 items, 0 missing).
17 V17: TYPE Integer (49 items, 0 missing).
18 V18: TYPE Integer (69 items, 0 missing).
19 V19: TYPE Integer (13 items, 0 missing).
20 V20: TYPE Set (11 items, 0 missing).
21 V21: TYPE Set (9 items, 0 missing).
22 V22: TYPE Integer (15 items, 0 missing).
23 V23: TYPE Integer (19 items, 0 missing).
24 V24: TYPE Set (10 items, 0 missing).
25 V25: TYPE Integer (15 items, 0 missing).
26 V26: TYPE Set (12 items, 0 missing).
27 V27: TYPE Integer (17 items, 0 missing).
28 V28: TYPE Integer (15 items, 0 missing).
29 V29: TYPE Integer (25 items, 0 missing).
30 V30: TYPE Set (2 items, 0 missing).
31 Target: TYPE Set (2 items, 0 missing).
48.655 secs needed.
______________________________________________________________________
module CSV
def parse_line(line, sep="\t", missing='?', comment='#')
line.chomp!
if line == '' or line[0] == comment
fields = []
nfields = 0
else
fields = line.split(sep)
nfields = fields.length
end
return nfields, fields
end
end #module
### -- c l a s s DataReader ---------------------------------------
class DataReader
include CSV
def initialize(fname, header=true, sep="\t", missing="?", comment="#")
### ------------------------------------------------
@fname = fname;
@header = header; @hfields = []
@dtypes = []; @dfields = []
@nrows = 0; @ncols = 0
@sep = sep; @missing = missing
@comment = comment
### ------------------------------------------------
end
def load(logging=false)
t1 = Time.now
if logging
puts
puts "---------------------------------------------- LOADING DATA ----"
puts "Loading data set #{@fname}..."
end
csvFile = File.open(@fname, 'r')
if @header
@ncols, @hfields = parse_line(csvFile.gets, \
sep=@sep, missing=@missing, comment=@comment)
else
raise "Not Implemented Error."
end
@row = []; @col = []
@row[0] = @hfields
(0...@ncols).each { |j| @col << [] }
no_short = 0; no_long = 0
ln_short = []; ln_long = []
n = 0
while line = csvFile.gets
n += 1
m, fields = parse_line(line, \
sep=@sep, missing=@missing, comment=@comment)
if m == 0 then next end
# fill row up with NA character or cut if too long
if m < @ncols
no_short +=1; ln_short << n+1
(@ncols - m).times { fields << @missing }
elsif m > @ncols
no_long += 1; ln_long << n+1
fields = fields[0...@ncols]
end
@row[n] = fields
(0...@ncols).each { |j| @col[j] << fields[j] }
end
csvFile.close
@nrows = @row.size
t2 = Time.now
if logging
puts "#{@nrows} rows loaded, of required length #{@ncols}."
if no_short > 0
puts "#{no_short} rows too short: #{ln_short[0]}, ..."
end
if no_long > 0
puts "#{no_long} rows too long: #{ln_long[0]}, ..."
end
puts "#{t2 - t1} secs needed."
puts
end
end
def prelyze(logging=false, missing=@missing)
t1 = Time.now
dtypes = {0 => 'NA', 1 => 'Integer', 2 => 'Continuous',
3 => 'String', 4 => 'Set'}
@dtypes = []
for j in (0...@ncols) do
ctype = 0; mitms = 0
@col[j].each { |item|
if item == missing
ctype = [ctype, 0].max
mitms += 1
elsif item =~ /^\s*[+\-]?\d+\s*$/
ctype = [ctype, 1].max
elsif item =~ /^\s*[+\-]?(?:\d+\.\d*|\d*\.\d+)\s*$/
ctype = [ctype, 2].max
else
ctype = [ctype, 3].max
end
}
nitms = (@col[j]-['']).nitems
if 0 < nitms and nitms <= 12 and nitms <= 0.1*(@nrows-mitms) then ctype
= 4 end
ctype = dtypes[ctype]
@dtypes << ctype
if logging
puts "#{j.to_s.rjust(3)} #{(@row[0][j]).rjust(15)}:\tTYPE #{ctype}
(#{nitms} items, #{mitms} missing)."
end
end
t2 = Time.now
if logging
puts
puts "#{t2 - t1} secs needed."
puts "----------------------------------------------------------------"
puts " Copyright (C) 2001, Data Mining Center."
puts
end
end
### -- accessor functions --
attr_reader :nrows, :ncols
attr_reader :dtypes
def nrow(); @nrows; end
def ncol(); @ncols; end
def hfields(); @row[0]; end
def [](i, j); @row[i][j]; end
def col(j); @col[j]; end
def row(i); @row[i]; end
end #class
### -- m a i n ( ) ------------------------------------------------#
tData = DataReader.new("test2.dat", header=true, \
sep="\t", missing="", comment="%")
tData.load(logging=true)
tData.prelyze(logging=true)