I implemented a _DataReader_ class in Ruby and Python. The reader:
- reads in a CSV file, in this case tab-separated,
- gets variable names from the header line,
- splits up each row into single items,
- checks for and counts missing values,
- determines the type of the item - using regular expressions -
(integer, float, or else classified as string), and
- counts the number of unique items in each column
finally outputting a short report on what it found. And this result is
quite useful even if you later on perform data mining tasks on these
data utilizing other tools.
The implementation is straightforward with no attempts to optimize in
the first run. I tested it on a quite large data file with 4.3 MB and
1.6 Mill. data items, most of them integers.
Here are the running times for some available Ruby implementations
under Windows:
________data items______1,600,000_________320,000_______
Ruby 1.6.5-2 17:10 min 46 sec
Ruby 1.6.6-0 18:43 min 58 sec
Ruby 1.7.2 (i586-mswin32) 18:05 min 54 sec
As a comparision, I implemented the method in Python too with the
following results:
Python 2.1.1 (Zope) 58 sec 10 sec
Python 2.2 49 sec 9 sec
Active State Python 2.2a 49 sec 11 sec
And I also tested the data with the _read.table_ function of the
public domain statistical package *R* that has a almost the same
functionality (in a way I tried to model it)
R::read.table 30 sec 2 sec
One can see that the Python implementation compares reasonably with
such a well-known package. Unfortunately, the Ruby implementation of
the same method is *unacceptably* slow.
I had experiences with some text analysis functionalities where I did
split some 5,000 news messages into words and then counted and stored
these words for retrieval and for determining similarity between the
news articles.
Ruby was 20-30% slower than Python in this task, which I could really
accept because Ruby is such a nice language. But the time differences
above will kill my project, I'm afraid.
The tests were done on a 1.1 GHz Pentium III PC under Windows 2000 and
with 512 MB main memory. I didn't try Linux for that because the final
application has to run under MS Windows anyway.
So for me the question remains: Why is Ruby so unbelievably slow (more
than 5-20 times slower than Python) in this task -- esp. for larger
data sets?
Many thanks, Hans Werner.
______________________________________________________________________
Loading data set test.dat...
10001 rows loaded, of required length 32.
2.824 secs needed.
0 Id: TYPE Integer (10000 items, 0 missing).
1 V1: TYPE Set (2 items, 0 missing).
2 V2: TYPE Integer (75 items, 0 missing).
3 V3: TYPE Set (2 items, 0 missing).
4 V4: TYPE Set (6 items, 0 missing).
5 V5: TYPE Integer (885 items, 0 missing).
6 V6: TYPE Integer (467 items, 0 missing).
7 V7: TYPE Integer (402 items, 0 missing).
8 V8: TYPE Set (9 items, 0 missing).
9 V9: TYPE Integer (19 items, 0 missing).
10 V10: TYPE Integer (70 items, 0 missing).
11 V11: TYPE Integer (1653 items, 0 missing).
12 V12: TYPE Integer (1316 items, 0 missing).
13 V13: TYPE Integer (52 items, 0 missing).
14 V14: TYPE Set (6 items, 0 missing).
15 V15: TYPE Set (2 items, 0 missing).
16 V16: TYPE Integer (29 items, 0 missing).
17 V17: TYPE Integer (49 items, 0 missing).
18 V18: TYPE Integer (69 items, 0 missing).
19 V19: TYPE Integer (13 items, 0 missing).
20 V20: TYPE Set (11 items, 0 missing).
21 V21: TYPE Set (9 items, 0 missing).
22 V22: TYPE Integer (15 items, 0 missing).
23 V23: TYPE Integer (19 items, 0 missing).
24 V24: TYPE Set (10 items, 0 missing).
25 V25: TYPE Integer (15 items, 0 missing).
26 V26: TYPE Set (12 items, 0 missing).
27 V27: TYPE Integer (17 items, 0 missing).
28 V28: TYPE Integer (15 items, 0 missing).
29 V29: TYPE Integer (25 items, 0 missing).
30 V30: TYPE Set (2 items, 0 missing).
31 Target: TYPE Set (2 items, 0 missing).
48.655 secs needed.
______________________________________________________________________
module CSV
def parse_line(line, sep="\t", missing='?', comment='#')
line.chomp!
if line == '' or line[0] == comment
fields = []
nfields = 0
else
fields = line.split(sep)
nfields = fields.length
end
return nfields, fields
end
end #module
### -- c l a s s DataReader ---------------------------------------
class DataReader
include CSV
def initialize(fname, header=true, sep="\t", missing="?", comment="#")
### ------------------------------------------------
@fname = fname;
@header = header; @hfields = []
@dtypes = []; @dfields = []
@nrows = 0; @ncols = 0
@sep = sep; @missing = missing
@comment = comment
### ------------------------------------------------
end
def load(logging=false)
t1 = Time.now
if logging
puts
puts "---------------------------------------------- LOADING DATA ----"
puts "Loading data set #{@fname}..."
end
csvFile = File.open(@fname, 'r')
if @header
@ncols, @hfields = parse_line(csvFile.gets, \
sep=@sep, missing=@missing, comment=@comment)
else
raise "Not Implemented Error."
end
@row = []; @col = []
@row[0] = @hfields
(0...@ncols).each { |j| @col << [] }
no_short = 0; no_long = 0
ln_short = []; ln_long = []
n = 0
while line = csvFile.gets
n += 1
m, fields = parse_line(line, \
sep=@sep, missing=@missing, comment=@comment)
if m == 0 then next end
# fill row up with NA character or cut if too long
if m < @ncols
no_short +=1; ln_short << n+1
(@ncols - m).times { fields << @missing }
elsif m > @ncols
no_long += 1; ln_long << n+1
fields = fields[0...@ncols]
end
@row[n] = fields
(0...@ncols).each { |j| @col[j] << fields[j] }
end
csvFile.close
@nrows = @row.size
t2 = Time.now
if logging
puts "#{@nrows} rows loaded, of required length #{@ncols}."
if no_short > 0
puts "#{no_short} rows too short: #{ln_short[0]}, ..."
end
if no_long > 0
puts "#{no_long} rows too long: #{ln_long[0]}, ..."
end
puts "#{t2 - t1} secs needed."
puts
end
end
def prelyze(logging=false, missing=@missing)
t1 = Time.now
dtypes = {0 => 'NA', 1 => 'Integer', 2 => 'Continuous',
3 => 'String', 4 => 'Set'}
@dtypes = []
for j in (0...@ncols) do
ctype = 0; mitms = 0
@col[j].each { |item|
if item == missing
ctype = [ctype, 0].max
mitms += 1
elsif item =~ /^\s*[+\-]?\d+\s*$/
ctype = [ctype, 1].max
elsif item =~ /^\s*[+\-]?(?:\d+\.\d*|\d*\.\d+)\s*$/
ctype = [ctype, 2].max
else
ctype = [ctype, 3].max
end
}
nitms = (@col[j]-['']).nitems
if 0 < nitms and nitms <= 12 and nitms <= 0.1*(@nrows-mitms) then ctype
= 4 end
ctype = dtypes[ctype]
@dtypes << ctype
if logging
puts "#{j.to_s.rjust(3)} #{(@row[0][j]).rjust(15)}:\tTYPE #{ctype}
(#{nitms} items, #{mitms} missing)."
end
end
t2 = Time.now
if logging
puts
puts "#{t2 - t1} secs needed."
puts "----------------------------------------------------------------"
puts " Copyright (C) 2001, Data Mining Center."
puts
end
end
### -- accessor functions --
attr_reader :nrows, :ncols
attr_reader :dtypes
def nrow(); @nrows; end
def ncol(); @ncols; end
def hfields(); @row[0]; end
def [](i, j); @row[i][j]; end
def col(j); @col[j]; end
def row(i); @row[i]; end
end #class
### -- m a i n ( ) ------------------------------------------------#
tData = DataReader.new("test2.dat", header=true, \
sep="\t", missing="", comment="%")
tData.load(logging=true)
tData.prelyze(logging=true)
Please see the "file reading impossibly slow?" thread.
http://blade.nagaokaut.ac.jp/cgi-bin/scat.rb/ruby/ruby-talk/35367
HTH
--
<[ Kent Dahl ]>================<[ http://www.stud.ntnu.no/~kentda/ ]>
)__(stud.techn.; ind. econ & management: computer technology)__(
/"Opinions expressed are mine and not those of my Employer, "\
( "the University, my girlfriend, stray cats, banana fruitflies, " )
\"nor the frontal lobe of my left cerebral hemisphere. "/
The thread you referenced above said that this problem is fixed in 1.7, but
Venherm provided a 1.7.2 test which showed it to be nearly as slow as the
other versions of Ruby.
If Venherm's numbers are correct, it looks like there is still a problem
here.
Curt
IIRC, the problem was very dependant on how it is compiled up on
Windows, with regards to Cygwin, MinGW etc. I was under the impression
that it was fixed as far as Linux goes, but that portability to Windows
left something to be desired. Last I remember was a post that commented
on some "tricks" Cygwin apparently was doing. (Not too sure, as I
started reading it with only half-a-brain, after the problem sounded
Winblows specific :-)
Have you tried using something other than gets?
Jim
Perhaps, but as I recall he showed that the 1.7 test he ran was under
cygwin (Windows and cygwin) - once again, could cygwin be the culperit in
that case?
Phil
> Curt Hibbs wrote:
>> The thread you referenced above said that this problem is fixed in 1.7, but
>> Venherm provided a 1.7.2 test which showed it to be nearly as slow as the
>> other versions of Ruby.
>
> IIRC, the problem was very dependant on how it is compiled up on
> Windows, with regards to Cygwin, MinGW etc. I was under the
> impression that it was fixed as far as Linux goes, but that
> portability to Windows left something to be desired. Last I remember
> was a post that commented on some "tricks" Cygwin apparently was
> doing. (Not too sure, as I started reading it with only
> half-a-brain, after the problem sounded Winblows specific :-)
With respect to IO, 1.7.* native windows should be as fast as cygwin
windows or Unix. That is what I find with empirical tests.
--
matt
> WHY IS RUBY SO SLOW?
We probably can't tell just be reading your code. You may have
identified an area where Ruby can be improved, or where the speed
under Windows isn't good.
To rule out the IO slowness mentioned earlier, you could write a
simple program like this:
save = []
while line = datafile.gets
save << line
end
And time that on your data files. That'll tell you how much time it
takes Ruby to read in the file and store it in an array. The rest of
the overhead will likely be in your string manipulation code.
Also, ruby has a built in profiler. Run your program like "ruby -r
profile your_program.rb" Make sure to use a much smaller data set,
since it will really slow your program down.
But I find that I like RBProf much better:
http://aspectr.sourceforge.net/rbprof/
It is a little quirky, but it gives you more information and doesn't
profile every built in function so the program doesn't run anywhere
near as slowly.
--
matt
At Tue, 19 Mar 2002 03:41:31 +0900,
Venherm Borchers wrote:
> WHY IS RUBY SO SLOW?
>
> I implemented a _DataReader_ class in Ruby and Python. The reader:
(snip)
> Here are the running times for some available Ruby implementations
> under Windows:
>
> ________data items______1,600,000_________320,000_______
>
> Ruby 1.6.5-2 17:10 min 46 sec
> Ruby 1.6.6-0 18:43 min 58 sec
> Ruby 1.7.2 (i586-mswin32) 18:05 min 54 sec
tData.load ran in 2.824 secs, but tData.prelyze spent 48.655
secs, it's exactly too slow.
> def prelyze(logging=false, missing=@missing)
> t1 = Time.now
> dtypes = {0 => 'NA', 1 => 'Integer', 2 => 'Continuous',
> 3 => 'String', 4 => 'Set'}
> @dtypes = []
> for j in (0...@ncols) do
> ctype = 0; mitms = 0
> @col[j].each { |item|
> if item == missing
> ctype = [ctype, 0].max
> mitms += 1
> elsif item =~ /^\s*[+\-]?\d+\s*$/
> ctype = [ctype, 1].max
> elsif item =~ /^\s*[+\-]?(?:\d+\.\d*|\d*\.\d+)\s*$/
> ctype = [ctype, 2].max
> else
> ctype = [ctype, 3].max
> end
> }
>
> nitms = (@col[j]-['']).nitems
Possibly here. Array#- makes a hash once so a little
expensive. Try with:
nitms = @col[j].nitems - @col.grep(/^$/).nitems
Or it may be better to count nitms up in @col[j].each block.
--
Nobu Nakada
save = IO.readlines(file)
and got about a 50% improvement.
7.81u 1.73s 0:13.95 68.3% # 4 line method
4.48u 1.54s 0:06.25 96.3% # 1 line method
dir bigfile
-rw-rw-r-- 1 jfn cad 40500000 Mar 19 08:47 bigfile
--
Jim Freeze
~