The latest instance of having to write such a program was simple
enough I figured I'd try my hand at using Haskell instead. The short
story is that I could quickly write a Haskell program that achieved
the goal. Yay! However, the performance was ~8x slower than a
comparable Perl implementation. With a lot of effort I got the Haskell
version to only 2x slower. A lot of the optimization was done with
guesses that the performance difference was due to how each line was
being read from the file. I couldn't determine much using GHC's
profiler.
I still have two questions after all this:
- Can I get a Haskell implementation as fast as the Perl?
- What do I need to do to get GHC's profiler to provide me usable
information? Telling me that 98% of the time was in "main" is not very
enlightening ;-)
All the code and data for this little experiment I've placed here:
http://tothepowerofdisco.com/repo/sum_optimization/
My first, and shortest, version is SumTiny.hs. The optimized version
is SumFast.hs.
The long version for the curious:
The (cleaned up) data was a 78k line file consisting of lines like the
following:
framerate (prev == no pts): 15
framerate (delta): 25
framerate (invalid timebase): 12.5
.. and so on.
The need was for a program that calculated 1.0 / framerate for each
line and produced the sum. Easy no?
My straightforward Haskell solution was:
-------------------------------------------------------------------
import Text.Regex.Posix
main = do
f_lines <- readFile "test.out" >>= return . lines
let duration = foldl add_line 0.0 f_lines
add_line sum line =
let [[_,m]] = line =~ "([0-9.]+)"
framerate = read m
delta = 1.0 / framerate
in sum + delta
putStrLn $ "Duration (sec): " ++ show duration
-------------------------------------------------------------------
Just for fun I decided to compare it to a Perl implementation:
-------------------------------------------------------------------
#!/usr/bin/perl
my @f_lines = split(/\n/,`cat test.out`);
my $sum = 0.0;
foreach(@f_lines)
{
/([0-9.]+)/;
my $delta = 1.0 / $1;
$sum += $delta;
}
print("Duration (sec): ", $sum, "\n");
-------------------------------------------------------------------
(I'm sure there are other ways to write this same program in both languages.)
I was pretty happy with how the Haskell implementation's code compared
to the Perl implementation's just in terms of looks. Though I think
the Perl implementation is easier to understand at the part where the
regex match is extracted.
For fun I compared the performance of the two:
(64bit linux running on a 2.4ghz Core2Duo)
$ time perl ./Sum.pl
Duration (sec): 3155.62666666438
real 0m0.121s
user 0m0.103s
sys 0m0.016s
$ time ./SumTiny
Duration (sec): 3155.626666664377
real 0m1.099s
user 0m1.073s
sys 0m0.009s
Youch! ~1s is fast enough that I don't care, but I'm still curious why
there is a 8x performance difference. Profiling with manual cost
center annotations (See SumTinyProf.hs) indicated the "do" expression
main is equal to was responsible 90% of the tiny. Which isn't
revealing at all!
Some experimenting led me to find a 2x slower implementation
(SumFast.hs). Quick notes ont he changes I made:
- Used ByteStrings instead of String.
- Used hGetLine instead of reading the entire file lazily and splitting on "\n"
- Fusing the foldl' with the loop of reading each line from the file.
- Using bang patterns to make the fused loop strict on the
accumulator argument.
I think the largest performance gain was from changing how each line
was read from the file. The lazy read and split seemed very slow
compared to a loop that used hGetLine.
--
-Corey O'Connor
_______________________________________________
Haskell-Cafe mailing list
Haskel...@haskell.org
http://www.haskell.org/mailman/listinfo/haskell-cafe
Could you try using lazy and/or strict bytestrings. For *any* IO
performance issue, that should be your first stop.
See, e.g., this sum-file example,
http://shootout.alioth.debian.org/gp4/benchmark.php?test=sumcol&lang=ghc&id=6
which outperforms C on the benchmark.
The second point would be to avoid Text.Regex.Posix, which is
algorithmically much worse than the PCRE regexes used by perl. Instead,
try the pcre-light or regex-pcre packages, which have both better
complexity, and operate on efficient bytestrings.
Finally, there's some small details about writing efficient loops. Give
type declarations for atomic types like Int and Double, and use strict
folds or explicit recursion, as in this post,
http://cgi.cse.unsw.edu.au/~dons/blog/2008/05/16#fast
For your task here, with those simple changes, it should be relatively
easy to produce competitive-with-C performance.
-- Don
-Corey
--
-Corey O'Connor
{-# OPTIONS -fbang-patterns #-}
import qualified Data.ByteString.Char8 as S
import Data.ByteString.Lex.Double
import Debug.Trace
main = print . go 0 =<< S.getContents
where
go !n !s = case readDouble str of
Nothing -> n
Just (k,t) -> let delta = 1.0 / k in go (n+delta) t
where
(_, xs) = S.break ((==) ':') s
str = S.drop 2 xs
It uses the bytestring-lexing package on Hackage to read the Doubles
out,
$ ghc --make Fast.hs -O2 -fvia-C -optc-O2
$ time ./Fast < test.out
3155.626666664377
./Fast < test.out 0.07s user 0.01s system 97% cpu 0.078 total
So that's twice as fast as the perl entry on my box,
$ time perl Sum.pl < test.out
Duration (sec): 3155.62666666438
perl Sum.pl < test.out 0.15s user 0.03s system 100% cpu 0.180 total
Note that the safe Double lexer uses a bit of copying, so
we can in fact do better still with a non-copying Double parser,
but that's only for the hardcore.
-- Don
Don did a great job answering teh first question. To get better
profiling information, add "SCC" annotations to your code. That will
tell you how much time is spent in a particular expression. For your
main function, take a divide and conquer approach - add SCCs to
determine what is taking the most time.
Take Don's advice first though before benchmarking - you probably
won't care after switching to bytestring & a better regex engine.
Justin
$ time awk -F: '{sum += 1 / $2} END{print sum}' test.out
3155.63
real 0m0.197s
user 0m0.184s
sys 0m0.004s
compared to Don's Haskell version:
$ time ./fastSum < test.out
3155.626666664377
real 0m0.072s
user 0m0.056s
sys 0m0.004s
compared to the Corey's perl version:
$ time perl Sum.pl
Duration (sec): 3155.62666666438
real 0m0.181s
user 0m0.164s
sys 0m0.012s
Regards,
Brad Larsen
I did write a version of SumTiny.hs with SCC annotations:
http://tothepowerofdisco.com/repo/sum_optimization/SumTinyProf.hs
Maybe I'm missing something, but the profiling still gives me poor
results. Here is the results from the profiler (cut down and, no
doubt, with poor post-email formatting):
-------------------------------------------------------------------
COST CENTRE %time %alloc
do 92.8 87.6
regex 2.8 4.8
add_line 2.2 0.4
div 1.1 0.3
readFile 0.6 6.2
individual inherited
COST CENTRE %time %alloc %time %alloc
MAIN
CAF 0.0 0.0 100.0 100.0
main 0.0 0.0 100.0 100.0
do 92.8 87.6 100.0 100.0
show 0.0 0.0 0.0 0.0
foldl 0.0 0.3 0.0 0.3
add_line 2.2 0.4 6.6 5.9
regex 2.8 4.8 2.8 4.8
sum 0.0 0.1 0.0 0.1
div 1.1 0.3 1.1 0.3
read Double 0.6 0.2 0.6 0.2
readFile 0.6 6.2 0.6 6.2
-------------------------------------------------------------------
Which tells me the "do" expression that main is equal to is
responsible for 92.8% of the run time. Which really doesn't tell me
anything.
--
-Corey O'Connor
Another variant, using a non-copying readDouble (available in
bytestring-lexing package 0.1.2),
{-# OPTIONS -fbang-patterns #-}
import qualified Data.ByteString.Char8 as S
import qualified Data.ByteString.Unsafe as S
import Data.ByteString.Lex.Double
main = print . go 0 =<< S.getContents
where
go :: Double -> S.ByteString -> Double
go !n !s = case unsafeReadDouble s' of
Nothing -> n
Just (k,t) -> let delta = 1 / k in go (n+delta) t
where
s' = case S.elemIndex ':' s of
Nothing -> S.empty
Just i -> S.unsafeDrop (i+2) s
Computes in:
$ time ./Fast < test.out
3155.626666664377
./Fast < test.out 0.02s user 0.00s system 80% cpu 0.025 total
So that's probably a good place to stop.
Note the general rules:
* strict bytestrings
* tight tail-recursive loops with strict, atomic accumulators
* non-copying parsing.
I'm sure the other suggestions have improved performance far more than
this suggestion will, but it's still worth a shot.
I don't think I've _ever_ seen a case where using foldl was a good
idea. foldl interacts poorly with laziness; use foldr when the
function you are folding is nonstrict, and use foldl' (from
Data.List), a strict version of foldl, when the function you are
folding is strict. This is, of course, only a rule of thumb.
Changing that foldl to foldl' will probably speed this up a decent amount.
Luke