slow wordcount

385 views
Skip to first unread message

Attila Zséder

unread,
Nov 28, 2015, 5:19:31 PM11/28/15
to julia...@googlegroups.com
Hi,

i'm new to Julia and wrote a baseline implementation of word count, that counts words, and writes them to stdout sorted by counts (highest first), and when tie, using alphabetical order.
My code is here:
https://github.com/juditacs/wordcount/blob/master/julia/wordcount.jl

Half of the time is spent if secondary sorting is used (for alphabetical order), this is the first bottleneck. But string addition to the dict is quite slow.
The python/cpp implementation runs for about 20-25 seconds while this implementation runs for 80 seconds.
Can you give me some hints on why this implementation is slow?

Thank you!

Attila

Lampkld

unread,
Nov 29, 2015, 1:08:49 AM11/29/15
to julia-users
Maybe it's the lambda? These are slow in julia right now.

Cedric St-Jean

unread,
Nov 29, 2015, 4:28:09 PM11/29/15
to julia-users
What I would try:

1. ProfileView to pinpoint the bottleneck further
2. FastAnonymous to fix the lambda
3. http://julia-demo.readthedocs.org/en/latest/manual/performance-tips.html In particular, you may check `code_typed`. I don't have experience with `split` and `eachline`. It's possible that they are not type stable (the compiler can't predict their output's type). I would try `for w::ASCIIString in ...`
4. Dict{ASCIIString, Int}()
5. Your loop will hash each string twice. I don't know how to fix that, anyone?

Good luck,

Cédric

Milan Bouchet-Valat

unread,
Nov 29, 2015, 4:42:57 PM11/29/15
to julia...@googlegroups.com
Le dimanche 29 novembre 2015 à 08:28 -0800, Cedric St-Jean a écrit :
> What I would try:
>
> 1. ProfileView to pinpoint the bottleneck further
> 2. FastAnonymous to fix the lambda
> 3. http://julia-demo.readthedocs.org/en/latest/manual/performance-tip
> s.html In particular, you may check `code_typed`. I don't have
> experience with `split` and `eachline`. It's possible that they are
> not type stable (the compiler can't predict their output's type). I
> would try `for w::ASCIIString in ...`
> 4. Dict{ASCIIString, Int}()
> 5. Your loop will hash each string twice. I don't know how to fix
> that, anyone?
You can use the unexported Base.ht_keyindex() function like this:
https://github.com/nalimilan/FreqTables.jl/blob/7884c000e6797d7ec621e07
b8da58e7939e39867/src/freqtable.jl#L36

But this is at your own risk, as it may change without warning in a
future Julia release.

We really need a public API for it.


Regards

Yichao Yu

unread,
Nov 29, 2015, 4:59:42 PM11/29/15
to Julia Users
On Sun, Nov 29, 2015 at 11:42 AM, Milan Bouchet-Valat <nali...@club.fr> wrote:
> Le dimanche 29 novembre 2015 à 08:28 -0800, Cedric St-Jean a écrit :
>> What I would try:
>>
>> 1. ProfileView to pinpoint the bottleneck further
>> 2. FastAnonymous to fix the lambda
>> 3. http://julia-demo.readthedocs.org/en/latest/manual/performance-tip
>> s.html In particular, you may check `code_typed`. I don't have
>> experience with `split` and `eachline`. It's possible that they are
>> not type stable (the compiler can't predict their output's type). I
>> would try `for w::ASCIIString in ...`
>> 4. Dict{ASCIIString, Int}()
>> 5. Your loop will hash each string twice. I don't know how to fix
>> that, anyone?
> You can use the unexported Base.ht_keyindex() function like this:
> https://github.com/nalimilan/FreqTables.jl/blob/7884c000e6797d7ec621e07
> b8da58e7939e39867/src/freqtable.jl#L36
>
> But this is at your own risk, as it may change without warning in a
> future Julia release.
>
> We really need a public API for it.

IIUC, https://github.com/JuliaLang/julia/issues/12157

Attila Zséder

unread,
Nov 30, 2015, 2:21:51 PM11/30/15
to julia-users
Hi,

Thank you all for the responses.

1. I tried simple profiling, but its output was difficult me to interpret, maybe if i put more time in it. I will try ProfileView later.
2. FastAnonymous gave me a serious speedup (20-30%). (But since it is an external dependency, it's kind of cheating, seeing the purpose of this small word count test)
3. Using ASCIIString is not a good option right now, since there are unicode characters there. I am trying with both UTF8String and AbstractString, I don't see any difference in performance right now.
4. Using ht_keyindex() is out of scope for me right now, because this is a pet project, I just wanted to see how fast current implementation is, without these kind of solutions.

I think I will keep trying with later versions of julia, but with sticking to the standard library only, without using any external packages.

Attila

Dan

unread,
Nov 30, 2015, 3:08:56 PM11/30/15
to julia-users
Can you provide the comparable python code? Perhaps even the data used for testing?

Since you are evaluating Julia, there are two important points to remember:
1) In Julia because the language is fast enough to implement basic functionality in Julia, then the distinction between Base Julia and additional packages is small. Opting to use 'just' the core makes less sense - the core is just a pre-compiled package.
2) The community is part of the language, so it should be regarded when making considerations.

Tim Holy

unread,
Nov 30, 2015, 3:20:33 PM11/30/15
to julia...@googlegroups.com
If you don't want to figure out how to use the profiler, your next best bet is
to split out the pieces so you can understand where the bottleneck is. For
example:

function docount(io)
wc = Dict{AbstractString,Int64}()
for l in eachline(io)
for w in split(l)
wc[w]=get(wc, w, 0) + 1
end
end
wc
end

@time open("somefile.tex") do io
docount(io)
end;
0.010617 seconds (27.70 k allocations: 1.459 MB)

vs

@time open("somefile.tex") do io
main(io)
end;
# < lots of printed output >
1.233154 seconds (330.59 k allocations: 10.829 MB, 1.53% gc time)

(I modified your `main` to take an io input.)

So it's the sorting and printing which is taking 99% of the time. Most of that
turns out to be the printing.

--Tim

Dan

unread,
Nov 30, 2015, 3:21:30 PM11/30/15
to julia-users
Some points:
1) If you are timing the program by running from shell (i.e. including the whole Julia load time) - it's not a good idea for a bunch of reasons.
2) The printouts are also a little dubious to be used in a test.
3) Too short an input makes the test noisy.


On Monday, November 30, 2015 at 4:21:51 PM UTC+2, Attila Zséder wrote:

Attila Zséder

unread,
Nov 30, 2015, 3:31:20 PM11/30/15
to julia...@googlegroups.com
Hi,


The data I'm using is part of (Hungarian) Wikipedia dump with 5M lines of text. On this data, python runs for 65 seconds, cpp for 35 seconds, julia baseline for 340 seconds, julia with FastAnonymous.jl for 280 seconds. (See https://github.com/juditacs/wordcount#leaderboard for details)

Dan:
I can use external packages, it's not a big issue. However, FastAnonymous didn't give results comparable to python.
The baseline python code I compare to is here:
https://github.com/juditacs/wordcount/blob/master/python/wordcount_py2.py

2) The community is part of the language, so it should be regarded when making considerations.
What do you mean by this?
My (our) purpose is not to evaluate this language or that one is better/faster/etc because it is faster in unoptimized word counting. So I don't want to make any judgements, considerations and anything like this. This is just for fun. And even though it looks like _my_ julia implementation of wc is not fast right now, I didn't lose interest in following what's going on with this language.


Your other points:
1) I do this with all the other languages as well. The test runs for about 30-300 seconds. If Julia load time or any other thing takes serious amount of time, then it does. This test is not precise, I didn't include c++ compile time for example, but it took less than a second. But I felt like my implementation is dummy, and other things take my time, not Julia load.
2) What if my test is about IO + dictionary storage? Then I have to include printouts into my test.
3) I think 5m lines of text file is enough to avoid this noises.



Tim:
Yes, I did this code split, and with larger files it looked like after sorting, dictionary manipulation (including hashes) took most of the time, and printing was less of an issue. But I do have to analyze this more precisely, seeing your numbers.


Thank you all for your help!

Attila

Dan

unread,
Nov 30, 2015, 3:38:16 PM11/30/15
to julia-users
My suggestions would be to replace
    for t in sort(collect(wc), by=x -> (-x.second, x.first))
        println
(t.first, "\t", t.second)
   
end

with
customlt(a,b) = (b.second < a.second) ? true : b.second == a.second ? a.first < b.first : false

function main()
   
:
   
:
    for t in sort(collect(wc), lt=customlt)
        println
(t.first, "\t", t.second)
   
end
end

Dan

unread,
Nov 30, 2015, 4:20:54 PM11/30/15
to julia-users
Using the `customlt` function for the sort order cut the time in half on a test file. So try:

customlt(a,b) = (b.second < a.second) ? true : b.second == a.second ? a.first < b.first : false

function main()
    wc = Dict{UTF8String,Int64}()
    for l in eachline(STDIN)
        for w in split(l)
            wc[w]=get(wc, w, 0) + 1
        end
    end

    v = collect(wc) 
    sort!(v,lt=customlt) # in-place sort saves a memory copy
    for t in v
        println("$(t.first)\t$(t.second)")
    end
end

main()

Dan

unread,
Nov 30, 2015, 4:29:28 PM11/30/15
to julia-users
and replacing
        println("$(t.first)\t$(t.second)")

with
        @printf("%s\t%d\n",t.first,t.second)

also halves the print time (which might or might not make a big difference but definitely > 1 second)

Tim Holy

unread,
Nov 30, 2015, 4:35:00 PM11/30/15
to julia...@googlegroups.com
For me, docount2 is about 1.7x faster than docount1 (on my old laptop, 28s vs
47s) for your "Hungarian Wikipedia" test dataset. We might want to implement
some of these tweaks in the standard library.

--Tim

import Base: start, next, done, eltype, readuntil

function docount1(io)
wc = Dict{AbstractString,Int64}()
for l in eachline(io)
for w in split(l)
wc[w]=get(wc, w, 0) + 1
end
end
wc
end

type EachLn{T}
stream::IO
end

start(itr::EachLn) = nothing
function done(itr::EachLn, nada)
if !eof(itr.stream)
return false
end
true
end
next{T}(itr::EachLn{T}, nada) = (readuntil(T, itr.stream, '\n'), nothing)
eltype{T}(::Type{EachLn{T}}) = T

function readuntil{T}(::Type{T}, s::IO, delim::Char)
if delim < Char(0x80)
data = readuntil(s, delim%UInt8)
return T(data)
end
out = IOBuffer()
while !eof(s)
c = read(s, Char)
write(out, c)
if c == delim
break
end
end
T(takebuf_array(out))
end


function docount2(io)
wc = Dict{SubString{UTF8String},Int64}()
for l in EachLn{UTF8String}(io)
for w in split(l)
wc[w]=get(wc, w, 0) + 1
end
end
wc
end



Stefan Karpinski

unread,
Nov 30, 2015, 4:45:34 PM11/30/15
to Julia Users
We definitely need a fair amount of tweaking for string and I/O performance to be as good as Python or Perl. Thus far it hasn't been the primary focus of our performance development, but that's starting to change. If anyone encounters specific slow cases, opening issues is helpful.

Attila Zséder

unread,
Nov 30, 2015, 4:54:59 PM11/30/15
to julia...@googlegroups.com

Hi,

following your suggestions:

- FastAnonymous without string types made my code slower
- UTF8String for string types made the code also slower
- using both of them resulted in the significant 30% increase
- using Dan's customlt instead of my (FastAnonymous-ized) lambda function resulted in another 10% improvement
- using Tim's docount2 did fasten up the reading, but only about 10%, compared to UTF8String typed reading of docount1

If you think, I can open an issue for this.


Attila

Judit Acs

unread,
Nov 30, 2015, 6:08:09 PM11/30/15
to julia-users
Hi,

I'm the author of the repository and this whole thing started as a pet project and is still purely for fun.
Here is a blog post about it if you're interested: http://juditacs.github.io/2015/11/26/wordcount.html

Although the results are not yet included in the repository's leaderboard, I do run tests on even bigger dataset, namely the full Hungarian Wikipedia which is 65 million lines, but I only include the fastest languages (C++, go, Python2 and Java right now). I'm not sure that the current julia script would fit into 16GB of memory, based on the fact the Python2 version uses 8.7GB when run on the full huwiki.

I merged getzdan's pull request and the new wordcount.jl is twice as fast as the previous one (340 s -> 175 s). Thank you very much for the improved solution.

Best,
Judit

James Gilbert

unread,
Dec 3, 2015, 12:35:50 PM12/3/15
to julia-users
I don't understand this line in Tim's code:

  data = readuntil(s, delim % UInt8)

Is it the same as:

  data = readuntil(s, UInt8(delim))

? I guess its purpose is to select this method:

  readuntil(s::IOStream, delim::UInt8)

which is faster than:

  readuntil(s::IO, delim::Char)


Stefan Karpinski

unread,
Dec 3, 2015, 3:59:08 PM12/3/15
to Julia Users
UInt8(delim) converts with a check to make sure that delim fits in the UInt8 type. (delim % UInt8) computes delim modulo 256 as a UInt8 – i.e. it just drops the high bits.
Reply all
Reply to author
Forward
0 new messages