How to efficiently count the number of lines in a file

2,009 views
Skip to first unread message

Daniel Carrera

unread,
Aug 19, 2015, 6:10:40 AM8/19/15
to julia-users
Hello,

I need to count the number of lines in a large number of UNIX text files, most of which are fairly large. I need help coming up with an efficient implementation of a line-count program. A naive implementation like length(readlines("foo.txt")) is very slow (notice how this loads the whole file into memory just to count newlines). I imagine that it should be possible to count the number of newline characters quickly, like what the "wc" command does, but I can't figure out how. Does anyone have any ideas?

Thanks for the help.

Daniel.

René Donner

unread,
Aug 19, 2015, 6:13:24 AM8/19/15
to julia...@googlegroups.com
I guess you could access it using mmap and simply loop through the array:

http://docs.julialang.org/en/latest/stdlib/io-network/?highlight=mmap#memory-mapped-i-o

I'd be curious, are there even faster alternatives?

René Donner

unread,
Aug 19, 2015, 6:20:43 AM8/19/15
to julia...@googlegroups.com
This should work:

a = Mmap.mmap("test.txt")
n = 1
for i in 1:length(a)
if a[i]==10
n+=1
end
end
@show n

Avik Sengupta

unread,
Aug 19, 2015, 6:22:41 AM8/19/15
to julia-users
You can't get much better than unix command line tools for this sort of thing. Any reason you can't use `wc` directly? I'd do this using run(`wc`...) from within Julia. 

Regards
-
Avik

Keith Campbell

unread,
Aug 19, 2015, 7:13:43 AM8/19/15
to julia-users
You could try countlines().   
Also, you likely want eachline() rather than readlines().  eachline() will iterate through the file for you.
cheers,
Keith


Daniel Carrera

unread,
Aug 19, 2015, 7:35:39 AM8/19/15
to julia-users
I tried running `wc` directly, but the run() command does not return the output of `wc`.

René Donner

unread,
Aug 19, 2015, 7:38:04 AM8/19/15
to julia...@googlegroups.com
You can use

parse(Int,split(readall(`wc -l test.txt`))[1])

for that.

FYI, a small benchmark of showed countlines to be 3x faster than run(wc..) and 4x faster than the mmap approach.

The code for countlines is interesting, nice example for highly efficient Julia code:

Daniel Carrera

unread,
Aug 19, 2015, 7:43:37 AM8/19/15
to julia-users
Thanks! I didn't know about countlines(). Interestingly, countlines() does not seem to include blank lines. That's not a problem for me, but it's good to be aware of:

--------------------------
$ cat > foo.txt
Hello      
world.
$ cat > bar.txt

Hello

World.
$ julia
...
julia> run(`wc -l foo.txt`)
2 foo.txt
julia> run(`wc -l bar.txt`)
4 bar.txt
julia> countlines("foo.txt")
2
julia> countlines("bar.txt")
2
--------------------------

Thanks again.

Daniel.

Daniel Carrera

unread,
Aug 19, 2015, 7:46:28 AM8/19/15
to julia-users
Thanks!

I did not realize that you could use readall() like that. I had been struggling to get the output out of run(). And thanks for the benchmark.

Cheers,
Daniel.

Yichao Yu

unread,
Aug 19, 2015, 7:49:42 AM8/19/15
to Julia Users
On Wed, Aug 19, 2015 at 7:43 AM, Daniel Carrera <dcar...@gmail.com> wrote:
> Thanks! I didn't know about countlines(). Interestingly, countlines() does
> not seem to include blank lines. That's not a problem for me, but it's good
> to be aware of:

https://github.com/JuliaLang/julia/pull/11947
Reply all
Reply to author
Forward
0 new messages