[erlang-questions] Speed of CSV parsing: how to read 1M of lines in 1 second

325 views
Skip to first unread message

Max Lapshin

unread,
Mar 23, 2012, 6:30:32 AM3/23/12
to Erlang-Questions Questions
I need to load large CSV file into memory very fast.

I've tried to use erlang parser, but my results were very bad (in fact
file:read_line is very slow), so I've tried to make a NIF for it.
Target speed is 1 microsecond per line.


My CSV has very strict format: only numbers, no quoting, \n in the
end. Also I moved parsing of integers and date into NIF.

My results are here: http://github.com/maxlapshin/csv_reader and I get
only 15 microseconds: 4,5 seconds for 300K lines CSV.

Currently I use fgets to read line by line from file. Maybe it is a
bad idea and I should use mmap or implement 1MB buffer for read?
_______________________________________________
erlang-questions mailing list
erlang-q...@erlang.org
http://erlang.org/mailman/listinfo/erlang-questions

Bengt Kleberg

unread,
Mar 23, 2012, 6:42:57 AM3/23/12
to Erlang-Questions Questions
Greetings,

With a working NIF you probably do not need to consider file:read_file/1
and binary:split/2 on <nl>. Otherwise that is an alternative to
read_line/1


bengt

Ulf Wiger

unread,
Mar 23, 2012, 7:29:15 AM3/23/12
to Max Lapshin, Erlang-Questions Questions

open_port({spawn, "/bin/cat " ++ File}, [{line, MaxLen}, binary]) 

will pour the file, one line at a time, into your message queue. :)

Wicked fast, but no flow control.

Fredrik Svahn made a flow-control hack for stdin back in 2008, but I don't know (a) if that's applicable to you or (b) if it ever made it into the OTP source somehow.


Otherwise, opening the file in [raw, binary] mode and using re:split() ought to work reasonably well.

file:read_line() is indeed dreadfully slow, but is OTOH extremely nice in distributed embedded systems, as it supports IO redirection across nodes.

BR,
Ulf W

Gordon Guthrie

unread,
Mar 23, 2012, 8:54:00 AM3/23/12
to Max Lapshin, Erlang-Questions Questions
Max

There is a csv parser for RFC 4180 compliant csv files which is now
being maintained by Eric Merritt:
https://github.com/afiniate/erfc_parsers/tree/master/src

Gordon

--
Gordon Guthrie
CEO hypernumbers

http://hypernumbers.com
t: hypernumbers
+44 7776 251669

Tim Watson

unread,
Mar 23, 2012, 8:55:51 AM3/23/12
to Gordon Guthrie, Erlang-Questions Questions
On 23 Mar 2012, at 12:54, Gordon Guthrie wrote:

> Max
>
> There is a csv parser for RFC 4180 compliant csv files which is now
> being maintained by Eric Merritt:
> https://github.com/afiniate/erfc_parsers/tree/master/src
>

This appears to read the whole file into memory, so probably not very space efficient for dealing with large files.

Tomasz Maciejewski

unread,
Mar 23, 2012, 9:03:28 AM3/23/12
to Max Lapshin, Erlang-Questions Questions
W dniu 23 marca 2012 11:30 użytkownik Max Lapshin
<max.l...@gmail.com> napisał:

> Currently I use fgets to read line by line from file. Maybe it is a
> bad idea and I should use mmap or implement 1MB buffer for read?

mmap is the fastest way to read lines is you don't much care about portability.

--
Tomasz Maciejewski

Max Lapshin

unread,
Mar 23, 2012, 9:10:53 AM3/23/12
to Gordon Guthrie, Erlang-Questions Questions
On Fri, Mar 23, 2012 at 4:54 PM, Gordon Guthrie <gor...@hypernumbers.com> wrote:
> Max
>
> There is a csv parser for RFC 4180 compliant csv files which is now
> being maintained by Eric Merritt:
> https://github.com/afiniate/erfc_parsers/tree/master/src
>

It is great, but:

1) it doesn't support header. Usually first line is a header
2) it is very strict to structure of CSV. It may change and it is
impossible to tie to index of column, only by its name
3) it is incredibly slow: NIF: 4812, RFC: 38371
It is 10 times slower than my code.

Tim Watson

unread,
Mar 23, 2012, 1:28:49 PM3/23/12
to Max Lapshin, Erlang-Questions Questions
The problem doesn't appear to be anything to do with the speed of read_line, but rather one of binary splitting/processing instead. Consider this example, which simply uses file:read_line to get a list of lines from the 300k csv file:

t4@malachi:csv_reader $ ./benchmark.erl example.csv
./benchmark.erl:23: Warning: type transform_fun() is unused
./benchmark.erl:149: Warning: variable 'Pattern' is unused
read_line time: 1365
read size: 300000
t4@malachi:csv_reader $

But just using binary:split/3 on the individual lines, without even processing the column cells, slows down considerably:

t4@malachi:csv_reader $ ./benchmark.erl example.csv
./benchmark.erl:23: Warning: type transform_fun() is unused
read_line time: 12654
read size: 300000
t4@malachi:csv_reader $

This is rather unfortunately slow, as binary processing is one of the things that Erlang is supposed to be exceptionally good at. The parsing code (which is part of a larger example I was playing with) boils down to these functions (with a few setup functions and record definitions omitted for brevity):

main([Path]) ->
{ok, Fd} = file:open(Path, [raw, binary, {read_ahead, 1024 * 1024}]),
try
T1 = erlang:now(),
Res = parse(make_parser(Fd), start),
T2 = erlang:now(),
io:format("read_line time: ~p~n", [timer:now_diff(T2, T1) div 1000]),
io:format("read size: ~p~n", [length(Res)])
catch
Ex:R ->
io:format("~p~n", [erlang:get_stacktrace()]),
throw({Ex, R})
after
file:close(Fd)
end.

parse(P=#csv_parser{ iodevice=Fd, header=true }, start) ->
file:read_line(Fd),
parse(P, []);
parse(P=#csv_parser{ iodevice=Fd,
delimiter=Pattern }, Acc) ->
case file:read_line(Fd) of
{ok, Data} ->
% Record = process(P, binary:split(Data, Pattern, [global])),
Record = binary:split(Data, Pattern, [global]),
parse(P, [Record|Acc]);
eof ->
Acc;
Other ->
throw(Other)
end.

***************************

So it doesn't look like file:read_line is really the cause of the slowness. I've also tried several other processing examples (to actually generate the records and convert to floats etc) and of course these just add processing time. I do wonder if there is a more efficient manner or splitting the line oriented binaries and processing the data, but I am currently assuming that binary:split/3 is a pretty efficient mechanism. I might investigate this a little further at some point, if I get time.

Cheers,

Tim

Max Lapshin

unread,
Mar 23, 2012, 1:31:45 PM3/23/12
to Tim Watson, Erlang-Questions Questions
On Fri, Mar 23, 2012 at 9:28 PM, Tim Watson <watson....@gmail.com> wrote:
> The problem doesn't appear to be anything to do with the speed of read_line, but rather one of binary splitting/processing instead. Consider this example, which simply uses file:read_line to get a list of lines from the 300k csv file:
>
> read_line time: 1365
> read size: 300000

But to be honest 1365 milliseconds is 10 times slower than plain C variant.

> t4@malachi:csv_reader $
>
> But just using binary:split/3 on the individual lines, without even processing the column cells, slows down considerably:
>
> t4@malachi:csv_reader $ ./benchmark.erl example.csv
> ./benchmark.erl:23: Warning: type transform_fun() is unused
> read_line time: 12654

And naive splitting in C gives 3 times faster. Can you launch my code
on your machine to normalize numbers?

Tim Watson

unread,
Mar 23, 2012, 6:11:39 PM3/23/12
to Max Lapshin, Erlang-Questions Questions
So if you are willing to loose some space efficiency (which you would with mmap anyway) then reading the entire binary into memory is a lot faster:

t4@malachi:csv_reader $ ./benchmark2.erl example.csv
./benchmark2.erl:13: Warning: variable 'Bin' is unused
read_file time: 298
t4@malachi:csv_reader $

So now the only question is can you split it in acceptable time!? I've started playing with this and can chunk up the file into lines in around 729, but now it's a question of splitting up the lines into fields. What I'm doing up until now is still sequential of course.
As this problem actually appears to be cpu bound rather than io bound, we could start looking at how to break up the problem most effectively to utilise many cores. It's not going to be anywhere near fast enough for your needs I suspect, but I'd like to know if it's possible to squeeze a 300k record file through in a couple of seconds with a bit of careful work.

Although I haven't finished with results collection and transforming the split binaries into records, I'm hoping this is achievable:

t4@malachi:csv_reader $ ./csv_reader.erl example.csv
./csv_reader.erl:10: Warning: record evt is unused
./csv_reader.erl:25: Warning: type transform_fun() is unused
./csv_reader.erl:104: Warning: variable 'Size' is unused
read_chunk time: 2765
t4@malachi:csv_reader $

Which (slightly faster before) is based on (incomplete):

#!/usr/bin/env escript
%%! -pa ebin

-mode(compile).

-record(evt, {
key,date,time,
l_1,l_2,l_3,l_4,l_5,l_6,l_7,l_8,l_9,l_10,
l_11,l_12,l_13,l_14,l_15,l_16,l_17,l_18,l_19,l_20,
l_21,l_22,l_23,l_24,l_25,l_26,l_27,l_28,l_29,l_30,
l_31,l_32,l_33,l_34,l_35,l_36,l_37,l_38,l_39,l_40,
last
}).

-record(csv_field, {
name :: atom() | string(),
type = 'binary' :: atom(), %% TODO: use erl_types:type()
converter :: fun((binary()) -> term())
}).

-type transform_fun() :: fun((term()) -> term()).

-record(csv_parser, {
io_device :: file:io_device(),
collector :: pid(),
input_file :: file:name(),
header :: boolean(),
strict = true :: boolean(),
size :: non_neg_integer(),
fields :: [#csv_field{} | constant()],
buffer_size = 16 * 1024
:: non_neg_integer(),
transform = fun(E) -> E end
:: transform_fun(),
delimiter =
binary:compile_pattern(<<",">>)
:: binary:cp(),
supported_newlines =
binary:compile_pattern([<<"\n">>, <<"\r\n">>])
:: binary:cp()
}).

main([Path]) ->
%% TODO: don't be so naive


try
T1 = erlang:now(),

chunker(make_parser(Path)),
T2 = erlang:now(),
io:format("read_chunk time: ~p~n", [timer:now_diff(T2, T1) div 1000])


catch
Ex:R ->
io:format("~p~n", [erlang:get_stacktrace()]),
throw({Ex, R})

% after
% file:close(Fd)
end.

make_parser(F) ->
%% TODO: write a function that takes a 'typed record' and
%% generates the csv_field mappings for us using erl_types or whatever...
Fields = [
#csv_field{ name="KEY" },
#csv_field{ name="Date" }, %% NB: we will worry about converting dates later on...
#csv_field{ name="Time" }, %% NB: we will worry about converting time(s) later on...
#csv_field{ name="l_1", type=float }, #csv_field{ name="l_2", type=float },
#csv_field{ name="l_3", type=float }, #csv_field{ name="l_4", type=float },
#csv_field{ name="l_5", type=float }, #csv_field{ name="l_6", type=float },
#csv_field{ name="l_7", type=float }, #csv_field{ name="l_8", type=float },
#csv_field{ name="l_9", type=float }, #csv_field{ name="l_10", type=float },
#csv_field{ name="l_11", type=float }, #csv_field{ name="l_12", type=float },
#csv_field{ name="l_13", type=float }, #csv_field{ name="l_14", type=float },
#csv_field{ name="l_15", type=float }, #csv_field{ name="l_16", type=float },
#csv_field{ name="l_17", type=float }, #csv_field{ name="l_18", type=float },
#csv_field{ name="l_19", type=float }, #csv_field{ name="l_20", type=float },
#csv_field{ name="l_21", type=float }, #csv_field{ name="l_22", type=float },
#csv_field{ name="l_23", type=float }, #csv_field{ name="l_24", type=float },
#csv_field{ name="l_25", type=float }, #csv_field{ name="l_26", type=float },
#csv_field{ name="l_27", type=float }, #csv_field{ name="l_28", type=float },
#csv_field{ name="l_29", type=float }, #csv_field{ name="l_30", type=float },
#csv_field{ name="l_31", type=float }, #csv_field{ name="l_32", type=float },
#csv_field{ name="l_33", type=float }, #csv_field{ name="l_34", type=float },
#csv_field{ name="l_35", type=float }, #csv_field{ name="l_36", type=float },
#csv_field{ name="l_37", type=float }, #csv_field{ name="l_38", type=float },
#csv_field{ name="l_39", type=float }, #csv_field{ name="l_40", type=float },
#csv_field{ name="Last" }
],
#csv_parser{
input_file=F,
header=true,
strict=true,
transform=fun(L) -> list_to_tuple(['evt'] ++ L) end,
size=length(Fields),
fields=Fields
}.

%parse(P=#csv_parser{ iodevice=Fd, buffer_size=Size }) ->
% parse(P, file:read(Fd, Size), 0, undefined, []).

%% the butcher process chops up the binary into chunks
chunker(P=#csv_parser{ input_file=F, buffer_size=Size }) ->
{ok, Fd} = file:open(F, [raw, binary, {read_ahead, 1024 * 1024}]),
CPid = spawn(fun loop/0),
read(P#csv_parser{ io_device=Fd, collector=CPid }, start, 0, <<>>, []).

loop() ->
receive {ok, _Pid, _X} ->
% io:format("Got ~p parts!~n", [length(X)])
loop()
end.

read(P=#csv_parser{ io_device=Fd, buffer_size=Size }, start, Idx, Rem, Pids) ->
read(P, file:read(Fd, Size), Idx, Rem, Pids);
read(P=#csv_parser{ io_device=Fd, buffer_size=Size,
delimiter=Delim, collector=Parent,
supported_newlines=NL },
{ok, Chunks},
Idx, Rem, Pids) ->
% io:format("read ~p characters~n", [size(Chunk)]),
Lines = binary:split(Chunks, NL, [global]),
{Data, [Remaining]} = lists:split(length(Lines) - 1, Lines),
Pid = spawn(fun() ->
[H|T] = Data,
Sz = size(H),
%% TODO: fix this code so it properly deals with
%% left over data between processed segments
%% i.e., where is \n in relation to Rem ....
WorkingSet = case size(Rem) of
X when X < Sz ->
[<<Rem/binary, H/binary>>|T];
_ ->
[Rem|Data]
end,
[ return(Parent, binary:split(L, Delim, [global])) ||
L <- WorkingSet, is_binary(L) ]
end),
read(P, file:read(Fd, Size), Idx + 1, Remaining, [{Pid, Idx}|Pids]);
read(_P, eof, _, _Rem, _Pids) ->
%% TODO: in strict mode, fail unless size(Rem) == 0
ok.

return(Collector, Matches) ->
Collector ! {ok, self(), Matches}.

collect_results(Results, []) ->
array:to_list(Results);
collect_results(Results, Pids) ->
receive
{ok, Pid, Data} ->
{value, {Pid, Idx}, RemainingPids} = lists:keytake(Pid, 1, Pids),
collect_results(array:set(Idx, Data, Results), RemainingPids);
Other ->
throw(Other)
end.

process(P=#csv_parser{ strict=Strict, fields=Fields }, Data) ->
case lists:foldl(fun process_field/2, {Fields, [], Strict}, Data) of
{[], Acc, _} ->
(P#csv_parser.transform)(Acc);
{MissingFields, Result, _} when is_list(MissingFields) ->
case Strict of
true ->
throw({parse_error, {unexpected_eol, MissingFields}});
false ->
Result
end
end.

process_field(_E, {[], Acc, false}) ->
Acc;
process_field(E, {[], _Acc, true}) ->
throw({parse_error, {expected_eol, E}});
process_field(E, {[#csv_field{ type=float }|Rest], Acc, Strict}) ->
{Rest, [list_to_float(binary_to_list(E))|Acc], Strict};
process_field(E, {[#csv_field{ type=binary }|Rest], Acc, Strict}) ->
{Rest, [E|Acc], Strict};
process_field(E, {[#csv_field{ type=list }|Rest], Acc, Strict}) ->
{Rest, [binary_to_list(E)|Acc], Strict}.

Tim Watson

unread,
Mar 23, 2012, 6:15:03 PM3/23/12
to Max Lapshin, Erlang-Questions Questions
Forgot to cc the list....

On 23 Mar 2012, at 19:27, Tim Watson wrote:

> On 23 Mar 2012, at 19:19, Tim Watson wrote:


>
>> On 23 Mar 2012, at 17:31, Max Lapshin wrote:
>>
>>> On Fri, Mar 23, 2012 at 9:28 PM, Tim Watson <watson....@gmail.com> wrote:
>>>> The problem doesn't appear to be anything to do with the speed of read_line, but rather one of binary splitting/processing instead. Consider this example, which simply uses file:read_line to get a list of lines from the 300k csv file:
>>>>
>>>> read_line time: 1365
>>>> read size: 300000
>>>
>>> But to be honest 1365 milliseconds is 10 times slower than plain C variant.
>>

>> Sure I appreciate that.


>>
>>>
>>>> t4@malachi:csv_reader $
>>>>
>>>> But just using binary:split/3 on the individual lines, without even processing the column cells, slows down considerably:
>>>>
>>>> t4@malachi:csv_reader $ ./benchmark.erl example.csv
>>>> ./benchmark.erl:23: Warning: type transform_fun() is unused
>>>> read_line time: 12654
>>>
>>> And naive splitting in C gives 3 times faster. Can you launch my code
>>> on your machine to normalize numbers?
>>

>> t4@malachi:csv_reader $ ./csv_bench.erl example.csv
>> Load csv_reader: ok
>> Load time: 5300
>> t4@malachi:csv_reader $ ./csv_bench.erl example.csv
>> Load csv_reader: ok
>> Load time: 5213
>> t4@malachi:csv_reader $ evm info
>> R15B compiled for i386-apple-darwin10.8.0, 64bit
>>
>
> And again with a lower erts version:
>
> t4@malachi:csv_reader $ ./csv_bench.erl example.csv
> Load csv_reader: ok
> Load time: 5518
> t4@malachi:csv_reader $ evm info
> R14B01 compiled for i386-apple-darwin10.5.0, 64bit
> t4@malachi:csv_reader $
>
> Hardware/OS profile: Apple Macbook Pro running OS-X 10.6.8 (Snow Leopard), 2.8GHz Intel Core2 Duo, 8Gb RAM.

Toby Thain

unread,
Mar 23, 2012, 11:07:02 PM3/23/12
to erlang-questions
On 23/03/12 9:03 AM, Tomasz Maciejewski wrote:
> W dniu 23 marca 2012 11:30 użytkownik Max Lapshin
> <max.l...@gmail.com> napisał:
>> Currently I use fgets to read line by line from file. Maybe it is a
>> bad idea and I should use mmap or implement 1MB buffer for read?
>
> mmap is the fastest way to read lines is you don't much care about portability.
>

Even Windows offers mmap functionality. Of course machines without a MMU
will not, but that doesn't seem a likely venue for this operation (they
probably won't have fast disks and network either :)

--Toby

Max Lapshin

unread,
Mar 24, 2012, 7:36:11 AM3/24/12
to Tim Watson, Erlang-Questions Questions
On Sat, Mar 24, 2012 at 2:11 AM, Tim Watson <watson....@gmail.com> wrote:
> So if you are willing to loose some space efficiency (which you would with mmap anyway) then reading the entire binary into memory is a lot faster:
>

Tim, looks like your solutions is twice faster due to using threads. Am I right?

Masklinn

unread,
Mar 24, 2012, 8:00:21 AM3/24/12
to Toby Thain, erlang-questions
On 2012-03-24, at 04:07 , Toby Thain wrote:
> On 23/03/12 9:03 AM, Tomasz Maciejewski wrote:
>> W dniu 23 marca 2012 11:30 użytkownik Max Lapshin
>> <max.l...@gmail.com> napisał:
>>> Currently I use fgets to read line by line from file. Maybe it is a
>>> bad idea and I should use mmap or implement 1MB buffer for read?
>>
>> mmap is the fastest way to read lines is you don't much care about portability.
>>
>
> Even Windows offers mmap functionality.

Yep, although the exact options may differ all modern OS can memory-map
files, there really is no reason *not* to use it. Especially when
running on 64b OS, where there is no risk to mmap a file bigger than
VMEM.

But due to Erlang semantics, I believe mmap would have to be supported
by the VM itself to work correctly with an ideal API (so that it can be
integrated well with binary handling e.g. as a special kind of refc
binary) otherwise you have to use a file-type interface à la emmap:
https://github.com/krestenkrab/emmap

Tim Watson

unread,
Mar 24, 2012, 12:37:35 PM3/24/12
to Max Lapshin, Erlang-Questions Questions
On 24 Mar 2012, at 11:36, Max Lapshin wrote:

> On Sat, Mar 24, 2012 at 2:11 AM, Tim Watson <watson....@gmail.com> wrote:
>> So if you are willing to loose some space efficiency (which you would with mmap anyway) then reading the entire binary into memory is a lot faster:
>>
>
> Tim, looks like your solutions is twice faster due to using threads. Am I right?

Yes, although I still haven't quite hit the sweet spot with this approach. You can sequentially chunk your way through the file (using file:read/2 on a file opened with [raw, binary]) using 64k chunks and break these up into lines in around 533ms. Breaking up the 300k individual lines on the comma and collecting the results requires a bit more though about how best to split the work up, so currently there's not really that much of an improvement. I am playing around with this to see what's possible though, as it's an interesting problem.

Max Lapshin

unread,
Mar 24, 2012, 12:45:29 PM3/24/12
to Tim Watson, Erlang-Questions Questions
I'm trying to use file:read from erlang combined with compiled scanf pattern or some regex. There are two ideas:
1) combine two binary:split will give double memory walking. I think it is a good idea to parse memory once
2) intermediate creation of binaries for all those floats (12 millions of them in this example) is also evil. Perhaps line parsing should be combined with float extraction in the same manner as decode_packet does.

Tim Watson

unread,
Mar 24, 2012, 6:14:37 PM3/24/12
to Max Lapshin, Erlang-Questions Questions
On 24 Mar 2012, at 16:45, Max Lapshin wrote:

> I'm trying to use file:read from erlang combined with compiled scanf pattern or some regex. There are two ideas:
> 1) combine two binary:split will give double memory walking. I think it is a good idea to parse memory once

Actually there's a double problem here. In a CSV file it is possible that a delimiter might be present 'inside' one of the fields, providing the contents of the cell are in quotes. For example:

>> 1,2,16,John Snow,"Flat A, 128, Watling Street", etc....

So just doing binary:split(Bin, binary:compile_pattern(<<",">>), [global]) isn't enough. I will take a look how binary:split/3 is implemented, but I'm starting to think that you really do need to drop into C to make this fast enough. Maybe for your data set you don't care about this possibility (because you have some control over the inputs) but for a general purpose CSV parser, this has to be handled and requires too much introspection of the data to efficient over 70MiB and more if written in pure Erlang.

With file:read_line and binary:split, I can get the data out (without float/int conversion) sequentially in about 10 seconds. It's not particularly useful to parallelise the binary:split work as for the most part it isn't taking a long time per operation, but performs a vast number of them which slows down. Spawning for each processing line isn't really very smart as even in SMP mode the amount of churn due to scheduling seems likely to be prohibitive. Even if you hit the sweet spot with the right number of worker processes (1 per cpu +/- 1 for example) you've got the additional work of reconstituting the data in the correct order which takes up a lot of time.

Now if I manually chunk my way through the file, using an optimal segment size (on my mac, 16 * 1024 seems to be best) with a sizeable read_ahead of 1024 * 1024, then I can get through the whole file and binary:split(Bin, CommaPattern, [global]) on each chunk, in an average of 2300ms. Of course this is too slow, is ignoring newlines at the moment (unlike the other tests) and I've removed the code for dealing with fields that sit across the segment boundaries (which slows things down also):

read(P=#csv_parser{ io_device=Fd, buffer_size=Size }, start, Idx, Rem) ->
read(P, file:read(Fd, Size), Idx, Rem);
read(P=#csv_parser{ io_device=Fd, buffer_size=Size,
delimiter=Delim, collector=Collector,


supported_newlines=NL },
{ok, Chunks},

Idx, Rem) ->
binary:split(Chunks, Delim, [global]),
read(P, file:read(Fd, Size), Idx + 1, <<>>);
read(_P, eof, _, _Rem) ->
%% in strict mode, fail unless size(Rem) == 0
ok.

>>>>


t4@malachi:csv_reader $ ./csv_reader.erl example.csv

Starting read....
read_chunk time: 2376
t4@malachi:csv_reader $

Now if I switch the split to use binary:matches instead, and look for either ',' or a newline, we get a big performance improvement:

t4@malachi:csv_reader $ ./csv_reader.erl example.csv

Starting read....
read_chunk time: 953
t4@malachi:csv_reader $

That's getting much faster, *but* we're not doing any hard work of parsing yet and we're not dealing with 'cross-segment' parts, which I'll look at next.

> 2) intermediate creation of binaries for all those floats (12 millions of them in this example) is also evil. Perhaps line parsing should be combined with float extraction in the same manner as decode_packet does.


Like I said I think that with the approach I mentioned above (using nicely segmented file:read calls and binary:matches) we've got the i/o subsystem and splitting of fields to go as fast as they're likely to go without dropping into C. Now we get into the more cpu intensive parts - taking the {Location, Size} data from binary:matches and combined with your segment Idx working out the correct offset to use in calling binary:part, plus reconstituting stuff that sits across two segments and then finally converting binary to other data types for the final results. If I could make these parts happen in 2 seconds to get combined throughput of 3 seconds I'd be pleased, but I'm not sure it'll be possible. And even then, your desired time of 1 second isn't looking at all likely for even this 300k record file, let alone a larger 1million line file.

It is possible that applying some parallelism to this will help, but the timings we're after are so small that I'm not sure. Spawning per split is of course a waste of time, but it is possible that having a worker process per core that does binary:matches and parses might help. The difficulty here is that you have to deal with the data sitting across two segments before you can pass it off to a worker, because you need the previous 'remainder' if any - I suppose you could 'pass this in' along with the data. If this could be made to work efficiently, then you might also benefit from a parallel collector process taking the completed/parsed records from the workers and storing them so they're ready for retrieval. Again, I'm not convinced all of this could be done in pure Erlang in less than 3 - 4 seconds overall. Perhaps putting ets into the mix might provide some speedup, as I would guess that a shared table can be atomically accessed faster than the processes can exchange data
via a mailbox.

I'm going to see how fast this can be made to go in pure Erlang (allowing for ETS if it helps) just as a fun exercise. I don't know how much spare time I'll have to do it though and I'm convinced a solution written in C is the right answer for 1million records per second, and to do that a segment at a time approach that does almost all the work in C including all the type conversion is probably the way to go. In that case I would probably do the whole i/o part in C too, using a similar series of buffered read operations rather than line oriented fgets calls or things like that.

I do not think that mmap or even reading all the file into memory will be of any use, as you mentioned earlier, the problem is really cpu bound.

Dmitry Kolesnikov

unread,
Mar 24, 2012, 6:32:24 PM3/24/12
to Tim Watson, Max Lapshin, erlang-questions
Hello,

You have raise a freaking interesting issue!

I have decided to invest my time here and share binary parsing techniques used by me.
In the nutshell, I was capable to parse a line in 2.17 micro seconds using native erlang implementation. I have tried to productize code and put it here:

https://github.com/fogfish/csv

The algorithm is very simple
1. load a whole file into memory
2. shard the file into N chunks
3. parse each chunk in parallel, parsing is based on binary scan (no binary:split, etc)
4. merge results.

The test was run on Mac Mini server with R15B
Erlang R15B (erts-5.9) [source] [64-bit] [smp:8:8] [async-threads:0] [hipe] [kernel-poll:false]

Since, you have not shared a details of csv file I used the following setups:
* set.txt contains 300K rows, each row has 16 fields of 3 hex digit each
* set2.txt contains 300K rows, each row has 48 fields of 3 hex digit each

Here is the results:
csv_example:run("priv/set.txt", 80).
{{lines,300000},
{size,18000000},
{read_ms,132.269},
{parse_ms,652.865},
{line_us,2.1762166666666665}}

2> csv_example:run("priv/set2.txt", 80).
{{lines,300000},
{size,54000000},
{read_ms,204.259},
{parse_ms,1860.286},
{line_us,6.2009533333333335}}

- Dmitry

Tim Watson

unread,
Mar 24, 2012, 10:19:32 PM3/24/12
to Dmitry Kolesnikov, erlang-questions
On 24 Mar 2012, at 22:32, Dmitry Kolesnikov wrote:

> Hello,
>
> You have raise a freaking interesting issue!
>
> I have decided to invest my time here and share binary parsing techniques used by me.
> In the nutshell, I was capable to parse a line in 2.17 micro seconds using native erlang implementation. I have tried to productize code and put it here:
>
> https://github.com/fogfish/csv
>
> The algorithm is very simple
> 1. load a whole file into memory
> 2. shard the file into N chunks
> 3. parse each chunk in parallel, parsing is based on binary scan (no binary:split, etc)
> 4. merge results.
>

Thanks for sharing this, it sounds great. I was under the impression from reading the documentation that using the binary module to split everything up would be quicker than hand coding.

> The test was run on Mac Mini server with R15B
> Erlang R15B (erts-5.9) [source] [64-bit] [smp:8:8] [async-threads:0] [hipe] [kernel-poll:false]
>
> Since, you have not shared a details of csv file I used the following setups:
> * set.txt contains 300K rows, each row has 16 fields of 3 hex digit each
> * set2.txt contains 300K rows, each row has 48 fields of 3 hex digit each
>

This is a bit different to the input Max is trying to work with isn't it? Can you try with his generated example.csv? I tried doing this and your output seems to be a small list of numbers, so I'm not sure what's going wrong.

13> csv_example:run("example.csv", 80).
total parse time: 5897.321
Result =
[3711,3750,3750,3751,3751,3750,3750,3749,3750,3751,3751,3751,3750,3752,3751,
3750,3750,3751,3751,3751,3750,3750,3751,3751,3750,3751,3751,3750,3753,3751,
3749,3750,3751,3751,3750,3751,3750,3750,3751,3751,3750,3751,3751,3750,3750,
3750,3751,3751,3751,3751,3751,3751,3750,3751,3751,3751,3749,3751,3749,3751,
3751,3750,3751,3751,3750,3752,3750,3751,3750,3751,3750,3751,3749,3750,3750,
3751,3750,3749,3750,3750]


> Here is the results:
> csv_example:run("priv/set.txt", 80).
> {{lines,300000},
> {size,18000000},
> {read_ms,132.269},
> {parse_ms,652.865},
> {line_us,2.1762166666666665}}
>
> 2> csv_example:run("priv/set2.txt", 80).
> {{lines,300000},
> {size,54000000},
> {read_ms,204.259},
> {parse_ms,1860.286},
> {line_us,6.2009533333333335}}
>

I changed the example code a little bit, to account for the fact that we consider the file:read_file and the parsing all together in all our other examples. I get very different results, although I'm admittedly (probably?) on a less powerful machine:

4@malachi:csv $ erl -pa src -pa priv
Erlang R14B01 (erts-5.8.2) [source] [64-bit] [smp:2:2] [rq:2] [async-threads:0] [hipe] [kernel-poll:false]

Eshell V5.8.2 (abort with ^G)
1> csv_example:run("example.csv", 80).
{{lines,300001},
{size,78599457},
{parse_ms,5727.757},
{line_us,19.09245969180103}}
2>

Tim Watson

unread,
Mar 24, 2012, 10:26:54 PM3/24/12
to Tim Watson, erlang-questions

On further inspection, I can see that this happens because the fun you pass to the parser (to handle the {line, _} event) is simply incrementing a count. My mistake!

Tim Watson

unread,
Mar 24, 2012, 10:54:11 PM3/24/12
to Tim Watson, erlang-questions
On 25 Mar 2012, at 03:26, Tim Watson wrote:
>>> The test was run on Mac Mini server with R15B
>>> Erlang R15B (erts-5.9) [source] [64-bit] [smp:8:8] [async-threads:0] [hipe] [kernel-poll:false]
>>>
>>> Since, you have not shared a details of csv file I used the following setups:
>>> * set.txt contains 300K rows, each row has 16 fields of 3 hex digit each
>>> * set2.txt contains 300K rows, each row has 48 fields of 3 hex digit each
>>>
>>
>> This is a bit different to the input Max is trying to work with isn't it? Can you try with his generated example.csv? I tried doing this and your output seems to be a small list of numbers, so I'm not sure what's going wrong.
>>

The kind of lines in the generated example.csv BTW look like this:

KEY1,20120201,12:15:44.543,34.28,54.74,16.74,88.51,32.48,15.7,54.19,71.52,69.5,55.35,3.9,20.08,33.83,63.43,12.4,9.66,0.29,59.61,52.94,82.49,78.96,70.52,55.73,79.37,61.25,54.19,49.31,14.14,40.18,21.39,82.26,40.79,36.57,86.14,39.58,28.3,20.1,24.07,51.35,8.38,zz

Accumulating the results takes even longer than this though. I have the line event handler simply aggregate the lines and do the wall clock time measures as Max did (just for consistency), and try it with Wrk set to 80 and then larger (to 200) which starts to degrade somewhere between the two:

test(Filename, Wrk) ->
T1 = erlang:now(),
{ok, Bin} = file:read_file(Filename),
R = csv:pparse(Bin, Wrk,
fun({line, Line}, Acc) -> [Line|Acc] end, []),
T2 = erlang:now(),
io:format("total parse time: ~p~n", [timer:now_diff(T2, T1) div 1000]),
io:format("size(Results) = ~p~n", [length(R)]).

And here's what I get:

Eshell V5.8.2 (abort with ^G)

3> csv_example:test("example.csv", 80).
total parse time: 15232
size(Results) = 80
ok
4>
5> csv_example:test("example.csv", 200).
total parse time: 17086
size(Results) = 200
ok
6>

So these appear to be taking between 50 and 20 seconds, and of course, you've got to then work your way through the results in the data for each shard. This part has switched me on to the idea of having a stream oriented API - I suspect that will work well, although it's worth baring in mind that the ordering isn't going to be maintained if you start 'sharding' the data set so as to process it sequentially.

So do we think the additional time is spent because of the comparative width of the line(s) in Max's inputs? I can't imagine that binary:split is doing anything vastly different, except that it uses binary:matches which is implemented as a BIF - I am guessing that a BIF is going to beat hand coded binary pattern matching every time, is it not? Please also do run this code on some other (faster) machines to see how much of the slowness is specific to my machine.

Tim Watson

unread,
Mar 24, 2012, 10:57:19 PM3/24/12
to Tim Watson, erlang-questions
On 25 Mar 2012, at 03:54, Tim Watson wrote:
> Please also do run this code on some other (faster) machines to see how much of the slowness is specific to my machine.

Oh and can you please pass on the set and set2 text files so I can test them on my machine to baseline the comparative differences between our environments?

Cheers!

Tim

Thomas Lindgren

unread,
Mar 25, 2012, 5:18:53 AM3/25/12
to Tim Watson, erlang-questions
Not a direct comment, I'll just hook into the general discussion here. 

Here's some older work that might be interesting: http://www.tbray.org/ongoing/When/200x/2007/09/20/Wide-Finder

Best,
Thomas

>________________________________
> From: Tim Watson <watson....@gmail.com>
>To: Tim Watson <watson....@gmail.com>
>Cc: erlang-questions <erlang-q...@erlang.org>
>Sent: Sunday, March 25, 2012 4:57 AM
>Subject: Re: [erlang-questions] Speed of CSV parsing: how to read 1M of lines in 1 second

Dmitry Kolesnikov

unread,
Mar 25, 2012, 6:09:57 AM3/25/12
to Tim Watson, Max Lapshin, erlang-questions
Hello,

1. Yes, I'd like to admit that I've missed the example file from the beginning of thread. I've put my sets generators into the repository. You can use them like this:
sh ./priv/dset line-ex.txt 1000 > ./priv/set-1M-ex.txt
The first parameter is line, second is kilolines to generate. line-16/48 are my original examples, line-ex.txt is from original example.

2. I am a bit of curious why we are getting results of different magnitude, especially with Robert's run. My HW config is pretty close. So, I've build Erlang R15B from sources with following config:
./configure --prefix=/usr/local/otp-R15B --enable-threads --enable-smp-support --enable-kernel-poll --enable-sctp --enable-hipe --disable-dynamic-ssl-lib --enable-darwin-64bit --enable-m64-build --without-javac

3. I've re-run cases again with/without +native flag results are very interesting, so we can parse 300K lines in less then second, less the second for 1M rows is challenging:

Parse time of data set is 300K:
set-300K-16 653 ms / 2.18 us per line
set-300K-48 1832 ms / 6.11 us per line
set-300K-ex 2561 ms / 8.54 us per line

Parse time of data set is 300K +native:
set-300K-16 277 ms / 0.92 us per line
set-300K-48 672 ms / 2.24 us per line
set-300K-ex 925 ms / 3.09 us per line

Parse time of data set is 1M:
set-300K-16 4406 ms / 2.20 us per line
set-300K-48 6076 ms / 6.08 us per line
set-300K-ex 8670 ms / 8.67 us per line

Parse time of data set is 1M +native:
set-300K-16 1908 ms / 0.95 us per line
set-300K-48 2293 ms / 2.29 us per line
set-300K-ex 3119 ms / 3.12 us per line

4. It might be unfair to comment but I have a good feeling that Max is evaluating/challenging a type of trade system ;-) Otherwise, why you have such hard latency requirements and dataset looks like a price snapshot ;-) IMHO, 2 - 4 us parse time per row is acceptable for "normal" web app.

Regards, Dmitry

Max Lapshin

unread,
Mar 25, 2012, 6:35:41 AM3/25/12
to Dmitry Kolesnikov, erlang-questions
I'm not speaking about latency. I'm speaking about speed.

Steve Davis

unread,
Mar 25, 2012, 10:12:14 AM3/25/12
to erlang-pr...@googlegroups.com, erlang-questions
I know there must be a reason why nobody has suggested:

re:split(Bin, <<",(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))">>).

...just not sure what that reason is?

/s

Tim Watson

unread,
Mar 25, 2012, 10:52:48 AM3/25/12
to Dmitry Kolesnikov, erlang-questions
Thanks for posting this detail Dmitry, it's very helpful. I'm going to
recompile with +native and give it a go. I also note that you're not
dealing with nested delimiters (e.g., where a field is enclosed in
single or double "quotation marks" and can therefore contain the
delimiters). If I can get the speed up to what you're seeing, then I
might just send you a couple of pull requests. :)

Also I'm interested to know how you called the csv:parse/3 function
(or pparse instead) for these examples? Are you, for example,
accumulating the results, or are you just counting the occurrences?
Please let me know so I can replicate the same setup and test +native
for myself.

Also would you be kind enough to let us all know what kind of hardware
your mac mini has, especially in terms of CPU and available memory?

All in all, if we can clear up the differences I think parsing 300k in
just under a second without having to resort to a NIF is a very good
result and makes this the fastest csv parsing facility we've seen so
far!!!

Nicolas Dufour

unread,
Mar 25, 2012, 12:59:49 PM3/25/12
to Tim Watson, erlang-questions
Hello everybody,

While we are talking on csv parsers, I've been working on one for which the priority is not raw speed but to be able to parse huge files in a stream (if your csv is 60G big, no worry), or csv sent over any wire (http, whatever), and using a callback to both process each line and a general state for the whole parsing.

You can take a look at https://github.com/refuge/ecsv . It's actually based on a 3 states machine
The main goal was to be able to parse a stream and dispatch the processing across many processes.
So far it's able to parse double-quoted field and different delimiters and is written in 100% erlang.

The last benchmark I did though was showing something like 1700 row/s ( example here http://friendpaste.com/58s9VAVswaczu4Zav49GEq ).
But again, my goal was to withstand bug files through a stream.

Feel free to take a look at it.

Thank you,

Nicolas Dufour

On Mar 25, 2012, at 10:52 AM, Tim Watson wrote:

Joe Armstrong

unread,
Mar 25, 2012, 1:27:04 PM3/25/12
to Max Lapshin, Erlang-Questions Questions
I'm joining this thread rather late, ...

Would it be possible to change the problem a bit?

The fastest approach would seem to be
(assume a quad core):

- spilt file into four
- parse each bit in parallel
- combine the results

Well splitting (the first part) is essentially sequential
(or at least difficult to do in parallel unless you really
understand your hardware - and may be impossible to do in parallel)

I would assume that the input file has been created
by a sequential program - if this were the case
could it not produce four files instead of one?

If this were the case then the split into four bit would go away.

Do the bits have to be recombined later? - if not the
last bit can be removed as well.

The key to performance might be to redesign the entire
processing pipeline making it as parallel as possible
as soon as possible and keeping it as parallel as possible as long as possible.

Cheers

/Joe

On Fri, Mar 23, 2012 at 11:30 AM, Max Lapshin <max.l...@gmail.com> wrote:
> I need to load large CSV file into memory very fast.
>
> I've tried to use erlang parser, but my results were very bad (in fact
>  file:read_line is very slow), so I've tried to make a NIF for it.
> Target speed is 1 microsecond per line.
>
>
> My CSV has very strict format: only numbers, no quoting, \n in the
> end. Also I moved parsing of integers and date into NIF.
>
> My results are here: http://github.com/maxlapshin/csv_reader and I get
> only 15 microseconds:  4,5 seconds for 300K lines CSV.
>

> Currently I use fgets to read line by line from file. Maybe it is a
> bad idea and I should use mmap or implement 1MB buffer for read?

Max Lapshin

unread,
Mar 25, 2012, 1:39:43 PM3/25/12
to Steve Davis, erlang-pr...@googlegroups.com, erlang-questions
On Sun, Mar 25, 2012 at 6:12 PM, Steve Davis
<steven.cha...@gmail.com> wrote:
> I know there must be a reason why nobody has suggested:
>
> re:split(Bin, <<",(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))">>).
>
> ...just not sure what that reason is?
>


Because after you get binaries with encoded floats you need very
expensive translation of binary to float via list.

Max Lapshin

unread,
Mar 25, 2012, 1:47:19 PM3/25/12
to Steve Davis, erlang-pr...@googlegroups.com, erlang-questions
So, currently my time is 850 ms for 300K line file which is 2.8 us per line.

https://github.com/maxlapshin/csv_reader

$ ./csv_bench.erl example.csv
./csv_bench.erl:66: Warning: variable 'Evt' is unused
Load csv_reader: ok
csv_reader:133 {chunk,210,19650181}
....
csv_reader:142 {loader,<0.36.0>,finish}
NIF: 851


Ideas are following:

1) Parse file in 4 threads. Detect where are real borders of lines of
each part and spawn 4 workers with offsets and limits
2) Don't use prepending of data, left from previous block. If some
line is read only partially, than read next block size(Rest) bytes
earlier
3) Use special C nif that splits binaries into lines and parses them
in the same call. Parsing is done via precompiled pattern.

It was a good idea to throw away non-erlang IO, because there are no
lags in erlang file:read layer. But there are lags even in
binary:split. NIF that returns subbinary till first EOL is several
times faster than binary:split(Bin, <<"\n">>).

james

unread,
Mar 25, 2012, 2:53:58 PM3/25/12
to Max Lapshin, Erlang-Questions Questions
I think you should be able to parse something like this at approaching
the speed of memcpy, but you have to try hard. Have you tried writing
something that runs outside erlang and just tries to parse the file into
numbers in memory, say using re2c or ragel?

You might need to try using mmap and a thread that just reads every 16th
byte and forces the file to read in. Or double buffer with readahead.

Is the data in your operating system's VM cache already, or on disk?

James

Tim Watson

unread,
Mar 25, 2012, 5:02:21 PM3/25/12
to Max Lapshin, erlang-pr...@googlegroups.com, Steve Davis, erlang-questions
I think all these ideas have merit.

For (1) I agree with Joe that to take proper advantage of parallelism, you've got to redesign the problem space a little, possibly by removing the need for splitting, recombining and/or re-sequencing. The problem to my mind is that parsing a csv input stream (be it from a file or a socket or whatever) is more naturally a sequential task. If we add parallelism for performance gains, we pay later on when we have to 'work around' the issues that parallelism introduces (such as out of order data). Having said that, maybe your problem space isn't specifically about parsing a large range of csv inputs files.

BTW my macbook pro appears to have appalling performance characteristics when testing this, taking +/- 300ms longer to do the work:

t4@malachi:csv_reader $ make bench
./rebar compile
==> csv_reader (compile)


./csv_bench.erl example.csv
./csv_bench.erl:66: Warning: variable 'Evt' is unused
Load csv_reader: ok

NIF: 1202

Note that I also had to remove the io:format/2 debugging from the code, as this caused the code to take much longer:

t4@malachi:csv_reader $ make bench
./rebar compile
==> csv_reader (compile)


./csv_bench.erl example.csv
./csv_bench.erl:66: Warning: variable 'Evt' is unused
Load csv_reader: ok

csv_reader:133 {chunk,210,19650023}
....
csv_reader:142 {loader,<0.35.0>,finish}
NIF: 1828

I would *love* to know why all this io:format takes so much longer on my machine than yours, as I was under the impression that Erlang/OTP behaved nicely on darwin (apart from not doing proper kernel poll as kqueue seems pretty broken on OS-X) and this is making me wonder.

Max Lapshin

unread,
Mar 25, 2012, 5:05:04 PM3/25/12
to Tim Watson, Steve Davis, erlang-pr...@googlegroups.com, erlang-questions
On Mon, Mar 26, 2012 at 1:02 AM, Tim Watson <watson....@gmail.com> wrote:

>
> BTW my macbook pro appears to have appalling performance characteristics when testing this, taking +/- 300ms longer to do the work:
>

I have changed HDD to SSD and thrown away cdrom. Don't know what
affects speed more =)

I really want to load CSV file very fast, because I need to load
several files and merge them with sorting.

Dmitry Kolesnikov

unread,
Mar 25, 2012, 5:19:06 PM3/25/12
to Tim Watson, erlang-questions
Hello Tim,

1. Yes, you are right about nested/escaped delimiters. They have to be fixed.

2. Reference platform:
* MacMini, Lion Server,
* 1x Intel Core i7 (2 GHz), 4x cores
* L2 Cache 256KB per core
* L3 Cache 6MB
* Memory 4GB 1333 MHZ DDR3
* Disk 750GB 7200rpm WDC WD7500BTKT-40MD3T0
* erlang R15B + native build of the library

3. I've tried to improve the documentation as part of README & csv.erl file. Please find them in the project: https://github.com/fogfish/csv

Then, I have put extra example in csv_example.erl to show how to perform intake procedure to ets table.

- Dmitry

Max Lapshin

unread,
Mar 25, 2012, 5:26:43 PM3/25/12
to Dmitry Kolesnikov, erlang-questions
>
> 3. I've tried to improve the documentation as part of README & csv.erl file. Please find them in the project: https://github.com/fogfish/csv
>

Can you commit some script to compare my benchmarks against your numbers?

james

unread,
Mar 25, 2012, 6:48:27 PM3/25/12
to Tomasz Maciejewski, Erlang-Questions Questions
> mmap is the fastest way to read lines is you don't much care about
portability.

While I think mmap is useful (and might see it as a way to avoid a split
being sequential, since you only really need to divide it roughly and
can probe for EoL), I think its worth qualifying your statement.

mmap is NOT necessarily the fastest way to read lines.

The issue is whether the operating system will perform read-ahead in its
disk system and how many times you fault and wait for a real disk IO if
not, so its rather important to know if the file is actually in the
operating system's VM cache already, or is actually on a disk drive.

As a first cut it would be handy to know how fast the OP's hardware can
do atof, compared with the number of numbers in the file.

Fyodor

unread,
Mar 25, 2012, 9:50:10 PM3/25/12
to Max Lapshin, Steve Davis, erlang-pr...@googlegroups.com, erlang-questions
Greetings,
I looked at the code, so you basically take the file, break it into 4
chunks, per each process and let each process do its chunks. Wouldn't
this cause lots of IO?

i.e. wouldn't something like: single process reads file sequentially
until EOL, spawns process, repeat be faster?

> --
> You received this message because you are subscribed to the Google Groups "Erlang Programming" group.
> To post to this group, send email to erlang-pr...@googlegroups.com.
> To unsubscribe from this group, send email to erlang-programm...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/erlang-programming?hl=en.
>

--
http://www.o0o.nu

Michael Turner

unread,
Mar 25, 2012, 11:00:14 PM3/25/12
to Tim Watson, erlang-questions
> The kind of lines in the generated example.csv BTW look like this:
>
> KEY1,20120201,12:15:44.543,34.28,54.74,16.74,88.51,32.48,15.7,54.19,71.52,69.5,55.35,3.9,20.08,33.83,63.43,12.4,9.66,0.29,59.61,52.94,82.49,78.96,70.52,55.73,79.37,61.25,54.19,49.31,14.14,40.18,21.39,82.26,40.79,36.57,86.14,39.58,28.3,20.1,24.07,51.35,8.38,zz

Which made me wonder: is Max actually trying to parse a CSV file of
*floating point* numbers, or just numbers that *happen* to be
expressed in floating point notation but that can be far more narrowly
characterized? If you have 1 million lines of positive 4-digit fixed
point numbers with only two significant digits past the decimal point
(as the above would suggest), the numbers will repeat (on average)
about 100 times in the file. So you could at least save yourself a
factor of 100 in the text-to-float conversion part of the problem.

This *general* problem of parsing CSV is important too, of course. But
the *general* solution could also (FTW) admit of such approaches in
its API.

-michael turner

Tim Watson

unread,
Mar 26, 2012, 2:50:57 AM3/26/12
to Dmitry Kolesnikov, erlang-questions
Interesting. Like Max said, would you update or post the script you're
using to test Max's inputs please, as I can't reproduce on a
comparable system. I've tried Macbook Pro (Intel Core 2 Duo, 3.6GHz,
8Gb Ram, twice the L2 cache) and also on an HP with Intel Core i7 8Gb
fast RAM and similar disk profile. I think the difference between our
results has to do with how the output is captured and that's what I'd
like to see (and be able to reproduce) - is this where you're
utilising an ets table yes? That sounds like a sensible approach.

Cheers,
Tim

Tim Watson

unread,
Mar 26, 2012, 2:59:52 AM3/26/12
to Michael Turner, erlang-questions
On 26 March 2012 04:00, Michael Turner <michael.eu...@gmail.com> wrote:
>> The kind of lines in the generated example.csv BTW look like this:
>>
>> KEY1,20120201,12:15:44.543,34.28,54.74,16.74,88.51,32.48,15.7,54.19,71.52,69.5,55.35,3.9,20.08,33.83,63.43,12.4,9.66,0.29,59.61,52.94,82.49,78.96,70.52,55.73,79.37,61.25,54.19,49.31,14.14,40.18,21.39,82.26,40.79,36.57,86.14,39.58,28.3,20.1,24.07,51.35,8.38,zz
>
> Which made me wonder: is Max actually trying to parse a CSV file of
> *floating point* numbers, or just numbers that *happen* to be
> expressed in floating point notation but that can be far more narrowly
> characterized? If you have 1 million lines of positive 4-digit fixed
> point numbers with only two significant digits past the decimal point
> (as the above would suggest), the numbers will repeat (on average)
> about 100 times in the file. So you could at least save yourself a
> factor of 100 in the text-to-float conversion part of the problem.
>
> This *general* problem of parsing CSV is important too, of course. But
> the *general* solution could also (FTW) admit of such approaches in
> its API.
>

Hi michael. Yes I agree with this and in my own experiments (and in
Dmitry's library) there is no data type conversion going on. Being
able to parse a 300k line CSV in around a second is a pretty good
achievement (once it's stable) and if turing off any automatic
to_int/to_float conversion is necessary to save a few seconds of
processing time, then I think most users will be happy with that.

As you say, Max seems to have some more specialised requirements of
the system in his 1 million rows in 1 second (on commodity hardware?)
goal.

> -michael turner

Angel J. Alvarez Miguel

unread,
Mar 26, 2012, 3:44:58 AM3/26/12
to erlang-q...@erlang.org
On Domingo, 25 de Marzo de 2012 19:27:04 Joe Armstrong escribió:
> I'm joining this thread rather late, ...
>
> Would it be possible to change the problem a bit?
>
> The fastest approach would seem to be
> (assume a quad core):
>
> - spilt file into four
> - parse each bit in parallel
> - combine the results
>
> Well splitting (the first part) is essentially sequential
> (or at least difficult to do in parallel unless you really
> understand your hardware - and may be impossible to do in parallel)

Maybe opening the file 4 times (and apropiate fseeks on the second, thirds
etc..)?

You can read data from every descriptor so you get your file split (while you
care not to overlap ranges during reads..)

--
-
>>----------------------------------------------------------------------------

Angel J. Alvarez Miguel, Servicios Informáticos
Edificio Torre de Control, Campus Externo UAH
Alcalá de Henares 28806, Madrid ** ESPAÑA **

-------------[taH pagh taHbe', DaH mu'tlheghvam vİqelnİS]--<<-

Max Lapshin

unread,
Mar 26, 2012, 4:23:10 AM3/26/12
to cl...@uah.es, erlang-q...@erlang.org
Colleagues. There is absolutely no problem with disk IO. wc -l and
erlang file:read both take about 100-200 ms to read whole file.

Problem is in handling this data and parsing.

Max Lapshin

unread,
Mar 26, 2012, 4:37:17 AM3/26/12
to Robert Melton, erlang-q...@erlang.org
On Mon, Mar 26, 2012 at 12:33 PM, Robert Melton <rme...@gmail.com> wrote:
>
> Agreed.  Do we have any baseline implementation in pure C or (insert
> fastest language/implementation you are aware of)?  I am working on
> speeding this up (and having a lot of fun!), but I have no idea the
> theory-craft maximum process speed (with proper escaping, etc) on my
> hardware.
>

I really can't understand why should parsing be slower than reading from HDD =)

However, it is slower. Currently I have 950 ms for 300K line CSV with
40 float columns when read on cold system and 820 ms when read from
disk cache.

Copying from kernel cache and byte-by-byte reading all data while
searching '\n' takes 100 ms (it is time of wc -l), so it takes about
700 ms for erlang to parse + create all proper objects.

Benoit Chesneau

unread,
Mar 26, 2012, 4:58:50 AM3/26/12
to erlang-questions
On Sun, Mar 25, 2012 at 7:47 PM, Max Lapshin <max.l...@gmail.com> wrote:
> So, currently my time is 850 ms for 300K line file which is 2.8 us per line.
>
> https://github.com/maxlapshin/csv_reader
>
> $ ./csv_bench.erl example.csv

can you provide the example.csv file somewhere ?

Benoit Chesneau

unread,
Mar 26, 2012, 5:02:25 AM3/26/12
to erlang-questions
On Mon, Mar 26, 2012 at 10:58 AM, Benoit Chesneau <bche...@gmail.com> wrote:
> On Sun, Mar 25, 2012 at 7:47 PM, Max Lapshin <max.l...@gmail.com> wrote:
>> So, currently my time is 850 ms for 300K line file which is 2.8 us per line.
>>
>> https://github.com/maxlapshin/csv_reader
>>
>> $ ./csv_bench.erl example.csv
>
> can you provide the example.csv file somewhere ?

nm...

$ ruby fill.rb > example.csv

did the trick.

Michael Turner

unread,
Mar 26, 2012, 6:36:28 AM3/26/12
to Max Lapshin, erlang-q...@erlang.org
> I really can't understand why should parsing be slower than reading from HDD =)

Are you converting the ASCII-coded floating point numbers to actual
floating point? That's actually quite a lot more overhead per
character than ... well, anything else I can think of in processing a
CSV file.

And what do these numbers look like? Do they repeat? Are they short?
Or are they high-precision and varying wildly in order of magnitude,
and widely distributed statistically?

-michael turner

Max Lapshin

unread,
Mar 26, 2012, 7:40:25 AM3/26/12
to Michael Turner, erlang-q...@erlang.org


>
> And what do these numbers look like? Do they repeat? Are they short?

Right as in example csv. It is trading data.


> Or are they high-precision and varying wildly in order of magnitude,
> and widely distributed statistically?


They are very close to each other and vary not more than several percents. You think ot is a good place for optimization?


In fact I have achieved good enough results: less than a second and thank to all community for it.

Tim Watson

unread,
Mar 26, 2012, 8:16:28 AM3/26/12
to Dmitry Kolesnikov, erlang-questions
Hi Dmitry,

On 25 March 2012 11:09, Dmitry Kolesnikov <dmkole...@gmail.com> wrote:

> Hello,
>
> 1. Yes, I'd like to admit that I've missed the example file from the beginning of thread. I've put my sets generators into the repository. You can use them like this:
> sh ./priv/dset line-ex.txt 1000 > ./priv/set-1M-ex.txt
> The first parameter is line, second is kilolines to generate. line-16/48 are my original examples, line-ex.txt is from original example.
>

What is line-ex.txt supposed to contain - is it a single line from the
original example? - I can't find it in the repo.

> 2. I am a bit of curious why we are getting results of different magnitude, especially with Robert's run. My HW config is pretty close. So, I've build Erlang R15B from sources with following config:
> ./configure --prefix=/usr/local/otp-R15B --enable-threads --enable-smp-support --enable-kernel-poll --enable-sctp --enable-hipe --disable-dynamic-ssl-lib --enable-darwin-64bit --enable-m64-build --without-javac

I'll rebuild that way, but essentially this is my build config looked
something like this (from config.log):

configure:4674: running /bin/sh
'/Users/t4/.kerl/builds/r15b-64-hipe-smp/otp_src_R15B/lib/configure'
--prefix=/usr/local '--cache-file=/dev/null' '--enable-hipe'
'--enable-darwin-64bit' '--enable-threads' '--enable-smp'
'--enable-kernel-poll'
'ERL_TOP=/Users/t4/.kerl/builds/r15b-64-hipe-smp/otp_src_R15B'
--cache-file=/dev/null
--srcdir=/Users/t4/.kerl/builds/r15b-64-hipe-smp/otp_src_R15B/lib

>
> 3. I've re-run cases again with/without +native flag results are very interesting, so we can parse 300K lines in less then second, less the second for 1M rows is challenging:
>
> Parse time of data set is 300K:
> set-300K-16     653 ms / 2.18 us per line
> set-300K-48   1832 ms / 6.11 us per line
> set-300K-ex   2561 ms / 8.54 us per line
>
> Parse time of data set is 300K +native:
> set-300K-16     277 ms / 0.92 us per line
> set-300K-48     672 ms / 2.24 us per line
> set-300K-ex     925 ms / 3.09 us per line
>

I can't replicate this even with +native turned on (for both the csv
modules and the csv_example) and I'm bemused as you don't have 'that
much more' horsepower in your 4 cores than my mbpro's 2. Here's what I
get for parsing the original file - as you would expect, there is very
little difference in setting the segments to any number higher than
the # cpus:

5> csv_example:import("example.csv", 40).
size (MB): 74.958283
read (ms): 109.798000
parse (ms): 3239.264000
266257
6> csv_example:import("example.csv", 80).
size (MB): 74.958283
read (ms): 105.949000
parse (ms): 3231.541000
270352
7> csv_example:import("example.csv", 4).
size (MB): 74.958283
read (ms): 102.250000
parse (ms): 3275.317000
274450
8> csv_example:parse("example.csv", 4).
lines: 300001
size (MB): 74.958283
read (ms): 102.768000
parse (ms): 2737.098000
per line (us): 9.123630
ok
9> csv_example:parse("example.csv", 44).
lines: 300001
size (MB): 74.958283
read (ms): 106.153000
parse (ms): 2775.781000
per line (us): 9.252572
ok
10> csv_example:parse("example.csv", 2).
lines: 300001
size (MB): 74.958283
read (ms): 104.410000
parse (ms): 2758.367000
per line (us): 9.194526
ok
11> csv_example:import("example.csv", 2).
size (MB): 74.958283
read (ms): 108.705000
parse (ms): 3390.453000
278547

How did you build Max's example file? I'm really struggling to
understand how I've got 2 seconds more processing time for such a
similar setup.

> Parse time of data set is 1M:
> set-300K-16   4406 ms / 2.20 us per line
> set-300K-48   6076 ms / 6.08 us per line
> set-300K-ex   8670 ms / 8.67 us per line
>
> Parse time of data set is 1M +native:
> set-300K-16    1908 ms / 0.95 us per line
> set-300K-48    2293 ms / 2.29 us per line
> set-300K-ex    3119 ms / 3.12 us per line
>
> 4. It might be unfair to comment but I have a good feeling that Max is evaluating/challenging a type of trade system ;-) Otherwise, why you have such hard latency requirements and dataset looks like a price snapshot ;-) IMHO, 2 - 4 us parse time per row is acceptable for "normal" web app.

I agree your numbers seem perfectly reasonable, but I'm not able to
reproduce them. I'm going to try on a 2 core linux machine later on
and see how I get on.

Tim Watson

unread,
Mar 26, 2012, 4:05:45 PM3/26/12
to Max Lapshin, erlang-q...@erlang.org
Max have you seen
http://blogtrader.net/blog/tim_bray_s_erlang_exercise2. This states
"0.93 sec on 1 million lines file on my 4-core linux box" which sounds
pretty impressive and is based on pure Erlang (with some ets thrown
into the mix by the looks of things). Might be worth looking at
whether this can potentially out-perform the NIF!

Dmitry Kolesnikov

unread,
Mar 26, 2012, 4:13:44 PM3/26/12
to Tim Watson, erlang-questions
Hello Tim, Max, et al,

Let's start from beginning... looks like we are mixed a bit.

1. I've made a clean up in the repository. The folder ./priv contains a perl script gen_set.pl to produce a data sets according to Max original format:

As an example:
key299991,20120326,21:06:31.543,24.16,92.39,22.68,1.71,43.50,53.29,90.53,10.05,91.01,80.66,23.09,18.55,41.38,98.90,61.31,40.44,14.26,42.23,78.22,54.78,18.86,11.72,97.45,47.39,61.50,39.98,73.25,51.65,9.84,42.33,18.84,23.52,60.11,94.82,55.87,86.04,25.12,20.40,32.69,4.51,zz

2. For historical reasons I run tests against three data sets, they differs each other by the number of fields (float numbers). They are
* 300K lines, each line contains 8 fields, ~23MB
* 300K lines, each line contains 24 fields, ~50MB
* 300K lines, each line contains 40 fields, ~77MB <- Max's original file

you can generate those data set's via
perl priv/gen_set.pl 300 8 > priv/set-300K-8.txt
perl priv/gen_set.pl 300 24 > priv/set-300K-24.txt
perl priv/gen_set.pl 300 40 > priv/set-300K-40.txt
or via make example if you are fun of GNU build like I am

2. I have made number of tests / examples to validate a performance. Essentially we are talking about ETL processes here:
* Extract data from csv file
* Transform csv lines into some other format
* Load data to some storage for processing

The following test cases were implemented:
* extract data from CSV, this operation just parses a file and does nothing with data. The objective here is to validate a speed of the parser as such
* extract/transform, it parses a CSV file and calculates a rolling hash agains data
* extract/transform, it parses a CSV file and converts list into tuple. The tuple is not stored anywhere. The objective here is to validate speed of parser + transform operation
* extract/transform/load, it parses a CSV file, convers it to tuple and stores to some in-memory storage.

You can find those cases at priv/csv_example.erl and run them by csv_example:run(80)

3. The results what I got are following (they are also available at README)

E/Parse Size (MB) Read (ms) Handle (ms) Per Line (us)
-------------------------------------------------------------------
300K, 8 flds 23.41 91.722 350.000 1.16
300K, 24 flds 50.42 489.303 697.739 2.33
300K, 40 flds 77.43 780.296 946.003 3.15


ET/hash Size (MB) Read (ms) Handle (ms) Per Line (us)
-------------------------------------------------------------------
300K, 8 flds 23.41 91.722 384.598 1.28
300K, 24 flds 50.42 489.303 761.414 2.54
300K, 40 flds 77.43 780.296 1047.329 3.49


ET/tuple Size (MB) Read (ms) Handle (ms) Per Line (us)
-------------------------------------------------------------------
300K, 8 flds 23.41 91.722 228.306 0.76
300K, 24 flds 50.42 489.303 601.025 2.00
300K, 40 flds 77.43 780.296 984.676 3.28

ETL/ets Size (MB) Read (ms) Handle (ms) Per Line (us)
-------------------------------------------------------------------
300K, 8 flds 23.41 91.722 1489.543 4.50
300K, 24 flds 50.42 489.303 2249.689 7.50
300K, 40 flds 77.43 780.296 2519.401 8.39

ETL/pts Size (MB) Read (ms) Handle (ms) Per Line (us)
-------------------------------------------------------------------
300K, 8 flds 23.41 91.722 592.886 1.98
300K, 24 flds 50.42 489.303 1190.745 3.97
300K, 40 flds 77.43 780.296 1734.898 5.78


The biggest frustration came from ets table. It is was to slow to load data into ets. Now, I believe that your results are slow because you are doing ets load... In my final test, I swap ets with lightweight pts (process term storage), this is a process that holds data an addressable by key. You can see that Max's original file is parsed / transformed and loaded into in-memory process based storage pertty fast, just 5.78 us per line. My original file is even faster 1.98 us per line

4. If you wish to replicate the results on you HW then I propose for you to compile csv library, generate data sets and try to compile csv_example. Please let me know you you have a trouble on each of those phases. and keep in-ming that +native flag were used...


Regards, Dmitry

Dmitry Kolesnikov

unread,
Mar 26, 2012, 4:25:13 PM3/26/12
to Tim Watson, erlang-q...@erlang.org
Oh my bad.... I've completely forget of theses aspects of VM. 
I boosted performance so that Max's original file is parsed with 2us per line vs 3.15us, a full ETL cycle (see my previous mail) takes just 7.8 us per line vs 8.39us. 

and very good hint on Boyer-Moore searching...

- dmitry

Max Lapshin

unread,
Mar 26, 2012, 4:27:23 PM3/26/12
to Dmitry Kolesnikov, erlang-questions
Dmitry. I tried to compile your library, but it has so many external
dependencies, that I've failed.
Sorry.

Tim Watson

unread,
Mar 26, 2012, 4:47:07 PM3/26/12
to Max Lapshin, erlang-questions
Max, just ignore the makefile and do this instead:

$ echo '{erl_opts, [native]}.' >> rebar.config
$ rebar clean compile

And then you'll need to compile the csv_example module, which is in
priv - I did this lazilly in the shell with `make:all([native]).'
before running the test functions.

Tim Watson

unread,
Mar 26, 2012, 4:48:02 PM3/26/12
to Dmitry Kolesnikov, erlang-questions
Thanks for posting back Dmitry. I'll go through this exercise tomorrow
morning and let you know how I get on. For now, I need a little sleep.
:)

Toby Thain

unread,
Mar 26, 2012, 7:36:10 PM3/26/12
to erlang-q...@erlang.org
On 26/03/12 6:36 AM, Michael Turner wrote:
>> I really can't understand why should parsing be slower than reading from HDD =)
>
> Are you converting the ASCII-coded floating point numbers to actual
> floating point? That's actually quite a lot more overhead per
> character than ... well, anything else I can think of in processing a
> CSV file.
>
> And what do these numbers look like? Do they repeat? Are they short?
> Or are they high-precision and varying wildly in order of magnitude,
> and widely distributed statistically?

I see where you're heading ... by the look of those 4-dig-digit numbers
the FP conversion could be done by a lookup in a 10,000 element array -
assuming this is cheaper than the straightforward conversion.

--Toby

Max Lapshin

unread,
Mar 27, 2012, 5:44:46 AM3/27/12
to Toby Thain, erlang-q...@erlang.org
I'm experimenting more and found very strange thing

sscanf((char *)start, "%4d%02d%02d", &year, &month, &day);

this adds 3 microseconds per line. Why?? I've changed it to macros:
#define D2(s) (((s)[0] - '0')*10 + ((s)[1] - '0')) and this macros
works for several nanoseconds.

Is sscanf really so slow??


And also I'll really have to cache somehow timestamps, because timegm
takes more than 200 microseconds per line. It is impossible to call it
on each line.

Max Lapshin

unread,
Mar 27, 2012, 6:08:58 AM3/27/12
to Toby Thain, erlang-q...@erlang.org
Yes. I have to use hand-made macros to create fixed-width string to
integer and static caching for avoiding useless calls to timegm
Reply all
Reply to author
Forward
0 new messages