Yes, I know this code is not yet optimal (I'm still learning :), but
it begs a few questions I'd like to understand from the VM etc.
1) I've run it fine with a small subset, but once I've loaded the 930k
lines file, the VM sucks up a lot of RAM/Virtualmemory. Like a burst
of about 2G (I have a 4G MacBookPro) and then once it returned in the
erl shell, the VM starts to go balistic and consumes >7G of
virtualmemory ;(
Q1: why did the VM exhibit this behaviour? the garbage collector going bad/mad??
2) I will push the data into an ETS of sorts, as I'll try to find
duplicate files, but were thinking of an initial pull into a list, en
then fron there do the tests etc. The idea might be to pull in one
disk, and then compare it to another removal disk's files.
Q2: Should I rather do this straight into an ETS/DETS?
Q3: Should I preferably start to consider DETS 'cause of the size??
Q4: will Mnesia help in this case?
%%--------------------------------------------------------------------
%% Function: process_line/1
%% Description: take a properly formated line, and parse it, and
%% returns the tuple {Type,File,Hash}
%% Line: "MD5 (/.file) = d41d8cd98f00b204e9800998ecf8427e"
%% Nore some might be SHA1 in future.
%%--------------------------------------------------------------------
process_line(Line) ->
{match,[Type,File,Hash]}=
re:run(Line,
"\(.*\)[ ][\\(]\(.*\)[\\)][ ][=][ ]\([0-9a-f]*\)\n",
[{capture,all_but_first,list}]),
{Type,File,Hash}.
%%--------------------------------------------------------------------
%% Function: read_lines/1
%% Description: read in all the lines from a "properly formatted"
%% md5 output on MacOSX, returning a list with the tupples.
%%--------------------------------------------------------------------
read_lines(IOfd) ->
case file:read_line(IOfd) of
{ok,Line} ->
[process_line(Line)|read_lines(IOfd)];
eof ->
[]
end.
________________________________________________________________
erlang-questions mailing list. See http://www.erlang.org/faq.html
erlang-questions (at) erlang.org
Since it consumes >7G, am I right in guessing that you're
running 64-bit Erlang?
If so (and in any case), you should really use 'binary' instead
of 'list' in the regular expression option list. Using list
representation of the string data, each byte will consume two
heap words of memory - 8 bytes in 32-bit Erlang and 16 bytes in
64-bit.
Regarding the GC, consider what it has to work with. You are
building a very large data structure in a tight loop. The
process will continuously run out of heap, triggering the GC.
The GC will copy live data (which is going to be most of it)
to another copy of the heap. If that's not enough, it will
run a fullsweep, also looking at data that survived the
previous GC (no garbage there, since the list just keeps
growing). This creates yet another heap copy.
Finally, it does a resize of the heap, if necessary.
It is possible to pre-size the heap using spawn_opt() and the
min_heap_size option. Given that you have a very large data
structure, this may still turn out problematic.
You should definitely try putting the data in ETS instead of
accumulating it on the heap.
BR,
Ulf W
--
Ulf Wiger
CTO, Erlang Training & Consulting Ltd
http://www.erlang-consulting.com
{ok,Line} ->
[process_line(Line)|read_lines(IOfd)];
...would maybe not be "tail recursive" so the entire stack is being
run hence chewing up 7G or more? What happens if you change the
read_lines function to use an accumulator for the result? e.g...
read_lines(IOfd) ->
read_lines(IOfd, []).
read_lines(IOfd, Acc) ->
case file:read_line(IOfd) of
{ok, Line} ->
read_lines(IOfd, [process_line(Line) | Acc]);
eof ->
lists:reverse(Acc)
end.
MacOSX 64bit erlang yes... perhaps I need to recompile to 32bit while
testing/playing with this one...
> but I
> think that...
>
> {ok,Line} ->
> [process_line(Line)|read_lines(IOfd)];
>
> ...would maybe not be "tail recursive" so the entire stack is being
> run hence chewing up 7G or more? What happens if you change the
> read_lines function to use an accumulator for the result? e.g...
Tries this, same trouble
type of troubles in the erl shell.
The symptoms:
the:
List=read_lines(FD).
executes, doesn't *appear* to be using lots of VM space, and it
outputs the partial (being the shell it only prints a part of the
lines with the [...]...] stuff.
Then it *hangs* at that, not showing the prompt (this is inside
Aquamacs's erlang shell mode). This is then where it appears the
system goes "west", as the memory utilization (as measured/shown by
the MacOSX Activity Monitor) starts to grow and grow. with the last
test (using the Accumulator as below) it grew to 7G at which point I
killed the beam.smp process.
This appears to be an improvement!
> Then it *hangs* at that, not showing the prompt (this is inside
> Aquamacs's erlang shell mode). This is then where it appears the
> system goes "west", as the memory utilization (as measured/shown by
> the MacOSX Activity Monitor) starts to grow and grow. with the last
> test (using the Accumulator as below) it grew to 7G at which point I
> killed the beam.smp process.
Next I'd suspect your regex is causing issues. Try some logging to
figure out what lines are causing issues with the regex expression you
defined (and perhaps incorporate Ulf's suggestion to use binaries).
If you don't find a solution there, it could well be that you'll have to
push the results out to ETS as Ulf suggested (and my experience is that
Ulf is generally right about such things - not to mention the fact that
he has many more years of experience with OTP than I do).
/s