Puzzling behavior with multiple processes and STDOUT usage

131 views
Skip to first unread message

Harry B

unread,
Apr 13, 2015, 3:28:17 PM4/13/15
to julia...@googlegroups.com
Hi,

My goal is to have 10 parallel processes read the same file and each process consume 1/10th of that file. They of course all read all lines of the file, but skips over lines not belonging to it. So process #1 would process lines 1, 11, 21 etc. Second one would process lines 2, 12, 22 etc.  The issue that I have has nothing to do with efficiency or performance of this, so let us forget the efficiency part of it, for now.

The code is checked in to a repository here https://github.com/harikb/scratchpad1 (including some sample data), but also quoted in the email at the end.

$ julia --version
julia version 0.3.7

# Input - some sample from nyc public database (see repo link above, but any file might be enough)
$ wc -l nyc311calls.csv
250000 nyc311calls.csv
# Ignore why I am not using a csv reader. this is just test data. there are no multi-line quoted csv data here.

$ julia -L ./julia_test_parallel.jl ./julia_test_parallel_driver.jl | wc -l
250001
# Non-parallel run, everything is fine. One extra line is the initial print statement from _driver.jl

# Now, let us run with 10 parallel processes
$ julia -p 10 -L ./julia_test_parallel.jl ./julia_test_parallel_driver.jl | wc -l
26420
$ julia -p 10 -L ./julia_test_parallel.jl ./julia_test_parallel_driver.jl | wc -l
40915
$ julia -p 10 -L ./julia_test_parallel.jl ./julia_test_parallel_driver.jl | wc -c
1919321
$ julia -p 10 -L ./julia_test_parallel.jl ./julia_test_parallel_driver.jl | wc -c
2172839

Output seems all over the place. I think the processes stop after reaching certain input.

$ julia -p 10 -L ./julia_test_parallel.jl ./julia_test_parallel_driver.jl | tail
        From worker 8:  Process 8 is processing line 46617
        From worker 5:  Process 5 is processing line 46614
        From worker 2:  Process 2 is processing line 50751
        From worker 4:  Process 4 is processing line 45593
        From worker 11: Process 11 is processing line 45380
        From worker 6:  Process 6 is processing line 46685
        From worker 7:  Process 7 is processing line 50756
        From worker 9:  Process 9 is processing line 46688
        From worker 10: Process 10 is processing line 46699
        From worker 3:  Process 3 is processing line 46692

Now, I could buy that the STDOUT is getting clobbered by multiple parallel writes to it.  I am used to STDOUT getting garbled/mixed data from other environments/languages, but I haven't seen missing data. The characters eventually make it in some form to the output.

But if I redirect the output to a file, it is perfectly fine every single time. Why is that STDOUT does not get clobbered in that case?

$ julia -p 10 -L ./julia_test_parallel.jl ./julia_test_parallel_driver.jl > xx
$ wc -l xx
250001 xx
$ wc -c xx
12988916 xx
$ julia -p 10 -L ./julia_test_parallel.jl ./julia_test_parallel_driver.jl > xx; wc -l xx
250001 xx
$ julia -p 10 -L ./julia_test_parallel.jl ./julia_test_parallel_driver.jl > xx; wc -l xx
250001 xx
$ julia -p 10 -L ./julia_test_parallel.jl ./julia_test_parallel_driver.jl > xx; wc -l xx
250001 xx
$ julia -p 10 -L ./julia_test_parallel.jl ./julia_test_parallel_driver.jl > xx; wc -l xx
250001 xx
$ julia -p 10 -L ./julia_test_parallel.jl ./julia_test_parallel_driver.jl > xx; wc -l xx
250001 xx
$ julia -p 10 -L ./julia_test_parallel.jl ./julia_test_parallel_driver.jl > xx; wc -l xx
250001 xx

== Code below same as the quoted github link ==

$ cat ./julia_test_parallel.jl
#!/usr/local/julia-cb9bcae93a/bin/julia
function processOneFile(filename)

    np = nprocs()
    jump = np - 1
    jump = jump == 0 ? 1 : jump

    selfid = myid()

    # in a single-process setup, this function will be called for parent (id=1)
    assert(jump == 1 || selfid != 1)

    f = open(filename);
    offset = np == 1 ? selfid : selfid - 1
    lnum = 0
    for l in eachline(f)
        lnum += 1
        if lnum == offset
            println("Process $(selfid) is processing line $(lnum)")
            offset += jump
        end
    end
end

$ cat ./julia_test_parallel_driver.jl
#!/usr/local/julia-cb9bcae93a/bin/julia
filename = "nyc311calls.csv"
np = nprocs()
println("Started $(np) processes")
if (np > 1)
    if (myid() == 1)
        # Mulitprocess and I am the parent
        @sync begin
            for i = 2:nprocs()
                @async remotecall_wait(i, processOneFile, filename)
            end
        end
    end
else
    processOneFile(filename)
end

Any help is appreciated.

Thanks
--
Harry

Harry B

unread,
Apr 13, 2015, 6:07:41 PM4/13/15
to julia...@googlegroups.com

I have confirmed the same behavior with the latest nightly as well

$ julia --version
julia version 0.4.0-dev+4159


$ julia -p 10 -L ./julia_test_parallel.jl ./julia_test_parallel_driver.jl | wc -l
38968

$ julia -p 10 -L ./julia_test_parallel.jl ./julia_test_parallel_driver.jl | wc -l
53050

$ julia -p 10 -L ./julia_test_parallel.jl ./julia_test_parallel_driver.jl > xx; wc -l xx
250001 xx
$ julia -L ./julia_test_parallel.jl ./julia_test_parallel_driver.jl | wc -l
250001

Thanks
--
Harry

wil...@gmail.com

unread,
Apr 14, 2015, 9:56:35 PM4/14/15
to julia...@googlegroups.com
It looks like program exits before STDOUT output is finished. That is because worker's console output is redirected to master and processed asynchronously. If you run your program in julia console, you'll get correct output.

Harry B

unread,
Apr 15, 2015, 12:56:43 AM4/15/15
to julia...@googlegroups.com
Thank you. I changed the program to explicitly write to real files named uniquely by myid() and it seems to work (do all the work).

I also noticed that the default file buffer size is very high and nothing is flushed to disk either until it has past several hundred kbs.
This probably applied to STDOUT too.

Thank you for looking into this.
--
Harry
Reply all
Reply to author
Forward
0 new messages