Hi,
My goal is to have 10 parallel processes read the same file and each process consume 1/10th of that file. They of course all read all lines of the file, but skips over lines not belonging to it. So process #1 would process lines 1, 11, 21 etc. Second one would process lines 2, 12, 22 etc. The issue that I have has nothing to do with efficiency or performance of this, so let us forget the efficiency part of it, for now.
The code is checked in to a repository here
https://github.com/harikb/scratchpad1 (including some sample data), but also quoted in the email at the end.
$ julia --version
julia version 0.3.7
# Input - some sample from nyc public database (see repo link above, but any file might be enough)$ wc -l nyc311calls.csv
250000 nyc311calls.csv
# Ignore why I am not using a csv reader. this is just test data. there are no multi-line quoted csv data here.$ julia -L ./julia_test_parallel.jl ./julia_test_parallel_driver.jl | wc -l
250001
# Non-parallel run, everything is fine. One extra line is the initial print statement from _driver.jl# Now, let us run with 10 parallel processes$ julia -p 10 -L ./julia_test_parallel.jl ./julia_test_parallel_driver.jl | wc -l
26420
$ julia -p 10 -L ./julia_test_parallel.jl ./julia_test_parallel_driver.jl | wc -l
40915
$ julia -p 10 -L ./julia_test_parallel.jl ./julia_test_parallel_driver.jl | wc -c
1919321
$ julia -p 10 -L ./julia_test_parallel.jl ./julia_test_parallel_driver.jl | wc -c
2172839
Output seems all over the place. I think the processes stop after reaching certain input.$ julia -p 10 -L ./julia_test_parallel.jl ./julia_test_parallel_driver.jl | tail
From worker 8: Process 8 is processing line 46617
From worker 5: Process 5 is processing line 46614
From worker 2: Process 2 is processing line 50751
From worker 4: Process 4 is processing line 45593
From worker 11: Process 11 is processing line 45380
From worker 6: Process 6 is processing line 46685
From worker 7: Process 7 is processing line 50756
From worker 9: Process 9 is processing line 46688
From worker 10: Process 10 is processing line 46699
From worker 3: Process 3 is processing line 46692
Now, I could buy that the STDOUT is getting clobbered by multiple parallel writes to it. I am used to STDOUT getting garbled/mixed data from other environments/languages, but I haven't seen missing data. The characters eventually make it in some form to the output.
But if I redirect the output to a file, it is perfectly fine every single time. Why is that STDOUT does not get clobbered in that case?
$ julia -p 10 -L ./julia_test_parallel.jl ./julia_test_parallel_driver.jl > xx
$ wc -l xx
250001 xx
$ wc -c xx
12988916 xx
$ julia -p 10 -L ./julia_test_parallel.jl ./julia_test_parallel_driver.jl > xx; wc -l xx
250001 xx
$ julia -p 10 -L ./julia_test_parallel.jl ./julia_test_parallel_driver.jl > xx; wc -l xx
250001 xx
$ julia -p 10 -L ./julia_test_parallel.jl ./julia_test_parallel_driver.jl > xx; wc -l xx
250001 xx
$ julia -p 10 -L ./julia_test_parallel.jl ./julia_test_parallel_driver.jl > xx; wc -l xx
250001 xx
$ julia -p 10 -L ./julia_test_parallel.jl ./julia_test_parallel_driver.jl > xx; wc -l xx
250001 xx
$ julia -p 10 -L ./julia_test_parallel.jl ./julia_test_parallel_driver.jl > xx; wc -l xx
250001 xx
== Code below same as the quoted github link ==
$ cat ./julia_test_parallel.jl
#!/usr/local/julia-cb9bcae93a/bin/julia
function processOneFile(filename)
np = nprocs()
jump = np - 1
jump = jump == 0 ? 1 : jump
selfid = myid()
# in a single-process setup, this function will be called for parent (id=1)
assert(jump == 1 || selfid != 1)
f = open(filename);
offset = np == 1 ? selfid : selfid - 1
lnum = 0
for l in eachline(f)
lnum += 1
if lnum == offset
println("Process $(selfid) is processing line $(lnum)")
offset += jump
end
end
end
$ cat ./julia_test_parallel_driver.jl
#!/usr/local/julia-cb9bcae93a/bin/julia
filename = "nyc311calls.csv"
np = nprocs()
println("Started $(np) processes")
if (np > 1)
if (myid() == 1)
# Mulitprocess and I am the parent
@sync begin
for i = 2:nprocs()
@async remotecall_wait(i, processOneFile, filename)
end
end
end
else
processOneFile(filename)
end
Any help is appreciated.Thanks
--
Harry