Re: Digest for scikit-image@googlegroups.com - 1 update in 1 topic

16 views

Skip to first unread message

Robert McLeod

unread,

Dec 29, 2016, 2:52:49 AM12/29/16

to scikit...@googlegroups.com

Hi Simone,

I think that if you are limited in the RAM per node then using MPI is a mistake and instead you should be going for a threading-based solutions so that all the threads can pool the available RAM. Alternatively maybe you are using Open MPI in hybrid mode but it seems unlikely from your comments. PyTables using Numexpr as the backend is the obvious solution here. Threads do not usually scale as nicely as processes, but the setup/takedown time is minimal and they share RAM. If you have an SMP (shared memory process) environment for your cluster then it's usually suitable for such a thing. PyTables is just generally a better HDF5 interface than h5py in my opinion.

http://www.pytables.org/

In particular pay attention to the 'evaluate' functionality which uses this library:

https://github.com/pydata/numexpr

If those 16 core machines are two CPUs, then certainly then Numexpr probably will not scale well past 8 cores, but 8 cores is better than 3. It's very difficult to beat NumExpr in imaging processing on CPU in the Python landscape, due to the amount of data flowing between the CPUs and memory. I've used pyFFTW and NumExpr in combination to be able to keep up with competitors who program in Fortran-90.

If your data is compressible you could also look at zarr, which uses blosc to compress Numpy arrays into chunks, and decompresses them at the cache for processing:

https://github.com/alimanfoo/zarr

PyTables also includes blosc support now.

Stuff like Dask and Hadoop are for parallelizing algorithms in machine learning and similar fields. If you just want to do matrix algebra they're probably sub-optimal. It comes down to, how big are each of your stacks? You said chunking isn't practical, why?

Robert

On Wed, Dec 28, 2016 at 11:46 PM, <scikit...@googlegroups.com> wrote:

scikit...@googlegroups.com Google Groups

Topic digest
View all topics

Image analysis pipeline improvement suggestions - 1 Update

Image analysis pipeline improvement suggestions

Simone Codeluppi <sim...@codeluppi.org>: Dec 28 09:58AM -0800

Hi all!

I would like to pick your brain for some suggestion on how to modify my
image analysis pipeline.

I am analyzing terabytes of image stacks generated using a microscope. The
current code I generated rely heavily on scikit-image, numpy and scipy. In
order to speed up the analysis the code runs on a HPC computer (
https://www.nsc.liu.se/systems/triolith/) with MPI (mpi4py) for
parallelization and hdf5 (h5py) for file storage. The development cycle of
the code has been pretty painful mainly due to my non familiarity with mpi
and problems in compiling parallel hdf5 (with many open/closing bugs).
However, the big drawback is that each core has only 2Gb of RAM (no shared
ram across nodes) and in order to run some of the processing steps i ended
up reserving one node (16 cores) but running only 3 cores in order to have
enough ram (image chunking won’t work in this case). As you can imagine
this is extremely inefficient and i end up getting low priority in the
queue system.

Our lab currently bought a new 4 nodes server with shared RAM running
hadoop. My goal is to move the parallelization of the processing to dask. I
tested it before in another system and works great. The drawback is that,
if I understood correctly, parallel hdf5 works only with MPI
(driver=’mpio’). Hdf5 gave me quite a bit of headache but works well in
keeping a good structure of the data and i can save everything as numpy
arrays….very handy.

If I will move to hadoop/dask what do you think will be a good solution for
data storage? Do you have any additional suggestion that can improve the
layout of the pipeline? Any help will be greatly appreciated.

Back to top

You have received this digest because you're subscribed to updates for this group. You can change your settings on the group membership page.
To unsubscribe from this group and stop receiving emails from it send an email to scikit-image+unsubscribe@googlegroups.com.

Robert McLeod, Ph.D.
Center for Cellular Imaging and Nano Analytics (C-CINA)

Biozentrum der Universität Basel

Mattenstrasse 26, 4058 Basel
Work: +41.061.387.3225
robert...@unibas.ch

robert...@bsse.ethz.ch

robbm...@gmail.com

Reply all

Reply to author

Forward

0 new messages