Image processing: Otsu's method thresholding. Help with optimizing code/algorithm

1,089 views
Skip to first unread message

Aneesh Sathe

unread,
Nov 8, 2014, 10:34:24 AM11/8/14
to julia...@googlegroups.com
Hello!

I use matlab for image processing and was thinking of moving some of my code to julia. But i found that the matlab function graythresh() was missing. So, i wrote my own version. 

Since I've been using julia for less than a week i doubt my code and/or algorithm is the best. So any input is much appreciated

I've used the Images package to read my tif file. and then passed it to the graythresh() function pasted below

Comparisons for a test image:
Matlab: Elapsed time is 0.013389 seconds. thresh value: 0.1294
Julia: elapsed time: 1.035032847 seconds thresh value: 0.13013116644890854

Clearly, the julia code is slower but seems to give the right answer. 

where i think it might be slowing down:
1) the function is not a .jl file
2) the conversion to Array from the image format is inefficient
3) i'm doing loops wrong
4) the float values are slowing things down

i tried using the profiler but couldn't quite get it to work for either just @profile graythresh()   or   @profile graythresh(img)

Cheers!
-Aneesh 



Attached files:
1) the test code for Otsu's method with the graythresh function(remember to run that cell before the rest)
2) the tif image i used for testing

The relevant part of code from the ijulia notebook

function graythresh(in_img)
    
    g=convert(Array,in_img);
    step=65536;
    
    #get histogram values
    e,counts=hist(g[:],linrange(minfinite(g[:]),maxfinite(g[:]),step))
        
   im_len=length(in_img);
    
    tot_sum=0;
    for t=1:step-1 
        tot_sum=tot_sum+t*counts[t];
    end
    
    sumB = 0;
    wB = 0;
    wF = 0;
    varMax = 0;
    threshold = 0;
    for t = 1:step-1
        
        wB=wB + counts[t]; #weight background
        
        wF=im_len - wB; #weight foreground
        
        sumB=sumB+t*counts[t];
        
        mB = sumB / wB;            #Mean Background
        mF = (tot_sum - sumB) / wF;    # Mean Foreground
        
        varBetween = wB * wF * (mB - mF)^2;
        # Check if new maximum found
        if (varBetween > varMax) 
            varMax = varBetween;
            threshold = t;
        end
    end
    return e[threshold]
end


Ostu's method-test.ipynb
control.mvd2 - XY point 3-1.tif

Jason Merrill

unread,
Nov 8, 2014, 2:52:09 PM11/8/14
to julia...@googlegroups.com
Here's a sketch of a different algorithm that I believe converges to the same value:

1. Initialize thresh to the mean value of the image pixels
2. Compute the mean of the pixels that are larger than thresh, mplus, and smaller than thresh, mminus.
3. Set thresh to the mean of mminus and mplus, and then loop back to 1. Iterate to a fixed point.

You can implement this version without allocating any extra data structures, so it may well be much faster.

Sorry I don't have any references here. Writing from a phone.

Jameson Nash

unread,
Nov 8, 2014, 2:58:36 PM11/8/14
to julia...@googlegroups.com
I suspect you may have a type-stability issue in your algorithm, limiting Julia's ability to optimize. Typically, you want to initialize variables with `zero(eltype(A))` or similar such statements. See http://docs.julialang.org/en/latest/manual/performance-tips/ and https://github.com/astrieanna/TypeCheck.jl

Tim Holy

unread,
Nov 8, 2014, 5:02:11 PM11/8/14
to julia...@googlegroups.com
When I profiled it, it looked like `hist` is taking >95% of the time. Seems
like we need an optimized path for when edges are specified by Ranges (which is
not true of the posted code, but could easily be modified to do so).

--Tim

Aneesh Sathe

unread,
Nov 8, 2014, 8:51:33 PM11/8/14
to julia...@googlegroups.com
@Tim this is naive, but what exactly did you type to profile it?

Also, yes, defined edges would be much more sensible. Will give it a try.

Andreas Noack

unread,
Nov 8, 2014, 9:00:32 PM11/8/14
to julia...@googlegroups.com
The profiler is explained in the documentation

Tim Holy

unread,
Nov 9, 2014, 9:15:17 AM11/9/14
to julia...@googlegroups.com
Dear Aneesh,

How are you even opening that image file on Matlab? I get this error:
>> img = imread('control.mvd2 - XY point 3-1.tif', 1);
Error using rtifc
TIFF library error: 'TIFFReadDirectory: Incorrect count for
"ColorMap"; tag ignored.'.

Error in readtif (line 49)
[X, map, details] = rtifc(args);

Error in imread (line 434)
[X, map] = feval(fmt_s.read, filename, extraArgs{:});


In contrast, julia's Images opens it without trouble :-). But one important
thing to note: your file is a multitiff (there are 11 images in it), and in
Julia the image is of size 512x512x11. If in Matlab you're just processing one
512x512 image, that fact alone will explain an 11-fold speed difference.

That said, there was a problem with Julia's `hist` function; an attempt at
making it better is found here: https://github.com/JuliaLang/julia/pull/8952.
With that pull request, running hist as
e, counts = hist(in_img, 65536)
is rather faster than what you're citing. (You also don't need all those
`convert` and `g[:]` calls, and in fact they slow you down.)

--Tim

Aneesh Sathe

unread,
Nov 9, 2014, 2:39:53 PM11/9/14
to julia...@googlegroups.com
Dear Tim,
Hahaha, Sorry i didnt specify earlier. I use the bio formats based bfopen() and i ran graythresh on the whole stack. 

I got the same error with imread, its cause the stack is cut out (using fiji) from a larger stack of 33 images (3 channels, with 11 z-slices). I'm guessing imread() looks for more images as per the description but doesn't find them.

Yes, Images does read it okay but only if i cut out the substack. If i don't, then it interprets the three channels as a time dimension, which isnt a pain at the moment but will be if i start using it for work. 


I realized that both the convert and the g[:] would slow me down but the hist function just wouldn't work without that kind of dance. Also, graythresh (http://www.mathworks.com/help/images/ref/graythresh.html) uses reshape to make it all one image which might also add to speed. 

The pull request is well and good but personally i would rather have a dedicated image histogram function like imhist: http://www.mathworks.com/help/images/ref/imhist.html
which would give histograms based on input images. To me that's the only way to make life easier. ....maybe i'll write one :)

Something about Images: do you think it possible to use the bio formats' .jar file to import images from a microscope format to Images?
Opening a microscope format image file in the relevant software and then exporting it as tiff takes too long and i'd rather be able to access the images directly. 

I am considering writing my own method to get that done. But i know neither julia nor java so it will be a while before i can manage that. 

-Aneesh

Tim Holy

unread,
Nov 9, 2014, 3:49:22 PM11/9/14
to julia...@googlegroups.com
On Sunday, November 09, 2014 11:39:53 AM Aneesh Sathe wrote:
> Yes, Images does read it okay but only if i cut out the substack. If i
> don't, then it interprets the three channels as a time dimension, which
> isnt a pain at the moment but will be if i start using it for work.

Hmm, that sounds like an annotation problem.

> I realized that both the convert and the g[:] would slow me down but the
> hist function just wouldn't work without that kind of dance. Also,
> graythresh (http://www.mathworks.com/help/images/ref/graythresh.html) uses
> reshape to make it all one image which might also add to speed.
>
> The pull request is well and good but personally i would rather have a
> dedicated image histogram function like
> imhist: http://www.mathworks.com/help/images/ref/imhist.html
> which would give histograms based on input images. To me that's the only
> way to make life easier. ....maybe i'll write one :)

imhist is necessary in matlab largely because hist works columnwise; in a
sense, Julia's `hist` is like imhist. Is there some specific functionality
you're interested in? There's no reason Images can't provide a custom version
of `hist`.

> Something about Images: do you think it possible to use the bio formats'
> .jar file to import images from a microscope format to Images?
> Opening a microscope format image file in the relevant software and then
> exporting it as tiff takes too long and i'd rather be able to access the
> images directly.

Yes, expansion of Images' I/O capabilities would be great. I've wondered about
Bio-Formats myself, but not had a direct need, nor do I know Java (but see
JavaCall.jl, if you haven't already).

The other way to go, of course, is Julia native support. Our support for NRRD
is a reasonable model of this approach. However, the reason we use ImageMagick
is because the reality is that there are a lot of formats out there; Bio-
Formats would fill a similar need for vendor-specific file formats. Out of
curiousity, what's the original format you're using?

--Tim

Jason Merrill

unread,
Nov 9, 2014, 9:03:34 PM11/9/14
to julia...@googlegroups.com
I was fooling around with this problem a little bit tonight, and noticed that mean(in_img) is not type stable (or at least inference produces a Union type for this operation).  E.g. in Aneesh's notebook, I ran

    @code_typed mean(in_img)

and the result ended with

    end::Union(Gray{Float32},Gray{Float64})

Jason Merrill

unread,
Nov 9, 2014, 9:24:26 PM11/9/14
to julia...@googlegroups.com

Here's a little bit of follow up on this algorithm. It turns out it is just the "standard algorithm" for k-means clustering, specialized to 2 clusters in 1 dimension:

    https://en.wikipedia.org/wiki/K-means_clustering#Standard_algorithm

Here's an implementation:

    https://gist.github.com/jwmerrill/179aae6c3d2d4e770614

I wasn't able to benchmark against your version because it fails with

    `linrange` has no method matching linrange(::Gray{UfixedBase{Uint16,16}}, ::Gray{UfixedBase{Uint16,16}}, ::Int64)

for me.

This code works on either in_img or data(in_img), but it is much faster on data(in_img). I'm not totally sure why.

    @time graythresh_kmeans(data(in_img))
    elapsed time: 0.072612812 seconds (132 bytes allocated)

    @time graythresh_kmeans(in_img)
    elapsed time: 4.849397962 seconds (3737126436 bytes allocated, 55.69% gc time)

 

Tim Holy

unread,
Nov 9, 2014, 9:32:47 PM11/9/14
to julia...@googlegroups.com
I agree that `mean` is not type-stable (via `sum`->`mapreduce`). Good catch! I
wonder how long that has been lurking?

But I can't replicate the difference between `in_img` and `data(in_img)`, to me
they are the same.

--Tim

Aneesh Sathe

unread,
Nov 9, 2014, 10:18:51 PM11/9/14
to julia...@googlegroups.com
ah cool!

I was thinking of doing a multi-thresh method like the one here: http://people.csail.mit.edu/jayadev/papers/mlt_thr_img.pdf

this should make it easier...thank you!

Aneesh Sathe

unread,
Nov 9, 2014, 10:38:27 PM11/9/14
to julia...@googlegroups.com
Tim,
i would like the imhist to be idiot proof. (i've been teaching matlab and nothing puts new people off more than things not being idiot proof).
things like using 256 bins by default.... returning a plot  if no outputs are specified (basically make it like matlab's imthresh() ) 

Btw, on matlab using bioformats is actually the slowest part of my algorithm, so unless it can be faster in julia native support might be nicer. Bioformats also fails in that it reads the whole sequence at once... so running things on laptops with even GB-level datasets is impossible. I wrote my own version of bfopen to only open the required XYZCT for specified series, but that only solves the memory usage. 

the source format for my image was .mvd2 (perkin elmer spinning disk). 

i know about JavaCall.jl just havent had the time to play with it...

i was thinking it might be fun to attempt native support for a few formats. I can also generate test data in a few vendor formats for a few microscopes. 
perhaps even make it a julia-box based project. ;) 

Tim Holy

unread,
Nov 10, 2014, 5:55:08 AM11/10/14
to julia...@googlegroups.com
All good plans. (I'm not sure about using 65536 bins for 16-bit images,
though, because that would be more bins than there are pixels in some images.
Still, it's not all that much memory, really, so maybe that would be OK.)

It would be great to add native support. Presumably you've found the docs on
adding support for new formats.

For formats that encode large datasets in a single block (like NRRD), you can
work with GB-sized datasets on a laptop because you can use mmap (I do it
routinely). But the love of TIFF does demand an alternative solution.
Presumably we should add a lower-level routine that returns a structure that
facilitates later access, e.g.,
imds = imdataset("my_image_file")
img = imds["z", 14, "t", 7]
or somesuch.

--Tim

Aneesh Sathe

unread,
Nov 10, 2014, 9:49:17 PM11/10/14
to julia...@googlegroups.com
Unless I understood wrong (which is very possible) the 65536 bins were to cover all possible values a 16bit pixel can take. Though, in the actual graythresh function i will probably use 256 bins by default.

I did find the docs for adding custom formats (https://github.com/timholy/Images.jl/blob/master/doc/extendingIO.md

But perhaps making bio formats .jar file will be better in the long run for few reasons:

1) A lot more formats are covered so implementing that would allow coverage of more formats faster. 
2) I understand your reasons for making all images in the Gray range, but i prefer having "real" pixel values. That way its easier to correlate test data with something like Fiji or Matlab. And I don't understand Julia float handling fully but there might be a gain in speed if using non-float values. 
3) Bio formats already allows the reading of individual images based on XYZCT so that doesn't  need to be rebuilt. 

Course, the above is the ideal thing to do. I'm still trying to figure out how to use the .jar file, so i might just end up adding the custom format first. 

Let's see...

-Aneesh

Tim Holy

unread,
Nov 10, 2014, 10:19:29 PM11/10/14
to julia...@googlegroups.com
On Monday, November 10, 2014 06:49:17 PM Aneesh Sathe wrote:
> 2) I understand your reasons for making all images in the Gray range, but i
> prefer having "real" pixel values. That way its easier to correlate test
> data with something like Fiji or Matlab. And I don't understand Julia float
> handling fully but there might be a gain in speed if using non-float
> values.

They're not really float values, underneath they are integers. You can just say
`reinterpret(Uint16, x)`.

--Tim

Aneesh Sathe

unread,
Nov 10, 2014, 11:18:44 PM11/10/14
to julia...@googlegroups.com
Ah! I had misunderstood that. Thank you! :)
Reply all
Reply to author
Forward
0 new messages