On Sat, Dec 26, 2009 at 1:52 PM, tinku99 <navee...@gmail.com> wrote:>> One option is to try something like infileiter.py
>There is a typo in the file: _ReadFileMetaInfo should be>_read_file_meta_info in 2 places, but otherwise it seems to work>nicely.Hi Naveen,It looks like you have replied to me personally -- did you mean to reply to the group? You have good questions and it would be nice for the discussion to be public.I was about to post to the group about new code I recently pushed to the repository which incorporates the fileiter directly into pydicom file reading, and also provides a new argument to read_file() called stop_before_pixels which can simplify the partial file reading. I'll still post that message soon, but meanwhile I'll answer your questions here also>I would prefer if the whole file or optionally the whole header, was>read from disk all at once, and then parsed in memory.>If you preallocate memory for the files, the whole process would be>much faster.It shouldn't be much faster -- operating systems cache disk reads anyway (and read ahead), so they probably read at least the non-pixel data before pydicom asks for it anyway. But it can be done ... keep reading.You sparked my interest about this, so I wrote some timing code (file attached). Tests on my machine show essentially no difference reading from file vs. to memory and then parsing. Surprisingly, even reading the pixel data only costs about 10% extra time for images read from local disk (for the 'small' images I used -- just 512x512 CT and MR images). If read from a network disk, (in my case, a USB drive attached to an NSLU2 device), then skipping pixels dropped the time almost in half.The images I used are freely available from http://pcir.org, by the way, so you could try the same set if you want. If you have a large number of big images, however, that would be a very useful test to run.>This would also enable reading dicom file from memory from other>sources such as an embedded storescp server.You can read dicom files from memory using pydicom, but it hasn't been documented. There is a memory file object modeled after python's StringIO used for "deflate" files which need to be decompressed and then read. You could use it something like this:>>> import dicom>>> from dicom.filebase import DicomStringIO>>> memory_file = DicomStringIO(open("dicom_file", 'rb').read(known_length))>>> ds = dicom.read_file(memory_file)Problem is, with DICOM files, I don't think you can predict (in a general way) how much to read in. My timing code used 4000 bytes, calculated based on the known image size. Dicom used to have the concept of "group length" tags but it seems to be deprecated now. So for most files you can only parse one data element at a time. Even some of those have "unknown length" VRs and you have to read until a delimiter is found.>Also, it would be nice to be able to give filereader a pointer to>where to put the pixel data, ie in a 3d numpy array.I'm not clear on exactly what you are asking here, but I think you are asking something similar to this dicsussion?: http://groups.google.com/group/pydicom/browse_thread/thread/ee9bcc87bfc088e6. The answer there is not a turn-key solution, but numpy does allow one to paste together different arrays, I believe. I also intend to write some code to make reading a complete series easier; it is a common request. I've just added this as an issue to the issue list.>For displaying files, is there any problem with using / suggesting>pygame which allows displaying from a numpy array (through>surfarrayhttp://www.pygame.org/docs/ref/surfarray.html), instead of>matplotlib or pil, which seem to require writing to disk and are much>slower.I don't know pygame, but if you try this and get something working, I'd be happy to add it to the contrib folder, which recently has had several examples added using other graphics libraries.Regards,Darcy
On Dec 27, 6:22 pm, Naveen Garg <naveen.g...@gmail.com> wrote:
> When I read a set of 80 ct images with ahk, it takes only 2 seconds.http://github.com/tinku99/ahkpacs/blob/master/python/dcmload.ahk
> It takes >8 seconds for only 10(yes 10, not 100) images each with pydicom:
There must be something else going on -- with time_test.py, my system
is reporting about 6.3 seconds to fully read and parse 100 CT images
from local disk. The new partial reading (no pixels) was about 5.7 sec/
100 CT files. I'm using the latest pydicom repository code but I doubt
there is anything in there that would be that much faster than
previous pydicom versions. I added another test to the time_test.py,
to simply read (using usual python open, read) 100 CT files, which on
my system took about 3 seconds, consistent with your ahk time.
I've posted the new time_test.py and two files capturing my computer's
stats to the latest repository push under source/dicom/test/
performance.
Timing things is notoriously difficult to do properly or consistently,
so I'd be happy for anyone to point out possible flaws in the methods.
My tests were done after restarting the computer -- repeated tests
were faster, presumably because the OS is caching reads in memory. On
the other hand, running the test under ipython took more than twice as
long (the first time, but then times dropped on each further run,
until an error about not having enough file handles). I also did the
tests on another computer (Win XP) and got similar results. The
profiling stats shows that there is no big time sink in any one part
of pydicom. If I pointed to a network disk instead of local disk, then
times went up dramatically (~42 seconds for full file read and parse
of 100 CTs) and file reading was the clear time sink.
I'd be happy for others to try the same tests on their systems and see
what the differences are. As I mentioned before, the image sets I
used are available at pcir.org, you can tell which ones I used by
looking at the code in time_test.py.
Regards,
Darcy
> It takes >8 seconds for only 10(yes 10, not 100) images each with pydicom:There must be something else going on -- with time_test.py, my system
is reporting about 6.3 seconds to fully read and parse 100 CT images
from local disk.
You could try the read_partial() function in the new code directly,
pass it a function which returns False until the tags you are
interested in have been read, then returns True. Another idea if the
files are accessed multiple time is to store the dataset (the no
pixels one) using python's pickle module or something -- it doesn't
help on the first read, but should be fast to read it rather than the
DICOM file on subsequent accesses to the same file (could check file
name and date/time stamp to make sure there have been no changes).
I'll certainly keep in mind trying to make the code faster, but python
is not known for its great speed, and as I mentioned, there are no
clear time sinks apparent in the tests I ran. If we doubled the speed
of the next highest time user (after file read), that would only
improve the read time by a few percent. Most speed gains would
probably come at the expense of code readability and ease of
maintenance.
> By the way, the 4th test in time_test.py gives me an error:
> File
> "c:\python26\lib\site-packages\pydicom-0.9.4svn-py2.6.egg\dicom\filereade
> r.py", line 223, in read_dataset
> logger.error(str(details) + " in file " + fp.name) # XXX is this visible
> eno
> ugh to user code?
> AttributeError: 'DicomFileLike' object has no attribute 'name'
> not sure what the problem is...
That looks like an error in the error catching -- it's complaining
that the file object has no name attribute, which the DicomStringIO
wouldn't. I'll correct that error message in a future code -- instead
of fp.name, using getattr(fp, name, "<no filename>") should work. If
you replace that in your filereader.py, and run again, you should see
the real error. My guess is it is running out of bytes on the partial
read, and the 4000 needs to be bumped up.
Regards.
Darcy
> You could try the read_partial() function in the new code directly,
> pass it a function which returns False until the tags you are
> interested in have been read, then returns True. Another idea if the
> files are accessed multiple time is to store the dataset (the no
> pixels one) using python's pickle module or something -- it doesn't
> help on the first read, but should be fast to read it rather than the
> DICOM file on subsequent accesses to the same file (could check file
> name and date/time stamp to make sure there have been no changes).
I'm sorry if this is an ignorant question, but does it make sense (do
you already) cache tags on read? I mean, could you (do you) gain
speed by only reading the requested tags? Or allow an option to do
that?
Again, sorry if that's silly,
Matthew
I'm sorry if this is an ignorant question, but does it make sense (do
you already) cache tags on read? I mean, could you (do you) gain
speed by only reading the requested tags? Or allow an option to do
that?
On Dec 28, 9:24 pm, Darcy Mason <darcyma...@gmail.com> wrote:
> On Mon, Dec 28, 2009 at 8:03 PM, Matthew Brett <matthew.br...@gmail.com>wrote:
>
>
>
> > I'm sorry if this is an ignorant question, but does it make sense (do
> > you already) cache tags on read? I mean, could you (do you) gain
> > speed by only reading the requested tags? Or allow an option to do
> > that?
>
Ok ... I'm a little slow sometimes. I thought about this a bit
more ... I'm not sure if this is what you or Naveen were saying, but a
really good answer is to _not_ parse the data element value on
reading, just get the VR, the length, and capture the raw bytes which
make up the value. Parsing those bytes into python types can happen
later, only when an item is requested by the user code. I think this
has been pointed out by a couple of people in the past, but I didn't
really pick up on it until now. It should bring the reading time down
close to the python limit. It will take some restructuring of the code
to make it happen, but I think this will be a very good change. I'll
post again when I have something working.
Regards,
Darcy
> Ok ... I'm a little slow sometimes. I thought about this a bit
> more ... I'm not sure if this is what you or Naveen were saying, but a
> really good answer is to _not_ parse the data element value on
> reading, just get the VR, the length, and capture the raw bytes which
> make up the value. Parsing those bytes into python types can happen
> later, only when an item is requested by the user code. I think this
> has been pointed out by a couple of people in the past, but I didn't
> really pick up on it until now. It should bring the reading time down
> close to the python limit. It will take some restructuring of the code
> to make it happen, but I think this will be a very good change. I'll
> post again when I have something working.
Ah - yes - that was the kind of thing I mean - but your earlier answer
also lightened the darkness of my ignorance.
Have you thought about Cython for optimization? I have found it
very good for taking me down to C-level speed while allowing me to
keep my code structured modular and readable. It's very well
supported too.
Best,
Matthew
Have you thought about Cython for optimization? I have found it
very good for taking me down to C-level speed while allowing me to
keep my code structured modular and readable. It's very well
supported too.
>> Have you thought about Cython for optimization? I have found it
>> very good for taking me down to C-level speed while allowing me to
>> keep my code structured modular and readable. It's very well
>> supported too.
>
> Other than having heard the name, I don't know anything about Cython either.
> After a quick look, though, it seems to be mainly about defining static
> types for a compiler to work on, to speed up e.g. repeated math operations
> in a loop. I'm not sure what parts of pydicom would be possibilities for it
> to work on.
That's a good summary; the thing that made me thing of it though, was
when you mentioned python function call overhead. If you define your
input types for a function, you can call it within cython / pyrex file
at C speed, (this is the 'cdef' or 'cpdef' def statements) sometimes
vastly reducing python overhead. You can also define python / C
struct types that allow very fast object attribute access.
I recently did a large refactor of the scipy matlab reading routines
that increased speed by an order of magnitude to comparable or better
than matlab itself, and - importantly - the code remains fairly easy
to read and maintain. There's a large user community too and it's
heavily used in Sage (a very large mathematics project), and
increasingly in Scipy as well as other projects. I should say, I'm
not one of the developers, but I have been very impressed with it...
Best,
Matthew
That's a good summary; the thing that made me thing of it though, was
when you mentioned python function call overhead. If you define your
input types for a function, you can call it within cython / pyrex file
at C speed, (this is the 'cdef' or 'cpdef' def statements) sometimes
vastly reducing python overhead. You can also define python / C
struct types that allow very fast object attribute access.
I recently did a large refactor of the scipy matlab reading routines
that increased speed by an order of magnitude to comparable or better
than matlab itself, and - importantly - the code remains fairly easy
to read and maintain. There's a large user community too and it's
heavily used in Sage (a very large mathematics project), and
increasingly in Scipy as well as other projects. I should say, I'm
not one of the developers, but I have been very impressed with it...
>> I recently did a large refactor of the scipy matlab reading routines
>> that increased speed by an order of magnitude to comparable or better
>> than matlab itself, and - importantly - the code remains fairly easy
>> to read and maintain. There's a large user community too and it's
>> heavily used in Sage (a very large mathematics project), and
>> increasingly in Scipy as well as other projects. I should say, I'm
>> not one of the developers, but I have been very impressed with it...
>
> It sounds very interesting, and you've certainly hit the key words with
> readable and maintainable -- two of the biggest reasons IMO why python is
> such a great language.
Yes - right - and also why it's such a good idea to have a DICOM
reader, in particular, in Python.
> Well, if you had any time to suggest some specific
> changes that would really boost pydicom speed, I'd be happy to entertain
> those.
I would very much like to help, but we're coming up for a release, and
things are getting slow over the holidays, so, for the next few weeks
at least, I'm tied up, but I hope very much to be able to try and help
after that, if I can.
> Even if you could point out a couple of examples of the matlab
> reading code before/after that led to those big performance gains, just to
> illustrate the concepts in a bit more depth, that would be very helpful.
Sadly, sort of, when I rewrote the reader in Cython, I did a huge
refactor at the same time to simplify the code semantics, so it's
difficult to compare directly. I'm confident that the same code in
Python would be about 10 times slower, but I haven't tested it
directly. The kind of thing I mean, you can see here:
http://projects.scipy.org/scipy/browser/trunk/scipy/io/matlab/mio5_utils.pyx
The basic unit of a matlab file read, is 'read_tag' (def
read_tag(...)). Of course I am calling this many many thousands of
times for a big complicated matlab file. The Cython rule is that if
your function or method is defined with the standard python 'def' then
it can be called from outside the module, in python, and has the
normal python function overhead. However, if defined with 'cdef' , it
is essentially a C function, cannot be called direct from Python (from
outside the module), and is very fast. However, in the Cython code,
it still looks like a python call, eg:
cdef int tag_res = self.cread_tag(mdtype_ptr,
byte_count_ptr,
tag_data)
There's also 'cpdef' which allows both python and C-like function calls.
You'll see in that file, that I've defined a C version of 'read_tag',
called 'cread_tag' that is very highly optimized, and yet, I think,
also readable. I call this routine often in the Cython file, but at
C function call overhead.
Other optimizations were doing the byte-swapping directly in C, and
minimizing the temporary python objects I was creating in order to set
up the array creations. For example, in the previous code, I was
first reading the information about the shape of the array into a
temporary numpy array object, then using that to create the returned
arrays. In the Cython code, I can often use C objects for the
temporary stuff, so that there is tiny overhead, and only use the call
into Python or Numpy APIs to make arrays that I will return to the
caller.
> For now I'll continue on the mods I mentioned, as I think they will provide
> a gain on their own, which would still put the code in a better place for
> further optimization with Cython or other methods. For example, the changes
> I'm trying involve reading (at first) into simpler data structures (just
> python built-ins), which eliminates function calls, but perhaps is also
> easier to boost with Cython. Now that the idea of minimizing the initial
> parsing has latched in my brain, I'm fairly convinced it is the right thing
> to do just in terms of program logic, and I have to try it before I can let
> it go ;-)
Good luck - sorry not to be of more practical use,
Best,
Matthew
Yes - right - and also why it's such a good idea to have a DICOM
> It sounds very interesting, and you've certainly hit the key words with
> readable and maintainable -- two of the biggest reasons IMO why python is
> such a great language.
reader, in particular, in Python.
> Well, if you had any time to suggest some specificI would very much like to help, but we're coming up for a release, and
> changes that would really boost pydicom speed, I'd be happy to entertain
> those.
things are getting slow over the holidays, so, for the next few weeks
at least, I'm tied up, but I hope very much to be able to try and help
after that, if I can.
> Even if you could point out a couple of examples of the matlabSadly, sort of, when I rewrote the reader in Cython, I did a huge
> reading code before/after that led to those big performance gains, just to
> illustrate the concepts in a bit more depth, that would be very helpful.
refactor at the same time to simplify the code semantics, so it's
difficult to compare directly. I'm confident that the same code in
Python would be about 10 times slower, but I haven't tested it
directly. The kind of thing I mean, you can see here:
http://projects.scipy.org/scipy/browser/trunk/scipy/io/matlab/mio5_utils.pyx
The basic unit of a matlab file read, is 'read_tag' (def
read_tag(...)). Of course I am calling this many many thousands of
times for a big complicated matlab file. The Cython rule is that if
your function or method is defined with the standard python 'def' then
it can be called from outside the module, in python, and has the
normal python function overhead. However, if defined with 'cdef' , it
is essentially a C function, cannot be called direct from Python (from
outside the module), and is very fast. However, in the Cython code,
it still looks like a python call, eg:
cdef int tag_res = self.cread_tag(mdtype_ptr,
byte_count_ptr,
tag_data)
There's also 'cpdef' which allows both python and C-like function calls.
You'll see in that file, that I've defined a C version of 'read_tag',
called 'cread_tag' that is very highly optimized, and yet, I think,
also readable. I call this routine often in the Cython file, but at
C function call overhead.
Other optimizations were doing the byte-swapping directly in C, and
minimizing the temporary python objects I was creating in order to set
up the array creations.
Good luck - sorry not to be of more practical use,
> For now I'll continue on the mods I mentioned, as I think they will provide
> a gain on their own, which would still put the code in a better place for
> further optimization with Cython or other methods. For example, the changes
> I'm trying involve reading (at first) into simpler data structures (just
> python built-ins), which eliminates function calls, but perhaps is also
> easier to boost with Cython. Now that the idea of minimizing the initial
> parsing has latched in my brain, I'm fairly convinced it is the right thing
> to do just in terms of program logic, and I have to try it before I can let
> it go ;-)
> I'm byte swapping using python's struct.unpack -- I don't know how fast that
> module is. But it also needs to be done for each data element length as it
> is read in, to know how much to read next. So for reading, I don't think it
> can be done on an array all at once. The values themselves (when fully
> parsed) could be, but that can already be done at C speed using python's
> array module, or a numpy array.
Yes; in my case, I was creating temporary numpy arrays to take care of
the byte-swapping, and I believe this was causing lots of object
creation overhead - that may well not be the case for you. Still, I
think struct unpack is in python, and if you are calling it a lot, I
suspect you may be hitting some speed issues.
>> Good luck - sorry not to be of more practical use,
>
> Actually I think this discussion has been very useful, and I've learned a
> lot. I have no doubt something practical will follow at some point.
Likewise - I'm looking forward to the time I can help more. And thank
you for pydicom, I was very pleased to find it, and I'm sure it's the
way forward for an adaptable DICOM reader...
Best,
Matthew
I've made the changes I had mentioned I was working on, although they
had a number of side-effects which I'm still working through. Some of
the standard unit tests are still failing, generally around reading
"undefined length" items and some sequences, it appears.
But the changes work on the test files I have been using, and have
made a signicant improvement in reading speed, according to the
time_test.py profiling. On my machine, reading the 100 CT files is
coming in at essentially the same time as simply reading all the bytes
in python with no parsing (about 2.6 seconds, or about 2.5 times
faster than previous code). Repeat runs appear to gain a lot from disk
caching, eventually coming in at 0.3 sec/100 files, as compared with
0.1 sec for python read.
I'm not sure what has made the most difference - it is not clear from
the profiler stats. I suspect it is a combination of reducing function
call overhead (through using a generator for data elements and
combining file reads/struct.unpack calls) and simplifying tag creation
and comparison. I'm not sure avoiding parsing the values themselves
had much of an impact, since for most VR types there isn't much
conversion.
If anyone wants to try this out, the code is posted at an experimental
repository clone [1]. As I said, it is somewhat broken in that it
fails many unit tests. It may take a while to clear those up, or I may
instead try a couple of the simple changes applied to the main
repository code to see whether most of the speed improvements can be
gained without changing the code as much.
Regards,
Darcy
The speed improvements code has been cleaned up and merged into the
main pydicom repository. Getting all unit tests to pass required more
logic to be added, but the time tests show it is still as fast as
simple python reading with no parsing.
I'd encourage those who are interested to try it out [1] and report
any problems found.
[1] http://code.google.com/p/pydicom/source/checkout
-Darcy
Great work!
Adit
> --
> You received this message because you are subscribed to the Google Groups "pydicom" group.
> To post to this group, send email to pyd...@googlegroups.com.
> To unsubscribe from this group, send email to pydicom+u...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/pydicom?hl=en.
>
>
I was about to start the new package, but the new code failed when
reading a different DICOM file I was working with, one that version
0.9.3 reads fine. I'm hoping to debug that before putting the package
together. But one way or another, I'll try to post something in the
next week.
> --
> You received this message because you are subscribed to the Google Groups
> "pydicom" group.
> To post to this group, send email to pyd...@googlegroups.com.
> To unsubscribe from this group, send email to
> pydicom+u...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/pydicom?hl=en.
>
>
--
Lic. José M. Rodriguez Bacallao
File "/Library/Frameworks/Python.framework/Versions/5.1.1/lib/
python2.5/site-packages/pydicom-0.9.3-py2.5.egg/dicom/dataset.py",
line 363, in _get_pixel_array
return self._getPixelArray()
File "/Library/Frameworks/Python.framework/Versions/5.1.1/lib/
python2.5/site-packages/pydicom-0.9.3-py2.5.egg/dicom/dataset.py",
line 358, in _getPixelArray
self._PixelArray = self._PixelDataNumpy()
File "/Library/Frameworks/Python.framework/Versions/5.1.1/lib/
python2.5/site-packages/pydicom-0.9.3-py2.5.egg/dicom/dataset.py",
line 317, in _PixelDataNumpy
need_byteswap = (self.isLittleEndian != sys_isLittleEndian)
File "/Library/Frameworks/Python.framework/Versions/5.1.1/lib/
python2.5/site-packages/pydicom-0.9.3-py2.5.egg/dicom/dataset.py",
line 235, in __getattr__
raise AttributeError, "Dataset does not have attribute '%s'." %
name
dicom.dataset.PropertyError: AttributeError in pixel_array property:
Dataset does not have attribute 'isLittleEndian'.
Any ideas? Is the 'isLittleEndian' attribute something I could set
manually in the script?
Many Thanks,
Jeff
On Feb 5, 3:55 pm, Jeff <jeff.macin...@gmail.com> wrote:
> ... I am getting an error that was not there in the
> previous version, however, when I try to write the pixel_array from
> each image into a larger numpy array:
> ...
> Dataset does not have attribute 'isLittleEndian'.
>
> Any ideas? Is the 'isLittleEndian' attribute something I could set
> manually in the script?
This is simply because I've been changing names to PEP-8 style and
have missed some spots that need to be changed to "is_little_endian".
I'll hunt down those incorrect names and fix them in the next (and
hopefully last before the release package) commit.
-Darcy
PS: sorry for my english!!!
The big change is the reorganization (and speed-up) of the reading
code. Other changes include the addition of contributed code for
viewing images, updates to the DICOM dictionary and private
dictionary, and many bug fixes.
I'll post release notes with more details on the pydicom wiki once the
package has been put together.
-Darcy
> --
> You received this message because you are subscribed to the Google Groups
> "pydicom" group.
> To post to this group, send email to pyd...@googlegroups.com.
> To unsubscribe from this group, send email to
> pydicom+u...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/pydicom?hl=en.
>
>
--
Lic. José M. Rodriguez Bacallao
No, sorry, there is nothing new on vtk. The new contrib codes use wx,
PIL, and Tkinter.
I haven't used vtk or itk, so I don't know anything really about how
to put that together. [Suggestions/starting code/etc. are welcome].
-Darcy
> --
> You received this message because you are subscribed to the Google Groups
> "pydicom" group.
> To post to this group, send email to pyd...@googlegroups.com.
> To unsubscribe from this group, send email to
> pydicom+u...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/pydicom?hl=en.
>
>
--
Lic. José M. Rodriguez Bacallao