NetCDF Segmentation Fault

369 views
Skip to first unread message

Nick Vance

unread,
Mar 1, 2013, 9:30:28 PM3/1/13
to netcdf4...@googlegroups.com
Hi,

I'm hitting what seems to be a bug in the EPD NetCDF reading or writing library. I'm using EPD version 7.3.1 (epd-7.3-1-rh5-x86_64).  Here's the python -v dump of the netCDF packages that are loaded by my script:

--------------------------------------------
dlopen("/opt/epd-7.3-1-rh5-x86_64/lib/python2.7/site-packages/netCDF4.so", 2);
import netCDF4_utils # from /opt/epd-7.3-1-rh5-x86_64/lib/python2.7/site-packages/netCDF4_utils.py
import netcdftime # from /opt/epd-7.3-1-rh5-x86_64/lib/python2.7/site-packages/netcdftime.py
import netCDF4 # dynamically loaded from /opt/epd-7.3-1-rh5-x86_64/lib/python2.7/site-packages/netCDF4.so
--------------------------------------------

I'm basically trying to write out 3 NumPy arrays (dataset, scan_lines, and cells) as a netCDF4 file and then be able to read the netCDF4 back again in Python. The write operation seems to work fine without any errors or issues. However, when I read the exact same file back, Python crashes with a segmentation fault so I don't get any error message pointing to a specific line or library.

I created a script that will usually end up with this Segmentation Fault (I'm working on getting one that will segfault every time, see note below).  It's attached to this email, just run netcdfReadWriteTest.py

I submitted this bug to Jonathan Rocher at Enthought and he confirmed the issue under RHEL and directed me to contact you.  He also confirmed that the same code exhibits the same "Access violation reading" under Windows and works without issues under Mac OS.  

I'm hoping you might be able to look into this to help suss out the problem.

Thanks,
Nick Vance

Note: Sometimes when I run the test I don't get a backtrace, just a Segmentation fault error:

--------------------------------------------
$ python netcdfReadWriteTest.py 
Reading Pickle Files In: ./raw_data_dumps/N65684083W167928375.5000m.2010.002.134000.aqua.modis_extracted
Read parameter sizes: 1, 10, 71
Writing out NetCDF File: /data/data-j/collection-pieces/hytes/scenes/lab-test-scenes/version-2-1-testdata/code/netCDF-test/test.netcdf
Re-reading NetCDF File: /data/data-j/collection-pieces/hytes/scenes/lab-test-scenes/version-2-1-testdata/code/netCDF-test/test.netcdf
----------1----------
----------2----------
----------3----------
----------4----------
Segmentation fault
--------------------------------------------

Other times (I'm unsure what causes the difference, but it might depend on what group of Pickle data you load, there are several included) I get a backtrace like this:

--------------------------------------------
$ python netcdfReadWriteTest.py 
Reading Pickle Files In: ./raw_data_dumps/N65684083W167928375.5000m.2010.002.232000.aqua.modis_extracted
Read parameter sizes: 1, 10, 73
Writing out NetCDF File: /data/data-j/collection-pieces/hytes/scenes/lab-test-scenes/version-2-1-testdata/code/netCDF-test/test.netcdf
Re-reading NetCDF File: /data/data-j/collection-pieces/hytes/scenes/lab-test-scenes/version-2-1-testdata/code/netCDF-test/test.netcdf
----------1----------
----------2----------
----------3----------
----------4----------
----------5----------
----------6----------
Re-Read parameter sizes: 1, 10, 73
Reading Pickle Files In: ./raw_data_dumps/N65684083W167928375.5000m.2010.002.214000.aqua.modis_extracted.dat1
*** glibc detected *** /opt/epd-7.3-1-rh5-x86_64/bin/python: free(): invalid next size (normal): 0x00000000021ab200 ***
======= Backtrace: =========
/lib/libc.so.6(+0x78bb6)[0x7fb195b47bb6]
/lib/libc.so.6(cfree+0x73)[0x7fb195b4e483]
/opt/epd-7.3-1-rh5-x86_64/lib/python2.7/site-packages/numpy/core/multiarray.so(+0x832af)[0x7fb193c292af]
/opt/epd-7.3-1-rh5-x86_64/lib/python2.7/site-packages/numpy/core/multiarray.so(+0x8330e)[0x7fb193c2930e]
/opt/epd-7.3-1-rh5-x86_64/lib/libpython2.7.so.1.0(+0x7b163)[0x7fb196777163]
/opt/epd-7.3-1-rh5-x86_64/lib/libpython2.7.so.1.0(PyDict_SetItem+0x73)[0x7fb196778183]
/opt/epd-7.3-1-rh5-x86_64/lib/libpython2.7.so.1.0(PyEval_EvalFrameEx+0x4d7b)[0x7fb1967d63cb]
/opt/epd-7.3-1-rh5-x86_64/lib/libpython2.7.so.1.0(PyEval_EvalCodeEx+0x8d2)[0x7fb1967d8c12]
/opt/epd-7.3-1-rh5-x86_64/lib/libpython2.7.so.1.0(PyEval_EvalCode+0x32)[0x7fb1967d8c62]
/opt/epd-7.3-1-rh5-x86_64/lib/libpython2.7.so.1.0(+0xf68e2)[0x7fb1967f28e2]
/opt/epd-7.3-1-rh5-x86_64/lib/libpython2.7.so.1.0(PyRun_FileExFlags+0x96)[0x7fb1967f29b6]
/opt/epd-7.3-1-rh5-x86_64/lib/libpython2.7.so.1.0(PyRun_SimpleFileExFlags+0x1f7)[0x7fb1967f3f17]
/opt/epd-7.3-1-rh5-x86_64/lib/libpython2.7.so.1.0(Py_Main+0xc86)[0x7fb196804a86]
/lib/libc.so.6(__libc_start_main+0xfd)[0x7fb195aedc4d]
/opt/epd-7.3-1-rh5-x86_64/bin/python[0x4006f9]
======= Memory map: ========
00400000-00401000 r-xp 00000000 08:21 45368962                           /opt/epd-7.3-1-rh5-x86_64/bin/python2.7
00600000-00601000 rw-p 00000000 08:21 45368962                           /opt/epd-7.3-1-rh5-x86_64/bin/python2.7
019a8000-026d1000 rw-p 00000000 00:00 0                                  [heap]
7fb188000000-7fb188021000 rw-p 00000000 00:00 0 
7fb188021000-7fb18c000000 ---p 00000000 00:00 0 
7fb18dc81000-7fb18dc97000 r-xp 00000000 08:21 5242885                    /lib/libgcc_s.so.1
7fb18dc97000-7fb18de96000 ---p 00016000 08:21 5242885                    /lib/libgcc_s.so.1
7fb18de96000-7fb18de97000 r--p 00015000 08:21 5242885                    /lib/libgcc_s.so.1
7fb18de97000-7fb18de98000 rw-p 00016000 08:21 5242885                    /lib/libgcc_s.so.1
7fb18de98000-7fb18e013000 r-xp 00000000 08:21 45369175                   /opt/epd-7.3-1-rh5-x86_64/lib/libcrypto.so.1.0.0
7fb18e013000-7fb18e212000 ---p 0017b000 08:21 45369175                   /opt/epd-7.3-1-rh5-x86_64/lib/libcrypto.so.1.0.0
7fb18e212000-7fb18e235000 rw-p 0017a000 08:21 45369175                   /opt/epd-7.3-1-rh5-x86_64/lib/libcrypto.so.1.0.0
7fb18e235000-7fb18e238000 rw-p 00000000 00:00 0 
7fb18e238000-7fb18e288000 r-xp 00000000 08:21 45371397                   /opt/epd-7.3-1-rh5-x86_64/lib/libssl.so.1.0.0
7fb18e288000-7fb18e487000 ---p 00050000 08:21 45371397                   /opt/epd-7.3-1-rh5-x86_64/lib/libssl.so.1.0.0
7fb18e487000-7fb18e48f000 rw-p 0004f000 08:21 45371397                   /opt/epd-7.3-1-rh5-x86_64/lib/libssl.so.1.0.0
7fb18e48f000-7fb18e492000 r-xp 00000000 08:21 45371005                   /opt/epd-7.3-1-rh5-x86_64/lib/python2.7/lib-dynload/_hashlib.so
7fb18e492000-7fb18e692000 ---p 00003000 08:21 45371005                   /opt/epd-7.3-1-rh5-x86_64/lib/python2.7/lib-dynload/_hashlib.so
7fb18e692000-7fb18e693000 rw-p 00003000 08:21 45371005                   /opt/epd-7.3-1-rh5-x86_64/lib/python2.7/lib-dynload/_hashlib.so
7fb18e693000-7fb18e697000 r-xp 00000000 08:21 45371000                   /opt/epd-7.3-1-rh5-x86_64/lib/python2.7/lib-dynload/_locale.so
7fb18e697000-7fb18e897000 ---p 00004000 08:21 45371000                   /opt/epd-7.3-1-rh5-x86_64/lib/python2.7/lib-dynload/_locale.so
7fb18e897000-7fb18e898000 rw-p 00004000 08:21 45371000                   /opt/epd-7.3-1-rh5-x86_64/lib/python2.7/lib-dynload/_locale.so
7fb18e898000-7fb18e8a8000 r-xp 00000000 08:21 45370984                   /opt/epd-7.3-1-rh5-x86_64/lib/python2.7/lib-dynload/datetime.so
7fb18e8a8000-7fb18eaa7000 ---p 00010000 08:21 45370984                   /opt/epd-7.3-1-rh5-x86_64/lib/python2.7/lib-dynload/datetime.so
7fb18eaa7000-7fb18eaab000 rw-p 0000f000 08:21 45370984                   /opt/epd-7.3-1-rh5-x86_64/lib/python2.7/lib-dynload/datetime.so
7fb18eaab000-7fb18eaee000 r-xp 00000000 08:21 1050318                    /opt/epd-7.3-1-rh5-x86_64/lib/python2.7/site-packages/numpy/random/mtrand.so
7fb18eaee000-7fb18eced000 ---p 00043000 08:21 1050318                    /opt/epd-7.3-1-rh5-x86_64/lib/python2.7/site-packages/numpy/random/mtrand.so
7fb18eced000-7fb18ed23000 rw-p 00042000 08:21 1050318                    /opt/epd-7.3-1-rh5-x86_64/lib/python2.7/site-packages/numpy/random/mtrand.so
7fb18ed23000-7fb18ed28000 r-xp 00000000 08:21 45370993                   /opt/epd-7.3-1-rh5-x86_64/lib/python2.7/lib-dynload/strop.so
7fb18ed28000-7fb18ef27000 ---p 00005000 08:21 45370993                   /opt/epd-7.3-1-rh5-x86_64/lib/python2.7/lib-dynload/strop.so
7fb18ef27000-7fb18ef29000 rw-p 00004000 08:21 45370993                   /opt/epd-7.3-1-rh5-x86_64/lib/python2.7/lib-dynload/strop.so
7fb18ef29000-7fb18ef32000 r-xp 00000000 08:21 1050128                    /opt/epd-7.3-1-rh5-x86_64/lib/python2.7/site-packages/numpy/fft/fftpack_lite.so
7fb18ef32000-7fb18f132000 ---p 00009000 08:21 1050128                    /opt/epd-7.3-1-rh5-x86_64/lib/python2.7/site-packages/numpy/fft/fftpack_lite.so
7fb18f132000-7fb18f133000 rw-p 00009000 08:21 1050128                    /opt/epd-7.3-1-rh5-x86_64/lib/python2.7/site-packages/numpy/fft/fftpack_lite.so
7fb18f133000-7fb18f135000 r-xp 00000000 08:21 45371026                   /opt/epd-7.3-1-rh5-x86_64/lib/python2.7/lib-dynload/grp.so
7fb18f135000-7fb18f334000 ---p 00002000 08:21 45371026                   /opt/epd-7.3-1-rh5-x86_64/lib/python2.7/lib-dynload/grp.so
7fb18f334000-7fb18f335000 rw-p 00001000 08:21 45371026                   /opt/epd-7.3-1-rh5-x86_64/lib/python2.7/lib-dynload/grp.soAborted
--------------------------------------------
netCDF-test.zip

Jeffrey Whitaker

unread,
Mar 2, 2013, 9:01:20 AM3/2/13
to netcdf4...@googlegroups.com
Nick:  I've confirmed this bug with the latest svn, netcdf 4.2.1.1 and hdf5 1.8.10 on macos x.  As you say, the file looks fine, and can be read just fine in a separate script.  Looks like the segfault is only triggered when the file is re-read in the same script.  Don't yet know whether this is a bug in the python module or one of the C libs.  Will get back to you when I know more.

Regards, Jeff

Jeffrey Whitaker

unread,
Mar 3, 2013, 9:36:36 AM3/3/13
to netcdf4...@googlegroups.com
Nick:  That error message is saying that something continued using (and modifying) an object that was created after it had been free (deallocated). Specifically, it appears to be related to the "cells' compound variable, which is based on a quite large (108 record)  numpy array.  It could be because an object is dropping out of scope and then being cleaned up by the python garbage collector, but is then used again by the python module. Or it could be a bug in the C lib causing some object to be deallocated before the python interface is done with it.  I'll continue to try to track it down, but it may take awhile - this is a tough one.  If it's an error in the C lib, I'll need to demonstrate that by creating a C program that triggers it, and send that to the netcdf developers.


Regards, Jeff

On Friday, March 1, 2013 7:30:28 PM UTC-7, Nick Vance wrote:

Nick Vance

unread,
Mar 3, 2013, 3:33:35 PM3/3/13
to netcdf4...@googlegroups.com
Jeff, thanks for your work.  I'll pass the info on to my team and keep monitoring here if you have any updates.

-Nick

Jeffrey Whitaker

unread,
Mar 3, 2013, 11:43:07 PM3/3/13
to netcdf4...@googlegroups.com
Nick:  I've found a workaround for you (script attached).  It turns out the error happens only if the pickled numpy arrays are loaded in the same loop that creates and re-reads the netcdf files.  In the attached script, I moved the pickle loads into a separate loop in which lists of numpy arrays are created from the pickled data. I think somehow the pickle loads must contain references to the numpy arrays that are going out of scope and/or get corrupted somehow.  I can't tell whether this is a consequence of a netcdf4-python bug, a bug in numpy or what.  If I were a betting man, I'd bet on the latter (mostly since I've been poring through the netcdf4-python code for the last couple of days and don't see anything suspicious).  If you google 'numpy, pickle, segfault', you'll get a lot of hits.  This one, pertaining to pickles of structured arrays, may be relevant

https://github.com/numpy/numpy/issues/3003

Regards, Jeff
netcdfReadWriteTest.py

Nick Vance

unread,
Mar 3, 2013, 11:58:17 PM3/3/13
to netcdf4...@googlegroups.com
Jeff,

I'm certain the bug doesn't have anything to do with Pickle.  It originally cropped up when importing data custom data files; I just dumped the loaded values with Pickle to create a script that could be shared easily.  Jonathan Rocher at Enthought actually sent me modified the script earlier today that loads data internally without pickle and segfaults every time.  I've attached that version.  Maybe it will help narrow down the source of the error?

I will investigate and discuss your suggested work-arounds with my team on Monday.

Thanks,
Nick
netcdfReadWriteTest2.py

Jonathan Rocher

unread,
Mar 4, 2013, 4:17:08 PM3/4/13
to netcdf4...@googlegroups.com
Dear Jeff, 

thanks for looking into this. Following your first suggestion, I have moved the reading part in another script but on my linux machine that has not removed the segfault. And the script you attached is leading to a segfault as well, still on my linux VM (not on OSX). Have you tested it under Linux as well? It could also be because I am slightly behind in version of things. 

Best,
Jonathan

Jeffrey Whitaker

unread,
Mar 4, 2013, 4:50:39 PM3/4/13
to netcdf4...@googlegroups.com
Jonathan:  Both my version and the version that Nick sent earlier today work for me on macos x with netcdf 4.2.1.1 and hdf5 1.8.10.  I do get the segfault on linux with 4.2 and 1.8.9 - will try to upgrade the libraries on linux and report back.  It's likely there is some memory corruption somewhere and sometimes on some systems we're just lucky to avoid the segfault. 

-Jeff

Jeffrey Whitaker

unread,
Mar 4, 2013, 5:05:20 PM3/4/13
to netcdf4...@googlegroups.com


On Monday, March 4, 2013 2:50:39 PM UTC-7, Jeffrey Whitaker wrote:
Jonathan:  Both my version and the version that Nick sent earlier today work for me on macos x with netcdf 4.2.1.1 and hdf5 1.8.10.  I do get the segfault on linux with 4.2 and 1.8.9 - will try to upgrade the libraries on linux and report back.  It's likely there is some memory corruption somewhere and sometimes on some systems we're just lucky to avoid the segfault. 

-Jeff


Segfault also occurs with 4.2.1.1/1.8.10 on linux.  I do get this traceback on linux:

*** glibc detected *** python: munmap_chunk(): invalid pointer: 0x0000000001bf30f0 ***
======= Backtrace: =========
/lib64/libc.so.6(+0x75916)[0x2b542cdd8916]
/pan2/projects/gfsenkf/whitaker/lib/libhdf5.so.7(H5FL_blk_free+0x1c8)[0x2b54311843b8]
pan2/projects/gfsenkf/whitaker/lib/libhdf5.so.7(+0x8a79c)[0x2b543114979c]
/pan2/projects/gfsenkf/whitaker/lib/libhdf5.so.7(H5Dread+0xc7)[0x2b5431149dc7]
/pan2/projects/gfsenkf/whitaker/lib/libnetcdf.so.7(nc4_get_vara+0x7f4)[0x2b5430bf4874]
/pan2/projects/gfsenkf/whitaker/lib/libnetcdf.so.7(NC4_get_vara+0x72)[0x2b5430bed6d2]
/pan2/projects/gfsenkf/whitaker/lib/libnetcdf.so.7(NC_get_vara+0x77)[0x2b5430bc1d67]
/pan2/projects/gfsenkf/whitaker/lib/libnetcdf.so.7(nc_get_vara+0x91)[0x2b5430bc301

Right now, I'm out of ideas.  I think the next thing to do is to try to reproduce this with a C program to see if the problem is in the HDF5 or netcdf libs, or in the python interface.  Unfortunately, I won't be able to do this anytime soon.

-Jeff

Jeffrey Whitaker

unread,
Mar 5, 2013, 7:47:50 AM3/5/13
to netcdf4...@googlegroups.com
Nick:   Good news - I've found the problem and it's an easy fix.    When a compound data type is created from a numpy dtype, the dtype must be created with 'align=True" so that padding is added to the fields to match what a C compiler would output for a similar C-struct (http://docs.scipy.org/doc/numpy/reference/generated/numpy.dtype.html).  Otherwise, when the data is read back it in it may not fit in the numpy array buffer, and memory may be over-written and corrupted.   Please give svn head a try and see if the problem is indeed fixed. If so, I will create a 1.0.4 release.

Regards, Jeff

Nick Vance

unread,
Mar 5, 2013, 4:49:22 PM3/5/13
to netcdf4...@googlegroups.com
Yes, 1.0.4 pulled & built from the SVN seems to make the test scripts work whereas they failed before.  Thanks!

-Nick

Jeffrey Whitaker

unread,
Mar 6, 2013, 11:04:37 AM3/6/13
to netcdf4...@googlegroups.com
OK, good.  I've released version 1.0.4, including windows binaries.  Let me know if you run into any more problems.  The compound type stuff needs some exercise, I don't think it's gotten much use yet.

-Jeff
Reply all
Reply to author
Forward
0 new messages