how to process a huge data set?

74 views
Skip to first unread message

Yaowang Li

unread,
Jun 8, 2015, 5:26:29 PM6/8/15
to emsp...@googlegroups.com
I extracted a data set of segments from 360 images, its size is around 500G. But failed to process in segmentclass command.

parameter file
Image input stack                        = /media/mybook/fibrils-grid3/segment_01_Jun_2015_13_00_54_4730/protein_stack.hdf
Class average stack                      = averages.hdf
Reference image option                   = False
Image reference stack                    = 
Spring database option                   = False
spring.db file                           = 
Pixel size in Angstrom                   = 0.883
Estimated helix width and height in Angstrom = (300, 600)
Number of classes                        = 50
Number of iterations                     = 0
Keep intermediate files                  = False
Limit in-plane rotation                  = True
Delta in-plane rotation angle            = 50.0
X and Y translation range in Angstrom    = (50, 50)
High-pass filter option                  = False
Low-pass filter option                   = False
High and low-pass filter cutoffs in 1/Angstrom = (0.0, 0.0)
Binning option                           = False
Binning factor                           = 1
MPI option                               = True
Number of CPUs                           = 16
Temporary directory                      = /tmp

the report.log

###############
INFO:CPU0:startlog:progress state: 1 %  [>                                        ]
INFO:CPU0:classify:

INFO:CPU0:classify:progress state: 10 %  [====>                                    ]
INFO:CPU0:sx_kmeans:

INFO:think:/opt/spring_v0-83-1449/parts/openmpi/bin/mpirun -np 8 /opt/spring_v0-83-1449/bin/springenv /opt/spring_v0-83-1449/parts/EMAN2/bin/sxk_means.py protein_stack.hdf sxk_means00 rectmask.hdf --K=50 --maxit=500 --rand_seed=-1 --crit=D --MPI
        logged on Mon, 08 Jun 2015 19:34:09
INFO:CPU0:sx_kmeans:springenv /opt/spring_v0-83-1449/parts/EMAN2/bin/sxk_means.py protein_stack.hdf sxk_means00 rectmask.hdf --K=50 --maxit=500 --rand_seed=-1 --crit=D --MPI
        logged on Mon, 08 Jun 2015 19:48:48
Traceback (most recent call last):
  File "/opt/spring_v0-83-1449/bin/segmentclass", line 137, in <module>
    sys.exit(spring.segment2d.segmentclass.main())
  File "/opt/spring_v0-83-1449/lib/python2.7/site-packages/emspring-0.83.1449-py2.7.egg/spring/segment2d/segmentclass.py", line 732, in main
    stack.classify()
  File "/opt/spring_v0-83-1449/lib/python2.7/site-packages/emspring-0.83.1449-py2.7.egg/spring/segment2d/segmentclass.py", line 700, in classify
    avgstack, varstack = self.sx_kmeans(aligned_stack, self.maskfile, self.noclasses)
  File "/opt/spring_v0-83-1449/lib/python2.7/site-packages/emspring-0.83.1449-py2.7.egg/spring/segment2d/segmentclass.py", line 451, in sx_kmeans
    external_kmeans_run.check_expected_output_file(program_to_be_launched, avgstack)
  File "/opt/spring_v0-83-1449/lib/python2.7/site-packages/emspring-0.83.1449-py2.7.egg/spring/csinfrastr/csproductivity.py", line 543, in check_expected_output_file
    raise IOError(error_message)
IOError: /opt/spring_v0-83-1449/parts/EMAN2/bin/sxk_means.py did not finish successfully. The output file sxk_means00/averages.hdf was not found, please check logfile of /opt/spring_v0-83-1449/parts/EMAN2/bin/sxk_means.py for details.

and the logfile
****************************************************************************************************
                    Beginning of the program k-means: Mon, 08 Jun 2015 19:34:13
****************************************************************************************************

that is all.

I checked the segmentclass_XXX folder, I found the spring will first copy file inside and then process it. I guess there is a problem here. 

anyway, could you suggest me how to process the huge data set?

currently I have splited it into small data set, and process one by one, but  I still hope I can process it together.

thanks

yaowang



spring --version && springenv e2version.py
Spring environment loaded.
/opt/spring_v0-83-1449/lib/python2.7/site-packages/setuptools-15.2-py2.7.egg/pkg_resources/__init__.py:1250: UserWarning: /home/think/.python-eggs is writable by group/others and vulnerable to attack when used with get_resource_filename. Consider a more secure location (set with .set_extraction_path or the PYTHON_EGG_CACHE environment variable).
GUI from package Emspring-0.83.1449
Spring environment loaded.
/opt/spring_v0-83-1449/lib/python2.7/site-packages/setuptools-15.2-py2.7.egg/pkg_resources/__init__.py:1250: UserWarning: /home/think/.python-eggs is writable by group/others and vulnerable to attack when used with get_resource_filename. Consider a more secure location (set with .set_extraction_path or the PYTHON_EGG_CACHE environment variable).
EMAN 2.1 alpha2 (CVS 2013/08/07 17:01:09)
Your EMAN2 is running on:  Ubuntu 14.04.2 LTS 3.13.0-52-generic x86_64
Your Python version is:  2.7.2

Carsten Sachse

unread,
Jun 9, 2015, 2:51:02 AM6/9/15
to emsp...@googlegroups.com, liyaow...@gmail.com
Dear Yoawang,

This looks like a problem with SPARX' k_means.py clustering that SPRING calls. You can check that logfile in one of the subfolders for further details.

Generally, I would recommend to turn on binning. This speeds up things and gets around memory issues with large data sets. In addition, it acts like a filter.

Best wishes,


Carsten

Yaowang Li

unread,
Jun 9, 2015, 4:45:01 AM6/9/15
to emsp...@googlegroups.com, liyaow...@gmail.com
Dear Carsten,

I agree that using binning option is a good choice.
I checked and posted here, the logfile(logfile_08_Jun_2015_19_34_12), while it seems k-means tried to start, but stopped immediately. 

****************************************************************************************************
                    Beginning of the program k-means: Mon, 08 Jun 2015 19:34:13
****************************************************************************************************

just personal curious, why the program has to copy the original file before processing? 
using my data as an example, it is 500GB, and then copy it into another subfolder for clustering. the result is, it will use 1TB.  it is also very slow.  while I have to say, the data size normally is relative smaller comparing to my current data.

Carsten Sachse

unread,
Jun 9, 2015, 4:59:45 AM6/9/15
to emsp...@googlegroups.com, liyaow...@gmail.com
Hi Yoawang,

The output of the logfile suggests some sort of memory problem. Again binning may resolve it. I have never worked with large data sets like that.The requirement of the copy is because of the multiple parallel CPU operations that need access to the file. It is a pre-cautionary measure to avoid crashes with other programs or parallel segmentclass runs that would interfere with your run.

Best wishes,


Carsten 
Reply all
Reply to author
Forward
0 new messages