how to process a huge data set?

Yaowang Li

unread,

Jun 8, 2015, 5:26:29 PM6/8/15

to emsp...@googlegroups.com

I extracted a data set of segments from 360 images, its size is around 500G. But failed to process in segmentclass command.

parameter file

Image input stack = /media/mybook/fibrils-grid3/segment_01_Jun_2015_13_00_54_4730/protein_stack.hdf

Class average stack = averages.hdf

Reference image option = False

Image reference stack =

Spring database option = False

spring.db file =

Pixel size in Angstrom = 0.883

Estimated helix width and height in Angstrom = (300, 600)

Number of classes = 50

Number of iterations = 0

Keep intermediate files = False

Limit in-plane rotation = True

Delta in-plane rotation angle = 50.0

X and Y translation range in Angstrom = (50, 50)

High-pass filter option = False

Low-pass filter option = False

High and low-pass filter cutoffs in 1/Angstrom = (0.0, 0.0)

Binning option = False

Binning factor = 1

MPI option = True

Number of CPUs = 16

Temporary directory = /tmp

the report.log

###############

INFO:CPU0:startlog:progress state: 1 % [> ]

INFO:CPU0:classify:

INFO:CPU0:classify:progress state: 10 % [====> ]

INFO:CPU0:sx_kmeans:

INFO:think:/opt/spring_v0-83-1449/parts/openmpi/bin/mpirun -np 8 /opt/spring_v0-83-1449/bin/springenv /opt/spring_v0-83-1449/parts/EMAN2/bin/sxk_means.py protein_stack.hdf sxk_means00 rectmask.hdf --K=50 --maxit=500 --rand_seed=-1 --crit=D --MPI

logged on Mon, 08 Jun 2015 19:34:09

INFO:CPU0:sx_kmeans:springenv /opt/spring_v0-83-1449/parts/EMAN2/bin/sxk_means.py protein_stack.hdf sxk_means00 rectmask.hdf --K=50 --maxit=500 --rand_seed=-1 --crit=D --MPI

logged on Mon, 08 Jun 2015 19:48:48

Traceback (most recent call last):

File "/opt/spring_v0-83-1449/bin/segmentclass", line 137, in <module>

sys.exit(spring.segment2d.segmentclass.main())

File "/opt/spring_v0-83-1449/lib/python2.7/site-packages/emspring-0.83.1449-py2.7.egg/spring/segment2d/segmentclass.py", line 732, in main

stack.classify()

File "/opt/spring_v0-83-1449/lib/python2.7/site-packages/emspring-0.83.1449-py2.7.egg/spring/segment2d/segmentclass.py", line 700, in classify

avgstack, varstack = self.sx_kmeans(aligned_stack, self.maskfile, self.noclasses)

File "/opt/spring_v0-83-1449/lib/python2.7/site-packages/emspring-0.83.1449-py2.7.egg/spring/segment2d/segmentclass.py", line 451, in sx_kmeans

external_kmeans_run.check_expected_output_file(program_to_be_launched, avgstack)

File "/opt/spring_v0-83-1449/lib/python2.7/site-packages/emspring-0.83.1449-py2.7.egg/spring/csinfrastr/csproductivity.py", line 543, in check_expected_output_file

raise IOError(error_message)

IOError: /opt/spring_v0-83-1449/parts/EMAN2/bin/sxk_means.py did not finish successfully. The output file sxk_means00/averages.hdf was not found, please check logfile of /opt/spring_v0-83-1449/parts/EMAN2/bin/sxk_means.py for details.

and the logfile

****************************************************************************************************

Beginning of the program k-means: Mon, 08 Jun 2015 19:34:13

****************************************************************************************************

that is all.

I checked the segmentclass_XXX folder, I found the spring will first copy file inside and then process it. I guess there is a problem here.

anyway, could you suggest me how to process the huge data set?

currently I have splited it into small data set, and process one by one, but I still hope I can process it together.

thanks

yaowang

spring --version && springenv e2version.py

Spring environment loaded.

/opt/spring_v0-83-1449/lib/python2.7/site-packages/setuptools-15.2-py2.7.egg/pkg_resources/__init__.py:1250: UserWarning: /home/think/.python-eggs is writable by group/others and vulnerable to attack when used with get_resource_filename. Consider a more secure location (set with .set_extraction_path or the PYTHON_EGG_CACHE environment variable).

GUI from package Emspring-0.83.1449

Spring environment loaded.

/opt/spring_v0-83-1449/lib/python2.7/site-packages/setuptools-15.2-py2.7.egg/pkg_resources/__init__.py:1250: UserWarning: /home/think/.python-eggs is writable by group/others and vulnerable to attack when used with get_resource_filename. Consider a more secure location (set with .set_extraction_path or the PYTHON_EGG_CACHE environment variable).

EMAN 2.1 alpha2 (CVS 2013/08/07 17:01:09)

Your EMAN2 is running on: Ubuntu 14.04.2 LTS 3.13.0-52-generic x86_64

Your Python version is: 2.7.2

Carsten Sachse

unread,

Jun 9, 2015, 2:51:02 AM6/9/15

to emsp...@googlegroups.com, liyaow...@gmail.com

Dear Yoawang,

This looks like a problem with SPARX' k_means.py clustering that SPRING calls. You can check that logfile in one of the subfolders for further details.

Generally, I would recommend to turn on binning. This speeds up things and gets around memory issues with large data sets. In addition, it acts like a filter.

Best wishes,

Carsten

Yaowang Li

unread,

Jun 9, 2015, 4:45:01 AM6/9/15

to emsp...@googlegroups.com, liyaow...@gmail.com

Dear Carsten,

I agree that using binning option is a good choice.

I checked and posted here, the logfile(logfile_08_Jun_2015_19_34_12), while it seems k-means tried to start, but stopped immediately.

****************************************************************************************************

Beginning of the program k-means: Mon, 08 Jun 2015 19:34:13

****************************************************************************************************

just personal curious, why the program has to copy the original file before processing?

using my data as an example, it is 500GB, and then copy it into another subfolder for clustering. the result is, it will use 1TB. it is also very slow. while I have to say, the data size normally is relative smaller comparing to my current data.

Carsten Sachse

unread,

Jun 9, 2015, 4:59:45 AM6/9/15

to emsp...@googlegroups.com, liyaow...@gmail.com

Hi Yoawang,

The output of the logfile suggests some sort of memory problem. Again binning may resolve it. I have never worked with large data sets like that.The requirement of the copy is because of the multiple parallel CPU operations that need access to the file. It is a pre-cautionary measure to avoid crashes with other programs or parallel segmentclass runs that would interfere with your run.