'parallel_stereo' error when use cluster

112 views
Skip to first unread message

yj

unread,
Jul 29, 2022, 2:03:00 PMJul 29
to Ames Stereo Pipeline Support

Hello,

I am pretty new to using ASP on a cluster, and I want to create DEM from two worldview-2 images.

I used ‘bundle_adjust’ first, then ‘map project ‘, and finally got an error generating the DEM.

 I use the command in the cluster file system :

sbatch -p fat job8.sh

 job8.sh:

------------------------

#!/bin/bash

#SBATCH -N 1

#SBATCH -n 40

parallel_stereo --alignment-method none \

              --stereo-algorithm 1 \

              --sgm-collar-size 512 \

              --corr-tile-size 1024 \

              --corr-memory-limit-mb 1300 \

                            --threads-multiprocess 40 \

                            --threads-singleprocess 20 \

                            --cost-mode 4 --corr-kernel 7 7 \

                            --subpixel-mode 7 \

              p1/left_mapped_ba.tif p1/right_mapped_ba.tif \

              --bundle-adjust-prefix p1/run_ba/run \

              p1/run_pc/out dem-adj.tif

---------------------------

 ASP fails to run, getting a slurm-65608.out file.

In the file, I found that the main errors are:

 Warning! Your current config file enables debug logging. This will be slow.

………………..

Error: /usr/bin/time -f "stereo_corr: elapsed=%E ([hours:]minutes:seconds), memory=%M (kb)"  /public/home/102014/.conda/envs/asp/bin/stereo_corr --alignment-method none --stereo-algorithm 1 --sgm-collar-size 0 --corr-tile-size 2048 --corr-memory-limit-mb 1300 --cost-mode 4 --corr-kernel 7 7 --subpixel-mode 7 p1/left_mapped_ba.tif p1/right_mapped_ba.tif --bundle-adjust-prefix p1/run_ba/run p1/run_pc/out-9216_0_1024_1024/9216_0_1024_1024 dem-adj.tif --skip-low-res-disparity-comp --corr-seed-mode 1 --stereo-file ./stereo.default --threads 40 --trans-crop-win 8704 0 2048 1536: [Errno 2] No such file or directory: '/usr/bin/time': '/usr/bin/time'

 ………………..

Traceback (most recent call last):

  File "/public/home/102014/.conda/envs/asp/bin/parallel_stereo", line 1017, in <module>

    spawn_to_nodes(step, settings, parallel_args)

  File "/public/home/102014/.conda/envs/asp/bin/parallel_stereo", line 526, in spawn_to_nodes

    asp_system_utils.generic_run(cmd, opt.verbose)

  File "/public/home/102014/.conda/envs/asp/libexec/asp_system_utils.py", line 486, in generic_run

    raise Exception('Failed to run: ' + cmd_str)

Exception: Failed to run: parallel --will-cite --env ASP_DEPS_DIR --env PATH --env LD_LIBRARY_PATH --env ASP_LIBRARY_PATH --env PYTHONHOME -u -P 80 -a /public/home/102014/wv/tmpikxhuoax "/public/home/102014/.conda/envs/asp/bin/python /public/home/102014/.conda/envs/asp/bin/parallel_stereo --alignment-method none --stereo-algorithm 1 --sgm-collar-size 512 --corr-tile-size 1024 --corr-memory-limit-mb 1300 --threads-singleprocess 20 --cost-mode 4 --corr-kernel 7 7 --subpixel-mode 7 p1/left_mapped_ba.tif p1/right_mapped_ba.tif --bundle-adjust-prefix p1/run_ba/run p1/run_pc/out dem-adj.tif --skip-low-res-disparity-comp --processes 80 --threads-multiprocess 40 --entry-point 1 --stop-point 2 --work-dir /public/home/102014/wv --tile-id {}"

--------------------------------------------

 I would like to know how to fix this error? How should I set up to take advantage of the computing cluster.

I uploaded the four files I set up: .vwrc, stereo.default, job8.sh, slurm-65608.out

If anyone has insight for how to solve this issue or suggestions on improving my overall workflow I would be very thankful.

 John

 

--------------------------------------------

 

Notes:

Fat node configuration:

 Machine model: H3C UniServer R6700 G3

 4* C6248R (3.0GHz/24 core/35.75MB/205W) CPU processor;

 48* 32GB 2Rx4 DDR4-2933P-R memory module (FIO);

 2* 1.92TB 6G SATA 2.5in RI 5300PRO SSD Universal Hard Disk Module (CMCTO);

 Os: Centos7.6

slurm-65608.out
stereo.default
job8.sh
.vwrc

Oleg Alexandrov

unread,
Aug 2, 2022, 12:21:15 PMAug 2
to Ames Stereo Pipeline Support
Your email went to my spam folder for some reason, and I could not see it in time. 

I am not fully sure. The error I see says: "No such file or directory: '/usr/bin/time': '/usr/bin/time'". Can you check if you have the /usr/bin/time program? You can also try to run without that. You can edit your file /public/home/102014/.conda/envs/asp/bin/parallel_stereo and delete the two lines from here: https://github.com/NeoGeographyToolkit/StereoPipeline/blob/master/src/asp/Tools/parallel_stereo#L575, so, the lines which say:


if 'linux' in sys.platform:
  timeCmd = ['/usr/bin/time', '-f', prog + \
 ': elapsed=%E ([hours:]minutes:seconds), memory=%M (kb)']

Note that for parallel_stereo you also need to specify --nodes-list, to point to the list of computing nodes your job has access to. I don't know how to do that for slurm. Without it, it will use just the node the program started on.

BTW, you are also getting a warning about using debug logging which may be slow.  To make that go away you can edit your .vwrc and replace:

30 = stereo

with

20 = stereo

(the text in there will explain what these numbers mean).

Let me know how it goes. If /usr/bin/time is the problem, I will put a fix.

Oleg Alexandrov

unread,
Aug 3, 2022, 12:48:27 PMAug 3
to Ames Stereo Pipeline Support
I added a robustness fix for parallel_stereo. It checks if /usr/bin/time exists before using it. This time command is only used to gather performance stats, which is not strictly necessary, so it is prudent to not run it if missing. 

yj

unread,
Aug 3, 2022, 4:20:22 PMAug 3
to Ames Stereo Pipeline Support

Oleg,

Hi, thank you for your suggestion, the previous problems may be caused by me using the wrong slurm command. There are no errors now.

Now use the parallel_stereo command to get the out-pc.tif file.

But then using the point2dem command will force quit without any prompt.

The main commands I use are as follows (ignore the slurm system command):

1. parallel_stereo 42233.tif 42129.tif 42233.xml 42129.xml dg3/out --threads-multiprocess 8 --threads-singleprocess 8 --session-type rpc

2. point2dem out-PC.tif -o dem/out --errorimage --tr 1.0

 1.png

I checked the series of folders generated by the para1.pngllel_stereo command and it seems that each folder has the correct *pc files inside.

Also, found that "out-log-stereo_corr-08-03-2220-104744.txt" had an error saying "[ fileio ] : Error: GdalIO: dg3/out-D_sub.tif: No such file or directory (code = 4)”. This error is weird, I checked that the file is indeed inside the path.

The relevant documents are attached.

Thank you for your patience and any suggestions are welcome.

Best

John



slurm-66311.out
out-log-stereo_corr-08-03-2220-104744.txt

Oleg Alexandrov

unread,
Aug 3, 2022, 4:31:30 PMAug 3
to Ames Stereo Pipeline Support
The error you see:

Error: GdalIO: dg3/out-D_sub.tif:

is harmless, it complains right before it tries to compute this. As I remember and as I see in in the code, in the meantime I tweaked the code to not print this, as it is confusing indeed. I wonder if you see this with ASP 3.1.0. If you do, I will take a second look.

I don't know what to say about your point2dem error. It looks that it actually completes the job, based on your screenshot. I don't know why it is forced to terminate after that. 

yj

unread,
Aug 5, 2022, 10:26:11 AMAug 5
to Ames Stereo Pipeline Support

I upgraded the software version with the command "conda install stereo-pipeline==3.1.0", but the upgraded version still shows "ASP 3.0.1-alpha"?

I re-experimented a few times, and the "Error: /usr/bin/time -f" and "Error: GdalIO: dg3/out-D_sub.tif" both disappeared.

This fix works well, and those errors don't appear anymore.

I used the "parallel_stereo" command and the whole process generated the "out-pc.tif" file without any error or warning messages. I only use one node of the cluster, so I can avoid using the "--nodata-value" parameter.

Next, I used the "point2dem" command and got the "out-DEM.tif" file, but it was only 1kb, which is an error result. There were no errors or warning messages throughout the process.

I checked the log file generated by "point2dem" and there was no error either.(See attachment)

I repeated the experiment several times and got the same result. I found that when "point2dem" runs, the steps "Statistics, Bounding box, and triangulation error range estimation, QuadTree" all look normal, except for the last step

"Writing: dg2/dem/out-DEM.tif" and "Writing: dg2/dem/out-IntersectionErr.tif" take a very long time and seem to have some errors causing the program to get stuck.

I'm guessing it could be caused by the cluster using the Slurm file system but ASP uses the PBS system parameters.The same content, their parameters are different, as shown in the figure below.

1.jpg
By the way, the images I use are wv-2 stereo pairs, each image is about 1.3G. After using the parallel_stereo command, I get the out-pc.tif file, and the whole folder becomes very large in this process, 17G. I think it might be because the *.TIF files generated in the middle process are in " Float32" format and this format takes up disk space and causes programs to run slowly. For satellite images, it is actually enough to have "Int16" or "Int8", but "parrlel_stereo" does not provide the option to select image bits.
Any suggestions? Thanks for your reply.
slurm-66420.out
slurm-66422.out
out-log-point2dem-08-05-1406-103396.txt

Oleg Alexandrov

unread,
Aug 5, 2022, 12:45:07 PMAug 5
to yj, Ames Stereo Pipeline Support

I upgraded the software version with the command "conda install stereo-pipeline==3.1.0", but the upgraded version still shows "ASP 3.0.1-alpha"?

That was likely an oversight on my side. In the meantime that string got updated. For the bleeding edge version of ASP from https://github.com/NeoGeographyToolkit/StereoPipeline/releases it now prints ASP 3.1.1-alpha (because we are after 3.1.0).  

I re-experimented a few times, and the "Error: /usr/bin/time -f" and "Error: GdalIO: dg3/out-D_sub.tif" both disappeared.


Very good.
 

I used the "parallel_stereo" command and the whole process generated the "out-pc.tif" file without any error or warning messages. I only use one node of the cluster, so I can avoid using the "--nodata-value" parameter.

See the example I added for SLURM here, https://stereopipeline.readthedocs.io/en/latest/examples.html#using-pbs-and-slurm. It shows how to set --node-list, if that's what you are referring to above. 

Next, I used the "point2dem" command and got the "out-DEM.tif" file, but it was only 1kb, which is an error result. There were no errors or warning messages throughout the process.

That is because you used the command:

point2dem dg2/out-PC.tif -o dg2/dem/out --errorimage --tr 1.0

The default point2dem projection is in degrees, and you set the grid size to 1 degree, which is 100000 m or so. You either need to use a projection in meters, and then --tr is in meters too, or you continue to use the projection in degrees, but then set --tr 0.001 or so. Better even, don't set --tr at all, and let the software find it for you. I now added a mention of how to use --tr here, https://stereopipeline.readthedocs.io/en/latest/tools/point2dem.html.
 
By the way, the images I use are wv-2 stereo pairs, each image is about 1.3G. After using the parallel_stereo command, I get the out-pc.tif file, and the whole folder becomes very large in this process, 17G. I think it might be because the *.TIF files generated in the middle process are in " Float32" format and this format takes up disk space and causes programs to run slowly. For satellite images, it is actually enough to have "Int16" or "Int8", but "parrlel_stereo" does not provide the option to select image bits.

It is enough to have int16 for images, but it is not enough for a point cloud, like PC.tif, because then, if an xyz point was given with int values, every point would be rounded to 1 meter, which is not good enough.

After your run finishes, you can delete everything except the DEM. Also note the the new parallel_stereo option named "--keep-only" which will wipe most files for you automatically (this is in the latest build, at the link earlier). Here's the doc explaining that new option. (https://stereopipeline.readthedocs.io/en/latest/tools/parallel_stereo.html)

yj

unread,
Aug 6, 2022, 11:12:59 AMAug 6
to Ames Stereo Pipeline Support
       | See the example I added for SLURM here, https://stereopipeline.readthedocs.io/en/latest/examples.html#using-pbs-and-slurm. It shows how to set --node-list, if that's what you are referring to above. 
Well done!
      |The default point2dem projection is in degrees, and you set the grid size to 1 degree, which is 100000 m or so. You either need to use a projection in meters, and then --tr is in meters too, or you continue to use the projection in degrees, but then set --tr 0.001 or so. Better even, don't set --tr at all, and let the software find it for you. I now added a mention of how to use --tr here, https://stereopipeline.readthedocs.io/en/latest/tools/point2dem.html.
Yes, you are right, I forgot my image did not do 'mapproject'.I reused "point2dem", removed the ‘’-tr‘’.The result is below.
     |  It is enough to have int16 for images, but it is not enough for a point cloud, like PC.tif, because then, if an xyz point was given with int values, every point would be rounded to 1 meter, which is not good enough.
        After your run finishes, you can delete everything except the DEM. Also note the the new parallel_stereo option named "--keep-only" which will wipe most files for you automatically (this is in the latest build, at the           link earlier). Here's the doc explaining that new option. (https://stereopipeline.readthedocs.io/en/latest/tools/parallel_stereo.html)
I understand, thank you for your detail explain and update.

1659797517607.png       1659798169434.png
I use" dem_mosaic --hole-fill-length 300" to fill hole ,but there is almost no effect, the picture on the left is filled, and the right is the original picture.
Maybe I need to adjust the method of stereo matching, or some previous step?

Oleg Alexandrov

unread,
Aug 6, 2022, 12:24:15 PMAug 6
to yj, Ames Stereo Pipeline Support


1659797517607.png       1659798169434.png
I use" dem_mosaic --hole-fill-length 300" to fill hole ,but there is almost no effect, the picture on the left is filled, and the right is the original picture.
Maybe I need to adjust the method of stereo matching, or some previous step?

It is hard for me to tell exactly what is going on with the holes. You may want to zoom in.

You can double the grid size maybe, passed to --tr, from the one autocomputed here, which can be found with gdalinfo. Or, you can increase --search-radius-factor to maybe 2 in point2dem (see its manual page).

For such mountainous terrain we strongly suggest mapprojection of these images either onto a lower resolution of the DEM obtained from this point cloud (so with even bigger --tr, for the moment for the new DEM), or onto another DEM. Also consider using --stereo-algorithm asp_mgm. See here for more details: https://stereopipeline.readthedocs.io/en/latest/next_steps.htmlhttps://stereopipeline.readthedocs.io/en/latest/next_steps.html#mapproj-example.

Reply all
Reply to author
Forward
0 new messages