Building Parflow on the TACC supercomputer

Mirna Kassem

unread,

Jan 27, 2025, 3:45:16 PMJan 27

to ParFlow

Hello,

I am having difficulties running Parflow on the TACC supercomputer.

I am able to successfully run Parflow on my local computer (using the Docker system) using 1 processor (topology of 1 1 1), or even several processors (e.g., topology of 2 2 1).

When I built Parflow on the TACC supercomputer, I was only able to run it successfully with 1 processor (topology of 1 1 1). Whenever I try to use more than 1 processor (e.g., topology of 2 2 1), I would get this error:

cp: cannot stat '/home1/09949/kassem_mirna/Parflow-TACC/drv_*': No such file or directory

START

SET_CONSOLE

child process exited abnormally

while executing

"exec sh /home1/09949/kassem_mirna/parflow/bin/run LW_var_dz_spinup 4"

("eval" body line 1)

invoked from within

"eval exec sh $Parflow::PARFLOW_DIR/bin/run $runname $NumProcs"

invoked from within

"if [pfexists Process.Command] {

set command [pfget Process.Command]

puts [format "Using command : %s" [format $command $NumProcs $runname]]

puts [e..."

(procedure "pfrun" line 53)

invoked from within

"pfrun $runname"

I believe the issue is how I am building Parflow on the TACC supercomputer.

This is the Github link that has the Parflow build file and the submit job file:

ParFlow-TACC/submit_parflow.sh at main · cdelcastillo21/ParFlow-TACC · GitHub

I would appreciate any guidance on how to resolve this issue.

For reference, this is the .sh file I use to submit the Parflow job on TACC supercomputer:

#!/bin/bash

# Function to print usage

print_usage() {

echo "Usage: $(basename $0) INPUT_DIR [OPTIONS]"

echo "Submit ParFlow simulation jobs"

echo ""

echo "Required arguments:"

echo " INPUT_DIR Directory containing ParFlow simulation input files"

echo ""

echo "Optional arguments:"

echo " -p, --parflow-dir DIR ParFlow installation directory (default: \$HOME/parflow)"

echo " -r, --root-dir DIR Root directory for job execution (default: \$SCRATCH)"

echo " -s, --stage-only Only stage the job, don't submit to SLURM"

echo ""

echo "SLURM job options:"

echo " -j, --job-name NAME Job name (default: derived from INPUT_DIR)"

echo " -N, --nodes N Number of nodes (default: 1)"

echo " -n, --tasks N Number of MPI tasks (default: 4)"

echo " -t, --time HH:MM:SS Wall time limit (default: 00:30:00)"

echo " -q, --queue QUEUE SLURM partition/queue (default: skx-dev)"

echo " -A, --account ACCOUNT SLURM account/allocation"

echo " -m, --mail-address EMAIL Email address for job notifications"

echo ""

echo " -h, --help Show this help message"

}

# Function to validate directory exists

validate_dir() {

if [ ! -d "$1" ]; then

echo "Error: Directory $1 does not exist"

exit 1

fi

}

# Function to get base directory name

get_base_dirname() {

# Remove trailing slash if present and get base name

dirname "$(echo "$1" | sed 's:/*$::')" | xargs basename

}

# Parse command line arguments

INPUT_DIR=""

PARFLOW_DIR="$HOME/parflow"

ROOT_DIR="$SCRATCH"

STAGE_ONLY=false

# SLURM defaults

NODES=1

TASKS=4

TIME="00:30:00"

QUEUE="skx-dev"

ACCOUNT=""

MAIL_ADDRESS=""

JOB_NAME=""

# Check if no arguments provided

if [ $# -eq 0 ]; then

print_usage

exit 1

fi

# Parse positional and optional arguments

INPUT_DIR="$1"

shift

while [[ $# -gt 0 ]]; do

case $1 in

-p|--parflow-dir)

PARFLOW_DIR="$2"

shift 2

;;

-r|--root-dir)

ROOT_DIR="$2"

shift 2

;;

-s|--stage-only)

STAGE_ONLY=true

shift

;;

-j|--job-name)

JOB_NAME="$2"

shift 2

;;

-N|--nodes)

NODES="$2"

shift 2

;;

-n|--tasks)

TASKS="$2"

shift 2

;;

-t|--time)

TIME="$2"

shift 2

;;

-q|--queue)

QUEUE="$2"

shift 2

;;

-A|--account)

ACCOUNT="$2"

shift 2

;;

-m|--mail-address)

MAIL_ADDRESS="$2"

shift 2

;;

-h|--help)

print_usage

exit 0

;;

*)

echo "Unknown option: $1"

print_usage

exit 1

;;

esac

done

# Function to validate directory exists

validate_dir() {

if [ ! -d "$1" ]; then

echo "Error: Directory $1 does not exist"

exit 1

fi

}

# Function to get base directory name

get_base_dirname() {

# Remove trailing slash if present and get base name

dirname "$(echo "$1" | sed 's:/*$::')" | xargs basename

}

# Function to generate timestamp

get_timestamp() {

date +"%Y%m%d_%H%M%S"

}

# Validate input directory

validate_dir "$INPUT_DIR"

validate_dir "$PARFLOW_DIR"

validate_dir "$ROOT_DIR"

# Set job name if not provided

if [ -z "$JOB_NAME" ]; then

JOB_NAME="ParFlow_$(get_base_dirname "$INPUT_DIR")"

fi

# Create timestamped job directory

TIMESTAMP=$(get_timestamp)

RUN_NAME="${JOB_NAME/ParFlow_/}" # Remove ParFlow_ prefix if present

JOB_DIR="${ROOT_DIR}/${RUN_NAME}_${TIMESTAMP}"

mkdir -p "$JOB_DIR"

# Create SLURM submit script with conditional sections

{

cat << EOL

#!/bin/bash

#SBATCH -J ${JOB_NAME} # Job name

#SBATCH -o ${JOB_NAME}.o%j # Name of stdout output file

#SBATCH -e ${JOB_NAME}.e%j # Name of stderr error file

EOL

# Add conditional SLURM directives

if [ -n "$ACCOUNT" ]; then

echo "#SBATCH -A $ACCOUNT # Allocation"

fi

cat << EOL

#SBATCH -p $QUEUE # Queue (partition) name

#SBATCH -N $NODES # Total # of nodes

#SBATCH -n $TASKS # Total # of mpi tasks

#SBATCH -t $TIME # Run time (hh:mm:ss)

EOL

# Add mail options if specified

if [ -n "$MAIL_ADDRESS" ]; then

echo "#SBATCH --mail-user=$MAIL_ADDRESS"

echo "#SBATCH --mail-type=all # Send email at begin and end of job"

fi

cat << EOL

# Load required modules

module purge

module load intel/24.0

module load impi/21.11

module load cmake/3.28.1

module load hypre/2.30.0

module load silo/git2024

module load hdf5/1.14.4

# Set ParFlow directory

export PARFLOW_DIR=${PARFLOW_DIR}

# Add ParFlow binary directory to path

export PATH=\$PARFLOW_DIR/bin:\$PATH

# Set working directory (already created by submit script)

WORK_DIR=${JOB_DIR}

cd \$WORK_DIR

# Copy input files from source directory

cp ${INPUT_DIR}/*.pfb .

cp ${INPUT_DIR}/drv_* .

cp ${INPUT_DIR}/*.tcl .

# Run ParFlow using TCL script

tclsh LW_NetCDF_Test.tcl

# Copy results back to submission directory

cp -r * \$SLURM_SUBMIT_DIR/

EOL

} > "${JOB_DIR}/submit.sh"

# Make submit script executable

chmod +x "${JOB_DIR}/submit.sh"

# If not stage-only, submit the job

if [ "$STAGE_ONLY" = false ]; then

cd "$JOB_DIR"

# Capture the entire sbatch output and get the last line, then extract the last word

SUBMITTED_JOB=$(sbatch submit.sh | tail -n1 | awk '{print $NF}')

# Verify we got a numeric job ID

if [[ ! "$SUBMITTED_JOB" =~ ^[0-9]+$ ]]; then

echo "Error: Failed to get valid job ID from submission"

exit 1

fi

echo "Job submitted in directory: $JOB_DIR"

echo "Job name: $JOB_NAME"

echo "Job ID: $SUBMITTED_JOB"

echo ""

echo "To monitor this job:"

echo " Single status check:"

echo " ./parflow_status.sh -j $SUBMITTED_JOB"

echo ""

echo " Continuous monitoring (updates every 10 seconds):"

echo " ./monitor_parlfow.sh -- -j $SUBMITTED_JOB"

echo ""

echo " Continuous monitoring with custom interval (e.g., 5 seconds):"

echo " ./monitor_parflow.sh -i 5 -- -j $SUBMITTED_JOB"

echo ""

echo " Monitor with detailed output (10 lines of log files):"

echo " ./monitor_parflow.sh -- -j $SUBMITTED_JOB -n 10"

else

echo "Job staged in directory: $JOB_DIR"

echo "Job name: $JOB_NAME"

echo ""

echo "To submit this job:"

echo " cd $JOB_DIR"

echo " sbatch submit.sh"

fi

Steven Smith

unread,

Jan 27, 2025, 4:23:56 PMJan 27

to ParFlow

Hi,

1) One of the errors is occurring before the parflow executable is launched:

cp: cannot stat '/home1/09949/kassem_mirna/Parflow-TACC/drv_*': No such file or directory

The drv_* files are not being copied to the work directory in the shell script.

2) Is there any output in the <runname>.out.txt an <runname>.out.log? These files sometimes contain helpful output and error messages.

3) Is the TCL script you are running the one in the parflow/tests/ directory. I'm looking at the tclsh lines in the script:

# Run ParFlow using TCL script

tclsh LW_NetCDF_Test.tcl

If so, there are some file copies going on in the test script as well. The LW_NetCDF_Test.tcl test script assumes some files will be in the ../../clm_input directory relative to where the test is running, which is "Outputs" in the original script.

-------------------

# ParFlow Inputs

file copy -force "../../parflow_input/slopes.nc" .
file copy -force "../../parflow_input/IndicatorFile_Gleeson.50z.pfb" .
file copy -force "../../parflow_input/press.init.nc" .

#CLM Inputs
file copy -force "../../clm_input/drv_clmin.dat" .
file copy -force "../../clm_input/drv_vegp.dat" .
file copy -force "../../clm_input/drv_vegm.alluv.dat" .
file copy -force "../../clm_input/metForcing.nc" .

puts "Files Copied"

-------------------------

Input files not being in expected locations is a common issue and we don't always print the greatest error messages from ParFlow.

Steve

Mirna Kassem

unread,

Jan 27, 2025, 10:14:09 PMJan 27

to ParFlow

Hi Steven,

Thanks for the response!

1- How do I make sure the drv_* files are copied to the work directory? I never see those drv_* output files. Are they generated temporarily and then deleted?

We have this line in the .sh file: cp ${INPUT_DIR}/drv_* . Does this not copy the drv_* files to the work directory?

2- I attach here additional output files that can help identify the problem.

3- The TCL script I am running is not the same as the one in parflow/tests/ directory, but it happened that I am using the same name "LW_NetCDF_Test". This is why I was not copying the files you referred to. I also attach the TCL script here.

Thank you in advance for all the help!

Best,

Mirna

ParFlow_kassem_mirna.e1461447

LW_var_dz_spinup.out.log

ParFlow_kassem_mirna.o1461447

LW_NetCDF_Test.tcl

LW_var_dz_spinup.out.txt

Mirna Kassem

unread,

Jan 28, 2025, 7:31:22 PMJan 28

to ParFlow

Hello Steven,

A follow up on my previous message:

1- I was able to resolve the issue of finding the drv_* files in the correct directory.

However, I still get the other errors (in the .o file):

"START
SET_CONSOLE
child process exited abnormally
while executing
"exec sh /home1/09949/kassem_mirna/parflow/bin/run LW_var_dz_spinup 4"
("eval" body line 1)
invoked from within
"eval exec sh $Parflow::PARFLOW_DIR/bin/run $runname $NumProcs"
invoked from within
"if [pfexists Process.Command] {
set command [pfget Process.Command]
puts [format "Using command : %s" [format $command $NumProcs $runname]]
puts [e..."
(procedure "pfrun" line 53)
invoked from within
"pfrun $runname"

(file "LW_NetCDF_Test.tcl" line 458)"

This is further explanation of the error as outputted in the .out file:

"Attach debugger to PID 4003969 and set var i = 1 to continue
Incorrect process allocation input"

I attach here the new error and output files as well as the .sh script used to submit the job.

Thanks in advance!

Mirna

ParFlow_kassem_mirna.o1502954

LW_var_dz_spinup.out.log

submit_parflow.sh

ParFlow_kassem_mirna.e1502954

LW_var_dz_spinup.out.txt

Reply all

Reply to author

Forward