Building Parflow on the TACC supercomputer

44 views
Skip to first unread message

Mirna Kassem

unread,
Jan 27, 2025, 3:45:16 PMJan 27
to ParFlow

Hello,

I am having difficulties running Parflow on the TACC supercomputer.

I am able to successfully run Parflow on my local computer (using the Docker system) using 1 processor (topology of 1 1 1), or even several processors (e.g., topology of 2 2 1).

When I built Parflow on the TACC supercomputer, I was only able to run it successfully with 1 processor (topology of 1 1 1). Whenever I try to use more than 1 processor (e.g., topology of 2 2 1), I would get this error:

cp: cannot stat '/home1/09949/kassem_mirna/Parflow-TACC/drv_*': No such file or directory

START

SET_CONSOLE

child process exited abnormally

    while executing

"exec sh /home1/09949/kassem_mirna/parflow/bin/run LW_var_dz_spinup 4"

    ("eval" body line 1)

    invoked from within

"eval exec sh $Parflow::PARFLOW_DIR/bin/run  $runname $NumProcs"

    invoked from within

"if [pfexists Process.Command] {

                    set command [pfget Process.Command]

                    puts [format "Using command : %s" [format $command $NumProcs $runname]]

                    puts [e..."

    (procedure "pfrun" line 53)

    invoked from within

"pfrun $runname"

 

I believe the issue is how I am building Parflow on the TACC supercomputer.

This is the Github link that has the Parflow build file and the submit job file:

ParFlow-TACC/submit_parflow.sh at main · cdelcastillo21/ParFlow-TACC · GitHub

I would appreciate any guidance on how to resolve this issue.



For reference, this is the .sh file I use to submit the Parflow job on TACC supercomputer:

#!/bin/bash

 

# Function to print usage

print_usage() {

    echo "Usage: $(basename $0) INPUT_DIR [OPTIONS]"

    echo "Submit ParFlow simulation jobs"

    echo ""

    echo "Required arguments:"

    echo "  INPUT_DIR                      Directory containing ParFlow simulation input files"

    echo ""

    echo "Optional arguments:"

    echo "  -p, --parflow-dir DIR         ParFlow installation directory (default: \$HOME/parflow)"

    echo "  -r, --root-dir DIR            Root directory for job execution (default: \$SCRATCH)"

    echo "  -s, --stage-only              Only stage the job, don't submit to SLURM"

    echo ""

    echo "SLURM job options:"

    echo "  -j, --job-name NAME           Job name (default: derived from INPUT_DIR)"

    echo "  -N, --nodes N                 Number of nodes (default: 1)"

    echo "  -n, --tasks N                 Number of MPI tasks (default: 4)"

    echo "  -t, --time HH:MM:SS           Wall time limit (default: 00:30:00)"

    echo "  -q, --queue QUEUE             SLURM partition/queue (default: skx-dev)"

    echo "  -A, --account ACCOUNT         SLURM account/allocation"

    echo "  -m, --mail-address EMAIL      Email address for job notifications"

    echo ""

    echo "  -h, --help                    Show this help message"

}

 

# Function to validate directory exists

validate_dir() {

    if [ ! -d "$1" ]; then

        echo "Error: Directory $1 does not exist"

        exit 1

    fi

}

 

# Function to get base directory name

get_base_dirname() {

    # Remove trailing slash if present and get base name

    dirname "$(echo "$1" | sed 's:/*$::')" | xargs basename

}

 

# Parse command line arguments

INPUT_DIR=""

PARFLOW_DIR="$HOME/parflow"

ROOT_DIR="$SCRATCH"

STAGE_ONLY=false

 

# SLURM defaults

NODES=1

TASKS=4

TIME="00:30:00"

QUEUE="skx-dev"

ACCOUNT=""

MAIL_ADDRESS=""

JOB_NAME=""

 

# Check if no arguments provided

if [ $# -eq 0 ]; then

    print_usage

    exit 1

fi

 

# Parse positional and optional arguments

INPUT_DIR="$1"

shift

 

while [[ $# -gt 0 ]]; do

    case $1 in

        -p|--parflow-dir)

            PARFLOW_DIR="$2"

            shift 2

            ;;

        -r|--root-dir)

            ROOT_DIR="$2"

            shift 2

            ;;

        -s|--stage-only)

            STAGE_ONLY=true

            shift

            ;;

        -j|--job-name)

            JOB_NAME="$2"

            shift 2

            ;;

        -N|--nodes)

            NODES="$2"

            shift 2

            ;;

        -n|--tasks)

            TASKS="$2"

            shift 2

            ;;

        -t|--time)

            TIME="$2"

            shift 2

            ;;

        -q|--queue)

            QUEUE="$2"

            shift 2

            ;;

        -A|--account)

            ACCOUNT="$2"

            shift 2

            ;;

        -m|--mail-address)

            MAIL_ADDRESS="$2"

            shift 2

            ;;

        -h|--help)

            print_usage

            exit 0

            ;;

        *)

            echo "Unknown option: $1"

            print_usage

            exit 1

            ;;

    esac

done

 

# Function to validate directory exists

validate_dir() {

    if [ ! -d "$1" ]; then

        echo "Error: Directory $1 does not exist"

        exit 1

    fi

}

 

# Function to get base directory name

get_base_dirname() {

    # Remove trailing slash if present and get base name

    dirname "$(echo "$1" | sed 's:/*$::')" | xargs basename

}

 

# Function to generate timestamp

get_timestamp() {

    date +"%Y%m%d_%H%M%S"

}

 

# Validate input directory

validate_dir "$INPUT_DIR"

validate_dir "$PARFLOW_DIR"

validate_dir "$ROOT_DIR"

 

# Set job name if not provided

if [ -z "$JOB_NAME" ]; then

    JOB_NAME="ParFlow_$(get_base_dirname "$INPUT_DIR")"

fi

 

# Create timestamped job directory

TIMESTAMP=$(get_timestamp)

RUN_NAME="${JOB_NAME/ParFlow_/}"  # Remove ParFlow_ prefix if present

JOB_DIR="${ROOT_DIR}/${RUN_NAME}_${TIMESTAMP}"

mkdir -p "$JOB_DIR"

 

# Create SLURM submit script with conditional sections

{

    cat << EOL

#!/bin/bash

#SBATCH -J ${JOB_NAME}      # Job name

#SBATCH -o ${JOB_NAME}.o%j  # Name of stdout output file

#SBATCH -e ${JOB_NAME}.e%j  # Name of stderr error file

EOL

 

    # Add conditional SLURM directives

    if [ -n "$ACCOUNT" ]; then

        echo "#SBATCH -A $ACCOUNT             # Allocation"

    fi

   

    cat << EOL

#SBATCH -p $QUEUE           # Queue (partition) name

#SBATCH -N $NODES          # Total # of nodes

#SBATCH -n $TASKS          # Total # of mpi tasks

#SBATCH -t $TIME          # Run time (hh:mm:ss)

EOL

 

    # Add mail options if specified

    if [ -n "$MAIL_ADDRESS" ]; then

        echo "#SBATCH --mail-user=$MAIL_ADDRESS"

        echo "#SBATCH --mail-type=all    # Send email at begin and end of job"

    fi

 

    cat << EOL

 

# Load required modules

module purge

module load intel/24.0

module load impi/21.11

module load cmake/3.28.1

module load hypre/2.30.0

module load silo/git2024

module load hdf5/1.14.4

 

# Set ParFlow directory

export PARFLOW_DIR=${PARFLOW_DIR}

# Add ParFlow binary directory to path

export PATH=\$PARFLOW_DIR/bin:\$PATH

 

# Set working directory (already created by submit script)

WORK_DIR=${JOB_DIR}

cd \$WORK_DIR

 

# Copy input files from source directory

cp ${INPUT_DIR}/*.pfb .

cp ${INPUT_DIR}/drv_* .

cp ${INPUT_DIR}/*.tcl .

 

# Run ParFlow using TCL script

tclsh LW_NetCDF_Test.tcl

 

# Copy results back to submission directory

cp -r * \$SLURM_SUBMIT_DIR/

EOL

} > "${JOB_DIR}/submit.sh"

 

# Make submit script executable

chmod +x "${JOB_DIR}/submit.sh"

 

# If not stage-only, submit the job

if [ "$STAGE_ONLY" = false ]; then

    cd "$JOB_DIR"

    # Capture the entire sbatch output and get the last line, then extract the last word

    SUBMITTED_JOB=$(sbatch submit.sh | tail -n1 | awk '{print $NF}')

   

    # Verify we got a numeric job ID

    if [[ ! "$SUBMITTED_JOB" =~ ^[0-9]+$ ]]; then

        echo "Error: Failed to get valid job ID from submission"

        exit 1

    fi

   

    echo "Job submitted in directory: $JOB_DIR"

    echo "Job name: $JOB_NAME"

    echo "Job ID: $SUBMITTED_JOB"

    echo ""

    echo "To monitor this job:"

    echo "  Single status check:"

    echo "    ./parflow_status.sh -j $SUBMITTED_JOB"

    echo ""

    echo "  Continuous monitoring (updates every 10 seconds):"

    echo "    ./monitor_parlfow.sh -- -j $SUBMITTED_JOB"

    echo ""

    echo "  Continuous monitoring with custom interval (e.g., 5 seconds):"

    echo "    ./monitor_parflow.sh -i 5 -- -j $SUBMITTED_JOB"

    echo ""

    echo "  Monitor with detailed output (10 lines of log files):"

    echo "    ./monitor_parflow.sh -- -j $SUBMITTED_JOB -n 10"

else

    echo "Job staged in directory: $JOB_DIR"

    echo "Job name: $JOB_NAME"

    echo ""

    echo "To submit this job:"

    echo "  cd $JOB_DIR"

    echo "  sbatch submit.sh"

fi

Steven Smith

unread,
Jan 27, 2025, 4:23:56 PMJan 27
to ParFlow
Hi,

1) One of the errors is occurring before the parflow executable is launched:


      cp: cannot stat '/home1/09949/kassem_mirna/Parflow-TACC/drv_*': No such file or directory

The drv_* files are not being copied to the work directory in the shell script.

2) Is there any output in the <runname>.out.txt  an <runname>.out.log?   These files sometimes contain helpful output and error messages.

3) Is the TCL script you are running the one in the parflow/tests/ directory.  I'm looking at the tclsh lines in the script:


      # Run ParFlow using TCL script

       tclsh LW_NetCDF_Test.tcl

If so, there are some file copies going on in the  test script as well.   The LW_NetCDF_Test.tcl test script assumes some files will be in the ../../clm_input directory relative to where the test is running, which is "Outputs" in the original script.

-------------------
# ParFlow Inputs
file copy -force "../../parflow_input/slopes.nc" .
file copy -force "../../parflow_input/IndicatorFile_Gleeson.50z.pfb"   .
file copy -force "../../parflow_input/press.init.nc"  .

#CLM Inputs
file copy -force "../../clm_input/drv_clmin.dat" .
file copy -force "../../clm_input/drv_vegp.dat"  .
file copy -force "../../clm_input/drv_vegm.alluv.dat"  .
file copy -force "../../clm_input/metForcing.nc"  .

puts "Files Copied"
-------------------------

Input files not being in expected locations is a common issue and we don't always print the greatest error messages from ParFlow.

Steve

Mirna Kassem

unread,
Jan 27, 2025, 10:14:09 PMJan 27
to ParFlow
Hi Steven,

Thanks for the response!

1- How do I make sure the drv_* files are copied to the work directory? I never see those drv_* output files. Are they generated temporarily and then deleted?
We have this line in the .sh file: cp ${INPUT_DIR}/drv_* .  Does this not copy the drv_* files to the work directory?

2- I attach here additional output files that can help identify the problem.

3- The TCL script I am running is not the same as the one in parflow/tests/ directory, but it happened that I am using the same name "LW_NetCDF_Test". This is why I was not copying the files you referred to. I also attach the TCL script here.

Thank you in advance for all the help!
Best,
Mirna


ParFlow_kassem_mirna.e1461447
LW_var_dz_spinup.out.log
ParFlow_kassem_mirna.o1461447
LW_NetCDF_Test.tcl
LW_var_dz_spinup.out.txt

Mirna Kassem

unread,
Jan 28, 2025, 7:31:22 PMJan 28
to ParFlow
Hello Steven,

A follow up on my previous message:

1- I was able to resolve the issue of finding the drv_* files in the correct directory.
However, I still get the other errors (in the .o file):
"START
SET_CONSOLE
child process exited abnormally
    while executing
"exec sh /home1/09949/kassem_mirna/parflow/bin/run LW_var_dz_spinup 4"
    ("eval" body line 1)
    invoked from within
"eval exec sh $Parflow::PARFLOW_DIR/bin/run  $runname $NumProcs"
    invoked from within
"if [pfexists Process.Command] {
set command [pfget Process.Command]
puts [format "Using command : %s" [format $command $NumProcs $runname]]
puts [e..."
    (procedure "pfrun" line 53)
    invoked from within
"pfrun $runname"
    (file "LW_NetCDF_Test.tcl" line 458)"

This is further explanation of the error as outputted in the .out file:
"Attach debugger to PID 4003969 and set var i = 1 to continue
Incorrect process allocation input"

I attach here the new error and output files as well as the .sh script used to submit the job.
Thanks in advance!
Mirna

ParFlow_kassem_mirna.o1502954
LW_var_dz_spinup.out.log
submit_parflow.sh
ParFlow_kassem_mirna.e1502954
LW_var_dz_spinup.out.txt
Reply all
Reply to author
Forward
0 new messages