BEBOP and Crossover

BEBOP and Crossover#

Bebop is the Argonne HPC cluster on which we often run our simulations. It is operated by the LCRC division within Argonne. In addition to the shared computation nodes that LCRC operates for general ANL use, they also maintain a dedicated set of POLARIS specific nodes that are only available to staff within the POLARIS group. These nodes are hosted on a separate partition and accessed via another set of login nodes called “crossover”. Recently the new HPC cluster ‘IMPROV’ has also come online and cann be utilised for testing purposes.

Cluster	Login Url
Bebop	username@bebop.lcrc.anl.gov
Crossover	username@crossover.lcrc.anl.gov
Improv	username@improv.lcrc.anl.gov

LCRC Account#

You will need an LCRC account that you can use to login to BEBOP and access group runtime with.

You will then need to add your ssh key to your LCRC account.

You must activate an account before being added to the POLARIS project.

Build POLARIS on LCRC cluster#

Bebop, Improv and Crossover all use a module manager called Spack which allows for many potentially conflicting sets of software to be “installed” within an environment. When you login to the cluster you can choose which of the many options you want to have available within your current environment.

For example, to load the modules that are required to build POLARIS you would run the following snippet:

module load gcc/11 hdf5 anaconda3                                      # Bebop
module load anaconda3 gcc/10 hdf5/1.12 libspatialite                   # Crossover
module load gcc/11.4.0 hdf5/1.14.2-gcc-11.4.0 anaconda3 libspatialite  # Improv

conda activate polaris-<insert-cluster-name>
python -m pip install patchelf

Important

Note that the default version of Python can be quite old and we generally replace it with a more recent one via conda create, being careful to keep a diff environment for each cluster. See the Anaconda section below.

Module snapshots#

After loading in a set of modules you can save these for later use.

module save polaris_build_bebop      # Save a snapshot of currently loaded modules
module restore polaris_build_bebop   # Corresponding loading of module snapshot

To have the POLARIS build requirements available each time you login, just add the appropriate restore lines (module restore polaris_build_requirements and optionally conda activate polaris) to your ~/.bashrc or ~/.bash_profile file.

if hostname | grep xover > /dev/null; then
    module restore polaris_build_xover
elif hostname | grep bebop > /dev/null; then
    module restore polaris_build_bebop
    export PATH="${PATH}:~/bin/bebop"
fi

Anaconda#

When using Python, the best practice found has been to use anaconda to setup Python in your home directory and then create a project specific virtualenv for each cluster that builds on that version and adds in dependencies. For example the following snippet creates a named conda virtualenv called polaris-bebop based on Python 3.11 and installs the requirements of polarislib.

conda create -y -n polaris-bebop python=3.11        # Note use of bebop here, replace with polaris-xover on that cluster
conda activate polaris-bebop                        # Also here
conda install --channel conda-forge libspatialite
python -m pip install -e ~/git/polaris-studio[hpc]  # installs the HPC dependencies of polaris-studio (assumed to be in ~/git)

If conda isn’t an option the following achieves basically the same effect but using pure python binaries.

python -m  pip install virtualenv 
python -m virtualenv ~/venvs/polarislib-bebop
. ~/venvs/polarislib-bebop/bin/activate
python -m pip install -e ~/git/polaris-studio[hpc] # installs the HPC dependencies of polaris-studio (assumed to be in ~/git)

You can also have conda automatically activate your environment but care must be taken with the cluster environment. The following is a modified version of the code inserted to a user .bashrc file by the conda init bash command.

# >>> conda initialize >>>
# !! Contents within this block are managed by 'conda init' !!
if hostname | egrep "xover|crossover" > /dev/null; then
  __conda_basedir=/gpfs/fs1/soft/xover/manual/anaconda3/2021.05
  __conda_env=polaris-xover
elif hostname | egrep "bebop|bdw" > /dev/null; then
  __conda_basedir=/gpfs/fs1/soft/bebop/software/custom-built/anaconda3/2024.06
  __conda_env=polaris-bebop
else
  __conda_basedir=/soft/anaconda3/2024.2
  __conda_env=polaris-improv
fi

__conda_setup="$(${__conda_basedir}/bin/conda 'shell.bash' 'hook') 2> /dev/null"
if [ $? -eq 0 ]; then
    eval "$__conda_setup"
else
    if [ -f "${__conda_basedir}/etc/profile.d/conda.sh" ]; then
        . "${__conda_basedir}/etc/profile.d/conda.sh"
    else
        export PATH="${__conda_basedir}/bin:$PATH"
    fi
fi
conda activate ${__conda_env}
unset __conda_setup
unset __conda_basedir
# <<< conda initialize <<<

Clone and build code#

The following command will configure using CMake (-c), build (-b) and install the binaries to the given location.

git clone https://git-out.gss.anl.gov/polaris/code/polaris-linux.git 
cd polaris-linux
python ./build.py –cb -d <deps_dir> --install <install_dir>

Default dependencies directory is /lcrc/project/POLARIS/bebop/opt/polaris/deps

If you supply the install path and point it to your project, the executable and all required SO/DLL files will be copied there. A common pattern is to use --install /lcrc/project/POLARIS/bebop/MyProject/bin.

Important

Historically it has been common to compile POLARIS on the login nodes of the the clusters. It would appear that LCRC are being more proactive about shutting down computation on login nodes and it may be necessary to request an actual compute node to do the compilation. This can be accomplished on PBS using the command qsub -A POLARIS -I -l select=1:mem=64gb,walltime=00:15:00 -j oe

Copy model data#

Watch demo on using WinSCP

Schedule job#

After you have compiled your software changes, you can run an individual POLARIS simulation by wrapping up the execution with a small script that adds the relevant Slurm or PBS options (account, partition, how many nodes, etc). Full instructions for LCRC runs can be found at: https://docs.lcrc.anl.gov/

Any line in the job script starting with the exact string #SBATCH or #PBS (called a “magic cookie”) will be interpreted by the scheduler as a special processing directive.

Important

On crossover, use --account=TPS --partition=TPS to target dedicated nodes. As we are allowing multiple jobs per node on Crossover, you will also have to make sure that you request more than a single core with --ntasks-per-node=64 option, use --exclusive to disable node sharing.

$ cat srun_polaris.sh

───────┬──────────────────────────────────────────────────────────────────────────────────────────────────────
       │ File: srun_polaris.sh
───────┼──────────────────────────────────────────────────────────────────────────────────────────────────────
 │ #!/bin/bash
 │ set -eu
 │ 
 │ # Setup Slurm options - note these can be overriden when running this script through sbatch, i.e. 
 │ # $ sbatch --time=01:00:00 srun_polaris.sh $data_dir $exe_name $threads 
 │ # 
 │ #SBATCH --job-name=polaris_base
 │ #SBATCH --account=POLARIS # <= important
 │ #SBATCH --partition=bdwall
 │ #SBATCH --nodes=1
 │ #SBATCH --time=04:00:00
 | 
 | # Setup for PBS scheduler
 | # $ qsub srun_polaris.sh $data_dir $exe_name $threads 
 │ #PBS -N polaris_base
 | #PBS -A POLARIS                # <= important
 | #PBS -l select=4:mpiprocs=128
 | #PBS -l walltime=04:00:00 
 │ 
 │ data=$1
 │ program=$2
 │ scenario=$3
 │ threads=$4
 │ 
 │ cd $data
 │ $program $scenario $threads
───────┴──────────────────────────────────────────────────────────────────────────────────────────────────────

This can then be invoked like this

$ export project_dir=/lcrc/project/POLARIS/bebop/MyProject
$ sbatch srun_polaris.sh ${project_dir}/data/grid-model ${project_dir}/bin/Integrated_Model scenario_init.json 12
Submitted batch job 2153400

The console output from your run will be captured by slurm and redirected into a file in the current directory called slurm-<job_id>.out. You can tail this file to watch progress:

$ tail -f slurm-2153400.out
LOG4CPP : [NOTICE] Successfully initialized logging utility.
LOG4CPP : [NOTICE] There are 3 arguments:
LOG4CPP : [NOTICE] /lcrc/project/POLARIS/bebop/polaris-bebop-test/data/integrated
LOG4CPP : [NOTICE] scenario_init.json
LOG4CPP : [NOTICE] 12
LOG4CPP : [ERROR] Failed to copy file into output folder.
0
LOG4CPP : [NOTICE] Initializing network skims...
LOG4CPP : [NOTICE] Network skims done.
LOG4CPP : [NOTICE] Starting simulation...

You can also follow the more detailed logging that will also be written in the model run output directory.

$ tail -f ${project_dir}/data/grid-model/grid_ABM3/log/polaris_progress.log
23/06/2021 13:17:19,630 | Thread: 47944819926784 [NOTICE] 23:59:54,
23/06/2021 13:17:19,636 | Thread: 47944819926784 [NOTICE] departed= 0, arrived= 0, in_network= 0, VMT= 0.00, VHT= 0.00
23/06/2021 13:17:19,636 | Thread: 47944819926784 [INFO] PRINTING ALL THE STUCK CARS... Time: 86399
23/06/2021 13:17:19,637 | Thread: 47944819926784 [INFO] 23:59:54,
23/06/2021 13:17:19,637 | Thread: 47944819926784 [INFO] departed= 0, arrived= 0, in_network= 0, VMT= 0.00, VHT= 0.00
23/06/2021 13:17:19,638 | Thread: 47944819926784 [INFO] 0 cars are printed!
23/06/2021 13:17:19,726 | Thread: 47944783320896 [NOTICE] Finished!

Scheduling a python script:#

When running python scripts on the cluster, we will want to make sure that our appropriate modules and conda environments are available on the compute node while running. An example setup is shown in the run script below. This script would be invoked as

$ sbatch run_land_use_study.sh --data-dir foo --exe-name ${project_dir}/bin/Integrated_Model

Because the args to the script are passed down to the python file, all the argument parsing can be handled in Python with a library like optparse.

───────┬──────────────────────────────────────────────────────────────────────────────────────────────────────────
       │ File: run_land_use_study.sh
───────┼──────────────────────────────────────────────────────────────────────────────────────────────────────────
 │ #!/bin/bash
 │ #SBATCH --job-name=land_use
 │ #SBATCH --account=TPS --partition=TPS --nodes=1
 │ #SBATCH --time=60:00:00 --ntasks-per-node=64 --exclusive
 │ #SBATCH --mail-user=jauld@anl.gov --mail-type=ALL
 │
 │ set -eu
 │
 │ main() {
 │     cd /lcrc/project/POLARIS/bebop/SMART_FY22_LAND_USE/PILATES-SRC-JA
 │     load_modules
 │     setup_venv
 │     python3 ./run.py "$@"
 │ }
 │
 │ # Method which creates a conda virtual environment 
 │ setup_venv() {
 |     if which conda; then
 |         # Conda Time
 |         set +u
 |         eval "$(conda shell.bash hook)"
 |         set -u
 |         local pattern=""
 |         if hostname | egrep "xover|crossover"; then
 |             pattern="xover|crossover"
 |         elif hostname | egrep "bebop|bdw"; then
 |             pattern="bebop"
 |         elif hostname | egrep "improv|ilogin|i0|i1|i2"; then
 |             pattern="improv"
 |         fi
 |         conda_env=$(conda env list | cut -d\  -f1 | grep ^[a-z] | grep polaris | egrep "${pattern}" | head -n1)
 |         [[ -z "${conda_env}" ]] && echo "Couldn't find a conda environment for this host" && exit 1
 |         conda activate ${conda_env}
 |     fi
 │ }
 | 
 | load_modules() {
 |     if hostname | egrep "crossover|xover"; then
 |         module load polarislib_xover
 |     elif hostname | egrep "bebop|bdw"; then
 |         module load polarislib_bebop
 |     elif hostname | egrep "improv|ilogin|i0|i1|i2"; then
 |         module load polarislib_improv
 |     else
 |         echo "not bebop or crossover or improv? EXITING!"
 |         exit 1
 |     fi
 | } 
 │
 │ main "$@"
───────┴──────────────────────────────────────────────────────────────────────────────────────────────────────────

Some useful LCRC commands:#

james.cook@beboplogin2 $ squeue -u rweimer
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)

james.cook@beboplogin2 $ qstat -u rweimer
Job id            Name             User              Time Use S Queue
----------------  ---------------- ----------------  -------- - -----

james.cook@beboplogin2 $ sbank-list-allocations -p POLARIS

Allocation  Suballocation  Start       End         Resource  Project  Jobs  Charged  Available Balance 
 ----------  -------------  ----------  ----------  --------  -------  ----  -------  ----------------- 
 900         847            2024-07-01  2024-10-01  bebop     POLARIS     6      3.5            5,552.5 

Totals:
  Rows: 1
  Bebop:
    Available Balance: 5,552.5 node hours
    Charged          : 3.5 node hours
    Jobs             : 6 

All transactions displayed are in Node Hours. 1 Improv Node Hour is 128 Core Hours. 1 Bebop Node Hour is 36 Core Hours. Balances and transactions displayed will update every 5 minutes.

james.cook@beboplogin2 $ lcrc-quota

----------------------------------------------------------------------------------------
Home                          Current Usage   Space Avail    Quota Limit    Grace Time
----------------------------------------------------------------------------------------
james.cook                         56 GB          43 GB         100 GB               
----------------------------------------------------------------------------------------
Project                       Current Usage   Space Avail    Quota Limit    Grace Time
----------------------------------------------------------------------------------------
POLARIS                          1456 GB        2637 GB        4096 GB               
POLARIS_TCF                       815 GB         209 GB        1024 GB               
TPS                                 0 GB        1024 GB        1024 GB