Controlling the system#

From a user perspective, the entire process can be controlled from a jupyter notebook (see cluster_control.ipynb). This can be run directly from the VS Code setup on a user’s own machine (with polaris-studio installed locally)

This notebook generally allows users to

  1. Add jobs to the queue

  2. See the status of the queue and the available workers

  3. See the status of individual jobs

It is generally advisable to have a customized version of this notebook that is relevant to your study and which is version controlled alongside your study specific source code. A good example of this can be found in the GPRA 23 study repository.

Killing a running node#

Very useful if someone goes on holiday and leaves idle nodes on Bebop. Note that running tasks can’t currently be terminated remotely.

eq_abort_task = { "task-type": "control-task", "control-type": "EQ_ABORT" }
with engine.connect() as conn:
    insert_task(conn=conn, worker_id='xover-0036-4095684', definition=eq_abort_task, input={})

Deploying to Bebop / Crossover#

With standard nodes based on a physical machine or desktop, a background service can be started that continually runs jobs for as long as the machine is turned on (and not needed for anything else). However on HPC clusters like Bebop and Crossover jobs are scheduled via an alternate task scheduler to ensure equitable access to the resource. This means that our HPC jobs need to be aware of where they are running and not needlessly consume CPU time waiting for jobs to arrive.

The way this is handled within ANL is that we will expand the number of nodes in our cluster as we have a workload which requires it - for example when a study is being performed. This will be done by scheduling worker nodes to run through slurm (the bebop task scheduler) that will run only for a known period of time and which can be terminated when there is no more work to do.

This process will differ based on the availability of HPC resources for a particular deployment. An example of how this achieved on the Crossover cluster at ANL is provided here for reference.

Crossover Case-study#

  1. An estimate of number and capability of required worker nodes is made based on the work-load. For example if we need to run 24 Chicago models:

    1. We know from experience that approximately two Chicago models can be run simulataneously on a single hardware node in this cluster, so we tune our sbatch arguments to allow two slurm jobs to run at once (--ntasks-per-node=64 where there are 128 hyper threads per hardware node).

    2. We know that we will likely be able to request 6 hardware nodes without risking the ire of our HPC admins or fellow users so we would schedule 12 workers, each of which we would expect to run 2 of our models (sequentially)

    3. We know that a full-model run will take ~24 hours, so we would estimate that an allocation of 2 * 24 * 1.25 hours would be appropriate (1.25 to allow a 25% buffer for our time estimation)

  2. Once we know how many jobs we need we would schedule that many jobs with the appropriate allocation of time. Each job would be running our wrapper shell script which is customized for that particular cluster (loads required modules, uses anaconda to setup python and requirements.txt, etc.).

    for i in {1..12}; do
        sbatch --ntasks-per-node=64 --time=2-12:00:00 \
               --partition=xxx --account=xxx worker_loop_xover.sh
  1. These worker nodes will then run until they are terminated via sending an EQ_ABORT task or running out of allocated time on the cluster.

  2. As HPC compute is heavily partitioned from the rest of the network - we utilise Globus to transfer data onto and off of the compute nodes. This means that the user who initiates these workers must also have working Globus credentials available on the cluster

Important

Generally it’s enough to authenticate on a login node by running python polaris-studio/utils/copy_utils.py and following the prompts.