Preemption queue “testpreempt”

by Nathalie Furmento
Documentation | No Comments

The queue "testpreempt"

The queue "testpreempt" allows to run jobs outside the usual limits on unused computing resources (during the day but mostly at night and on weekends) without blocking the access to the resources for jobs running on the other queues.

This means the jobs can be stopped ("preempted") suddenly at every moment if a regular job needs the ressources.

The job will be restarted when the resources become available.

The code must regularly backup its state ("checkpoint") and make sure the backup is safe (the job could be stopped while backing up) and be able to restart on a previous backup.

  • To avoid the queue to be used inadvertently, it is not shown when running sinfo et squeue as such, it will only be shown when using the option -a.
  • The queue has been setup as folllows (this could change depending on the feedback)
    • Execution time on the queue is limited to 8 hours.
    • All nodes are reachable, to limit the execution on some nodes, the option --constraint is available, e.g: --constraint="Miriel"
  • To know all the available constraints, one can use
    • sinfo -o "%30N %10c %10m %30f %10G" (column FEATURES)

Advises to backup your application state (checkpointing)

  • Use a specific function to backup and another function to restore from a backup.
  • Backup in a temporary file (or several files in a temporary directory), then rename (atomic operation) the file or directory with a final name to stamp the backup.
    • If the application is stopped during the backup, the temporary backup will be ignored, the previous stamped backup will be used.
  • When your application starts, it should first check if a backup is available, and if yes, use it to restart from it.
  • Backing up should typically be done after a MPI barrier to make sure all nodes are synchronised.
  • Backup frequency should be adapted to its duration (data writing on disk). A rough idea is that an application should not run for more than 30 mns to 1 hour without doing a backup.

On the same thematic