FAQ

Publications

Private: How to cite PlaFRIM in your publications

Don’t forget to cite PlaFRIM in all publications presenting results or contents obtained or derived from the usage of PlaFRIM.

The official acknowledgment to use in your publication must be the following:

Acknowledgment: Experiments presented in this paper were carried out using the PlaFRIM experimental testbed, supported by Inria, CNRS (LABRI and IMB), Université de Bordeaux, Bordeaux INP and Conseil Régional d’Aquitaine (see https://www.plafrim.fr/).

 

Private: How to cite PlaFRIM in Hal (Open Archive)

When you deposit a  publication (article, conférence, thèse, poster, …)  in Hal (Open Archive) do not forget to add plafrim in the Project/Collaboration field of the metadata.

Access

Private: How to connect to PlaFRIM

Your ssh client must use a “ProxyCommand” to reach the target server

Sample configuration of .ssh/config to reach plafrim on port 22 :

(replace LOGIN_PLAFRIM with your actual login)

Host plafrim
User LOGIN_PLAFRIM
ForwardAgent yes
ForwardX11 yes
ProxyCommand ssh -A LOGIN_PLAFRIM@ssh.plafrim.fr -W plafrim:22

Check that your private key is loaded with ssh-add -l. If not, load it with

ssh-add ~/.ssh/private_key

Then use

ssh plafrim

Private: How to store data

There are six possible storage spaces on the machine with different purposes.

  1. /home/<LOGIN>
    • Max size : 20 Go
    • Deletion : Never
    • Hardware Protection (RAID) : Yes
    • Backup : Regular + versioning
    • Primary use : individual
    • How to obtain : automatic
    • Quota usage command : quota -f /home
  2. /projets/<PROJET>
    • Size : 200 Go
    • Deletion : Never
    • Hardware Protection (RAID) : Yes
    • Backup : Regular + versioning
    • Primary use : group. This space is a storage space that can be shared between several users to deposit data, software, …
    • How to obtain : on demand. To obtain such a space, simply send an email to Plafrim Support, specifying the name and description of the project, with the list of  people connected to this project.
    • Quota usage command : du -s /projets/<PROJET>
  3. DEPRECATED /lustre/<LOGIN>
    • Max size : 1 To
    • Deletion : Never
    • Hardware Protection (RAID) : Yes
    • Backup : No
    • Primary use : individual
    • How to obtain : automatic
    • Quota usage command : lfs quota -u <LOGIN> /lustre
  4. /beegfs/<LOGIN>
    • Max size : 1 To
    • Deletion : Never
    • Hardware Protection (RAID) : Yes
    • Backup : No
    • Primary use : individual
    • How to obtain : automatic
    • Quota usage command : beegfs-ctl --getquota --uid <LOGIN>
  5. /tmp
    • Max size : variable
    • Deletion : If needed and when restarting machines
    • Hardware Protection (RAID) : No
    • Backup : No
    • Primary use : individual
    • How to obtain : automatic
  6. /scratch
    • Max size : variable
    • Deletion : If needed and when restarting machines
    • Hardware Protection (RAID) : No
    • Backup : No
    • Primary use : individual
    • How to obtain : automatic. This space is only available on sirocco[14,15,16]

Restore lost files

Each directory under /home/<LOGIN> or /projets/<PROJET> has a .snapshot directory in which you can retrieve lost files.

Only /home et /projets directories have snapshots activated and are replicated off-site for 4 weeks.

Private: How to access an external site from PlaFRIM

Users must send a ticket to plafrim-support stating the site they need to access and the reason.

The technical team will approve the request after checking it does not lead to any technical issues (security…).

Job manager

Private: How to run an interactive job

There is 2 commands to run interactive jobs. One can either connect to a node with srun or use salloc to run step-by-step job.

Srun

srun --pty bash -i

One can also define the following arguments to the command srun

  • -N 1 (or –nodes=1) : the node count, by default it equal to 1.
  • -n 1 (or — ntasks=1) : number of tasks, by default it equal to 1, otherwise it must be  equal or less to the cores of the node
  • –exclusive : allocate node in exclusive mode

$ hostname

devel02.plafrim.cluster

$ srun --pty bash -i
$ hostname

miriel004.plafrim.cluster

  • The option “–pty” also works when asking  more than one node. You will be connected to the first node. To see which are nodes are part of the job, you can look at the environment variable $SLURM_JOB_NODELIST.

salloc

salloc allows to run jobs with different steps, and use at each step all or a subset of resources.

salloc -N 3

salloc: Granted job allocation 1155503
salloc: Waiting for resource configuration
salloc: Nodes sirocco[01-03] are ready for job

srun hostname

sirocco01.plafrim.cluster
sirocco02.plafrim.cluster
sirocco03.plafrim.cluster

srun -N 1 hostname

sirocco01.plafrim.cluster

and you can then connect to the first node using

$ srun --pty bash -i

Private: How to run a non-interactive job

The sbatch command allows to submit a batch script to SLURM.

$ cat script-slurm.sh

#!/usr/bin/env bash
## name of job
#SBATCH -J TEST_Slurm
## Resources: (nodes, procs, tasks, walltime, … etc)
#SBATCH -N 1
#SBATCH -n 4
#SBATCH -t00:05:00
# #  standard output message
#SBATCH -o batch%j.out
# # output error message
#SBATCH -e batch%j.err
module purge
module load compiler/gcc/4.9.0
echo “=====my job informations ====”
echo “Node List: ” $SLURM_NODELIST
echo “my jobID: ” $SLURM_JOB_ID
echo “Partition: ” $SLURM_JOB_PARTITION
echo “submit directory:” $SLURM_SUBMIT_DIR
echo “submit host:” $SLURM_SUBMIT_HOST
echo “In the directory: `pwd`”
echo “As the user: `whoami`”
srun -n4 hostname

and submit the job with

sbatch script-slurm.sh

You can then check the progression of your job using the commands squeue and scontrol.

Private: How to get information on a job

To get information on a running job, one can use the command scontrol.

squeue

JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1155454 routage plafrim-master-plafrim-gcc.sl furmento R 5:37 1 sirocco08

scontrol show job 1155454

JobId=1155454 JobName=plafrim-master-plafrim-gcc.sl
UserId=furmento(10193) GroupId=storm(11118) MCS_label=N/A
Priority=1 Nice=0 Account=(null) QOS=normal
JobState=RUNNING Reason=None Dependency=(null)
Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
RunTime=00:06:15 TimeLimit=01:00:00 TimeMin=N/A
SubmitTime=2021-01-19T08:34:06 EligibleTime=2021-01-19T08:34:06
AccrueTime=2021-01-19T08:34:06
StartTime=2021-01-19T08:34:06 EndTime=2021-01-19T09:34:06 Deadline=N/A
SuspendTime=None SecsPreSuspend=0 LastSchedEval=2021-01-19T08:34:06
Partition=routage AllocNode:Sid=devel02.plafrim.cluster:300429
ReqNodeList=(null) ExcNodeList=(null)
NodeList=sirocco08
BatchHost=sirocco08
NumNodes=1 NumCPUs=24 NumTasks=1 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=24,node=1,billing=24
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=sirocco DelayBoot=00:00:00
OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
Command=/home/furmento/buildbot/plafrim-master-plafrim-gcc.sl
WorkDir=/home/furmento/buildbot
StdErr=/home/furmento/buildbot/slurm-1155454.out
StdIn=/dev/null
StdOut=/home/furmento/buildbot/slurm-1155454.out
Power=

Private: How to get information on available nodes

You can see the list of all the nodes on the hardware page.

To have the state of cluster go to https://www.plafrim.fr/state/

To allocate a specific category of node with SLURM, you need to specify the node features. To display the list, call the command

$ sinfo -o "%60f %N"
AVAIL_FEATURES                                               NODELIST
sirocco,intel,haswell,mellanox,nvidia,tesla,k40m             sirocco[01-05]
bora,intel,cascadelake,omnipath                              bora[001-044]
sirocco,intel,broadwell,omnipath,nvidia,tesla,p100           sirocco[07-13]
sirocco,intel,skylake,omnipath,nvidia,tesla,v100             sirocco[14-16]
sirocco,intel,skylake,omnipath,nvidia,tesla,v100,bigmem      sirocco17
arm,cavium,thunderx2                                         arm01
brise,intel,broadwell,bigmem                                 brise
amd,diablo                                                   diablo[01-05]
kona,intel,knightslanding,knl                                kona[01-04]
sirocco,intel,skylake,nvidia,quadro,rtx8000                  sirocco[18-20]
souris,sgi,ivybridge,bigmem                                  souris
visu                                                         visu01
miriel,intel,haswell,infinipath                              miriel[044-045,048,050-053,056-058,060-064,066-073,075-076,078-079,081,083-088]
miriel,intel,haswell,omnipath,infinipath                     miriel[001-043]
amd,zonda                                                    zonda[01-06]
mistral                                                      mistral[02-03,06]

For example, to reserve a bora node, you need to call

$ salloc -C bora

 

sinfo has many parameters, for example:

  • -N (–Node) Print information in a node-oriented format.
  • -l (–long) Print more detailed information.

sinfo -l

Tue Jan 19 10:41:23 2021
PARTITION AVAIL TIMELIMIT JOB_SIZE ROOT OVERSUBS GROUPS NODES STATE NODELIST
routage* up 3-00:00:00 1-infinite no NO all 20 drained* miriel[005,008,016-017,019-020,022,024,027,032,038,043-044,062,068,073,075,081,083,086]
routage* up 3-00:00:00 1-infinite no NO all 3 drained sirocco[07,10,13]
routage* up 3-00:00:00 1-infinite no NO all 8 mixed sirocco[09,12,15-16,18-20],souris
routage* up 3-00:00:00 1-infinite no NO all 51 allocated bora[001-011],diablo[04-05],miriel[001,003-004,006,009-015,018,021,023,030-031,033,036,039,045,048,050-053,056-058,060,064,067,069-071,076,078,087],sirocco14

sinfo -N

NODELIST NODES PARTITION STATE
arm01 1 routage* idle
bora001 1 routage* alloc
bora002 1 routage* alloc
bora003 1 routage* alloc
bora004 1 routage* alloc
bora005 1 routage* alloc
bora006 1 routage* alloc

Private: How to ask for nodes with GPU

The sirocco nodes as explained in the hardware page have GPU. You will need to specify the given constraints if you want a specific GPU card.

It’s advised to use the exclusive parameter to make sure nodes are not used by another job at the same time.

srun --exclusive -C sirocco --pty bash -i

@sirocco08.plafrim.cluster:~> module load compiler/cuda
@sirocco08.plafrim.cluster:~> nvidia-smi
Tue Jan 19 10:52:07 2021
+—————————————————————————–+
| NVIDIA-SMI 460.27.04 Driver Version: 460.27.04 CUDA Version: 11.2
|——————————-+———————-+———————-+

| 0 Tesla P100-PCIE… On | ….
| 1 Tesla P100-PCIE… On | ….
..
@sirocco08.plafrim.cluster:~>

Private: How to kill a running job

To kill all running jobs in batch session, use the scancel command with the login name option or  with the list of job’s id  (separated by space):

scancel -u <user>

or

scancel jobid_1 ... jobid_N

The command squeue can be used to get the job ids.

squeue

JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
2545 longq test1 bee R 4:46:27 1 miriel007
2552 longq test2 bee R 4:46:47 1 miriel003
2553 longq test1 bee R 4:46:27 1 miriel004

Interactive jobs can also be killed by exiting the current shell.

Private: How to launch multi-prog jobs

It is possible to run a job with several nodes and launch different programs on different set of nodes.

Here an example of such a multiprogram configuration file.

############################################################
# srun multiple program configuration file
#
# srun -n8 -l –multi-prog silly.conf
############################################################
4-6 hostname
1,7 echo task:%t
0,2-3 echo offset:%o

To submit such a file, use the following command

srun -n8 -l --multi-prog silly.conf

You will get a output similar to

4: miriel004.plafrim.cluster
6: miriel004.plafrim.cluster
5: miriel004.plafrim.cluster
7: task:7
1: task:1
2: offset:1
3: offset:2
0: offset:0

Parallel programming

Private: How to run my application with OpenMPI ?

To launch MPI applications on miriel, sirocco and devel nodes :

mpirun -np “nb_procs” –mca mtl psm  ./apps

If you need (pour les miriels 01 à 44) OmniPath interconnect :

mpirun -np “nb_procs” –mca mtl psm2 ./apps

All these feature are available for OpenMPI version > 2.0.0

Private: How to run my application with Intel MPI?

Choose your build and execution environment using modules.

Example:

module load compiler/gcc
module add compiler/intel
module add mpi/intel

Launch with mpiexec.hydra command:

srun hostname  -s| sort -u > mpd.hosts

Select the particular network fabrics to be used with the environment variable I_MPI_FABRICS.

I_MPI_FABRICS=<fabric>|<intra-node fabric>: <inter-node fabric>

Where <fabric> := {shm, dapl, tcp, tmi, ofa}

For example, to select shared memory fabric (shm), for intra-node communication mpi process, and tag maching interface fabric (tmi), for inter-node communication mpi process, use the following command:

export I_MPI_FABRICS=shm:tmi
mpiexec.hydra -f mpd.hosts -n $SLURM_NPROCS ./a.out

The available fabrics on the plate forme are:

tmi TMI-capable network fabrics including Intel® True Scale Fabric, Myrinet*, (through Tag Matching Interface)
ofa OFA-capable network fabric including InfiniBand* (through OFED* verbs)
dapl DAPL-capable network fabrics, such as InfiniBand*, iWarp*, Dolphin*, and XPMEM* (through DAPL*)/span>
tcp TCP/IP-capable network fabrics, such as Ethernet and InfiniBand* (through IPoIB*)

You can also specify a  list of fabrics, (The default value is dapl,tcp)with the environment variable I_MPI_FABRICS_LIST. The first fabric detected will be used at runtime:

I_MPI_FABRICS_LIST=<fabrics list>

Where <fabrics list> := <fabric>,…,<fabric>

(for more details visit https://software.intel.com/sites/products/documentation/hpc/ics/impi/41/lin/Reference_Manual/Communication_Fabrics_Control.htm )

Other

Private: How to change my password on www.plafrim.fr?

  1. Go to http://www.plafrim.fr/wp-login.php
  2. Click on “Lost your password ?”
  3. Enter either your PlaFRIM username (your SSH login) or the email address you used to create your PlaFRIM account, and click on “Get New Password”
  4. Check your email inbox
  5. Click on the link (the longer one) proposed within this email
  6. Choose your new password
  7. Test you can connect at http://www.plafrim.fr/connection/