Batch System - Job submission

1. Basic requirements
2. Batch job submission
3. Interactive job submission
4. Parallel Jobs
5. GPU Jobs
6. Most common qsub / qrsh submission switches
7. Requesting resources
8. Projects
9. CPU core binding
10. Some words about the stdout/stderr job output files
11. Examples
12. Best Practices
13. Other commands to handle your active jobs

1. Basic requirements

  • You need to be logged in on a submission host (see: submission hosts)
  • Your authentication credentials stored in $SGE_CERTFILE & $SGE_KEYFILE must be readable. Those credentials should get installed automatically on first usage!
  • Ensure that in your .[t]cshrc or .zshrc no commands are executed that need a terminal (tty) (users sometimes have a stty command in their startup scripts)

2. Batch job submission

  • Usual command: qsub [resource requirements] <script> [script parameter]
  • See Requesting resources for details about the available resources
  • Job requirements can also be provided within the job script.
  • Example:

#!/bin/zsh
#
#(otherwise the default shell would be used)
#$ -S /bin/bash
#(the running time for this job)
#$ -l h_rt=05:30:00
#
#(the maximum memory usage of this job)
#$ -l h_rss=2G
#
#(stderr and stdout are merged together to stdout)
#$ -j y
#
#(send mail on job's end and abort)
#$ -m ae

# change to scratch directory
cd $TMPDIR
copy raw data to scratch directory
cp /afs/ifh.de/group/amanda/rawdata/bigfile bigfile
# write calculated output also to scratch first
/path/to/executable/which/works/on/bigfile -in bigfile -out output
# copy the output back into afs
cp output /afs/ifh.de/group/amanda/output/

  • Some notes on how the flow chart of your job should look like:
  1. Please copy the data you need during your job to $TMPDIR (not to /tmp directly!) first.
  2. Let your job work on that data (generate the results in $TMPDIR also!).
  3. Before your job finishes copy the results back to AFS, dCache, etc. All data in $TMPDIR will be removed automatically when your job finishes!

3. Interactive job submission

  • You can get in interactive shell on an execution host.
  • Usual command: qrsh <resource requirements>
  • See Requesting resources for details about the available resources

4. Parallel Jobs

There are two kinds of parallel jobs Grid Engine supports:

1. Multithreaded / multiprocessor jobs on a single node:

  • You can reserve more than one core on a multicore host by specifying the qsub / qrsh parameter -pe multicore <number of cores>. This should of course only be done if your job is able to use all those cores simultaneously in an efficient way (i.e. 100% cpu usage per core).
  • As it is rather unlikely that you get the requested resources at once, you should always also specify the -R y switch to prevent your job from starving in the waiting queue.

2. MPI jobs:

  • use the parameter -pe mpi <number of cores>
  • see the Cluster pages for additional documentation about running parallel jobs of this kind.
  • to prevent "starving" jobs in the queue, request a reservation: qsub switch: -R y
Attention: Resources like h_rss & h_cpu are always requested per job slot, not per job! So you always have to adjust the requirements accordingly.

Example: your parallel multithreaded job consumes almost 8 GB memory altogether. This job runs in 4 slots (-pe multicore 4). So if you request -l h_rss=2G, your job is allowed to consume up to 8 GB resident memory (4*2GB).
Here's a short overview of the available parallel environments:

PE name

description

multicore

intended to be used for jobs using more than one cpu core on a single node

multicore-mpi

jobs running on more than one cpu core on a single node but do the parallelisation via OpenMPI

mpi

jobs linked against OpenMPI and running in parallel on several nodes

5. GPU Jobs

The Zeuthen batch farm provides a limited number of nodes with installed nVidia Tesla GPGPUs. The CUDA-SDK is installed on those nodes as well. To request a GPGPU use the qsub/qrsh switch -l gpu:

qsub -l gpu <other requirements> <job>


If you intend to use a special GPGPU model, use the qsub/qrsh switch -l gpu_type=. To see which different models are available, execute this statement:

qhost -l gpu_type='*' -F gpu_type


Inside the gpu job you can access your reserved nVidia device by using the environment variable $SGE_GPU_DEVICE:

[bladed0] ~ % ls -l $SGE_GPU_DEVICE
crw------- 1 ahaupt sysprog 195, 0 Mar 7 21:24 /dev/nvidia0


In case you requested multiple GPGPUs per job, the environment variables $SGE_GPU_DEVICE0, $SGE_GPU_DEVICE1, ... hold the paths to your devices. Also $CUDA_VISIBLE_DEVICES is set accordingly.

6. Most common qsub / qrsh submission switches

This section describes the most common parameters for qsub and qrsh. For more details read their man page. The switches can also be directly specified in the job script as it is shown in the example script.

Switch

Description

-cdw

execute the job from the current directory and not relative to your home directory, mostly used in conjunction with the -o and -e switches (if you just specify a relative path there)

-e <job's stderr output file>

specifies the path to the job's stderr output file (relative to home directory or to the current directory if the -cwd switch is used)

-hold_jid <job ids>

tell Grid Engine not to start the job until the specified jobs have been finished successfully

-i <job's stdin file>

specifies the job's stdin

-j <y|n>

merge the job's stderr with its stdout

-js <job share>

specifies the relative job share. Can be used to weight the importance of a single user's jobs. Higher numbers mean higher importance.

-l <job resource>

specifies the resources a job needs, multiple specifications can be stacked, see the next topic "Requesting resources" for details

-m <b|e|a>

Let Grid Engine send a mail on job's:
b : begin
e : end
a : abort

-notify

Grid Engine will notify the job about the following abortion (SIGKILL) by sending a SIGUSR2 first

-now <y|n>

force / switch off an immediate execution of the job. y is default for qrsh, n for qsub

-N <jobname>

specifies the job name, default is the name of the submitted scripts

-o <job's stdout output file>

specifies the path to the job's stdout output file, if you spefify -j as well it will also take the stderr output

-P <project>

specifies the project the job should run under. See the projects topic for details

-pe <parallel environment>

specifies the parallel environment the job should run under. See the parallel job topic for details

-R <y|n>

tell Grid Engine (not) to reserve a slot for huge jobs (i.e. parallel jobs, high demand of memory). This should prevent "starving" jobs in the waiting queue. If you know your job can run on the farm and already waits a really long time, try this switch

-S <path to shell>

specifies the shell Grid Engine should start your job with. Default is /bin/zsh

-t <from-to:step>

Submit an array job. That's actually the same as submitting the same job several times (difference of: to - from). Optionally you can specify the step size between the task numbers. The task within this array can be accessed in the job via the environment variable $SGE_TASK_ID. For more details see the man page

-tc <max running tasks>

In case you submit an array job, this switch limits the number of simultaneously running job tasks. Should be used e.g. to avoid overloading of limited central services like network file systems.

-V

inherit the current shell environment to the job ! LD_LIBRARY_PATH is not inherited.

7. Requesting resources

These are the most common resources you can request:

jobs requesting less than 256M will be rejected

complex name

description

possible values

example

comment

arch

the required host architecture

i386,x86_64

-l arch=x86_64

currently useless since only 64 bit systems are available

os

the required host operating system

sl6,sl7

-l os=sl6

After 1st August 2018 no SL6-based nodes will be available!

h_rss

job's maximum resident memory usage (RSS)

 

-l h_rss=2G

hard limit, default: 1G !
jobs requesting less than 256M will be rejected

s_cpu

job's maximum cpu time

 

-l s_cpu=03:00:00

soft limit

h_cpu

job's maximum cpu time

 

-l h_cpu=48:00:00

hard limit

s_rt

job's maximum wallclock time

 

-l s_rt=03:00:00

soft limit

h_rt

job's maximum wallclock time

 

-l h_rt=15:00:00

hard limit, if your job needs more than 30 minutes runtime, either h_cpu/s_cpu or h_rt/s_rt are mandatory

tmpdir_size

job's maximum scratch space usage in $TMPDIR

 

-l tmpdir_size=5G

actually only needed if you need lots of space (>1G)

gpu

request a GPGPU

 

-l gpu

 

hostname

name of the execution host

 

-l hostname=bladeff

sage not recommended!

Some notes on requesting resources:

  • Only request the resources you really need! Especially for memory usage this can increase your job throughput dramatically. Remember: the more resources you request the less are the chances for your job to be submitted during the next scheduling interval! Indeed: Very often only one job is running on a single node just because this jobs requests/reserves too much h_rss (although it actually does not need it) - a second job doesn't fit any more on that node.
  • In order to make Grid Engine's feature "backfilling" work, either one of h_rt or s_rt needs to be specified for every job.
    If you omit those parameters, s_rt will be calculated and set automatically! s_rt = h_cpu + 5 minutes
  • Send a test job with a high memory usage specification (so that it won't die because of overuse) together with the mail option "-m ae". The batch system will send you an email when your job finishes. This email contains very nice information about your job's resource consumption. Use them!
  • Most of the consumable resources (like h_cpu, h_rss, tmpdir_size) are requested per job slot - NOT per job. In the standard case (single threaded job running on one cpu core) this makes no difference. In case you submit parallel jobs, you have to keep this in mind!


Some notes on signaling the end of the job:

  • With the qsub option -notify, a SIGUSR2 signal is sent to the job one minute before the job is suspended/killed.
  • You can also use soft limits. Just set them to a value a bit less than the hard limit and GE will notify your job with SIGUSR1 (SIGXCPU when using s_cpu, s_vmem) when the soft limit has been reached.

8. Projects

  • Every user must be member of at least one project.
  • Every job is assigned to a certain project. If not separately specified, all user's jobs run under her/his default project.
  • Your default project name is identical with your primary unix group. All available unix groups are mapped to an identically named project. So all in all your SGE project memberships are identical to your unix group memberships.
  • If you do not want your job run under the default project, specify it with the qsub / qrsh parameter -P <project>

9. CPU core binding

All jobs requesting a runtime of more than 30 minutes will automatically run with cpu core binding enabled. You can of course change the default core-binding strategy - see qsub man page for details.

10. Some words about the stdout/stderr job output files

Unfortunately Gridengine stores the job's stdout/stderr files in your home directory by default. However, as soon as you run more than a few (say: 10 and more) jobs simultaneously, this will cause severe performance impacts on the AFS file server hosting your home directory. That's why we really advise you to store those files at a different place (e.g. somewhere below your scratch volume at /afs/ifh.de/group/<your group>/scratch/<your account>). Furthermore do not access this directory from within your job (e.g. make it the job's cwd, do an "ls" on it, etc.)!
Nevertheless, best would be to not create those files on a shared file system (AFS, Lustre) at all. To get this working, do something like this:

[oreade38] ~ % qsub -j y -o /dev/null <other requirements> <your jobscript>


Your job script should then start with a line like this:

exec > "$TMPDIR"/stdout.txt 2>"$TMPDIR"/stderr.txt


If you are interested in those two stdout/stderr files you'll need to copy it to a common place (e.g. your scratch dir) at the end of the job. You can catch signals sent to the job in your script like this, USR1 is sent if you hit the s_rt limit, 0 is for the normal exit:

#$ -m e
#$ -l s_rt=00:00:02
echo "starting"
trap 'echo exiting normally' 0
trap 'echo exiting after USR1;exit 2' USR1
sleep 60

11. Examples

Intention

Command ( Always without line breaks! )

You want to submit a job which needs 4 gb rss memory and 13 hours cpu time

qsub -l h_rss=4G -l h_cpu=13:00:00 <job_script>

Your job should run under the project z_nuastr (which you are member of) although your default project is icecube. It needs only 20 minutes of cpu time.

qsub -P z_nuastr -l h_cpu=00:20:00 <job_script>

You want to submit the same 64 bit job 20 times at once.
Every job needs 1.5 gb rss memory and runs for 30 hours.

qsub -t 1-20 -l h_rss=1.5G -l h_cpu=30:00:00 <job_script>

You want to have an interactive shell on a batch node. You intend to use this shell for three hours.

qrsh -l h_rt=03:00:00

Your job needs 48 hours cpu time, but you are not sure, whether this is enough. The job should receive a USR2 signal before actually being killed by the batch system. You further want to receive an email on jobs start, end and possible abort.

qsub -l h_cpu=48:00:00 -notify -m abe <job_script>

You have a number of jobs that run some lower priority "background tasks". You now want to submit more urgent jobs without killing / suspending the currently already running "background tasks jobs".

qsub -l <resources> <low priority job>
qsub -l <resources> -js 10 <important_jobscript>

12. Best Practices

When running mass production on the farm, please keep in mind:

  • Memory (requested by h_rss) is a limited resource! In order to maximize throughput, keep it as small as possible and below the average 'per-slot amount' - see qhost output and have a look at the columns NCPU and MEMTOT. HINT: it's currently mostly ~4GB per slot ;-)
  • Keep a reasonable average job runtime (> 30 minutes)! Every job start and termination produces an overhead of some seconds. A good ratio of payload/overhead is desirable.
  • Minimize I/O (disk, network) as much as possible! I/O is probably the worst performance killer, so avoid any "unnecessary" operations.
  • Use array jobs whenever possible! They are much more convenient for the batch system scheduler.
  • Do not store large amounts of STDOUT/STDERR files into one directory! It will slow down directory access in Lustre, it will not in AFS at all - so use a reasonable subdirectory structure.
  • Keep the number of pending jobs small! Hundredthousands of pending jobs slow down the scheduler for everybody!
  • Check for enough free quota in the output directories before job submission!. It is one typical source for massive job failures.
  • Think twice before submitting large amounts of jobs! Did you take all precautions to avoid massive job crashes in case of unforeseen problems? 1000s of dying jobs within a short period slow down the batch system for everybody!

By following these rules you will help everybody (including yourself) to get most out of the existing, limited resources!

13. Other commands to handle your active jobs

Command

Function

qhold <job id>

Put submitted (but not yet started) jobs into 'hold' state, i.e. the job won't considered for execution until it is 'released' again.

qrls <job id>

Removes the 'hold' state of a job.