DV-Zeuthen

| Computer Center

Job submission

Batch System - Job submission

Computer Center

Batch System - Job submission

1. Basic requirements
2. Batch job submission
3. Interactive job submission
4. Parallel Jobs
5. GPU Jobs
6. Most common qsub / qrsh submission switches
7. Requesting resources
8. Projects
9. CPU core binding
10. Some words about the stdout/stderr job output files
11. Examples
12. Best Practices
13. Other commands to handle your active jobs

1. Basic requirements

2. Batch job submission

#!/bin/zsh
#
#(otherwise the default shell would be used)
#$ -S /bin/bash
#(the running time for this job)
#$ -l h_rt=05:30:00
#
#(the maximum memory usage of this job)
#$ -l h_rss=2G
#
#(stderr and stdout are merged together to stdout)
#$ -j y
#
#(send mail on job's end and abort)
#$ -m ae

# change to scratch directory
cd $TMPDIR
copy raw data to scratch directory
cp /afs/ifh.de/group/amanda/rawdata/bigfile bigfile
# write calculated output also to scratch first
/path/to/executable/which/works/on/bigfile -in bigfile -out output
# copy the output back into afs
cp output /afs/ifh.de/group/amanda/output/ 

       1. Please copy the data you need during your job to $TMPDIR (not to /tmp directly!) first.
       2. Let your job work on that data (generate the results in $TMPDIR also!).
       3. Before your job finishes copy the results back to AFS, dCache, etc. All data in $TMPDIR will be removed automatically when your job finishes!

3. Interactive job submission

4. Parallel Jobs

There are two kinds of parallel jobs Grid Engine supports:

1. Multithreaded / multiprocessor jobs on a single node:

2. MPI jobs:

Attention

Resources like h_rss & h_cpu are always requested per job slot, not per job! So you always have to adjust the requirements accordingly.

Example: your parallel multithreaded job consumes almost 8 GB memory altogether. This job runs in 4 slots (-pe multicore 4). So if you request -l h_rss=2G, your job is allowed to consume up to 8 GB resident memory (4*2GB).
Here's a short overview of the available parallel environments:

PE name
description
multicore
intended to be used for jobs using more than one cpu core on a single node
multicore-mpi
jobs running on more than one cpu core on a single node but do the parallelisation via OpenMPI
mpi
jobs linked against OpenMPI and running in parallel on several nodes

5. GPU Jobs

The Zeuthen batch farm provides a limited number of nodes with installed nVidia Tesla GPGPUs. The CUDA-SDK is installed on those nodes as well. To request a GPGPU use the qsub/qrsh switch -l gpu=1:

qsub -l gpu=1 <other requirements> <job>


If you intend to use a special GPGPU model, use the qsub/qrsh switch -l gpu_type=. To see which different models are available, execute this statement:

qhost -l gpu_type='*' -F gpu_type


Inside the gpu job you can access your reserved nVidia device by using the environment variable $SGE_GPU_DEVICE:

[bladed0] ~ % ls -l $SGE_GPU_DEVICE
crw------- 1 ahaupt sysprog 195, 0 Mar 7 21:24 /dev/nvidia0


In case you requested multiple GPGPUs per job (-l gpu=<number>), the environment variables $SGE_GPU_DEVICE0, $SGE_GPU_DEVICE1, ... hold the paths to your devices. Also $CUDA_VISIBLE_DEVICES is set accordingly.

6. Most common qsub / qrsh submission switches

This section describes the most common parameters for qsub and qrsh. For more details read their man page. The switches can also be directly specified in the job script as it is shown in the example script.

Switch
Description
-cdw
execute the job from the current directory and not relative to your home directory, mostly used in conjunction with the -o and -e switches (if you just specify a relative path there)
-e <job's stderr output file>
specifies the path to the job's stderr output file (relative to home directory or to the current directory if the -cwd switch is used)
-hold_jid <job ids>
tell Grid Engine not to start the job until the specified jobs have been finished successfully
-i <job's stdin file>
specifies the job's stdin
-j <y|n>
merge the job's stderr with its stdout
-js <job share>
specifies the relative job share. Can be used to weight the importance of a single user's jobs. Higher numbers mean higher importance.
-l <job resource>
specifies the resources a job needs, multiple specifications can be stacked, see the next topic "Requesting resources" for details
-m <b|e|a>
Let Grid Engine send a mail on job's:
b : begin
e : end
a : abort
-notify
Grid Engine will notify the job about the following abortion (SIGKILL) by sending a SIGUSR2 first
-now <y|n>
force / switch off an immediate execution of the job. y is default for qrsh, n for qsub
-N <jobname>
specifies the job name, default is the name of the submitted scripts
-o <job's stdout output file>
specifies the path to the job's stdout output file, if you spefify -j as well it will also take the stderr output
-P <project>
specifies the project the job should run under. See the projects topic for details
-pe <parallel environment>
specifies the parallel environment the job should run under. See the parallel job topic for details
-R <y|n>
tell Grid Engine (not) to reserve a slot for huge jobs (i.e. parallel jobs, high demand of memory). This should prevent "starving" jobs in the waiting queue. If you know your job can run on the farm and already waits a really long time, try this switch
-S <path to shell>
specifies the shell Grid Engine should start your job with. Default is /bin/zsh
-t <from-to:step>
Submit an array job. That's actually the same as submitting the same job several times (difference of: to - from). Optionally you can specify the step size between the task numbers. The task within this array can be accessed in the job via the environment variable $SGE_TASK_ID. For more details see the man page
-tc <max running tasks>
In case you submit an array job, this switch limits the number of simultaneously running job tasks. Should be used e.g. to avoid overloading of limited central services like network file systems.
-V
inherit the current shell environment to the job ! LD_LIBRARY_PATH is not inherited.

7. Requesting resources

These are the most common resources you can request:

jobs requesting less than 256M will be rejected

complex name
description
possible values
example
comment
arch
the required host architecture
i386,x86_64
-l arch=x86_64
currently useless since only 64 bit systems are available
os
the required host operating system
sl7
-l os=sl7
currently useless as only nodes running SL7 are available
h_rss
job's maximum resident memory usage (RSS)
  -l h_rss=2G
hard limit, default: 1G !
jobs requesting less than 256M will be rejected
s_cpu
job's maximum cpu time
  -l s_cpu=03:00:00
soft limit
h_cpu
job's maximum cpu time
  -l h_cpu=48:00:00
hard limit, usage is discouraged in favour of h_rt
s_rt
job's maximum wallclock time
  -l s_rt=03:00:00
soft limit
h_rt
job's maximum wallclock time
  -l h_rt=15:00:00
hard limit, if your job needs more than 30 minutes runtime, either h_cpu/s_cpu or h_rt/s_rt are mandatory
tmpdir_size
job's maximum scratch space usage in $TMPDIR
  -l tmpdir_size=5G
actually only needed if you need lots of space (>1G)
gpu
request a GPGPU
  -l gpu=1
 
hostname
name of the execution host
  -l hostname=bladeff
sage not recommended!

Some notes on requesting resources:


Some notes on signaling the end of the job:

8. Projects

9. CPU core binding

All jobs requesting a runtime of more than 30 minutes will automatically run with cpu core binding enabled. You can of course change the default core-binding strategy - see qsub man page for details.

10. Some words about the stdout/stderr job output files

Unfortunately Gridengine stores the job's stdout/stderr files in your home directory by default. However, as soon as you run more than a few (say: 10 and more) jobs simultaneously, this will cause severe performance impacts on the AFS file server hosting your home directory. That's why we really advise you to store those files at a different place (e.g. somewhere below your scratch volume at /afs/ifh.de/group/<your group>/scratch/<your account>). Furthermore do not access this directory from within your job (e.g. make it the job's cwd, do an "ls" on it, etc.)!
Nevertheless, best would be to not create those files on a shared file system (AFS, Lustre) at all. To get this working, do something like this:

[oreade38] ~ % qsub -j y -o /dev/null <other requirements> <your jobscript>


Your job script should then start with a line like this:

exec > "$TMPDIR"/stdout.txt 2>"$TMPDIR"/stderr.txt


If you are interested in those two stdout/stderr files you'll need to copy it to a common place (e.g. your scratch dir) at the end of the job. You can catch signals sent to the job in your script like this, USR1 is sent if you hit the s_rt limit, 0 is for the normal exit:

#$ -m e
#$ -l s_rt=00:00:02
echo "starting"
trap 'echo exiting normally' 0
trap 'echo exiting after USR1;exit 2' USR1
sleep 60

11. Examples

Intention
Command ( Always without line breaks! )
You want to submit a job which needs 4 gb rss memory and 13 hours time
qsub -l h_rss=4G -l h_rt=13:00:00 <job_script>
Your job should run under the project z_nuastr (which you are member of) although your default project is icecube. It needs only 20 minutes runtime.
qsub -P z_nuastr -l h_rt=00:20:00 <job_script>
You want to submit the same 64 bit job 20 times at once.
Every job needs 1.5 gb rss memory and runs for 30 hours.
qsub -t 1-20 -l h_rss=1.5G -l h_rt=30:00:00 <job_script>
You want to have an interactive shell on a batch node. You intend to use this shell for three hours.
qrsh -l h_rt=03:00:00
Your job needs 48 hours runtime, but you are not sure, whether this is enough. The job should receive a USR2 signal before actually being killed by the batch system. You further want to receive an email on jobs start, end and possible abort.
qsub -l h_rt=48:00:00 -notify -m abe <job_script>
You have a number of jobs that run some lower priority "background tasks". You now want to submit more urgent jobs without killing / suspending the currently already running "background tasks jobs".
qsub -l <resources> <low priority job>
qsub -l <resources> -js 10 <important_jobscript>

12. Best Practices

When running mass production on the farm, please keep in mind:

By following these rules you will help everybody (including yourself) to get most out of the existing, limited resources!

13. Other commands to handle your active jobs

Please also consult man pages of mentioned commands for more detailed usage information!

Command
Function
qdel <job id> [job id], ...
Delete job(s) from the batch system
qhold <job id> [job id], ...
Put submitted (but not yet started) jobs into 'hold' state, i.e. the job won't considered for execution until it is 'released' again.
qrls <job id> [job id], ...
Removes the 'hold' state of a job.