Batch System - Troubleshooting

1. Common problems
2. Retrieving Job Status Information
3. SGE Failure and Exit Codes

1. Common problems

1.1 Your job "starves" in the waiting queue

possible reason

example

solution

The farm is full.

 

check the output of "qstat -g c" for available nodes

You requested resources which cannot be fulfilled.

-l h_cpu > 48:00:00

you can just request cpu time < 48 hours

Your job is in error state.

qstat lists your job in Eqw state

Check the reason for the error and remove the error flag (details about it can be found here) Link !

You requested high amounts of consumable resources (h_rss, your jobs requested a PE).

qsub -pe multicore 8 <jobscript>
qsub -l h_rss=30G <jobscript>

Use job reservation additionally!
  (qsub switch: -R y)

1.2 Only some of a set of identical jobs die

possible reason

example

solution

You did not specify your requirements correctly.

You did not specify h_cpu .

If h_cpu is not specified, your job might run on short queues. If your job needs more than 30 minutes cpu time, it will be killed.

Too many jobs access data on the same file server at once.

 

Use AFS!
Do not submit too many jobs at once. If you really need to, try using the qsub "-hold_jid" option.
Read the article about optimal storage usage at DESY.

1.3 All your jobs die at once

possible reason

example

solution

There are problems writing the log files
(job's STDOUT/STDERR).

The log directory (located in AFS) contains too many files. SGE's error mail (qsub parameter '-m a') contains a line saying something like "/afs/ifh.de/...: File too large".

Do not store more than 1000 output files per directory.

 

The output directory is not writable. SGE's error mail contains a line saying something like "/afs/ifh.de/...: permission denied".

Check directory permissions.

 

The log directory does not exist on the execution host.

You can only use network enabled filesystems (AFS, NFS) as log directory. Local directories (e.g. /usr1/scratch) won't work.

1.4 qrsh fails with an error message complaining 'Your "qrsh" request could not be scheduled, try again later'

possible reason

example

solution

The farm is full and qrsh wants to occupy a slot at once.

You did not specify h_cpu .

Try "qrsh -now n <other requirements>".
That way your request will be put into the waiting queue and no immediate execution will be forced.


2. Retrieving Job Status Information

Shortly after jobs are finished the job status information is no longer accessible using normal SGE commands. The MACBAT web page does contain in the menu for a given farm the heading 'Reporting - Finished Jobs'. The related link gives access to the job information of your finished jobs. This URL will list an overview of finished jobs per day. From there listings of finished jobs on a given date can be retrieved. By further following the links every job detail can be displayed up to the single tasks in array jobs.
The same information can be obtained using command line tools on Linux. The command arcx sgejobs is provided for retrieving this info. A short usage is printed with

arcx sgejobs -h


The information is displayed for the default farm ('uge'), but another farm ('pax' or the old 'sge') can be chosen using the -f=<farm> switch. All information is displayed only belonging to authenticated users and only for own jobs. Group admins can be registered (please contact UCO) who then are able to view information on jobs belonging to other users of their group.
If arcx sgejobs called without further arguments a list of submission dates and number of jobs submitted that day is printed.

arcx sgejobs

Submit date

Jobs

User

2015-09-01

12

ahaupt

If a submission date or submission interval is given (date format yyyy-mm-dd) then job data are printed in a tabular form.

arcx sgejobs 2015-09-01

=== Tue Sep 1 2015 ===

Job ID

Hostname

Jobname

Submit

Delay

Run

%

Memory

Fail

Exit

944180

bladege

farmHEPSPEC.sh

15:41:01

46

12963

192

1320.0M

0

0

946942

bladecd

farmHEPSPEC.sh

21:12:16

11

9572

96

720.3M

0

0

Finally if a job number is given.

arcx sgejobs 946942


then the full information belonging to that job is displayed.

arcx sgejobs 946942

         qname = std.q
         hostname = bladecd.zeuthen.desy.de
         unixgroup = sysprog
         owner = ahaupt
         job_name = farmHEPSPEC.sh
         job_number = 946942
         submission_time = 1441134736
         start_time = 1441134747
         end_time = 1441144320
         failed = 0
         exit_status = 0
         ru_wallclock = 9572
         ru_utime = 9047
         ru_stime = 166
         ru_maxrss = 493216
         ru_minflt = 23156122
         ru_majflt = 483
         ru_inblock = 2699188
         ru_oublock = 7691376
         ru_nvcsw = 114531
         ru_nivcsw = 943807
         project = sysprog
         granted_pe = NONE
         slots = 1
         task_number = 0
         cpu = 9213
         mem = 1023.24
         category = -l h_cpu=32400,h_rss=2G,h_stack=10M,hostname=bladec*,m_mem_free=2.1G,s_rt=32700,tmpdir_size=5G -P sysprog -binding linear_automatic 1 0 0 0 no_explicit_binding
         pe_taskid = NONE
         maxvmem = 720318000

3. SGE Failure and Exit Codes

The exit code is the return value of the exiting program. It can be a user defined value if the job is finished with a call to 'exit(number)'. For abnormally terminated jobs it is the signal number + 128. If an SGE job is terminated because a limit was exceeded, SGE has sent a SIGUSR1 signal (10) to the job which results in an exit code of 138.
The SGE failure code indicates why a job was abnormally terminated. The following incomplete list mentions the most frequent failure codes:

code

meaning

1

failure before job (execd)

7

failure before prolog

8

failure in prolog

10

failure in pestart

11

failure before job (shepherd)

15

failure epilog

19

no exit status

21

failure in recognizing job

25

rescheduling

26

failure opening output

27

no shell

28

no current working dir

29

AFS problem

30

rescheduling on application error

36

check daemon configuration

37

qmaster enforced h_rt limit

100

failure after job