DV-Zeuthen

| Computer Center

Troubleshooting

Batch System - Troubleshooting

Computer Center

Batch System - Troubleshooting

1. Common problems
2. Retrieving Job Status Information
3. SGE Failure and Exit Codes

1. Common problems

1.1 Your job "starves" in the waiting queue

possible reason
example
solution
The farm is full.
  check the output of "qstat -g c" for available nodes
You requested resources which cannot be fulfilled.
-l h_cpu > 48:00:00
you can just request cpu time < 48 hours
Your job is in error state.
qstat lists your job in Eqw state
Check the reason for the error and remove the error flag (details about it can be found here) Link !
You requested high amounts of consumable resources (h_rss, your jobs requested a PE).
qsub -pe multicore 8 <jobscript>
qsub -l h_rss=30G <jobscript>
Use job reservation additionally!
  (qsub switch: -R y)

1.2 Only some of a set of identical jobs die

possible reason
example
solution
You did not specify your requirements correctly.
You did not specify h_cpu .
If h_cpu is not specified, your job might run on short queues. If your job needs more than 30 minutes cpu time, it will be killed.
Too many jobs access data on the same file server at once.
  Use AFS!
Do not submit too many jobs at once. If you really need to, try using the qsub "-hold_jid" option.
Read the article about optimal storage usage at DESY.

1.3 All your jobs die at once

possible reason
example
solution
There are problems writing the log files
(job's STDOUT/STDERR).
The log directory (located in AFS) contains too many files. SGE's error mail (qsub parameter '-m a') contains a line saying something like "/afs/ifh.de/...: File too large".
Do not store more than 1000 output files per directory.
  The output directory is not writable. SGE's error mail contains a line saying something like "/afs/ifh.de/...: permission denied".
Check directory permissions.
  The log directory does not exist on the execution host.
You can only use network enabled filesystems (AFS, NFS) as log directory. Local directories (e.g. /usr1/scratch) won't work.

1.4 qrsh fails with an error message complaining 'Your "qrsh" request could not be scheduled, try again later'

possible reason
example
solution
The farm is full and qrsh wants to occupy a slot at once.
You did not specify h_cpu .
Try "qrsh -now n <other requirements>".
That way your request will be put into the waiting queue and no immediate execution will be forced.


2. Retrieving Job Status Information

Shortly after jobs are finished the job status information is no longer accessible using normal SGE commands. The MACBAT web page does contain in the menu for a given farm the heading 'Reporting - Finished Jobs'. The related link gives access to the job information of your finished jobs. This URL will list an overview of finished jobs per day. From there listings of finished jobs on a given date can be retrieved. By further following the links every job detail can be displayed up to the single tasks in array jobs.
The same information can be obtained using command line tools on Linux. The command arcx sgejobs is provided for retrieving this info. A short usage is printed with

arcx sgejobs -h


The information is displayed for the default farm ('uge'), but another farm ('pax' or the old 'sge') can be chosen using the -f=<farm> switch. All information is displayed only belonging to authenticated users and only for own jobs. Group admins can be registered (please contact UCO) who then are able to view information on jobs belonging to other users of their group.
If arcx sgejobs called without further arguments a list of submission dates and number of jobs submitted that day is printed.

arcx sgejobs

Submit date
Jobs
User
2015-09-01
12
ahaupt

If a submission date or submission interval is given (date format yyyy-mm-dd) then job data are printed in a tabular form.

arcx sgejobs 2015-09-01

=== Tue Sep 1 2015 ===
Job ID
Hostname
Jobname
Submit
Delay
Run
%
Memory
Fail
Exit
944180
bladege
farmHEPSPEC.sh
15:41:01
46
12963
192
1320.0M
0
0
946942
bladecd
farmHEPSPEC.sh
21:12:16
11
9572
96
720.3M
0
0

Finally if a job number is given.

arcx sgejobs 946942

then the full information belonging to that job is displayed.

arcx sgejobs 946942

         qname = std.q
         hostname = bladecd.zeuthen.desy.de
         unixgroup = sysprog
         owner = ahaupt
         job_name = farmHEPSPEC.sh
         job_number = 946942
         submission_time = 1441134736
         start_time = 1441134747
         end_time = 1441144320
         failed = 0
         exit_status = 0
         ru_wallclock = 9572
         ru_utime = 9047
         ru_stime = 166
         ru_maxrss = 493216
         ru_minflt = 23156122
         ru_majflt = 483
         ru_inblock = 2699188
         ru_oublock = 7691376
         ru_nvcsw = 114531
         ru_nivcsw = 943807
         project = sysprog
         granted_pe = NONE
         slots = 1
         task_number = 0
         cpu = 9213
         mem = 1023.24
         category = -l h_cpu=32400,h_rss=2G,h_stack=10M,hostname=bladec*,m_mem_free=2.1G,s_rt=32700,tmpdir_size=5G -P sysprog -binding linear_automatic 1 0 0 0 no_explicit_binding
         pe_taskid = NONE
         maxvmem = 720318000

3. SGE Failure and Exit Codes

The exit code is the return value of the exiting program. It can be a user defined value if the job is finished with a call to 'exit(number)'. For abnormally terminated jobs it is the signal number + 128. If an SGE job is terminated because a limit was exceeded, SGE has sent a SIGUSR1 signal (10) to the job which results in an exit code of 138.
The SGE failure code indicates why a job was abnormally terminated. The following incomplete list mentions the most frequent failure codes:

code
meaning
1
failure before job (execd)
7
failure before prolog
8
failure in prolog
10
failure in pestart
11
failure before job (shepherd)
15
failure epilog
19
no exit status
21
failure in recognizing job
25
rescheduling
26
failure opening output
27
no shell
28
no current working dir
29
AFS problem
30
rescheduling on application error
36
check daemon configuration
37
qmaster enforced h_rt limit
100
failure after job