Troubleshooting
Batch System - Troubleshooting
Computer Center
Batch System - Troubleshooting
1. Common problems
2. Retrieving Job Status Information
3. SGE Failure and Exit Codes
1. Common problems
1.1 Your job "starves" in the waiting queue
possible reason |
example |
solution |
---|---|---|
The farm is full. |
check the output of "qstat -g c" for available nodes |
|
You requested resources which cannot be fulfilled. |
-l h_cpu > 48:00:00 |
you can just request cpu time < 48 hours |
Your job is in error state. |
qstat lists your job in Eqw state |
Check the reason for the error and remove the error flag (details about it can be found here) Link ! |
You requested high amounts of consumable resources (h_rss, your jobs requested a PE). |
qsub -pe multicore 8 <jobscript> qsub -l h_rss=30G <jobscript> |
Use job reservation additionally! (qsub switch: -R y) |
1.2 Only some of a set of identical jobs die
possible reason |
example |
solution |
---|---|---|
You did not specify your requirements correctly. |
You did not specify h_cpu . |
If h_cpu is not specified, your job might run on short queues. If your job needs more than 30 minutes cpu time, it will be killed. |
Too many jobs access data on the same file server at once. |
Use AFS! Do not submit too many jobs at once. If you really need to, try using the qsub "-hold_jid" option. Read the article about optimal storage usage at DESY. |
1.3 All your jobs die at once
possible reason |
example |
solution |
---|---|---|
There are problems writing the log files (job's STDOUT/STDERR). |
The log directory (located in AFS) contains too many files. SGE's error mail (qsub parameter '-m a') contains a line saying something like "/afs/ifh.de/...: File too large". |
Do not store more than 1000 output files per directory. |
The output directory is not writable. SGE's error mail contains a line saying something like "/afs/ifh.de/...: permission denied". |
Check directory permissions. |
|
The log directory does not exist on the execution host. |
You can only use network enabled filesystems (AFS, NFS) as log directory. Local directories (e.g. /usr1/scratch) won't work. |
1.4 qrsh fails with an error message complaining 'Your "qrsh" request could not be scheduled, try again later'
possible reason |
example |
solution |
---|---|---|
The farm is full and qrsh wants to occupy a slot at once. |
You did not specify h_cpu . |
Try "qrsh -now n <other requirements>". That way your request will be put into the waiting queue and no immediate execution will be forced. |
2. Retrieving Job Status Information
Shortly after jobs are finished the job status information is no longer accessible using normal SGE commands. The MACBAT web page does contain in the menu for a given farm the heading 'Reporting - Finished Jobs'. The related link gives access to the job information of your finished jobs. This URL will list an overview of finished jobs per day. From there listings of finished jobs on a given date can be retrieved. By further following the links every job detail can be displayed up to the single tasks in array jobs.
The same information can be obtained using command line tools on Linux. The command arcx sgejobs is provided for retrieving this info. A short usage is printed with
arcx sgejobs -h
The information is displayed for the default farm ('uge'), but another farm ('pax' or the old 'sge') can be chosen using the -f=<farm> switch. All information is displayed only belonging to authenticated users and only for own jobs. Group admins can be registered (please contact UCO) who then are able to view information on jobs belonging to other users of their group.
If arcx sgejobs called without further arguments a list of submission dates and number of jobs submitted that day is printed.
arcx sgejobs
Submit date |
Jobs |
User |
2015-09-01 |
12 |
ahaupt |
If a submission date or submission interval is given (date format yyyy-mm-dd) then job data are printed in a tabular form.
arcx sgejobs 2015-09-01
=== Tue Sep 1 2015 === |
|||||||||
---|---|---|---|---|---|---|---|---|---|
Job ID |
Hostname |
Jobname |
Submit |
Delay |
Run |
% |
Memory |
Fail |
Exit |
944180 |
bladege |
farmHEPSPEC.sh |
15:41:01 |
46 |
12963 |
192 |
1320.0M |
0 |
0 |
946942 |
bladecd |
farmHEPSPEC.sh |
21:12:16 |
11 |
9572 |
96 |
720.3M |
0 |
0 |
Finally if a job number is given.
arcx sgejobs 946942
then the full information belonging to that job is displayed.
arcx sgejobs 946942
qname = std.q
hostname = bladecd.zeuthen.desy.de
unixgroup = sysprog
owner = ahaupt
job_name = farmHEPSPEC.sh
job_number = 946942
submission_time = 1441134736
start_time = 1441134747
end_time = 1441144320
failed = 0
exit_status = 0
ru_wallclock = 9572
ru_utime = 9047
ru_stime = 166
ru_maxrss = 493216
ru_minflt = 23156122
ru_majflt = 483
ru_inblock = 2699188
ru_oublock = 7691376
ru_nvcsw = 114531
ru_nivcsw = 943807
project = sysprog
granted_pe = NONE
slots = 1
task_number = 0
cpu = 9213
mem = 1023.24
category = -l h_cpu=32400,h_rss=2G,h_stack=10M,hostname=bladec*,m_mem_free=2.1G,s_rt=32700,tmpdir_size=5G -P sysprog -binding linear_automatic 1 0 0 0 no_explicit_binding
pe_taskid = NONE
maxvmem = 720318000
3. SGE Failure and Exit Codes
The exit code is the return value of the exiting program. It can be a user defined value if the job is finished with a call to 'exit(number)'. For abnormally terminated jobs it is the signal number + 128. If an SGE job is terminated because a limit was exceeded, SGE has sent a SIGUSR1 signal (10) to the job which results in an exit code of 138.
The SGE failure code indicates why a job was abnormally terminated. The following incomplete list mentions the most frequent failure codes:
code |
meaning |
---|---|
1 |
failure before job (execd) |
7 |
failure before prolog |
8 |
failure in prolog |
10 |
failure in pestart |
11 |
failure before job (shepherd) |
15 |
failure epilog |
19 |
no exit status |
21 |
failure in recognizing job |
25 |
rescheduling |
26 |
failure opening output |
27 |
no shell |
28 |
no current working dir |
29 |
AFS problem |
30 |
rescheduling on application error |
36 |
check daemon configuration |
37 |
qmaster enforced h_rt limit |
100 |
failure after job |