Troubleshooting

Batch System - Troubleshooting

Computer Center

Batch System - Troubleshooting

1. Common problems
2. Retrieving Job Status Information
3. SGE Failure and Exit Codes

1. Common problems

1.1 Your job "starves" in the waiting queue

possible reason	example	solution
The farm is full.		check the output of "qstat -g c" for available nodes
You requested resources which cannot be fulfilled.	-l h_cpu > 48:00:00	you can just request cpu time < 48 hours
Your job is in error state.	qstat lists your job in Eqw state	Check the reason for the error and remove the error flag (details about it can be found here) Link !
You requested high amounts of consumable resources (h_rss, your jobs requested a PE).	qsub -pe multicore 8 <jobscript> qsub -l h_rss=30G <jobscript>	Use job reservation additionally! (qsub switch: -R y)

1.2 Only some of a set of identical jobs die

possible reason	example	solution
You did not specify your requirements correctly.	You did not specify h_cpu .	If h_cpu is not specified, your job might run on short queues. If your job needs more than 30 minutes cpu time, it will be killed.
Too many jobs access data on the same file server at once.		Use AFS! Do not submit too many jobs at once. If you really need to, try using the qsub "-hold_jid" option. Read the article about optimal storage usage at DESY.

1.3 All your jobs die at once

possible reason	example	solution
There are problems writing the log files (job's STDOUT/STDERR).	The log directory (located in AFS) contains too many files. SGE's error mail (qsub parameter '-m a') contains a line saying something like "/afs/ifh.de/...: File too large".	Do not store more than 1000 output files per directory.
	The output directory is not writable. SGE's error mail contains a line saying something like "/afs/ifh.de/...: permission denied".	Check directory permissions.
	The log directory does not exist on the execution host.	You can only use network enabled filesystems (AFS, NFS) as log directory. Local directories (e.g. /usr1/scratch) won't work.

1.4 qrsh fails with an error message complaining 'Your "qrsh" request could not be scheduled, try again later'

possible reason	example	solution
The farm is full and qrsh wants to occupy a slot at once.	You did not specify h_cpu .	Try "qrsh -now n <other requirements>". That way your request will be put into the waiting queue and no immediate execution will be forced.

2. Retrieving Job Status Information

Shortly after jobs are finished the job status information is no longer accessible using normal SGE commands. The MACBAT web page does contain in the menu for a given farm the heading 'Reporting - Finished Jobs'. The related link gives access to the job information of your finished jobs. This URL will list an overview of finished jobs per day. From there listings of finished jobs on a given date can be retrieved. By further following the links every job detail can be displayed up to the single tasks in array jobs.
The same information can be obtained using command line tools on Linux. The command arcx sgejobs is provided for retrieving this info. A short usage is printed with

arcx sgejobs -h

The information is displayed for the default farm ('uge'), but another farm ('pax' or the old 'sge') can be chosen using the -f=<farm> switch. All information is displayed only belonging to authenticated users and only for own jobs. Group admins can be registered (please contact UCO) who then are able to view information on jobs belonging to other users of their group.
If arcx sgejobs called without further arguments a list of submission dates and number of jobs submitted that day is printed.

arcx sgejobs

Submit date	Jobs	User
2015-09-01	12	ahaupt

If a submission date or submission interval is given (date format yyyy-mm-dd) then job data are printed in a tabular form.

arcx sgejobs 2015-09-01

=== Tue Sep 1 2015 ===
Job ID	Hostname	Jobname	Submit	Delay	Run	%	Memory	Fail	Exit
944180	bladege	farmHEPSPEC.sh	15:41:01	46	12963	192	1320.0M	0	0
946942	bladecd	farmHEPSPEC.sh	21:12:16	11	9572	96	720.3M	0	0

Finally if a job number is given.

arcx sgejobs 946942

then the full information belonging to that job is displayed.

arcx sgejobs 946942 qname = std.q hostname = bladecd.zeuthen.desy.de unixgroup = sysprog owner = ahaupt job_name = farmHEPSPEC.sh job_number = 946942 submission_time = 1441134736 start_time = 1441134747 end_time = 1441144320 failed = 0 exit_status = 0 ru_wallclock = 9572 ru_utime = 9047 ru_stime = 166 ru_maxrss = 493216 ru_minflt = 23156122 ru_majflt = 483 ru_inblock = 2699188 ru_oublock = 7691376 ru_nvcsw = 114531 ru_nivcsw = 943807 project = sysprog granted_pe = NONE slots = 1 task_number = 0 cpu = 9213 mem = 1023.24 category = -l h_cpu=32400,h_rss=2G,h_stack=10M,hostname=bladec*,m_mem_free=2.1G,s_rt=32700,tmpdir_size=5G -P sysprog -binding linear_automatic 1 0 0 0 no_explicit_binding pe_taskid = NONE maxvmem = 720318000

3. SGE Failure and Exit Codes

The exit code is the return value of the exiting program. It can be a user defined value if the job is finished with a call to 'exit(number)'. For abnormally terminated jobs it is the signal number + 128. If an SGE job is terminated because a limit was exceeded, SGE has sent a SIGUSR1 signal (10) to the job which results in an exit code of 138.
The SGE failure code indicates why a job was abnormally terminated. The following incomplete list mentions the most frequent failure codes:

code	meaning
1	failure before job (execd)
7	failure before prolog
8	failure in prolog
10	failure in pestart
11	failure before job (shepherd)
15	failure epilog
19	no exit status
21	failure in recognizing job
25	rescheduling
26	failure opening output
27	no shell
28	no current working dir
29	AFS problem
30	rescheduling on application error
36	check daemon configuration
37	qmaster enforced h_rt limit
100	failure after job