DV-Zeuthen

| Computer Center

Monitoring

Batch System - Monitoring the current farm and job status

Computer Center

Batch System - Monitoring the current farm and job status


Here are some of the most useful statements to query the current farm status:

Command
Provided information
qhost
Print out execution host configuration and load
qstat -g c
Print out the current queue utilization
qstat -u <user>
Only show jobs of a special user
qstat -j <job id>
Print out detailed information about the job with the specified job id


Jobs queried by qstat can be in different states:

Status
Explanation
qw
job is waiting for execution
t
job is transfering to the execution host
r
job is currently running
Eqw
job has failed, use the command sge-job-error <job id> to determine why. After that either delete the job with qdel <job id> (if it is a permanent error) or clear the error status with qmod -cj <job id> (if the error reason was temporary)
Rq / Rr
job has been requeued / restarted as it was running on a node that crashed

The farm status can also be visualized in the web browser. From the MACBAT overview page more detailed information can be retrieved by clicking on the link for a farm. Please see also the chapter on retrieving Job Status Information.

Additionally there is a Grafana-based dashboard available visualizing some runtime details of your job. The URL can be retrieved via this command:

sge-job-url <job-id> [task-id]