Monitoring
Batch System - Monitoring the current farm and job status
Computer Center
Batch System - Monitoring the current farm and job status
Here are some of the most useful statements to query the current farm status:
Command |
Provided information |
---|---|
qhost |
Print out execution host configuration and load |
qstat -g c |
Print out the current queue utilization |
qstat -u <user> |
Only show jobs of a special user |
qstat -j <job id> |
Print out detailed information about the job with the specified job id |
Jobs queried by qstat can be in different states:
Status |
Explanation |
---|---|
qw |
job is waiting for execution |
t |
job is transfering to the execution host |
r |
job is currently running |
Eqw |
job has failed, use the command sge-job-error <job id> to determine why. After that either delete the job with qdel <job id> (if it is a permanent error) or clear the error status with qmod -cj <job id> (if the error reason was temporary) |
|
job has been requeued / restarted as it was running on a node that crashed |
The farm status can also be visualized in the web browser. From the MACBAT overview page more detailed information can be retrieved by clicking on the link for a farm. Please see also the chapter on retrieving Job Status Information.
Additionally there is a Grafana-based dashboard available visualizing some runtime details of your job. The URL can be retrieved via this command:
sge-job-url <job-id> [task-id]