/usr/bin
on the login node.
Use the cqstat command to see what jobs are queued or running. WallTime
is in hours:minutes:seconds
.
$ cqstat JobID User WallTime Nodes State Location ============================================================== 188178 luke 24:00:00 64 running 64_R001_J207_NA 188313 luke 24:00:00 64 running 64_R001_J203_N8 188345 luke 24:00:00 64 running 64_R001_J106_N2 188348 luke 24:00:00 64 running 64_R001_J102_N0 188400 hsolo 06:00:00 8 running 32_R001_J113_N5 188403 hsolo 04:00:00 64 queued None 188404 hsolo 04:00:00 64 queued None 188408 hsolo 04:00:00 64 queued None 188409 hsolo 04:00:00 64 queued None 188428 hsolo 06:00:00 8 running 32_R001_J117_N7 188445 hsolo 04:00:00 64 queued None 188450 yoda 24:00:00 128 running 128_R000_J102_N0 188451 yoda 24:00:00 128 running 128_R000_J111_N4 188452 yoda 24:00:00 128 running 128_R000_J210_NC 188453 yoda 24:00:00 128 running 128_R000_J203_N8 |
The -f
flag gives more info:
$ cqstat -f JobID JobName User WallTime QueuedTime RunTime Nodes State Location Mode Procs Queue Kernel StartTime Index ===================================================================================================================================================== 188178 - luke 24:00:00 00:03:15 23:04:06 64 running 64_R001_J207_NA vn 127 default ZeptoOS 04/22/08 13:09:01 None 188313 - luke 24:00:00 00:00:21 14:41:22 64 running 64_R001_J203_N8 vn 127 default default 04/22/08 21:31:45 None 188345 - luke 24:00:00 00:00:25 11:15:10 64 running 64_R001_J106_N2 vn 127 default default 04/23/08 00:57:56 None 188348 - luke 24:00:00 00:00:28 09:57:37 64 running 64_R001_J102_N0 vn 127 default default 04/23/08 02:15:30 None 188400 - hsolo 06:00:00 00:00:15 02:55:05 8 running 32_R001_J113_N5 vn 16 default default 04/23/08 09:18:02 None 188403 - hsolo 04:00:00 02:55:16 N/A 64 queued None vn 128 default ZeptoOS N/A None 188404 - hsolo 04:00:00 02:55:15 N/A 64 queued None vn 128 default default N/A None 188408 - hsolo 04:00:00 02:35:07 N/A 64 queued None vn 128 default default N/A None 188409 - hsolo 04:00:00 02:15:01 N/A 64 queued None vn 128 default default N/A None 188428 - hsolo 06:00:00 00:04:38 01:50:18 8 running 32_R001_J117_N7 vn 16 default default 04/23/08 10:22:49 None 188445 - hsolo 04:00:00 01:24:50 N/A 64 queued None vn 128 default default N/A None 188450 - yoda 24:00:00 00:00:18 00:07:01 128 running 128_R000_J102_N0 vn 256 default default 04/23/08 12:06:06 None 188451 - yoda 24:00:00 00:00:20 00:04:26 128 running 128_R000_J111_N4 vn 256 default default 04/23/08 12:08:40 None 188452 - yoda 24:00:00 00:00:22 00:02:04 128 running 128_R000_J210_NC vn 256 default default 04/23/08 12:11:02 None 188453 - yoda 24:00:00 00:00:21 00:01:16 128 running 128_R000_J203_N8 vn 256 default default 04/23/08 12:11:51 None |
There are 3 primary queues in use on Frost.
default
for long running jobsdebug
for small, short jobsJumboFridays
for half or full rack jobs during the big run window (usually 8-10am on Fridays.)man cqstat
). You can check which queues a partition is a member of with the partlist command.$ cqstat -q debug default JumboFridays Name State Users MinTime MaxTime MaxRunning MaxQueued MaxUserNodes TotalNodes Priority ============================================================================================================ JumboFridays running None None 03:00:00 4 None None None 0 debug running None None 02:00:00 4 50 None None 0 default running None None 24:00:00 4 None None None 0 |
Use the cqsub command to submit a job to the queue.
cqsub <required flags> <optional flags> executable |
Required Flag |
Description |
---|---|
|
where |
|
where |
Example: Simple submission
$ cqsub -n 32 -t 00:10:00 example.rts submitting walltime=10.0 minutes 162 |
In this example STDOUT is stored in 162.output
, and STDERR is stored in 162.error
in the current directory.
Some optional flags (see man cqstat for a full descrption of all flags):
Flag |
Description |
---|---|
|
where |
|
where |
|
where |
|
where |
|
where |
|
sends an email message at the start and stop of the job to the specified email address. Multiple email addresses, separated by colons, can be specified. |
|
forces the job to wait to run until the listed jobids have finished running |
Example: Specify 55 processes in virtual-node mode
$ cqsub -n 28 -c 55 -m vn -t 00:10:00 example.rts |
Use the cqdel command to cancel a job that has been submitted to the queue.
$ cqdel 162 Deleted Jobs JobID User ============== 162 voran |
It may take a bit for the job to be deleted if it is running, but you can check the `.error` file to see if the job is being deleted
Use the qmove command to move a job to another queue, after the job has been submitted.
$ qmove debug 118399 Moved Jobs to queue: debug 118399 |
Use the qhold and qrls commands to hold and release your jobs that are in the queue. When a job is in a hold state, it will not be scheduled to run by the scheduler.
$ qhold 118399 Placed user hold on jobs: 188499 $ qrls 118399 Removed user hold on jobs: 188499 |
The qalter command allows you to change many attributes of a job that you previously submitted to a queue, including walltime, number of nodes
$ qalter -t 60 -n 32 --mode vn 118399 |
See man qalter for more information.
The cqwait command can be used to wait for a job to either start running, or complete. If cqwait is used with no arguments, cqwait will not return until the job has finished running, or has left the queue because of a deletion. If cqwait is used with the --start option, cqwait will return when the job starts running in the queue.
$ cqwait 182949 |
Use the partlist command to see what partitions are available and which are in use.
$ partlist Name Queue State Backfill =================================================================================================== NCAR_R00 default:debug:teragrid:JumboFridays:admin blocked (128_R000_J102_N0) - NCAR_R000 default:debug:teragrid:JumboFridays:admin blocked (128_R000_J102_N0) - NCAR_R001 default:debug:teragrid:JumboFridays:admin blocked (128_R001_J102_N0) - 256_R000_J102_N0 default:teragrid:debug:JumboFridays:admin blocked (128_R000_J102_N0) - 256_R000_J203_N8 default:teragrid:debug:JumboFridays:admin blocked (128_R000_J203_N8) - 256_R001_J102_N0 debug:JumboFridays:admin blocked (128_R001_J102_N0) - 256_R001_J203_N8 debug:JumboFridays:admin blocked (64_R001_J203_N8) - 128_R000_J102_N0 debug:default:teragrid:admin busy - 128_R000_J111_N4 debug:default:teragrid:admin idle 4:23 128_R000_J203_N8 debug:default:teragrid:admin busy - 128_R000_J210_NC debug:default:teragrid:admin busy - 128_R001_J102_N0 debug:default:teragrid:admin busy - 128_R001_J111_N4 debug:default:teragrid:admin blocked (64_R001_J111_N4) - 128_R001_J203_N8 debug:default:teragrid:admin blocked (64_R001_J203_N8) - 128_R001_J210_NC debug:default:teragrid:admin blocked (64_R001_J210_NC) - ... |
This example shows that three of the four 128-node partitions in midplane R000 are in use by jobs ('busy'), which consequently block the full rack partition ('NCAR_R00'), and the 512-node and 256-node partitions on midplane R000 ('NCAR_R000, 256_R000_J102_N0, and 256_R000_J203_N8, respectively). Note that partition 128_R000_J111_N4 is idle, and the scheduler has calculated a 4h23m backfill window for it. This means that if a user submits a 128-node job with a requested walltime of less than 4h23m, the job can be run right away.
You may also use the nodes helper script which will display only the currently free partitions, along with their associated backfill windows. nodes -v lists all of the partitions (from the partlist -l output) and indents the output to show the partition hierarchy.
Use the showres -l command to see what system reservations are in place.
If your job is sitting 'queued' and it seems like it should be running, there are a few things to check before sending a message to frost-help@ucar.edu: