Skip Navigation

University of Nebraska–Lincoln

Research Computing Facility

Prairiefire - PBS

  1. What is PBS?
  2. How do I create a PBS script for running jobs?
  3. How do I submit a job?
  4. My job needs user intervention. Is there an interactive mode for PBS?
  5. What queues are available to me?
  6. Now that I've submitted my job, how do I check the status?
  7. Is there a way to see all of the jobs I have run on PrairieFire?
  8. I don't see any output from my job. What's going on?
  9. How do I kill a job?
  10. My job was killed. What should I do if I don't know why?
  11. Why shouldn't I just ask for the maximum amount of resources possible at all times?
  12. Is there a user's manual available?
  13. What do I need to run through PBS and what can I run on the head node?

What is PBS?

PBS is a Portable Batch System. Basically, it is used so that the user can simply submit their jobs to PBS and not have to worry about finding the right machines on a cluster to run on. It also allows for the administrators of a machine to make sure that a user will not over-use the machine as well. The version of PBS on PrairieFire is PBSPro version 7.1.3. It may seem like a pain to have to submit jobs to the queue (PBS is essentially also a queueing software), but really, it will make your life easier. PBS will allow you to run jobs as smoothly as possible by giving you the proper resources while keeping all users from breaking policy as much as it can. You also do not need to worry at all about what nodes on the cluster are free or keeping track of which nodes your jobs went to.

How do I create a PBS script for running jobs?

A PBS script is a script that is used to submit jobs to the queue. There are many possibilities for a PBS script. Here is a simple example of a PBS script for running a serial job on PrairieFire.

#PBS -W group_list=advisor_name
#PBS -N jobname
#PBS -l nodes=1,walltime=0:20:30
#PBS -S /bin/sh
#PBS -j oe
#PBS -q short
#PBS -M johndoe@cse.unl.edu
#PBS -m abe
#PBS -o out.log

#cd to your execution directory
cd ~/programs/code
./a.out parameter1 parameter2

Please note that the '#PBS' prefix is not a comment. These lines are PBS directives that request certain resources and modify the behavior for how the queue handles the job.

#PBS -W group_list=group_name - The first line will tell PBS which 'group' you are in. The group name is usually your advisor's last name. If you are not sure what group you are in, type 'groups' to see which group you are in. If you are a member of multiple groups, you should probably use the first one that comes up on the list.

#PBS -N jobname - This line tells PBS what the name of the job is. You should change 'jobname' to whatever will help you keep track of that job. PBS does not care what your jobname is as long as it only contains numerals and letters.

#PBS -l nodes=1,walltime=0:20:30 - This tells PBS now many nodes you are requesting. Because this is a serial (one processor) job, you should only ask for one CPU. The 'walltime' is the amount of time your job needs in hours:minutes:seconds. The sample script is asking for 20 minutes and 30 seconds.

#PBS -s /bin/sh - This will tell PBS what shell to use. If your program requires bash, you should use bash. If it requires tcsh, use tcsh. By default, sh is a safe one to use if you aren't sure what you are doing. Note that when you use the particular shells, the 'rc' files (ie. tcshrc and bashrc) DO get sourced. This will allow you to add any code inside of those scripts for your job as necessary.

#PBS -j oe - The -j command will join the output files. The 'oe' means to join the output and error files together. This is good for people who would like to know where the errors happened with respects to the output. If you would like seperate files for this, you may omit this line.

#PBS -q short - This is the line where you tell PBS which queue you would like to use. If you would like to use the long queue, you would exchange the word 'short' with 'long'. If this line is not included, it will default to using the 'short' queue.

#PBS -M johndoe@cse.unl.edu - You should use this line if you would like emails when your job does something. The next command allows you to control when you get emails. If you do not want emails sent to you to notify you of the status of your job, you may omit this and the next line.

#PBS -m abe - This tells PBS when it should email you. The 'a' stands for abort. This will email you when your job gets killed by either you or by the superuser. The 'b' command emails you when your job begins. Sometimes, PrairieFire is full when you submit your jobs and this will notify you when your job starts. Finally, the 'e' command will have PBS email you when your job exits for any reason. This includes when your job ends successfully or dies for any reason.

#PBS -o out.log - This line is the name of your output file. If you have a program that will display an output to the screen, it will get included into this output file.

Everything under the #PBS prefixed commands are for you. You can put anything there you would like and it will run just as it would in a shell script. The shell you chose to use is the language your shell script should be in. Also, in this case, the '#cd ...' is a comment.

Running an MPI or other parallel job requires a PBS script pretty much like the one for the serial job. Here is a sample PBS script for an MPI job.

#PBS -W group_list=advisor_name
#PBS -N mpitesting
#PBS -l nodes=4,walltime=0:20:30
#PBS -S /bin/sh
#PBS -q short
#PBS -M johndoe@cse.unl.edu
#PBS -m abe
#PBS -j oe
#PBS -o out.log

cd ~/mpitest
/opt-fs/mpich.pgi-64/bin/mpirun -np 4 greetings > ~/out.log

As mentioned before, the script for a parallel job looks very similar to that of a serial job. The main difference here is the following line:

#PBS -l nodes=4,walltime=0:20:30 - This line has the number of nodes changed to 4. You can change this number to how many numbers of processors you would like to run your job on, up to the maximum number allowed by the queue.

For a parallel job, a good thing to know is that you can get a list of the nodes you are running on by typing 'cat '. You can redirect this into a file and go from there if necessary.

To run a two processor SMP job, you should ask for a job in exclusive mode. Just add the following to your PBS script.

#PBS -l nodes=1:ppn=2,walltime=0:20:30#excl

This will give you two processors on one node. If you would like an 8 processor job on four nodes for any reason (such as running on the debug queue), you can use the following.

#PBS -l nodes=4:ppn=2,walltime=0:20:30

Notice that the nodes are now four and the processors per node is still two.

How do I submit a job?

To submit a job, create a PBS script and submit it to the queue by typing 'qsub PBS_script' where PBS_script is the name of the PBS script you created.

My job needs user intervention. Is there an interactive mode for PBS?

Yes. You can submit a job by using the '-I' flag, 'qsub -I PBS_script' and you will be logged onto the first node that will run your job. Then, you can run anything you need to to run your job and put in inputs as necessary.

What queues are available to me?

There are more queues than just short. At the present, the regular queues are short , long , and debug . If you need special requirements, contact the system administrator and he or she may be able to negotiate getting a custom queue set up f or your purposes.

Currently the pbs queues have the following settings:

Queue Name CPU Jobs Time
short 32 8 72 hours
long 8 8 360 hours
debug 2 2 15 min

Please also know that if you publish, are a co-pi or a pi for a grant, or if you donat e money, you will be allowed extra resources.
1 Publication = 1 unit
1 Co-PI = 4 units
1 PI = 6 units
Donating money = negotiated

For more information on a certain queue, type:

qmgr -c 'print queue queuename'
The queuename in the above command is the name of the queue you are wanting more information on.

Now that I've submitted my job, how do I check the status?

To check a job submitted to PBS, try using the command, ' qstat -u username ' where 'username' is your username on PrairieFire. You will get information about your job in the following order.

Job ID Username Queue Jobname SessID NDS TSK Req'd Memory Req'd Time S Elap Time
700.prairiefire johndoe short mpitesting 2365 2 4 512mb 10:00 R 5:30

Most of the above is self explanitory. The Job ID is the PBS Job ID. When you kill a job or want more information about it through PBS, you will use this ID. The SessID is the Process ID on the first node that PBS is running your job on. If you aren't sure what a Process ID is, it is the number assigned to the process you are running. The NDS is the number of nodes your job is using. Please note that this is nodes, not CPUs. The next number, TSK, is the number of tasks it is doing. A lot of times, this is the number of CPUs your job is using, although not always. If you see a user using one node and two tasks, the user is most likely running a two processor job on a single node.

The time Req'd Time is the time the user asked for to complete the job, and the Elap Time is the time that has elapsed since the beginning of your job. When the Elap Time gets to the Req'd Time, the job will be killed by PBS. Finally, 'S' is for status. Your job may be 'Q' which means queued to run, 'R' which stands for running, and 'E' which means the job is exiting at the current time.

To see what is running on the entire cluster, use 'qstat -a'. This will allow you to see what is running and you can then estimate when your job may run. To see a summary of all the jobs running on PrairieFire, you can use 'qstat -Bf'. This will show you how many jobs are running, how many CPUs are being used, how many jobs are running, queued, and much more. If you notice that not all CPUs are being used but your job is not running, it is probably because the nodes are reserved or being serviced/tested.

To see what nodes your job is running on, use 'qstat -u username -n'. The '-n' flag will tell you which nodes your job is running on. You can then type 'pbsnodes nodexxx' where nodexxx is the ndoe your job is running on, to get more information about that node.

Similar qstat information can also be found at http://rcf.unl.edu/prairiefire/PFjobs.php

Is there a way to see all of the jobs I have run on PrairieFire?

Actually there is. You can see all jobs run on PrairieFire at http://rcf.unl.edu/prairiefire/PFstats.php .

I don't see any output from my job. What's going on?

Some jobs will not output information into your output files until your job is complete. However, PBS does keep this info in a file. To see this file, add the following into your PBS script.

OUT=$(echo | cut -f 1 -d .)
tail -f /var/spool/PBS/spool/.OU >> ~/output.log&

How do I kill a job?

Killing a job is simple. Just type 'qdel JobID'. The JobID is the PBS Job ID as mentioend in the sample above. If this does not work, you can use 'qdel -Wforce JobID'.

My job was killed. What should I do if I don't know why?

Take a look at the error log. If you 'joined' it with the output and error files, it will be 'jobname.JobID.o, or whatever you decided to name your output file as if using '#PBS -o output.txt'. If you didn't join the output and error files, it will have the suffix, '*.e'. Take a look at those files and see if you can figure anything out. If not, please email the errors to support@rcf.unl.edu for assistance.

Why shouldn't I just ask for the maximum amount of resources possible at all times?

It almost sounds like a good idea to hog up all the resources you can, but it's actually to your disadvantage to do so. PBS tries to load-balance the jobs, and so if it sees that it can run a smaller job before a larger job will be able to run, it actually will run the smaller job in order to use as much resources as possible. So, if your job only needs 10 hours of runtime on 6 CPU and you ask for 72 hours, PBS will run a job that will run a job that needs 4 CPU and 5 hours if 6 CPUs will not be available for 6 hours. If you see a job running after you have submitted a job to queue, this is also probably the reason why. Also, using exclusive mode sounds like a good idea, but please also note that the chances of getting a node with nothing else running on it is not as frequent as the chances of getting two nodes with one CPU open. In general, ONLY ASKING FOR THE AMOUNT OF RESOURCES AS YOU NEED AND NOT MORE HELPS EVERYBODY INCLUDING YOU.

Is there a user's manual available?

Yes, the PBS user's manual is available on PF at /util/doc/PBSproUG.pdf. You can also use the man pages. There are a lot more possible commands that you can use with PBS. I would add them here, but there is a reason why the user's manual is very very long. The user's manual will help you use them.

What do I need to run through PBS and what can I run on the head node?

In general, all jobs need to go through PBS. You can compile, compress/uncompress, untar/tar, edit, and look through directories on the headnode. Of course, you can also monitor jobs and do other everyday use commands on the headnode as well. Other commands than what is listed above that may take up resources should be run through PBS. If you have any questions, ask support for assistance.