Introduction
Intorduction to using Condor on RCF systems
What is Condor?
The Condor project is designed to implement High Throughput Computing. Condor is a bundle of software that takes care of scheduling applications as well as checking for computing resources in a clustered/grid environment.
RCF is using Condor for its ability to cycle-scavenge and run serial jobs at remote locations. The cycle-scavenging aspect allows jobs to be run through Condor on processors not being used by PBS on the main Prairiefire cluster. Once PBS schedules a job on those processors, Condor will checkpoint or evict the job thus not interfering with PBS. In addition, Condor makes it relatively easy to run jobs on machines not directly connected to Prairiefire. We now have -substantial- resources (named 'village') in the WSEC 29 machine room dedicated to running Condor jobs submitted from Praireifire. At the moment, we only allow serial jobs to be run through Condor, and try to reserve the main Prairiefire cluster for parallel jobs needing the high speed myrinet interconnects.
IMPORTANT NOTE: While we have given separate names to the Prairiefire (located in Miller & Paine) and Village (located in WSEC 29) clusters, users will always interact and submit their jobs through the Praireifire head node. It is critical that you understand the following differences between these two clusters before reading further and trying to submit a Condor job.
- Prairiefire and Village do -not- currently share a filesystem. You will have to make use of Condor's file transfer mechanisms to move any input and output files from your home directory on Prairiefire to Village in order for you job to run there. Thankfully Condor makes this easy and sometimes even automatic.
- All nodes in Village are dedicated to Condor and will -not- kill or evict a job unless the job has exceeded the maximum runtime of 72 hours. Nodes on Prairiefire are intended to serve PBS jobs first, and there is a chance that PBS may start a job on a node running Condor jobs thus evicting those condor jobs. If your jobs run for a long time and are not easy to restart you may wish to run them only on the village nodes. If your jobs are short or very easy to restart you may want to let your jobs run both on Prairiefire and Village. Instructions of how to limit your jobs to a specific cluster are given in the Job Submission page.
- Currently Village consists of retired Prairiefire nodes which are still the same 2.2GHz Opteron processors currently in Prairiefire. In the future we will add a variety of nodes and architectures as they become available, with the soonest being a collection of old AMD Athlon (32-bit) single processors desktop machines retired from the CSE department. If you need to limit your jobs to run only on specific architectures or processor speeds please contact us at rcf-support@unl.edu and we will try to accommodate you and help modify your condor submission scripts accordingly.
How to get started using Condor to run serial jobs on RCF systems
Additional Help
The official Condor documentation is the best place to look if you want to do more advanced things with Condor
If you're having troubles running jobs or can't find an answer in the official documentation please email us at rcf-support@unl.edu.

