An introduction to Cluster Computing¶
Introduction¶
Welcome to the introduction of the HNC Condor. In order to use it efficiently, you have to be familiar with a few concepts. I try to be as simple and short as possible and provide you with examples. But please bear in mind that the information you find on this page is the absolute minimum! So be sure to understand it or drop by and ask. The introduction will cover the following aspects:
What is a cluster?
How do I use the cluster?
What is a cluster?¶
General Remarks¶
In one short sentence: a cluster is a bunch of computers that are used by a bunch of people. These computers provide resources, in our case CPU power and memory, to all users. These resources have to be distributed as optimal and fair as possible between the users.
Here is a picture of how the system looks like:
As you see, your computer is connected to your Personal Analysis Machine (a.k.a. Bomber). All the Bombers are connected to the HNC Condor Master Node. Through this connection, your Bomber tells the HNC Condor Master Node that it wants it to do a job. You can think of a job as a python function or method that takes some parameters. The Master Node also needs some more information, for instance, how much RAM your job will need. The Master Node then collects all your jobs and the jobs of everybody else who wants to compute things on the cluster. It then asks the HNC Condor Execute Nodes whether they have the resources for the jobs (i.e., enough RAM and CPU). If one of them says yes, he will get one of the jobs and execute it.
Sounds complicated? Don’t worry! This obob_condor
package makes it super easy for you!
About Fairness¶
Please always bear in mind that the Cluster is a resource that you share with all your colleagues. There are, of course, ways to use the system in your advantage while putting everyone else at a disadvantage. Please just do not! This system works best when everybody has everybody else in mind. And it also increases your Karma™.
As I wrote before, the HNC Condor Master Node collects all the jobs by all the users who want to use the Cluster and then distributes it to the Execute Nodes. It tries to be as fair as possible in the distribution of the jobs. For example, if two people are submitting jobs at the same time, it will make sure that both get half of the resources. However, the Master Node cannot guess how many resources like your jobs need. So, you need to tell him and try to be as exact as possible.
At the moment, the only thing you need to tell the Cluster is how much RAM your job will need. If your job consumes more RAM, it will be put on hold, which means that it will stop being executed. If you specify too much RAM, less of your jobs will run.
How do I use the Cluster?¶
If you want to use the cluster, all you need to do is to connect to your Bomber.
What is going on on the cluster?¶
To be able to monitor what is going on on the cluster, you first need to open a terminal. To do this, click on the “Applications Menu” and then on “Terminal Emulation”. In the new window enter this command:
watch -n4 'condor_q -global'
You will see something like this:
-- Schedd: cf000016.sbg.ac.at : <141.201.106.7:9618?... @ 06/01/22 08:46:49
OWNER BATCH_NAME SUBMITTED DONE RUN IDLE HOLD TOTAL JOB_IDS
bAAAAAAA ID: 46449 6/1 08:29 8 4 _ _ 12 46449.0-3
bBBBBBBB ID: 46450 6/1 08:32 _ 1 _ _ 1 46450.0
bCCCCCCC ID: 46451 6/1 08:46 _ 30 _ _ 30 46451.0-29
Total for query: 35 jobs; 0 completed, 0 removed, 0 idle, 35 running, 0 held, 0 suspended
Total for all users: 35 jobs; 0 completed, 0 removed, 0 idle, 35 running, 0 held, 0 suspended
As you can see, currently 3 Job Clusters are running by 3 different users. Here is what some of the individual columns mean:
Column |
Description |
---|---|
OWNER |
The username of the person who submitted the job |
BATCH_NAME |
Every Job Cluster gets an ID which is show here |
SUBMITTED |
Date and time on which the job was submitted |
DONE |
How many of the jobs have completed. Sucessfully or not! |
RUN |
How many jobs are currently running |
IDLE |
How many jobs are waiting for resources to become available |
HOLD |
How many jobs are on hold. If something went wrong, HTCondor might put jobs in this condition. |
TOTAL |
Total number of jobs in the Job Cluster |
How do I submit my first job?¶
Go to How to use obob_condor and find out!