An introduction to Cluster Computing

Introduction

Welcome to the introduction of the HNC Condor. In order to use it efficiently, you have to be familiar with a few concepts. I try to be as simple and short as possible and provide you with examples. But please bear in mind that the information you find on this page is the absolute minimum! So be sure to understand it or drop by and ask. The introduction will cover the following aspects:

  1. What is a cluster?

  2. How do I use the cluster?

What is a cluster?

General Remarks

In one short sentence: a cluster is a bunch of computers that are used by a bunch of people. These computers provide resources, in our case CPU power and memory, to all users. These resources have to be distributed as optimal and fair as possible between the users.

Here is a picture of how the system looks like:

network_topology

As you see, your computer is connected to your Personal Analysis Machine (a.k.a. Bomber). All the Bombers are connected to the HNC Condor Master Node. Through this connection, your Bomber tells the HNC Condor Master Node that it wants it to do a job. You can think of a job as a python function or method that takes some parameters. The Master Node also needs some more information, for instance, how much RAM your job will need. The Master Node then collects all your jobs and the jobs of everybody else who wants to compute things on the cluster. It then asks the HNC Condor Execute Nodes whether they have the resources for the jobs (i.e., enough RAM and CPU). If one of them says yes, he will get one of the jobs and execute it.

Sounds complicated? Don’t worry! This obob_condor package makes it super easy for you!

About Fairness

Please always bear in mind that the Cluster is a resource that you share with all your colleagues. There are, of course, ways to use the system in your advantage while putting everyone else at a disadvantage. Please just do not! This system works best when everybody has everybody else in mind. And it also increases your Karma™.

As I wrote before, the HNC Condor Master Node collects all the jobs by all the users who want to use the Cluster and then distributes it to the Execute Nodes. It tries to be as fair as possible in the distribution of the jobs. For example, if two people are submitting jobs at the same time, it will make sure that both get half of the resources. However, the Master Node cannot guess how many resources like your jobs need. So, you need to tell him and try to be as exact as possible.

At the moment, the only thing you need to tell the Cluster is how much RAM your job will need. If your job consumes more RAM, it will be put on hold, which means that it will stop being executed. If you specify too much RAM, less of your jobs will run.

How do I use the Cluster?

If you want to use the cluster, all you need to do is to connect to your Bomber.

What is going on on the cluster?

To be able to monitor what is going on on the cluster, you first need to open a terminal. To do this, click on the “Applications Menu” and then on “Terminal Emulation”. In the new window enter this command:

watch -n4 'condor_q -global'

You will see something like this:

-- Schedd: cf000016.sbg.ac.at : <141.201.106.7:9618?... @ 06/01/22 08:46:49
OWNER    BATCH_NAME    SUBMITTED   DONE   RUN    IDLE   HOLD  TOTAL JOB_IDS
bAAAAAAA ID: 46449    6/1  08:29      8      4      _      _     12 46449.0-3
bBBBBBBB ID: 46450    6/1  08:32      _      1      _      _      1 46450.0
bCCCCCCC ID: 46451    6/1  08:46      _     30      _      _     30 46451.0-29

Total for query: 35 jobs; 0 completed, 0 removed, 0 idle, 35 running, 0 held, 0 suspended 
Total for all users: 35 jobs; 0 completed, 0 removed, 0 idle, 35 running, 0 held, 0 suspended

As you can see, currently 3 Job Clusters are running by 3 different users. Here is what some of the individual columns mean:

Column

Description

OWNER

The username of the person who submitted the job

BATCH_NAME

Every Job Cluster gets an ID which is show here

SUBMITTED

Date and time on which the job was submitted

DONE

How many of the jobs have completed. Sucessfully or not!

RUN

How many jobs are currently running

IDLE

How many jobs are waiting for resources to become available

HOLD

How many jobs are on hold. If something went wrong, HTCondor might put jobs in this condition.
You can use condor_q -hold to find out what happened.

TOTAL

Total number of jobs in the Job Cluster

How do I submit my first job?

Go to How to use obob_condor and find out!