A cluster, or compute cluster, is a number of computers that are all connected together by a network. (At least) one of these computers is singled out as a “head node” or “login node” that the users of the cluster can log in to. The other computers are generally referred to as “compute nodes”. The head node runs some job management software that allows users to send “jobs” (in the form of scripts to be executed) to the compute nodes. Users can specify the resources (number of cores, amount of memory) that these jobs need to be able to run.
The job management software keeps track of the resources available (such as processor cores that are currently in use, or amount of memory that is currently in use) on each compute node. Jobs submitted to the cluster may have to wait in a “queue” because insufficient resources are available. The job is then run when resources become available.
The job management software provides tools for checking on what jobs are running, or waiting in the queue, checking on the outcome of completed jobs, or cancelling jobs.