Parallel jobs¶
One main advantage of using an HPC system is the ability to utilise its large compute power to run jobs in parallel.
Important
When considering running parallel jobs make sure to consult your application documentation to find out if it can be run
in the parallel environment. Nowadays most applications will support some level of parallelism.
Many scientific software tools will have -p
or -t
options to specify numbers of CPUs to be used when running in parallel.
Applications that can make use of multiple nodes are less common.
If the application does not support parallelism, requesting additional resources will not improve performance, and will likely lead to longer waiting times for your job to be scheduled. It also leads to resources being wasted as they are allocated to your job but are unused.
If you request multiple CPUs/nodes for your job, it's a good idea to check how effectively your job uses them using sacct
, as discussed
in the job monitoring section.
Most applications do not scale infinitely and will reach a point where the marginal impact of allocating more resources is minimal.
There are four types of parallel execution we'll discuss here, but we won't cover all of them in detail:
- Shared Memory Parallelism (SMP) / Multithreading
- Distributed Memory Parallelism / Multiprocessing
- Job Arrays
- GPUs
Before we delve deeper into each of these, let's clarify some terminology:
Processes: A process is an instance of a running program. It has its own memory space and resources allocated by the operating system. Processes are independent of each other and do not share memory, except through inter-process communication mechanisms.
Threads: Threads are units of execution (or CPU utilisation) within a process. A process can have multiple threads, and these threads share the same memory space. Threads within the same process can communicate directly with each other. Threads are often used to perform multiple tasks concurrently within a single program.
Tasks / Jobs: In the context of SLURM and other job scheduling systems, a task / job typically refers to a unit of work that can be executed independently of other units of work. Each task may correspond to a single process or a group of processes, depending on how the job is configured. So far, each time we've submitted something to the queue, we've created one task / job.
Multithreaded/multicore (SMP) jobs¶
Shared Memory Parallelism, as the name suggests, is a form of parallelism where all parallel threads of execution have access to a block of shared memory. We can use this shared memory to store objects which multiple threads need access to and to allow the threads to communicate.
This can also be referred to as multithreading as the initial single thread of a process forks into a number of parallel threads.
This approach will generally use a library such as OpenMP
(Open MultiProcessing), TBB
(Threading Building Blocks), or pthread
(POSIX threads).
Parallel programming
Writing our own code which makes use of parallelism is a complex topic and is beyond the scope of this training. We use some minimal examples to help explain how different models of parallelism work, but don't worry if you don't fully understand all the examples.
As an example, let's run a minimal C parallel program, called omp_hello.c
, that prints "Hello World" for a number of threads.
This program is available at /datasets/hpc_training/utils/omp_hello
.
Example files
Most of the example files we use in this section can be found on CREATE at /datasets/hpc_training/
.
We need to create a shell script that requests appropriate resources to run the program:
#SBATCH --job-name=omp_hello
#SBATCH --partition=cpu
#SBATCH --reservation=cpu_introduction
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=2G
#SBATCH -t 0-0:02 # time (D-HH:MM)
/datasets/hpc_training/utils/omp_hello
Hint
You can also request memory per cpu, rather than per node using the --mem-per-cpu
option.
--mem
and --mem-per-cpu
are mutually exclusive meaning you can use one, or the other in your
resource request.
We submit the job with the following command:
sbatch run_omp_hello.sh
One possible output would be:
Hello World from OpenMP thread 2 of 4
Hello World from OpenMP thread 3 of 4
Hello World from OpenMP thread 0 of 4
Hello World from OpenMP thread 1 of 4
Note here that the lines don't come out in any particular order - each time you run the program you might end up with a different result. This is because the program doesn't make any attempt to synchronise the printing of each line, it just executes all of them in parallel.
Order of execution
When using parallelism, we cannot rely on threads / processes reaching a particular line of code in any particular order, though each individual thread will still execute the order as expected. If we need things to happen in a particular order, for example if we wanted the threads to say hello in order, we need to force this using synchronisation.
Python example¶
Let's consider the slightly more complex example of a Python script, called squares_numba.py
, which calculates the square of numbers from one to one billion using multithreading - again with OpenMP, but this time via the Numba library.
import time
import numba
import numpy as np
@numba.jit(parallel=True)
def calculate_squares(n):
squares = np.zeros(n)
# Square numbers in parallel using Numba
for i in numba.prange(n):
# print("Hello from thread", numba.get_thread_id())
squares[i] = i ** 2
return squares
if __name__ == "__main__":
start_time = time.time()
squares = calculate_squares(1_000_000_000)
end_time = time.time()
print("Used {} threads".format(numba.get_num_threads()))
print("Took {:.4f} seconds".format(end_time - start_time))
The output should be something like:
Used 8 threads
Took 1.6164 seconds
The key steps in this code are:
- Import necessary libraries: Time to measure the execution time, NumPy for numerical computation, and Numba for parallelisation.
- Define a function
calculate_squares()
which calculates the squares of numbers from one to one billion - computers can square numbers very quickly, so this needs to be a large number. - Use Numba's
@jit
decorator to enable JIT compilation and theparallel=True
option to enable parallel execution using OpenMP. Python libraries like Numba use OpenMP internally to parallelize computations. Numba uses a compiler called LLVM to compile our Python code - when we ask for parallel execution with@jit(parallel=True)
, that then uses OpenMP just like we did in our C++ example. - Using
numba.prange
for our loop instead of the usualrange
causes it to be executed in parallel across all available threads. Each thread will be given an approximately equal share of the loop iterations to execute. - In the main block we call the
calculate_squares()
function and time how long it takes to run.
To execute the above code on CREATE, we can use submit_squares.sh
:
#!/bin/bash
#SBATCH --job-name=squares_numba
#SBATCH --partition=cpu
#SBATCH --reservation=cpu_introduction
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=4
#SBATCH --mem=2G
#SBATCH -t 0-0:02 # time (D-HH:MM)
# Load any required modules
module load python/3.9.12-gcc-10.3.0
# Activate virtual environment
source numba_env/bin/activate
python /datasets/hpc_training/DH-RSE/scripts/squares_numba.py
But before we run that we'll need to install necessary packages (in this case Numpy and Numba) in a virtual environment using:
python -m venv numba_env
source numba_env/bin/activate
pip install numba numpy
We can then submit the job:
sbatch submit_squares.sh
Array jobs¶
Array jobs are a feature provided by job schedulers like SLURM and offer a mechanism for submitting and managing collections of similar tasks (a task is equal to a single "job" from a "job array") quickly and easily. Each task in the job array runs independently. Job arrays are useful for running multiple instances of the same job with different input parameters or configurations. Tasks in a job array may or may not communicate with each other, depending on the specific requirements of the job.
Unlike in MPI or multi-threading, parallelization is orchestrated at the level of the shell script rather than within your program (e.g. Python script). Each task in the job array represents an independent instance of the program, and the shell script manages the execution of these instances in parallel by iterating over the task indices and launching separate invocations of the program. This approach allows you to parallelize the execution of the program across multiple tasks without modifying the script itself, making it a flexible and convenient method for running parallel tasks on HPC systems.
All jobs will have the same initial options (e.g. memory, number of cpus, runtime, etc.) and will run the same commands. Using array jobs is an easy way to parallelise your workloads, as long as the following is true:
- Each array task can run independently of the others and there are no dependencies between the different components (embarassingly parallel problem).
- There is no requirement for all of the array tasks to run simultaneously.
- You can link the array task id (
SLURM_ARRAY_TASK_ID
) somehow to your data, or execution of your application.
To define an array job you will be using an --array=range[:step][%max_active]
option:
range
defines index values and can consist of comma separated list and/or a range of values with a "-" separator, e.g.1,2,3,4
, or1-4
or1,2-4
step
defines the increment between the index values, i.e.0-15:4
would be equivalent to0,4,8,12
max_active
defines number of simultaneously running tasks at any give time, i.e.1-10%2
means only two array tasks can run simultaneously for the given array job
A sample array job is given below:
#!/bin/bash -l
#SBATCH --job-name=array-sample
#SBATCH --partition=cpu
#SBATCH --reservation=cpu_introduction
#SBATCH --ntasks=1
#SBATCH --mem=1G
#SBATCH -t 0-0:02 # time (D-HH:MM)
#SBATCH --array=1-3
echo "Array job - task id: $SLURM_ARRAY_TASK_ID"
Submit the array job using:
sbatch submit_array.sh
Info
When the job starts running a separate job id in the format jobid_taskid
will be assigned to each of the tasks.
As a result, the array job will produce a separate log file for each of the tasks, i.e.
you will see multiple files in the slurm-jobid_taskid.out
format.
A more realistic example¶
For a more realistic example, let's revisit the Python script we used earlier to identify the most frequent words in a text file. It would be useful to able to run this on many input text files, without having to modify the Python script or manually submit a job for each input. This is a great usecase for array jobs.
Here's our original submission script, which specifies a single text file as input.
#! /bin/bash -l
#SBATCH --job-name=top_words
#SBATCH --partition=cpu
#SBATCH --reservation=cpu_introduction
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=2G
#SBATCH -t 0-0:10 # time (D-HH:MM)
module load python/3.11.6-gcc-13.2.0
source top_words_env/bin/activate
python top_words.py paradise-lost.txt 20
There are multiple text files we can use as input in the /datasets/hpc_training/DH-RSE/data/
folder.
We'll use ls
to create a list of these input files and save it to an file.
ls /datasets/hpc_training/DH-RSE/data/*.txt > input_files.txt
We can now use the head
and tail
commands to pull out individual lines in the file.
For example, to extract the third line:
head input_files.txt -n 3 | tail -n 1
The head
command extracts the first n lines of a file (here n = 3).
We use the pipe |
to pass this output directly to the tail
command, which extracts the last n lines of its input.
Here we specify n = 1 to extract just the last line returned by head
, which is the the third line of the input file.
In the submission script, we can use the $SLURM_ARRAY_TASK_ID
to extract the corresponding file name.
Here's what our updated submission script looks like:
#! /bin/bash -l
#SBATCH --job-name=top_words
#SBATCH --partition=cpu
#SBATCH --reservation=cpu_introduction
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=2G
#SBATCH -t 0-0:10 # time (D-HH:MM)
#SBATCH --array=1-10
module load python/3.11.6-gcc-13.2.0
source top_words_env/bin/activate
input_file=`head input_files.txt -n $SLURM_ARRAY_TASK_ID | tail -n 1`
echo "Analysing "$input_file
python top_words.py $input_file 20
Submit the array job with:
sbatch submit_top_words_array.sh
You can then monitor the jobs with squeue --me
.
Output files will be produced for each item in the array.
Distributed Memory Parallelism (DMP) / MPI jobs¶
Sometimes you might want to utilise resources on multiple nodes simultaneously to perform computations. This is possible using a Message Passing Interface (MPI). In MPI, multiple independent processes run concurrently on separate nodes or processors. Each process has its own memory space. Communication between processes is achieved through explicit message passing (processes send and receive messages via MPI function calls).
As mentioned earlier, requesting the resource by itself will not make your application run in parallel - the application has to support parallel execution.
Message Passing Interface (MPI) is the most common library used by research software for parallel execution across processes.
Although MPI programming is beyond the scope of this course, if your application uses, or supports MPI then it can be executed on multiple nodes in parallel. For example, given the following submission script:
#!/bin/bash -l
#SBATCH --job-name=multinode-test
#SBATCH --partition=cpu
#SBATCH --nodes=2
#SBATCH --ntasks=16
#SBATCH --mem=2G
#SBATCH -t 0-0:05 # time (D-HH:MM)
module load openmpi/4.1.3-gcc-10.3.0-python3+-chk-version
mpirun /datasets/hpc_training/utils/mpi_hello
Which would be submitted using:
sbatch test_mpi.sh
A sample output would be:
Hello world from process 11 of 16 on host erc-hpc-comp006
Hello world from process 2 of 16 on host erc-hpc-comp005
Hello world from process 15 of 16 on host erc-hpc-comp006
Hello world from process 13 of 16 on host erc-hpc-comp006
Hello world from process 12 of 16 on host erc-hpc-comp006
Hello world from process 1 of 16 on host erc-hpc-comp005
Hello world from process 14 of 16 on host erc-hpc-comp006
Hello world from process 0 of 16 on host erc-hpc-comp005
Hello world from process 3 of 16 on host erc-hpc-comp005
Hello world from process 9 of 16 on host erc-hpc-comp006
Hello world from process 7 of 16 on host erc-hpc-comp005
Hello world from process 6 of 16 on host erc-hpc-comp005
Hello world from process 10 of 16 on host erc-hpc-comp006
Hello world from process 8 of 16 on host erc-hpc-comp006
Hello world from process 4 of 16 on host erc-hpc-comp005
Hello world from process 5 of 16 on host erc-hpc-comp005
Exercises - parallel jobs and benchmarking¶
Work through the exercises in this section to practice submitting parallel jobs.