Introduction¶

Many research projects involve processing of large datasets or running intensive computational pipelines using high performance computing (HPC) clusters. There are often multiple different data storage locations associated with an HPC cluster, which can make it challenging to understand where and how to best store your data and code. This workshop introduces principles for organising your data and code, as well as tools for moving data around and for managing your code.

Learning objectives: After this training, you should be able to:

Identify the most appropriate location to store different kinds of data
Move data between locations as needed
Use version control to manage code, configuration and job submission scripts

Target audience: We expect this workshop to be useful if you:

Have logged in and submitted compute jobs on CREATE HPC
Are working with data on CREATE HPC
Aren’t sure how best to manage your data to balance secure long-term storage with easy access on the HPC

Prerequisites: We expect attendees at this workshop to:

Have or request an account on CREATE (see https://docs.er.kcl.ac.uk/CREATE/requesting_access/)
Be able to navigate and use basic commands on the Unix command line and submit jobs on an HPC cluster

Scenario¶

You have just joined a new research group and have been asked to work on a project that was started by a previous lab member, Alex. Alex has just finished their PhD and started a new job in Australia, so they'll be hard to get hold of for questions about the project! The aim of the project is to identify "blobs" in images, and measure some characteristics of these blobs. The data and code for the project are available in the /scratch/prj/hpc_training/example-project/ directory on CREATE HPC.

To get started with the project, you will need to:

identify the raw data, processed data, and results files from Alex's analysis
make sure the raw data and important results files are securely backed up
identify which scripts were run, and in what order, to produce the results
identify and install the software required for the analysis
make sure the original code is backed up before you make any changes to it