Skip to content

Introduction

Many research projects involve processing of large datasets or running intensive computational pipelines using high performance computing (HPC) clusters. There are often multiple different data storage locations associated with an HPC cluster, which can make it challenging to understand where and how to best store your data and code. This workshop introduces principles for organising your data and code, as well as tools for moving data around and for managing your code.

Learning objectives: After this training, you should be able to:

  • Identify the most appropriate location to store different kinds of data
  • Move data between locations as needed
  • Use version control to manage code, configuration and job submission scripts

Target audience: We expect this workshop to be useful if you:

  • Have logged in and submitted compute jobs on CREATE HPC
  • Are working with data on CREATE HPC
  • Aren’t sure how best to manage your data to balance secure long-term storage with easy access on the HPC

Prerequisites: We expect attendees at this workshop to:

Scenario

You have just joined a new research group and have been asked to work on a project that was started by a previous lab member, Alex. Alex has just finished their PhD and started a new job in Australia, so they'll be hard to get hold of for questions about the project! The aim of the project is to identify "blobs" in images, and measure some characteristics of these blobs. The data and code for the project are available in the /scratch/prj/hpc_training/example-project/ directory on CREATE HPC.

To get started with the project, you will need to:

  • identify the raw data, processed data, and results files from Alex's analysis
  • make sure the raw data and important results files are securely backed up
  • identify which scripts were run, and in what order, to produce the results
  • identify and install the software required for the analysis
  • make sure the original code is backed up before you make any changes to it