Backing up your data¶

We have now explored Alex's project, made our own copy of it, and reorganised the data to separate raw data from processed data and results. We should also make sure we back up the project, before we start making changes and doing new analysis. This is important to prevent accidental data loss from mistakes or technical issues. The scratch location on the HPC is not backed up, so if you accidentally delete a file stored only on scratch, it can't be restored.

Discussion

What should be backed up? Is there anything in the project directory that doesn't need to be backed up? What solutions are you aware of for backing up data and code?

Sample answers

Things that should be backed up include:

raw data - anything that can't easily be recreated
any processed data or results files that would be difficult or time-consuming to reproduce
metadata / documentation for the raw data
code / scripts for reproducing processed data and results

It's good practice to consider what you really need to back up, so you don't use more storage space than you need. Using lots of storage space has costs both in terms of research funding and environmental impacts. Not everything needs to be backed up, such as:

raw data if it's downloaded from a publicly available source or you're sure that someone else has a backed-up copy
large intermediate processed data files, if they can be easily regenerated

CREATE RDS¶

King's provides a number of options to store data during your research, including options provided by e-Research, central IT, or locally in your faculty, department, or school. The Library has a webpage with information about these options.

For large research data, especially when using the HPC, e-Research provide the CREATE Research Data Storage (RDS). All King's staff members can use RDS by requesting an RDS project; students need to ask their supervisor to request a project. For every registered project central funding provides up to 5TB of storage.

CREATE RDS is automatically backed up, with regular snapshots as well as keeping multiple independent copies of all data. The standard snapshot scheme for a project is:

Hourly snapshots kept for 1 day
Daily snapshots kept for 1 week
Weekly snapshots kept for 1 month
Monthly snapshots kept for 1 year

RDS is accessible from the login nodes of the HPC, but not from the compute nodes. Therefore raw data needs to be copied to scratch to work on, and important outputs need to be copied back to RDS.

RDS can also be mounted on your laptop or desktop for viewing and working with files locally.

Warning

CREATE RDS is not suitable for the storage of unencrypted personally identifiable or sensitive data. For this type of data, consider using a Trusted Research Environment.

Copying data from `scratch` to RDS¶

We will now copy the raw data and results from Alex's analysis to RDS, where they will be securely backed up.

For the purposes of this training we will use the /rds/prj/rds_hpc_training/ project. First create a directory for yourself inside that RDS project, since it's a shared project.

mkdir /rds/prj/rds_hpc_training/k1234567

The simplest way to copy the data would be to use the cp command. However, we will use the more powerful rsync command. Its syntax is similar to cp but it has many options to control the behaviour.

We'll copy over the full data/ directory, as well as the README file which describes how the data is organised. From your example-project/ directory, run the following commands:

rsync -avPn README data /rds/prj/rds_hpc_training/k1234567/

Here we're using some additional options to control the behaviour of rsync.

-a - this stands for "archive", and tells rsync to sync recursively and to preserve file modification times, owners, and permissions, amongst other things.
-v - this stands for "verbose", so that rsync will give us output about which files it's copying.
-P - this indicates both --progress and --partial - rsync will show the copying progress, which can be helpful for large files, and if the copying process is interrupted it can resume partial transfers.
-n - this indicates we want to do a "dry-run" to test the command. In combination with -v, this means rsync will print the list of files it will copy, but not actually do anything.

Exercise

What's the difference between the two commands below? Do a dry-run of both and compare the output.

rsync -avPn README data /rds/prj/rds_hpc_training/k1234567/
rsync -avPn README data/ /rds/prj/rds_hpc_training/k1234567/

Example

rsync source dest means copy the source directory into the destination directory. rsync source/ dest, with the trailing slash, means copy the contents of the source directory into the destination directory. You can see the difference by comparing the dry-run output.

Now we've done a dry-run to test the command, let's actually run it.

rsync -avP README data /rds/prj/rds_hpc_training/k1234567/
ls /rds/prj/rds_hpc_training/k1234567/

As the "sync" part of rsync suggests, it's a tool for synchronising files, not just copying them. What does this mean? If you already have a backup copy of your files (in RDS), and you want to make sure it's up to date with changes you've made in your local copy (in scratch), rsync will check which files have changed and only copy over those that have been updated.

As a demonstration, let's make a small change to our README file, for example to document the project backup location. Then let's re-run the same (dry-run) command as above.

rsync -avPn README data /rds/prj/rds_hpc_training/k1234567/

The list of files that rsync will synchronise only includes the README, not the data files, as these have not changed since we last copied them.

Warning

By default, rsync does not delete files in the destination directory, even if they have been deleted in the source directory. To tell rsync to delete files in the destination directory that are not present in the source directory, use the --delete option. If you use this option, make sure to do a dry-run first and check that the output is what you expect!

Another advantage of rsync is that you can also use it to copy data from a local to remote location, or vice versa, like scp.

For example you could run the following command on your laptop to copy a file from your laptop into scratch on the HPC.

rsync -avP local_script.py k1234567@hpc.create.kcl.ac.uk:/scratch/users/k1234567/

Checksums

According to its man page, rsync automatically calculates checksums for all transferred files, to verify that they were transmitted correctly.

Moving large data¶

If you want to copy large data to or from RDS, it's recommended to use the dedicated data mover node rather than the normal HPC login nodes. SSH to the data mover node, and run your rsync commands there.

ssh k1234567@erc-hpc-dm1.create.kcl.ac.uk
...
k1234567@erc-hpc-dm1:~$rsync -avP /rds/prj/example_project/data /scratch/prj/example_project/

More details are available in the e-Research docs.

Backing up scripts¶

So far we've looked at how to back up data, but code and other files associated with a project such as configuration files should also be backed up. Since the RDS space is not available from the HPC compute nodes, you will typically be editing and running scripts from the scratch storage. It's a good idea to copy your scripts and config files to RDS periodically, and in particular at the end of a project to make sure the data and analysis can be linked. However, there are more powerful tools you can use while actively working on your scripts, that will not just let you back up your scripts but also help you experiment and make temporary changes to your code. This is what we'll cover next.