Data organisation¶

Exploring the project¶

The first thing we need to do is explore and try to understand the project and the work that Alex has already done. Let's log in to CREATE HPC, where they were working.

ssh k1234567@hpc.create.kcl.ac.uk

When logging to CREATE HPC, the "message of the day" includes a list of data storage locations you have access to and how much available space they have.

Quota information for k1234567

Path                                  Type      Quota GB    Usage GB    Avail GB    Usage %  Updated
------------------------------------  ------  ----------  ----------  ----------  ---------  -------------------
/rds/prj/rds_hpc_training             Group         1099           0        1099          0  2025-05-16 14:19:56
/scratch/prj/hpc_training             Group          200           0         200          0  2025-05-16 07:30:17
/scratch/users/k1234567               User           200          22         178         11  2025-05-16 07:30:16
/users/k1234567                       User            50          30          20         60  2025-05-16 07:30:16

Only files stored within /rds and /users are backed up.

Tip

This quota usage information is updated daily. If you add/delete files and want to check your quota without waiting until the next day, use the ceph_quota command. This checks the space available to you in your home directory and /scratch/.

This includes your home directory and personal /scratch/ space. It may also include /scratch/ space that's shared across a group or project, or project-specific /rds/ space.

Alex worked on this project in their home directory, and copied their work to a shared group directory in the /scratch/ space when they finished their PhD. Let's navigate to the /scratch/prj/hpc_training/example-project/ directory.

cd /scratch/prj/hpc_training/example-project/

Exercise

Spend some time exploring the project files. Don't worry too much about understanding the details of the code, focus more on the overall project organisation.

Can you identify raw data, processed data and results files? Can you identify how the data was generated/obtained?
Which scripts were used to produce the processed data and results files?
Are there any changes you would make to the organisation of the data when you start working on the project?

Useful commands

pwd - print your current working directory

cd project_dir/ - move into a directory called project_dir

cd .. - move one directory level up from your current working directory

ls - list files and directories in your current working directory

ls project_dir/ - list files and directories in a directory called project_dir

less - see the contents of a file

cat - print the contents of a file to the terminal

Here are a few examples of potential issues with the project - you may identify others as well!

Project organisation issues

The raw data, processed data and results are all stored in the same directory. This might make it tricky to identify which files are which.
The file naming scheme is not consistent, which could make it hard to work with the files using code.
The file names contain spaces, which can be tricky to handle when working on the command line. Spaces are typically interpreted as separators by command line tools, so spaces in file names will need escaping or file names will need to be quoted.
There's no explanation of how the raw data was generated or obtained. Is this public data or was it generated in the research group? What type of images are these?
There's multiple versions of scripts and it's not clear which version was used to produce the results.
There's no list of software dependencies for the project - what packages and what package versions were used?

Here are some general tips for file naming and organisation:

Avoid using spaces and special characters in file names, as these can cause problems. Use "-" or "_" as separators in file names.
Pick a consistent naming scheme to use for files of the same dataset. An ideal naming scheme should be easy for both humans and computers to read. For example, use consistent capitalisation as most computational systems won't recognise e.g. "Control" and "control" as the same word.
Use YYYY-MM-DD date format. This is computer-readable and allows files to easily be sorted in date order.
Use zero-padding for numbers to allow for sorting in numerical order, e.g. "file-001", "file-099", "file-100" instead of "file-1", "file-10", etc.
Organise your data hierarchically rather than keeping everything in one folder. This helps identify related datasets and distinguish raw data from processed data or results files.
Keep metadata - data about your data - and data documentation in the same place as you keep your data. This helps track how data was obtained or generated.

Additional resources on how to organise and document your data are available from the Library. The Library also provides resources for writing Data Management Plans, which are a structured way to document things like how data will be stored, who is responsible for it, and how it will be preserved at the end of a project. In addition, a data management plan may document any ethical or legal considerations related to storing or sharing data. Most funding organisations require data management plans for research they fund.

Reorganising the project¶

Before starting to work on the project, it would be a good idea to make your own copy of Alex's work and do some of the project reorganisation we talked about above.

First, let's copy the project to our own scratch space and change directory to our copy:

cd ..
cp -r example-project/ /scratch/users/k1234567/
cd /scratch/users/k1234567/

Let's separate the raw data, processed data, and results files:

cd example-project/data/
mkdir raw processed results
mv *.tif raw/
mv *.npy processed/
mv *.png all_image_stats.csv results/

Although it should be fairly clear from the directory names, we should also make sure to document how the project is organised. The README file is a good place to do this. We can also note down which script(s) are used for each step in the analysis.

Exercise

Edit the README file and add notes about how the data is organised and which scripts are used for each step.

Then compare with your neighbour. Did you write the same things, or did you include different details?