Transferring large data sets

Transferring large data sets

General strategy

It is usual to try and avoid very long running processes that, if interrupted, would have to be completely restarted from the beginning. As a result, one or both of the following approaches may be used:

  • Logically partition the data into chunks (eg by folder or name pattern,)

  • Use a transfer process that can be easily restarted.

Also, it is best to start with a small subset of the data and verify everything worked as it should before moving to longer running transfers.

Suggested approach

This example will use the Pawsey python client pshell for illustrative purposes. On the machine you want to run your transfers, create the appropriate commands to move data in chunks or sets.

eg a text file called set1 might contain the pshell commands:

lcd /local/folder cd /projects/myproject/folder put directory1 (or "get directory/file_pattern_1*" if downloading)

Then, assuming you've already created a delegate (https://pawsey.atlassian.net/wiki/x/QFgYAw) run:

nohup pshell -i set1 > set1.log & tail -f set1.log

If you're SSH'd into this machine - you can log out (or get disconnected) and it won't matter as you can log back in and resume the tail command.

Verification

Strictly, a robust approach to verification would be to compare all the source crc32 checksums with the corresponding checksums of the copied files. This is relatively easy for small datasets, but much more time consuming for large collections. A faster, but less robust method, is to compare the total sizes of the data.

Compare checksums

To compute the CRC32 checksum of local files:

crc32 /local/data/myfile 11be99ce

To display to CRC32 of a file stored in Mediaflux:

pshell> info /projects/myproject/myfile ... csum: 11BE99CE

Compare total sum of file sizes

Compute the total number of bytes in a local folder, eg:

ls -lR /mnt/local/data_directory1 | grep -v '^d' | awk '{total += $5} END {print "Total:", total}'

Don't use du for this (unless you have something like --apparent-size available) as this will typically report based on blocks consumed, rather than bytes

Compute the total number of bytes under a Mediaflux folder:

pshell> asset.query :namespace "/projects/myproject" :action sum :xpath content/size ... value="123456789"