Transferring large data sets
General strategy
It is usual to try and avoid very long running processes that, if interrupted, would have to be completely restarted from the beginning. As a result, one or both of the following approaches may be used:
Logically partition the data into chunks (eg by folder or name pattern,)
Use a transfer process that can be easily restarted.
Also, it is best to start with a small subset of the data and verify everything worked as it should before moving to longer running transfers.
Suggested approach
This example will use the Pawsey python client pshell for illustrative purposes. On the machine you want to run your transfers, create the appropriate commands to move data in chunks or sets.
eg a text file called set1 might contain the pshell commands:
lcd /local/folder
cd /projects/myproject/folder
put directory1 (or "get directory/file_pattern_1*" if downloading)Then, assuming you've already created a delegate (https://pawsey.atlassian.net/wiki/x/QFgYAw) run:
nohup pshell -i set1 > set1.log &
tail -f set1.logIf you're SSH'd into this machine - you can log out (or get disconnected) and it won't matter as you can log back in and resume the tail command.
Verification
Strictly, a robust approach to verification would be to compare all the source crc32 checksums with the corresponding checksums of the copied files. This is relatively easy for small datasets, but much more time consuming for large collections. A faster, but less robust method, is to compare the total sizes of the data.
Compare checksums
To compute the CRC32 checksum of local files:
crc32 /local/data/myfile
11be99ceTo display to CRC32 of a file stored in Mediaflux:
pshell> info /projects/myproject/myfile
...
csum: 11BE99CECompare total sum of file sizes
Compute the total number of bytes in a local folder, eg:
ls -lR /mnt/local/data_directory1 | grep -v '^d' | awk '{total += $5} END {print "Total:", total}'Don't use du for this (unless you have something like --apparent-size available) as this will typically report based on blocks consumed, rather than bytes
Compute the total number of bytes under a Mediaflux folder:
pshell> asset.query :namespace "/projects/myproject" :action sum :xpath content/size
...
value="123456789"