Tips for ingesting large data sets
General strategy
Hopefully, you have a way of logically partitioning the data to be ingested (eg by folder or name pattern) so that the process can be performed in stages.
This avoids very long running processes that may be interrupted (eg by outages.)
If you're unfamiliar with the clients, it is also best to first move a small subset of the data and verify everything worked as it should.
Then, based on the total size of data and the transfer rates you're getting you can proceed with partitioning up into reasonable sized chunks.
Suggested approach
Use the Pawsey python client pshell.
On the machine you want to run your long running ingestion, create the appropriate commands to move data in chunks or sets.
eg a text file called set1 might contain the pshell commands:
lcd /local/source cd /projects/My Project/destination put directory1 put directory2
Then - assuming you've already logged in and/or created a delegate - run:
nohup pshell -i set1 > set1.log & tail -f set1.log
If you're SSH'd into this machine - you can log out (or get disconnected) and it won't matter as you can log back in and resume the tail command.
Fallback approach
Use the vendor's (Arcitecta) own client: aterm.jar for the transfers.
Requires Java >=1.7
Create a config file eg aterm.cfg containing:
host=data.pawsey.org.au transport=https port=443 domain=ivec
Then start up the client (which will prompt for your Pawsey username and password) using something like:
java -Dmf.cfg=aterm.cfg -jar aterm.jar nogui
You can then use the standard service calls to ingest data. This will most likely be some variant of:
aterm> import -namespace "/projects/my project" /mnt/local/data_directory_1
Verification
Strictly, the most robust approach is to compare all the local crc32 checksums with the corresponding checksums of the ingested files.
This should be relatively easy to do for an initial small test dataset, but much more time consuming for very large collections.
Some alternatives:
Compare random sampling of checksums
Compute the local checksum, eg:
crc32 /mnt/local/data_directory1/file1
Retrieve the remote checksum for the uploaded file, eg:
aterm> asset.query :where "namespace='/projects/my project' and name='file1'" :action get-values :xpath content/csum
Most aterm commands (such as the above) can be run in pshell as well
Compare total sum of file sizes
Compute the total number of bytes in the local files, eg:
ls -lR /mnt/local/data_directory1 | grep -v '^d' | awk '{total += $5} END {print "Total:", total}'
Don't use du for this (unless you have something like --apparent-size available) as this will typically report based on blocks consumed, rather than bytes
Compute the total number of bytes ingested into Mediaflux, eg:
aterm> asset.query :where "namespace>='/projects/my project'" :action sum :xpath content/size