/
Tips for ingesting large data sets

Tips for ingesting large data sets

This page:

Related pages:

General strategy

Hopefully, you have a way of logically partitioning the data to be ingested (eg by folder or name pattern) so that the process can be performed in stages.

This avoids very long running processes that may be interrupted (eg by outages.)

If you're unfamiliar with the clients, it is also best to first move a small subset of the data and verify everything worked as it should.

Then, based on the total size of data and the transfer rates you're getting you can proceed with partitioning up into reasonable sized chunks.

Suggested approach

Use the Pawsey python client pshell. 

On the machine you want to run your long running ingestion, create the appropriate commands to move data in chunks or sets.

eg a text file called set1 might contain the pshell commands:

lcd /local/source
cd /projects/My Project/destination
put directory1
put directory2

Then - assuming you've already logged in and/or created a delegate - run:

nohup pshell -i set1 > set1.log &
tail -f set1.log

If you're SSH'd into this machine - you can log out (or get disconnected) and it won't matter as you can log back in and resume the tail command.

Fallback approach

Use the vendor's (Arcitecta) own client: aterm.jar for the transfers.

Requires Java >=1.7

Create a config file eg aterm.cfg containing:

host=data.pawsey.org.au
transport=https
port=443
domain=ivec

Then start up the client (which will prompt for your Pawsey username and password) using something like:

java -Dmf.cfg=aterm.cfg -jar aterm.jar nogui

You can then use the standard service calls to ingest data. This will most likely be some variant of:

aterm> import -namespace "/projects/my project" /mnt/local/data_directory_1

Verification

Strictly, the most robust approach is to compare all the local crc32 checksums with the corresponding checksums of the ingested files.

This should be relatively easy to do for an initial small test dataset, but much more time consuming for very large collections.

Some alternatives:

Compare random sampling of checksums

Compute the local checksum, eg:

crc32 /mnt/local/data_directory1/file1

Retrieve the remote checksum for the uploaded file, eg:

aterm> asset.query :where "namespace='/projects/my project' and name='file1'" :action get-values :xpath content/csum

Most aterm commands (such as the above) can be run in pshell as well

Compare total sum of file sizes

Compute the total number of bytes in the local files, eg:

ls -lR /mnt/local/data_directory1 | grep -v '^d' | awk '{total += $5} END {print "Total:", total}'

Don't use du for this (unless you have something like --apparent-size available) as this will typically report based on blocks consumed, rather than bytes

Compute the total number of bytes ingested into Mediaflux, eg:

aterm> asset.query :where "namespace>='/projects/my project'" :action sum :xpath content/size