Transferring Files in/out Pawsey Filesystems
Pawsey filesystems are specialised for different uses, and therefore it is important that you transfer your files into the correct location.
Command-line clients for transfers in and out Pawsey filesystems
Command line clients are a convenient way of moving data between computers.
The transfer of files in and out Pawsey filesystems is performed over encrypted data streams. The Secure Copy Protocol (scp) and the Secure File Transfer Protocol (sftp), which are based on the Secure Shell protocol (SSH), are installed in our systems and can send data over a normal SSH connection. Most of the tools we recommend in this page make use of these protocols.
For Linux and macOS operating systems, the command-line clients listed here are available in a terminal
or xterm
.
For windows operating system, the command-line clients listed here are part of most of the common Linux environments that run on top of windows; that is the "Windows Subsystem for Linux (WSL)", "Cygwin" and the terminal tool "MobaXterm" (see Alternative Programs to Log in from Windows).
Use ssh-keys with command-line clients
Pawsey strongly recommends the use of ssh-keys instead of the conventional and less secure method of typing your username and password. For a description of how to set up ssh-keys, see: Use of SSH Keys for Authentication.
GUI clients
GUI clients are a very attractive option for file transfers because of their intuitive framework. They have the advantage that users do not need to remember the several different options for the command-line tools, although, they are still based on the command-line clients listed above. In practice, the combined usage of both GUI and command-line clients within your workflows results in better efficiency.
Use ssh-keys with GUI transfer tools
Pawsey strongly recommends the use of ssh-keys instead of the conventional and less secure username/password method. In this section we describe how to set up some GUI tools using ssh-keys to authenticate access to Pawsey. For a description of how to set up ssh-keys, see: Use of SSH Keys for Authentication.
Also, by avoiding the use of username/password method, you will not need to save this sensitive data within the tool (which is also never recommended). Furthermore, avoiding the use of username/password method will keep you apart of the common problem of blockage of your account when the transfer tool retries to connect to Pawsey with an old or wrong password.
Always pay attention to the source and destination directories
Most GUI clients will start in your /home
directory when you first connect to a remote server, while some will start in your previously accessed directory. This is almost never where you need to put the data for the new session. In most cases you will need to browse to your own /scratch
, /home
or /software
directories.
Transfers between Pawsey filesystems and Acacia
Transfers between Pawsey filesystems and Acacia object storage are performed by tools/clients compatible with the Amazon S3 protocol. In Pawsey clusters, we count with modules of the following S3 compatible clients:
- rclone:
module load rclone/<version>
- aws client:
module load awscli/<version>
In-depth description of the different tools/clients for accessing and transferring data in/out Acacia is in the /wiki/spaces/DATA/pages/54459526.
Best practices for your data and data transfers
Only keep at Pawsey the data that is still going to be needed for your project
Only keep at Acacia and Pawsey filesystems the files and objects that are going to be utilised during the rest of your project. Those files/results/data that are finalised and are not going to be used anymore at Pawsey supercomputing/cloud and visualisation facilities should be transferred into your own institution storage and removed from our systems. The same applies when a project as a whole is finalised. None of the systems at Pawsey are designed as long term storage.
Don't use login nodes to transfer large amounts of data
Transfers initiated by remote sites should connect to the data-mover nodes with the generic hostname data-mover.pawsey.org.au
.
When large data transfers are initiated from within the Pawsey systems, they should use the "copyq
" queue.
Prefer the use of TAR files to transfer your data
Transferring clients spend a lot of time figuring out the list of files to be transferred and checking the correctness of each individual transfer. When the transfer involves a large amount of files and directories (for example, the result files of multiple OpenFOAM runs), this "listing and checking" time adds up dramatically. Therefore, it is always much more convenient to use the tar
command locally to first pack the files that need be transfered into a single TAR file (or into a very reduced number of TAR files). Then transfer the TAR files and "untar" them (tar -x
) afterwards, once they are at Pawsey filesystem
Use tools that can resume transfers if something fails
Interactive use of SCP may be the most appropriate and practical way of transferring a small amount of data, but for larger data transfers, interactive use of SCP is discouraged. It is always preferred to perform large data transfers with tools that can resume transfers if the process/connection fails (for example Rsync or FileZilla).
Use scripts to automatize frequent transfers
For transfer processes that may occur routinely, the use of scripts with command-line tools for automating the processes is a great option.
For automating the connections needed within your script, you can use ssh-keys as indicated in "Secure transfers using ssh-keys" section below.
WinSCP also allows scripting.
Do not preserve the time stamp of your files when transferring them into /scratch
If files have not been accessed for more than 30 days in your own system and then you transfer them into /scratch
using a "time stamp preserving" option, this will result in the inadvertent and almost immediate deletion of the recently copied files and then data loss. Deletion will happen because of the 30 days purge policy will identify those files as old.
Do not use spaces or special characterers in the names
The administration of filenames containing spaces and special characters may present problems. We strongly recommend to avoid their use in files that are to be transferred into Pawsey.
Secure transfers using ssh-keys
Transferring data using a protocol based on SSH allows us to protect information and ensure its integrity. However, setting up a proper environment configuration can be tricky; if not done right, security risks arise. This is especially true when one wants to automate copy operations, for example through a SLURM job on data-mover nodes. In such a scenario, a public key-based authentication method is recommended because the SSH client, running on a Pawsey's supercomputer node, will only need the private key to connect to a third-party system's SSH server, which in turn has the correspondent public key to be used to perform a secure handshake. The private key, however, must not be protected by a passphrase otherwise human input is required. There are several issues to address in the described situation.
You must generate a key-pair specifically for this purpose, that is, data transfers to and from Pawsey systems (Use of SSH Keys for Authentication). Let's call this key-pair: COPYPAIR
. Do not repurpose an existing key-pair used to log in to Pawsey or other systems (which by the way, should use a passphrase). This allows isolation of unauthorized accesses due to a compromised key-pair.
The SSH server on the third party system should be configured to avoid using COPYPAIR
's public key to authorise connections not originating from Pawsey's data-mover machines. This is a powerful capability that protects the third party server from unauthorised use of COPYPAIR
from outside the Pawsey network. To enable the discussed feature, users need to edit the COPYPAIR.pub
public key file and prepend the following string:
from="data-mover*.pawsey.org.au" no-port-forwarding no-pty
followed by a space and followed by the original key.
Here is an example:
from="data-mover*.pawsey.org.au" no-port-forwarding no-pty ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDhGk1QdMVDVao1j9eclHPPhniU5x6rHYBhJp88DJZrEiDM3Kt70+gHvo/fCGaHmOMWQX0hjqLs5uin42VGUW7w3y0FrIBB/hZJro+JKXJzhUJFpTE/wR08CK8DI4c3GrxjrCqNRkd3ff4AOUIgS7VFGcmagg9aAj6iSas1ibvAMLMZuXkVyPcNcKhB+J38atc3u5/zuRqU9QgKGQvTQgLL7lx4CrsHGKd8bPzjdEVDaCoeD1KBdRq/S+am2wvaPwN5wqqgs6hVU83VvZggIBkGRLBbGEeMmnzu8dkG1osqE4S3RCmFVQ8MG9tiOiP0MN/jx/DpckP++NnuamJWcD/Z comment
Common problems with transferred files
Your uploaded files into /scratch
have been unexpectedly purged
This may have happened if the original time stamp of the transferred files was preserved after copying them into /scratch
(like the -t
option for Rsync or the default transfer settings for WinSCP). To avoid this problem, do not use a time-preserving option in your transfer tool.
Problem with ownership of files on /scratch
If the "group" property has been overriden
By default, your personal directory in /scratch
(easily accessible with the environmental variables $MYSCRATCH
) is set to belong to: "<yourUsername>:<yourProject>
" in the "owner:group
" properties of the directory. This combination of ownership properties has been set through setgid
. Also, by default, all files and subdirectories created under your personal directory in /scratch
should inherit the same ownership properties, that is, they should belong to: "<yourUsername>:<yourProject>
" too.
Unfortunately, some file transfer programs override our setgid
defaults for ownership properties. The typical unfortunate change is to set "<yourUsername>:<yourUsername>
" instead of "<yourUsername>:<yourProject>
for the "owner:group
" properties (see terminal 1). This makes the files abide by the quota restrictions as set on /home
and won't be able to make use of extended quotas as on /scratch
or /astro
. You may receive errors during an upload if this is the case.
To resolve this problem, the "group" part of the ownership properties needs to be fixed. Secondly, the setup of the transferring tool needs to be fixed for avoiding the problem to happen again.
Consider the following example where the directory badDirectory
resulted with the wrong settings after being transferred. The ls -la
command shows the ownership properties of the directory as mickey-mickey
instead of mickey
-pawsey9999
, which is the right "owner-group" combination of ownership properties. The rest of the directories listed have the correct properties.
$ ls -la total 20 drwxrws---+ 5 mickey pawsey9999 4096 Jul 30 16:00 . drwxr-s---+ 17 mickey pawsey9999 4096 Jul 25 09:25 .. drwxrws---+ 2 mickey mickey 4096 Jul 30 16:00 badDirectory drwxrws---+ 2 mickey pawsey9999 4096 Jul 30 15:52 borra drwxrws---+ 4 mickey pawsey9999 4096 Jul 11 15:53 GS-8819-ONETEP
To resolve the problem, execute the chgrp
command (change group) with the "-R
" option to apply the change recursively through all the content (files and subdirectories) inside the affected initial directory. The general syntax is:
$ chgrp -R <yourProject> <yourAffectedDirectory>
This fixes that group association of the directory named badDirectory
and all the files in it recursively:
$ chgrp -R pawsey0001 badDirectory $ ls -la total 20 drwxrws---+ 5 mickey pawsey9999 4096 Jul 30 16:00 . drwxr-s---+ 17 mickey pawsey9999 4096 Jul 25 09:25 .. drwxrws---+ 2 mickey pawsey9999 4096 Jul 30 16:00 badDirectory drwxrws---+ 2 mickey pawsey9999 4096 Jul 30 15:52 borra drwxrws---+ 4 mickey pawsey9999 4096 Jul 11 15:53 GS-8819-ONETEP
If the problem has expanded extensively into most of your /group
directory, you can fix it by using fix.group.permission.sh
, which is provided by the module pawseytools
. For more information about this tool, see under "File Permissions and Quota" on the Pawsey Filesystems and their Use page.
Finally, in order to avoid this problem to happen again, configure your file transfer program to honour the setgid
(set-group identification) default so that newly created files and directories belong to your project on the "group" property. This is explained for several documented tools in the following subsections.
Your institution firewall may block connections
SSH is enabled on Pawsey systems for both incoming and outgoing traffic. This, however, may not be true for some firewalls on connections on the client side. Most university, business and home internet connections only permit outgoing connections, and have their incoming SSH disabled within their firewall. This means that, SCP is always invoked on the client, that is, your Laptop/Desktop to copy the data to/from the Pawsey supercomputers.