Column |
---|
Note |
---|
This page is still a work in progress and support for Machine Learning workload has just started. Please check it frequently for updates. |
|
...
Column |
---|
|
Code Block |
---|
language | bash |
---|
theme | DJango |
---|
title | Terminal 4. Installing additional Python packages using virtual environments |
---|
| matilda@setonix:~> module load tensorflow/rocm5.6-tf2.12
matilda@setonix:~> mkdir -p $MYSOFTWARE/manual/pythonAdditionalpythonEnvironments/forContainerTensorflow
matilda@setonix:~> cd $MYSOFTWARE/manual/pythonAdditionalpythonEnvironments/forContainerTensorflowmatilda@setonixforContainerTensorflow
matilda@setonix:/software/projects/pawsey12345/matilda/manual/pythonAdditionalpythonEnvironments/forContainerTensorflow> bash
Singularity> python3 -m venv --system-site-packages myenv
Singularity> source myenv/bin/activate
(myenv) Singularity> python3 -m pip install xarray
Collecting xarray
Using cached xarray-2023.8.0-py3-none-any.whl (1.0 MB)
Requirement already satisfied: packaging>=21.3 in /usr/local/lib/python3.10/dist-packages (from xarray) (23.1)
Collecting pandas>=1.4
Using cached pandas-2.1.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.7 MB)
Requirement already satisfied: numpy>=1.21 in /usr/local/lib/python3.10/dist-packages (from xarray) (1.23.5)
Collecting pytz>=2020.1
Using cached pytz-2023.3.post1-py2.py3-none-any.whl (502 kB)
Collecting python-dateutil>=2.8.2
Using cached python_dateutil-2.8.2-py2.py3-none-any.whl (247 kB)
Collecting tzdata>=2022.1
Using cached tzdata-2023.3-py2.py3-none-any.whl (341 kB)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.2->pandas>=1.4->xarray) (1.16.0)
Installing collected packages: pytz, tzdata, python-dateutil, pandas, xarray
Successfully installed pandas-2.1.0 python-dateutil-2.8.2 pytz-2023.3.post1 tzdata-2023.3 xarray-2023.8.0
(myenv) Singularity> python3
Python 3.10.12 (main, Jun 11 2023, 05:26:28) [GCC 11.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import tensorflow
2023-09-07 14:59:00.339696: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
>>> import xarray
>>> exit()
(myenv) Singularity> exit
matilda@setonix:/software/projects/pawsey12345/matilda/manual/pythonAdditionalpythonEnvironments/forContainerTensorflow> ls -l
drwxr-sr-x 5 matilda pawsey12345 4096 Apr 22 16:33 myenv |
|
...
Column |
---|
|
Code Block |
---|
language | bash |
---|
theme | DJango |
---|
title | Terminal 5. The environment can be used once again. |
---|
| $ module load tensorflow/rocm5.6-tf2.12
$ bash
Singularity> source $MYSOFTWARE/manual/pythonAdditionalpythonEnvironments/forContainerTensorflow/myenv/bin/activate
(myenv) Singularity> |
|
...
FROM quay.io/pawsey/tensorflow:2.12.1.570-rocm5.6.0
To pull the image to your local desktop with Docker you can use:
$ docker pull quay.io/pawsey/tensorflow:2.12.1.570-rocm5.6.0
To know more about our recommendations of container builds with Docker and later translation into Singularity format for their use in Setonix please refer to the Containers Documentation.
...
Column |
---|
|
Code Block |
---|
language | bash |
---|
theme | Emacs |
---|
title | Listing 1. distribute_tf.sh : An example batch script to run a TensorFlow distributed training job. |
---|
| #!/bin/bash --login
#SBATCH --job-name=distribute_tf
#SBATCH --partition=gpu
#SBATCH --nodes=2 #2 nodes in this example
#SBATCH --exclusive #All resources of the node are exclusive to this job
# #8 GPUs per node (16 "allocation packs" in total for the job)
#SBATCH --time=00:05:00
#SBATCH --account=pawsey12345-gpu #IMPORTANT: use your own project and the -gpu suffix
#----
#Loading needed modules:
module load tensorflow/<version>
echo -e "\n\n#------------------------#"
module list
#----
#Printing the status of the given allocation
echo -e "\n\n#------------------------#"
echo "Printing from scontrol:"
scontrol show job ${SLURM_JOBID}
#----
#If additional python packages have been installed in user's own virtual environment
VENV_PATH=$MYSOFTWARE/manual/pythonAdditionalpythonEnvironments/forContainerTensorflow/myenv
#----
#Clear definition of the python script containing the tensorflow training case
PYTHON_SCRIPT=$MYSRATCH/matilda-machinelearning/models/01_horovod_mnist.py
#----
#TensorFlow settings if needed:
# The following two variables control the real number of threads in Tensorflow code:
export TF_NUM_INTEROP_THREADS=1 #Number of threads for independent operations
export TF_NUM_INTRAOP_THREADS=1 #Number of threads within individual operations
#----
#Execution
#Note: srun needs the explicit indication full parameters for use of resources in the job step.
# These are independent from the allocation parameters (which are not inherited by srun)
# Each task needs access to all the 8 available GPUs in the node where it's running.
# So, no optimal binding can be provided by the scheduler.
# Therefore, "--gpus-per-task" and "--gpu-bind" are not used.
# Optimal use of resources is now responsability of the code.
# "-c 8" is used to force allocation of 1 task per CPU chiplet. Then, the REAL number of threads
# for the code SHOULD be defined by the environment variables above.
echo -e "\n\n#------------------------#"
echo "Code execution:"
srun -N 2 -n 16 -c 8 --gres=gpu:8 bash -c "source $VENV_PATH/bin/activate && python3 $PYTHON_SCRIPT"
#----
#Printing information of finished job steps:
echo -e "\n\n#------------------------#"
echo "Printing information of finished jobs steps using sacct:"
sacct -j ${SLURM_JOBID} -o jobid%20,Start%20,elapsed%20
#----
#Done
echo -e "\n\n#------------------------#"
echo "Done"
|
|
...
...