Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

Insert excerpt
TensorFlow
TensorFlow
nameWarningAfterJune2024
nopaneltrue

In this tutorial, you are going to see how to write a Horovod-powered distributed TensorFlow computation. More specifically, the final goal is to train different models in parallel by assigning each of them to a different GPU. The discussion is organised in two sections. The first section illustrates Horovod's basic concepts and its usage coupled with TensorFlow, the second one uses the MNIST classification task as test case.

...

Column
width900px


Code Block
languagepy
themeEmacs
titleListing 1. 01_horovod.mnist. Initialisation.py : Initialisation (first part of the script)
linenumberstrue
import tensorflow as tf
import horovod.tensorflow as hvd
import logging
 
# Show our log messages
logging.basicConfig(level=logging.INFO)

# ...but disable tesorflow's ones except for errors
logging.getLogger("tensorflow").setLevel(logging.ERROR)

# initialize horovod - this call must always be done at the beginning of our scripts.
hvd.init()


...

Column
width900px


Code Block
languagepy
themeEmacs
titleListing 2. 01_horovod_mnist.py : Assigning a different GPU to each process .(second part of the script)
linenumberstrue
# retrieve and print the process's global and local ranks
rank = hvd.rank()
local_rank = hvd.local_rank()
size = hvd.size()
local_size = hvd.local_size()
logging.info(f"This is process with rank {rank} and local rank {local_rank}")

# each process retrieves the list of gpus available on its node
gpus = tf.config.experimental.list_physical_devices('GPU')
if local_rank == 0:
    logging.info(f"This is process with rank {rank} and local rank {local_rank}: gpus available are: {gpus}")
 
# each process selects a gpu (if any gpu is available)
if local_rank >= len(gpus):
    raise Exception("Not enough gpus.")
tf.config.experimental.set_visible_devices(gpus[local_rank], 'GPU')
 
# From now on each process has its own gpu to use...



The first two lines are convenient function calls to retrieve the rank and local rank of the process, which are then logged for demonstration purposes. Next, each process retrieves the list of GPUs that are available on the node it is running on. Of course, processes on the same node will retrieve the same list, whereas any two processes running on different nodes will have different, non overlapping, sets of GPUs. In the latter case, resource contention is structurally impossible; it is in the former case that the local rank concept comes handy. Each process uses its local rank as index to select a GPU in the gpus list and will not share it with any other processes because:

...

The last function call sets the GPU Tensorflow will use for each process. Try the script using distributed_tf.sh.

To test what we have written so far, use the batch job script runTensorflow.sh provided in the previous page as a template for submitting the job. You will need to adapt the batch job script and remove the exclusive option to change the number of GPUs per node to 2 in the request of resources together with changes in the srun command, and use of the python script (01_horovod_mnist.py) containing the two parts described above. The adapted lines of the batch job script should look like:

#SBATCH --nodes=2      #2 nodes in this example
#SBATCH --gres=gpu:2   #2 GPUS per node
.
.
PYTHON_SCRIPT=$PYTHON_SCRIPT_DIR/01_horovod_mnist.py
.
.
srun -N 2 -n 4 -c 8 --gres=gpu:2 python3 $PYTHON_SCRIPT

(Note that the resource request for GPU nodes is different from the usual Slurm allocation requests and also the parameters to be given to the srun command. Please refer to the page Example Slurm Batch Scripts for Setonix on GPU Compute Nodes for a detailed explanation of resource allocation on GPU nodes.)
You should see an output similar to the following one.:

Column
width900px


Code Block
languagebash
themeEmacs
titleListing 3. Example job output.
linenumberstrue
INFO:root:This is process with rank 1 and local rank 1
INFO:root:This is process with rank 0 and local rank 0
INFO:root:This is process with rank 23 and local rank 01
INFO:root:This is process with rank 32 and local rank 10
INFO:root:This is process with rank 20 and local rank 0: gpus available are: [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU')]
INFO:root:This is process with rank 02 and local rank 0: gpus available are: [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU')]


...

Column
width900px


Code Block
languagepy
themeEmacs
titleListing 4. 01_horovod_mnist.py : MNIST classification example (third part of the script)
linenumberstrue
# From now on each process has its own gpu to use.

# We will now train the same model on each gpu indipendently, and make each of them
# output a prediction for a different input.
mnist = tf.keras.datasets.mnist
 
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train, x_test = x_train / 255.0, x_test / 255.0
 
model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=(28, 28)),
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Dense(10)
])
 
loss_fn = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
model.compile(optimizer='adam', loss=loss_fn, metrics=['accuracy'])
 
# We will partition the training set evenly among processes so that the same model
# is trained by each process on different data.
dataset_size = len(x_train)
from math import ceil
# samples per model - number of samples to train each model with
spm = ceil(dataset_size / size)
 
model.fit(x_train[rank*spm:(rank+1)*spm], y_train[rank*spm:(rank+1)*spm], epochs=15)
print(model.evaluate(x_test,  y_test, verbose=2))


...

External Resources