Debugging GPU-Enabled Software

Debugging GPU-enabled codes requires specific debuggers. This section discusses the available debuggers on Setonix which can debug GPU-enabled code. 

On this Page

Prerequisite knowledge

To start using GPU-enabled debuggers you should first be familiar with how to compile and install GPU-enabled code (using APIs such as HIP) and the general idea of debugging. 

Available Debuggers

The number of GPU-enabled debuggers for AMD GPUs is ever-growing, though not all are available on Setonix. There are several tools provided by AMD, others provided by Cray and even debuggers like ARM DDT that are planning on to provide support for AMD GPUs.

The current ones available on Setonix are 

  • ROCDGB - Tool developed by AMD and provided by the rocm/<VERSION> module.

Future debuggers

  • ARM Forge - Tool developed by ARM and provided by the arm-forge/<VERSION> module but AMD support is not yet available.
  • CCDB - Tool developed by Cray provided by the cray-ccdb/<VERSION> module and should soon support for AMD GPUs.

ROCGDB

The ROCGDB debugger is based on the GDB, the GNU Debugger Project. It can do four main things to help you catch bugs:

  • Start your program, specifying anything that might affect its behaviour.
  • Make your program stop on specified conditions.
  • Examine what has happened, when your program has stopped.
  • Change things in your program, so you can experiment with correcting the effects of one bug and go on to learn about another.

It is designed to debug codes written in C and C++. It is run using the command line and is not designed to work with MPI-codes. Instead it focuses on HIP-enabled codes.

Step-by-step Example

Step 1. Get HIP-enabled source code

An example HIP-code would be 

Source code
#include <vector>
#include <hip/hip_runtime.h>
int main(int argc, char ** argv) {
    int N = 100;
    int nbytes = N*sizeof(float);
    std::vector<float> p1(N);
    float *p1_dev = nullptr;
    hipMalloc(&p1_dev, N); // incorrect passing of size 
    hipMemcpy(p1.data(), p1_dev, nbytes, hipMemcpyHostToDevice);
	return 0;
}

Step 2. Run ROCGDB with a GPU

To debug this code we need to load the module, compile the code and run the debugger.

Running ROCGDB
$ salloc -A <project>-gpu -p gpu-dev --nodes=1 --ntasks=1 --gpus-per-task=1 --cpus-per-task=8  
$ module load rocm/<VERSION>
$ module load craype-accel-amd-gfx90a
$ hipcc -o debugme debugme.cpp
$ rocgdb ./debugme   GNU gdb (rocm-rel-5.0-72) 11.1
Copyright (C) 2021 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://github.com/ROCm-Developer-Tools/ROCgdb/issues>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./debugme...
(gdb) run
Starting program: /software/projects/pawsey0001/pelahi/gpu-tests/debugme
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
warning: File "/opt/cray/pe/gcc/12.1.0/snos/lib64/libstdc++.so.6.0.30-gdb.py" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load".
To enable execution of this file add
	add-auto-load-safe-path /opt/cray/pe/gcc/12.1.0/snos/lib64/libstdc++.so.6.0.30-gdb.py
line to your configuration file "/home/pelahi/.config/gdb/gdbinit".
To completely disable this security protection add
	set auto-load safe-path /
line to your configuration file "/home/pelahi/.config/gdb/gdbinit".
For more information about this security protection see the
"Auto-loading safe path" section in the GDB manual.  E.g., run from the shell:
	info "(gdb)Auto-loading safe path"
[New Thread 0x15554aa57700 (LWP 11904)]
1
[Thread 0x15554aa57700 (LWP 11904) exited]
[Inferior 1 (process 11895) exited normally]
(gdb) quit

Now with this running you should be able to start debugging the program. 

CCDB


Related Pages

External