Debugging GPU-Enabled Software
Debugging GPU-enabled codes requires specific debuggers. This section discusses the available debuggers on Setonix which can debug GPU-enabled code.
Prerequisite knowledge
To start using GPU-enabled debuggers you should first be familiar with how to compile and install GPU-enabled code (using APIs such as HIP) and the general idea of debugging.
Available Debuggers
The number of GPU-enabled debuggers for AMD GPUs is ever-growing, though not all are available on Setonix. There are several tools provided by AMD, others provided by Cray and even debuggers like ARM DDT that are planning on to provide support for AMD GPUs.
The current ones available on Setonix are
- ROCDGB - Tool developed by AMD and provided by the
rocm/<VERSION>
module.
Future debuggers
- ARM Forge - Tool developed by ARM and provided by the
arm-forge/<VERSION>
module but AMD support is not yet available. - CCDB - Tool developed by Cray provided by the
cray-ccdb/<VERSION>
module and should soon support for AMD GPUs.
ROCGDB
The ROCGDB debugger is based on the GDB, the GNU Debugger Project. It can do four main things to help you catch bugs:
- Start your program, specifying anything that might affect its behaviour.
- Make your program stop on specified conditions.
- Examine what has happened, when your program has stopped.
- Change things in your program, so you can experiment with correcting the effects of one bug and go on to learn about another.
It is designed to debug codes written in C and C++. It is run using the command line and is not designed to work with MPI-codes. Instead it focuses on HIP-enabled codes.
Step-by-step Example
Step 1. Get HIP-enabled source code
An example HIP-code would be
#include <vector> #include <hip/hip_runtime.h> int main(int argc, char ** argv) { int N = 100; int nbytes = N*sizeof(float); std::vector<float> p1(N); float *p1_dev = nullptr; hipMalloc(&p1_dev, N); // incorrect passing of size hipMemcpy(p1.data(), p1_dev, nbytes, hipMemcpyHostToDevice); return 0; }
Step 2. Run ROCGDB with a GPU
To debug this code we need to load the module, compile the code and run the debugger.
$ salloc -A <project>-gpu -p gpu-dev --nodes=1 --ntasks=1 --gpus-per-task=1 --cpus-per-task=8 $ module load rocm/<VERSION> $ module load craype-accel-amd-gfx90a $ hipcc -o debugme debugme.cpp $ rocgdb ./debugme GNU gdb (rocm-rel-5.0-72) 11.1 Copyright (C) 2021 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-pc-linux-gnu". Type "show configuration" for configuration details. For bug reporting instructions, please see: <https://github.com/ROCm-Developer-Tools/ROCgdb/issues>. Find the GDB manual and other documentation resources online at: <http://www.gnu.org/software/gdb/documentation/>. For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from ./debugme... (gdb) run Starting program: /software/projects/pawsey0001/pelahi/gpu-tests/debugme [Thread debugging using libthread_db enabled] Using host libthread_db library "/lib64/libthread_db.so.1". warning: File "/opt/cray/pe/gcc/12.1.0/snos/lib64/libstdc++.so.6.0.30-gdb.py" auto-loading has been declined by your `auto-load safe-path' set to "$debugdir:$datadir/auto-load". To enable execution of this file add add-auto-load-safe-path /opt/cray/pe/gcc/12.1.0/snos/lib64/libstdc++.so.6.0.30-gdb.py line to your configuration file "/home/pelahi/.config/gdb/gdbinit". To completely disable this security protection add set auto-load safe-path / line to your configuration file "/home/pelahi/.config/gdb/gdbinit". For more information about this security protection see the "Auto-loading safe path" section in the GDB manual. E.g., run from the shell: info "(gdb)Auto-loading safe path" [New Thread 0x15554aa57700 (LWP 11904)] 1 [Thread 0x15554aa57700 (LWP 11904) exited] [Inferior 1 (process 11895) exited normally] (gdb) quit
Now with this running you should be able to start debugging the program.
CCDB
Related Pages
External
- For information on rocgdb, see https://docs.amd.com/bundle/ROCDubugger-User-Guide/page/index.html
- For information on GDB, see https://www.sourceware.org/gdb/