Loop Optimisations
Loop optimisations are a group of compiler optimisations aimed at increasing execution speed and reducing the overheads associated with loops.Small changes in loop operations can drastically increase the performance of the code. This is because loop optimisation can change the pattern of memory accesses performed within the loop to run more efficiently.
Prerequisite knowledge
In order to get started with loop optimisation, you should first be familiar with how to install and run your program, including knowing how to update compilation flags for compiled languages. You should also be familiar with the programming language used in your program.
Loop unrolling
Loop unrolling is a technique that aims to reduce the overhead of loop control instructions and provide more instructions per loop to give the compiler more opportunities to optimise.
Listings 1 and 2 show an example of a loop unrolling operation.
do n=1,nmax a(n)=somefunc(n) + a(n-1) end do
do n=1,nmax,2 a(n)=somefunc(n) + a(n-1) a(n+1)=somefunc(n+1) + a(n) end do
The main benefits of loop unrolling are elimination of branches and reuse of data in registers. The potential limitation is that excessive loop unrolling can lead to increased code size.
Loop blocking
Loop blocking is a technique that aims to reduce cache misses by structuring loops based on cache size.
Listings 3 and 4 show an example of loop blocking.
do j=1,jmax do i=1,imax a(i,j) = a(i,j) + b(j,i) end do end do
do j=1,jblksize,jmax do i=1,iblksize,imax do jj=j,j+jblksize do ii=i,i+iblksize a(ii,jj) = a(ii,jj) + b(jj,ii) end do end do end do end do
The innermost loops operate on small blocks of data (matrices a
and b
) which both fit into cache. Some compilers (for example Cray) can restructure the code automatically to implement these local cache operations.
Loop interchange
Loop interchange is a technique that reorders the levels of nested loops to improve the efficiency of memory access patterns and allow the compiler to use vectorisation instruction optimisation.
Listings 5 and 6 show an example of loop interchange.
for (i=0; i<N; i++) { for(j=0; j<N; j++) { for (k=0; k<N; k++) { c[i][j] += a[i][k] * b[k][j]; } } }
for (i=0; i<N; i++) { for(k=0; k<N; k++) { for (j=0; j<N; j++) { c[i][j] += a[i][k] * b[k][j]; } } }
The loop interchange in the example given above changes the i-j-k order to i-k-j. This alters the pattern of memory accesses for matrices c
and b
. The i-k-j version increases reuse of data stored in cache for both matrices.
Loop fission
Loop fission breaks one larger loop into smaller loops to take advantage of data locality in sections of the loop, and use cache more optimally.
Listings 7 and 8 show a simple example of loop fission.
do n=1, nmax lots of vectorisable work some I/O end do
do n=1, nmax lots of vectorised work end do do n=1, nmax some I/O end do
Loop fission may be beneficial in some cases, for example:
- When too much work is performed in a given loop and the registers or instruction cache become used up.
- When only part of a given loop is vectorisable.
Loop fusion
Loop fusion combines two similar loops to reduce loop control overhead and provide more scope for compiler optimisation of the instructions in the loop.
Listings 9 and 10 show a simple example of loop fusion.
do n=1, nmax a(n)=somefunc(n) end do do n=1, namx b(n)=someotherfunc(n) end do
do n=1, nmax a(n)=somefunc(n) b(n)=someotherfunc(n) end do
The main idea of loop fusion is to merge two or more loops into a single loop. This type of optimisation is usually used in conjunction with other techniques such as optimising memory accesses and cache reuse. Whether it is beneficial depends on the work inside the loops.
Compiler options
Most of the loop optimisations are enabled with various optimisation levels. However, compilers provide separate switches to control some of the most important loop transformations. The Cray Fortran compiler attempts to unroll all loops by default, whereas the Cray C/C++ compiler (which is based on Clang) does not.
Table 1. Compiler options to control specific loop transformations
Flag | Cray Fortran | Cray C/C++ (Clang) | AOCC | Intel | GNU | PGI |
---|---|---|---|---|---|---|
Loop unrolling | -h unrolln
| -funroll-loops | -funroll-loops |
| -funroll-loops |
|
Loop interchange |
| |||||
Loop blocking |
|
|