Loop optimisations are a group of compiler optimisations aimed at increasing execution speed and reducing the overheads associated with loops.Small changes in loop operations can drastically increase the performance of the code. This is because loop optimisation can change the pattern of memory accesses performed within the loop to run more efficiently.

Prerequisite knowledge

In order to get started with loop optimisation, you should first be familiar with how to install and run your program, including knowing how to update compilation flags for compiled languages. You should also be familiar with the programming language used in your program.

Loop unrolling

Loop unrolling is a technique that aims to reduce the overhead of loop control instructions and provide more instructions per loop to give the compiler more opportunities to optimise.

Listings 1 and 2 show an example of a loop unrolling operation.

Listing 1. Initial loop

do n=1,nmax
  a(n)=somefunc(n) + a(n-1)
end do

Listing 2. Loop unrolled 2 times

do n=1,nmax,2
  a(n)=somefunc(n) + a(n-1)
  a(n+1)=somefunc(n+1) + a(n)
end do

The main benefits of loop unrolling are elimination of branches and reuse of data in registers. The potential limitation is that excessive loop unrolling can lead to increased code size.

Loop blocking

Loop blocking is a technique that aims to reduce cache misses by structuring loops based on cache size.

Listings 3 and 4 show an example of loop blocking.

Listing 3. Initial Loop

do j=1,jmax
  do i=1,imax
    a(i,j) = a(i,j) + b(j,i)
  end do
end do

Listing 4. Loop Blocking

do j=1,jblksize,jmax
	do i=1,iblksize,imax
 		do jj=j,j+jblksize
  			do ii=i,i+iblksize
    			a(ii,jj) = a(ii,jj) + b(jj,ii)
  			end do
		end do
 	end do
end do

The innermost loops operate on small blocks of data (matrices a and b) which both fit into cache. Some compilers (for example Cray) can restructure the code automatically to implement these local cache operations.

Loop interchange

Loop interchange is a technique that reorders the levels of nested loops to improve the efficiency of memory access patterns and allow the compiler to use vectorisation instruction optimisation.

Listings 5 and 6 show an example of loop interchange.

Listing 5. Initial loop

for (i=0; i<N; i++) {
  for(j=0; j<N; j++) {
    for (k=0; k<N; k++) {
      c[i][j] += a[i][k] * b[k][j];
    }
  }
}

Listing 6. Loop interchange

for (i=0; i<N; i++) {
  for(k=0; k<N; k++) {
    for (j=0; j<N; j++) {
      c[i][j] += a[i][k] * b[k][j];
    }
  }
}

The loop interchange in the example given above changes the i-j-k order to i-k-j. This alters the pattern of memory accesses for matrices c and b. The i-k-j version increases reuse of data stored in cache for both matrices.

Loop fission

Loop fission breaks one larger loop into smaller loops to take advantage of data locality in sections of the loop, and use cache more optimally.

Listings 7 and 8 show a simple example of loop fission.

Listing 7. Initial loop

do n=1, nmax
  lots of vectorisable work
  some I/O
end do

Listing 8. Loop fission

do n=1, nmax
  lots of vectorised work
end do
do n=1, nmax
  some I/O
end do

Loop fission may be beneficial in some cases, for example:

When too much work is performed in a given loop and the registers or instruction cache become used up.
When only part of a given loop is vectorisable.

Loop fusion

Loop fusion combines two similar loops to reduce loop control overhead and provide more scope for compiler optimisation of the instructions in the loop.

Listings 9 and 10 show a simple example of loop fusion.

Listing 9. Initial loop

do n=1, nmax
  a(n)=somefunc(n)
end do
do n=1, namx
  b(n)=someotherfunc(n)
end do

Listing 10. Loop fusion

do n=1, nmax
  a(n)=somefunc(n)
  b(n)=someotherfunc(n)
end do

The main idea of loop fusion is to merge two or more loops into a single loop. This type of optimisation is usually used in conjunction with other techniques such as optimising memory accesses and cache reuse. Whether it is beneficial depends on the work inside the loops.

Compiler options

Most of the loop optimisations are enabled with various optimisation levels. However, compilers provide separate switches to control some of the most important loop transformations. The Cray Fortran compiler attempts to unroll all loops by default, whereas the Cray C/C++ compiler (which is based on Clang) does not.

Table 1. Compiler options to control specific loop transformations

Flag	Cray Fortran	Cray C/C++ (Clang)	AOCC	Intel	GNU	PGI
Loop unrolling	`-h unrolln`	`-funroll-loops`	`-funroll-loops`	`-unroll n`	`-funroll-loops`	`-Munroll`
Loop interchange					`-floop-interchange`
Loop blocking				`-qopt-block-factor=n`	`-floop-block`

User Support Documentation

Loop Optimisations