Intel® FPGA SDK for OpenCL™ Standard Edition: Best Practices Guide

ID 683176
Date 9/24/2018
Public
Document Table of Contents

5.1.4. Transferring Loop-Carried Dependency to Local Memory

For a loop-carried dependency that you cannot remove, improve the II by moving the array with the loop-carried dependency from global memory to local memory.

Consider the following kernel example:

1 #define N 128
2
3 __kernel void unoptimized( __global int* restrict A )
4 {
5     for (unsigned i = 0; i < N; i++)
6           A[N-i] = A[i];
7 }	 
==================================================================================
Kernel: unoptimized
==================================================================================
The kernel is compiled for single work-item execution.

Loop Report:

 + Loop "Block1" (file unoptimized4.cl line 5)
   Pipelined with successive iterations launched every 324 cycles due to:

       Memory dependency on Load Operation from: (file unoptimized4.cl line 6)
         Store Operation (file unoptimized4.cl line 6)
       Largest Critical Path Contributors:
           49%: Load Operation  (file unoptimized4.cl line 6)
           49%: Store Operation  (file unoptimized4.cl line 6)

Global memory accesses have long latencies. In this example, the loop-carried dependency on the array A[i] causes the long latency. This latency is reflected by an II of 324 in the optimization report. To reduce the II value by transferring the loop-carried dependency from global memory to local memory, perform the following tasks:

  1. Copy the array with the loop-carried dependency to local memory. In this example, array A[i] becomes array B[i] in local memory.
  2. Execute the loop with the loop-carried dependence on array B[i].
  3. Copy the array back to global memory.
When you transfer array A[i] to local memory and it becomes array B[i], the loop-carried dependency is now on B[i]. Because local memory has a much lower latency than global memory, the II value improves.

Below is the restructured kernel optimized:

 1 #define N 128
 2
 3 __kernel void optimized( __global int* restrict A )
 4 {
 5     int B[N];
 6
 7     for (unsigned i = 0; i < N; i++)
 8         B[i] = A[i];
 9
10     for (unsigned i = 0; i < N; i++)
11         B[N-i] = B[i];
12
13     for (unsigned i = 0; i < N; i++)
14         A[i] = B[i];
15 }

An optimization report similar to the one below indicates the successful reduction of II from 324 to 2:

==================================================================================
Kernel: optimized
==================================================================================
The kernel is compiled for single work-item execution.

Loop Report:

 + Loop "Block1" (file optimized4.cl line 7)
   Pipelined well. Successive iterations are launched every cycle.


 + Loop "Block2" (file optimized4.cl line 10)
   Pipelined with successive iterations launched every 2 cycles due to:

       Memory dependency on Load Operation from: (file optimized4.cl line 11)
         Store Operation (file optimized4.cl line 11)
       Largest Critical Path Contributors:
           65%: Load Operation  (file optimized4.cl line 11)
           34%: Store Operation  (file optimized4.cl line 11)


 + Loop "Block3" (file optimized4.cl line 13)
   Pipelined well. Successive iterations are launched every cycle.