Intel® FPGA SDK for OpenCL™ Standard Edition: Best Practices Guide

ID 683176
Date 9/24/2018
Public
Document Table of Contents

4.3.4.5. No Stalls, Low Occupancy Percentage, and Low Bandwidth Efficiency

Loop-carried dependencies might create a bottleneck in your design that causes an LSU or channel to have a low occupancy percentage and a low bandwidth.
Remember: An ideal kernel pipeline condition has a stall percentage of 0%, an occupancy percentage of 100%, and a bandwidth that equals the board's available bandwidth.
Figure 71. Example OpenCL Kernel and Profiler Analysis

In this example, dst[] is executed once every 20 iterations of the FACTOR2 loop and once every four iterations of the FACTOR1 loop. Therefore, FACTOR2 loop is the source of the bottleneck.

Solutions for resolving loop bottlenecks:

  • Unroll the FACTOR1 and FACTOR2 loops evenly. Simply unrolling FACTOR1 loop further will not resolve the bottleneck
  • Vectorize your kernel to allow multiple work-items to execute during each loop iteration