Intel® FPGA SDK for OpenCL™ Standard Edition: Best Practices Guide

ID 683176
Date 9/24/2018
Public
Document Table of Contents

2.8.7. Load-Store Units

The generates a number of different types of load-store units (LSUs). For some types of LSU, the compiler might modify the LSU behavior and properties depending on the memory access pattern and other memory attributes.

While you cannot explicitly choose the load-store unit type or modifier, you can affect the type of LSU the compiler instantiates by changing the memory access pattern in your code, the types of memory available, and whether the memory accesses are to local or global memory.

Load-Store Unit Types

The compiler can generate several different types of load-store unit (LSU) based on the inferred memory access pattern, the types of memory available on the target platform, and whether the memory accesses are to local or global memory. The can generate the following types of LSU:

Burst-Coalesced Load-Store Units

A burst-coalesced LSU is the default LSU type instantiated by the compiler. It buffers requests until the largest possible burst can be made. The burst-coalesced LSU can provide efficient access to global memory, but it requires a considerable amount of FPGA resources.

kernel void burst_coalesced (global int * restrict in, 
                             global int * restrict out) {
  int i = get_global_id(0);
  int value = in[i/2];      // Burst-coalesced LSU
  out[i] = value;
}
Depending on the memory access pattern and other attributes, the compiler might modify a burst-coalesced LSU in the following ways:

Prefetching Load-Store Units

A prefetching LSU instantiates a FIFO (sometimes called a named pipe) which burst reads large blocks from memory to keep the FIFO full of valid data based on the previous address and assuming contiguous reads. Non-contiguous reads are supported, but a penalty is incurred to flush and refill the FIFO.

kernel void prefetching (global int * restrict in, 
                         global int * restrict out, 
                         int N) {
  int res = 1;
  for (int i = 0; i < N; i++) {
    int v = in[i];             // Prefetching LSU
    res ^= v;
  }
  out[0] = res;
}

Streaming Load-Store Units

A streaming LSU instantiates a FIFO which burst reads large blocks from memory to keep the FIFO full of valid data. This block of data can be used only if memory accesses are in-order, and addresses can be calculated as a simple offset from the base address.

kernel void streaming (global int * restrict in, 
                       global int * restrict out) {
  int i = get_global_id(0);
  int idx = out[i];           // Streaming LSU
  int cached_value = in[idx];
  out[i] = cached_value;      // Streaming LSU
}

Semi-Streaming Load-Store Units

A semi-streaming LSU instantiates a read-only cache. The cache will have an area overhead, but will provide improved performance in cases where you make repeated accesses to the same data location in the global memory. You must ensure that your data is not overwritten by a store within the kernel, as that would break the coherency of the cache. The LSU cache is flushed each time the associated kernels are started.

#define N 16
kernel void semi_streaming (global int * restrict in, 
                            global int * restrict out) {
  #pragma unroll 1
  for (int i = 0; i < N; i++) {
    int value = in[i]; // Semi-streaming LSU
    out[i] = value;
  }
}

Local-Pipelined Load-Store Units

A local-pipelined LSU is a pipelined LSU that is used for accessing local memory. Requests are submitted as soon as they are received. Memory accesses are pipelined, so multiple requests can be in flight at a time. If there is no arbitration between the LSU and the local memory, a local-pipelined never-stall LSU is created.

__attribute((reqd_work_group_size(1024,1,1)))
kernel void local_pipelined (global int* restrict in, 
                             global int* restrict out) {
  local int lmem[1024];
  int gi = get_global_id(0);
  int li = get_local_id(0);

  int res = in[gi];
  for (int i = 0; i < 4; i++) {
    lmem[li - i] = res;                 // Local-pipelined LSU
    res >>= 1;
  }

  barrier(CLK_GLOBAL_MEM_FENCE);

  res = 0;
  for (int i = 0; i < 4; i++) {
    res ^= lmem[li - i];                // Local-pipelined LSU
  }

  out[gi] = res;
}
The compiler might modify a local-pipelined LSU in the following way:

Global Infrequent Load-Store Units

A global infrequent LSU is a pipelined LSU that is used for global memory accesses that can be proven to be infrequent. The global infrequent LSU is instantiated only for memory operations that are not contained in a loop, and are active only for a single thread in an NDRange kernel.

The compiler implements a global infrequent LSU as pipelined LSU because a pipelined LSU is smaller than other LSU types. While a pipelined LSU might have lower throughput, this throughput tradeoff is acceptable because the memory accesses are infrequent.

kernel void global_infrequent (global int * restrict in, 
                               global int * restrict out, 
                               int N) {
  int a = 0;
  if (get_global_id(0) == 0)
    a = in[0];                   // Global Infrequent LSU
  for (int i = 0; i < N; i++) {
    out[i] = in[i] + a;
  }
}

Constant-Pipelined Load-Store Units

A constant pipelined LSU is a pipelined LSU that is used mainly to read from the constant cache. The constant pipelined LSU consumes less area than a burst-coalesced LSU. The throughput of a constant-pipelined LSU depends greatly on whether the reads hit in the constant cache. Cache misses are expensive.

 kernel void constant_pipelined (constant int *src, 
                                 global int *dst) {
  int i = get_global_id(0);
  dst[i] = src[i];              // Constant pipelined LSU
}

For information about the constant cache, see Constant Cache Memory.

Atomic-Pipelined Load-Store Units

An atomic-pipelined LSU is used for all atomic operations. Using atomic operations can significantly reduce kernel performance.

kernel void atomic_pipelined (global int* restrict out) {
  atomic_add(&out[0], 1);  // Atomic LSU
}

Load-Store Unit Modifiers

Depending on the memory access pattern in your kernel, the compiler modifies some LSUs.

Cached

Burst-coalesced LSUs might sometimes include a cache. A cache is created when the memory access pattern is data-dependent or appears to be repetitive. The cache cannot be shared with other loads even if the loads want the same data. The cache is flushed on kernel start and consumes more hardware resources than an equivalent LSU without a cache. The cache can be disabled by simplifying the access pattern or marking the pointer as volatile.

kernel void cached (global int * restrict in, 
                    global int * restrict out) {
  int i = get_global_id(0);
  int idx = out[i]; 
  int cached_value = in[idx];  // Burst-coalesced cached LSU
  out[i] = cached_value;
}

Write-Acknowledge (write-ack)

Burst-coalesced store LSUs sometimes require a write-acknowledgment signal when data dependencies exist. LSUs with a write-acknowledge signal require additional hardware resources. Throughput might be reduced if multiple write-acknowledge LSUs access the same memory.

kernel void write_ack (global int * restrict in, 
                       global int * restrict out, 
                       int N) {
  for (int i = 0; i < N; i++) {
    if (i < 2)
      out[i] = 0;            // Burst-coalesced write-ack LSU
    out[i] = in[i];
  }
}
       

Nonaligned

When a burst-coalesced LSU can access memory that is not aligned to the external memory word size, a nonaligned LSU is created. Additional hardware resources are required to implement a nonaligned LSU. The throughput of a nonaligned LSU might be reduced if it receives many unaligned requests.

kernel void non_aligned (global int * restrict in, 
                         global int * restrict out) {
  int i = get_global_id(0);
  
  // three loads are statically coalesced into one, creating a Burst-coalesced non-aligned LSU
  int a1 = in[3*i+0];
  int a2 = in[3*i+1];
  int a3 = in[3*i+2];
  
  // three stores statically coalesced into one
  out[3*i+0] = a3;
  out[3*i+1] = a2;
  out[3*i+2] = a1;
}

Never-stall

If a local-pipelined LSU is connected to a local memory without arbitration, a never-stall LSU is created because all accesses to the memory take a fixed number of cycles that are known to the compiler.

In the following example, some of the 96-bit wide memory access span two memory words, which requires two full lines of data to be read from memory.

__attribute((reqd_work_group_size(1024,1,1)))
kernel void never_stall (global int* restrict in, 
                         global int* restrict out, 
                         int N) {
  local int lmem[1024];
  int gi = get_global_id(0);
  int li = get_local_id(0);

  lmem[li] = in[gi];                   // Local-pipelined never-stall LSU
  barrier(CLK_GLOBAL_MEM_FENCE);
  out[gi] = lmem[li] ^ lmem[li + 1];
}