Abstract

The goal of this guide is to help Ripple users know what to do when an error occurs. The “Generic errors” section goes through known issues encountered while learning Ripple, on any supported target architecture, and their solutions. The “Hexagon-specific errors enumerates known issues found when targeting Hexagon, along with their solutions.

License

Clear 3-clause BSD License

Redistribution and use in source and binary forms, with or without modification, are permitted (subject to the limitations in the disclaimer below) provided that the following conditions are met:

Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. NO EXPRESS OR IMPLIED LICENSES TO ANY PARTY’S PATENT RIGHTS ARE GRANTED BY THIS LICENSE. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

Generic errors

Ripple introduces a slightly new SPMD representation, in which the shape of an operation is propagated from its operands. This varies from the traditional SPMD abstraction used for GPUs, in which an “ambient” block shape is associated with a function every time the function is called.

This section goes through errors users have encountered. For each error, we explain the issue and present a solution.

Link-time errors

Undefined symbols reported by the linker traditionally happen when users fail to point clang to a library that contains the symbol definitions. In Ripple, there are two more sources for potential missing symbols, explained below.

Missing ripple_* symbols

Issue: clang complains about any of the following symbols being undefined:

ripple_id
ripple_set_block_shape
ripple_get_block_size
ripple_parallel
ripple_parallel_full

Likely cause 1: Ripple is not activated through command-line options, and hence clang does not interpret the aforementioned symbols.

Fix to likely cause 1: Use clang with the -fenable-ripple flag. For example:

$ clang -fenable-ripple -O2 [other options] my_prog.cpp

Function calls not being vectorized

When a call to a scalar function has a shape, Ripple tries to find a vector equivalent of the scalar function that can be used with said shape. To support this, Ripple comes with a “default” vector library for architecture-specific functions. However, users can also define vector libraries for Ripple (as explained in the Ripple Documentation). One way these libraries are made available is by compiling them to the LLVM “bitcode” format (.bc extension). For instance, you or someone else may have made a my_lib.bc Ripple vector library available in a folder, for instance /usr/lib/ripple/my_lib.bc. In that case, it is necessary to tell clang where to look for Ripple libraries, using the -fripple-lib flag, as follows:

$ clang -fripple-lib=/usr/lib/ripple`

without this flag, Ripple will not detect the vector version to use, and will create sequential calls to the scalar function. If the scalar function is not defined in the input code or in a library, the linker will fail with a undefined symbol error.

Non-deterministic SIMD writes

Effect: Compilation error.

To avoid unintended concurrent write hazards, Ripple doesn’t allow users to write SIMD code in which several block elements explicitly write to the same memory location, as for instance

  void bad_write(float A[8], float B[8]) {
    ripple_block_t B = ripple_set_block_shape(VEC, 8);
    size_t v = ripple_id(B, 0);
    A[0] = B[v]; // error! 8 different elements are written to A[0]
  }

Solution: Explicitly choose a value to be written out, as in the following good_write example functions.

  void bad_write(float A[8], float B[8]) {
    ripple_block_t B = ripple_set_block_shape(VEC, 8);
    size_t v = ripple_id(B, 0);
    A[0] = ripple_slice(B[v], 3); // We are writing B[3] to A[0]
  }

Or in this overly simple case,

  void bad_write(float A[8], float B[8]) {
    A[0] = B[3]; // We are writing B[3] to A[0]
  }

Interplay with the automatic broadcast rule

In the case when the shape of the right-hand-side (the “read”) of an assignment is compatible but smaller than its left-hand-side (the “write”), the automatic broadcast rule applies: the read is first broadcasted to match the shape of the write, and there is no issue. The only problem, in SIMD processing elements, comes when the write has lower dimension than the read. This is illustrated in the following example:

void auto_bcast_example(float a, float B[8]) {
    ripple_block_t B = ripple_set_block_shape(VEC, 8);
    size_t v = ripple_id(B, 0);
    B[v] = a; // a gets auto-broadcasted to [8] before the write to B[v]
}

Interplay with scalar expansion

Ripple includes a convenient mechanism to propagate shapes through scalar temporary values (cf Ripple Manual’s Implicit Scalar Expansion section). For instance, in the following function, scalar variable tmp gets automatically expanded to a 8-element vector.

void auto_expand_example(float A[8], float B[8]) {
    ripple_block_t B = ripple_set_block_shape(VEC, 8);
    size_t v = ripple_id(B, 0);
    float tmp = 2 * B[v]; // the [8] shape of B[v] is propagated through tmp
    A[v] = tmp;
}

This automatic expansion mechanism is only valid for temporary scalars. Pointer-based, arrays and data structure writes are typically not subject to automatic expansion. In these cases, it is the programmer’s responsibility to ensure that the shape of a SIMD write has enough dimensions to accept the incoming value.

The following example illustrates a pointer-based write that can’t be expanded:

void no_auto_expand(float * a, float B[8]) {
    ripple_block_t B = ripple_set_block_shape(VEC, 8);
    size_t v = ripple_id(B, 0);
    // *a has a [1] "scalar" shape,
    // but it is not a temporary scalar in 'no_auto_expand()`
    *a = B[v]; // error
}

Control shape vs value shape

Effect: Correctness (unexpected results)

In Ripple, shape is propagated through values, not through control. In other terms, the shape of a statement that lies in a control block (for instance the “then” branch of an “if-then-else” statement) is not influenced by the shape of the conditional. This is illustrated in the following cond_shape() example, in which a scalar computation is performed under the control of a vector conditional. cond_shape sets all the elements of A to 1 if any of B[0..7] is positive.

void cond_shape(int A[8], float B[8]) {
    ripple_block_t B = ripple_set_block_shape(VEC, 8);
    size_t v = ripple_id(B, 0);
    int x = 0;
    if (B[v] > 0) { // conditional of shape [8]
        x = 1; // scalar computation -> shape not influenced by the conditional
    }
    A[v] = x; // x (0 or 1) is broadcast here to A[0..7]
}

Problem: intuitively, we may think that A[v] contains the result of checking if B[v] > 0.

Solution 1: Avoid creating statements with smaller-dimensional shapes controlled by conditionals with higher-dimensional shapes, as these can be counter-intuitive.

In the cond_shape example, if we want A[v] to contain the result of whether B[v] is positive, we need to give x an explicit [8] shape, as follows:

void cond_shape_fixed(int A[8], float B[8]) {
    ripple_block_t B = ripple_set_block_shape(VEC, 8);
    size_t v = ripple_id(B, 0);
    int x = ripple_broadcast(B, 0b1, 0); // 0 --> [0 0 0 0 0 0 0 0]
    if (B[v] > 0) { // [8]-shaped conditional
        x = 1; // [8]-shaped computation controlled by a [8]-shaped conditional.
    }
    A[v] = x; // the shape of x is already [8] -> elementwise write to A[v]
}

ripple_broadcast is often useful to explicitly adjust the shape of computations that would otherwise have too small a shape. Some extra ripple_broadcasts in the code are the price we pay for the ability to mix scalar, vector and tensor computations in the same function.

Solution 2: If what you want is the semantics of cond_shape, explicitly express that x should be an or redution of the values of B[v], as in the cond_shape_explicit code below:

void cond_shape(int A[8], float B[8]) {
    ripple_block_t B = ripple_set_block_shape(VEC, 8);
    size_t v = ripple_id(B, 0);
    int x = ripple_reduceor(0b1, B[v] > 0);
    A[v] = x; // x is broadcast here to A[0..7]
}

Using block sizes that don’t match hardware vector sizes

Effect: compiler error or incorrect code.

Not all compiler backends are designed to gracefully support the lowering of target-independent code that doesn’t exactly match full vector computations. Assume for instance that our target SIMD machine is 512 bits wide, and that its backend was production-tested only with full vectors. The following code, fit_my_loop, could cause a lowering issue for the target’s LLVM lowering backend.

void fit_my_loop(char in[67], char out[67]) {
  // Here we make the block size with the data, as opposed to the SIMD hardware
  ripple_block_t B = ripple_set_block_shape(VEC, 67);
  size_t v = ripple_id(B, 0);
  out[v] = - in[v];
}

Problem: compiler internal error or incorrect results when using block sizes that do not match the targeted SIMD hardware’s vector size. In the example above, the hardware target’s vector size is 64 bytes. However, the user requests the computation to be performed on a block (i.e. a vector) of 67 elements. The targeted compiler backend may not be good at lowering that to 64-byte vector code.

Solution: Add the -mllvm -ripple-pad-to-target-simd option to your clang compilation command line. This will activate a Ripple behavior, which produces explicit full vector computations. This way, even a target that can only lower full vector code will work with Ripple. An added bonus is that the performance behavior of the code becomes more predictable as we increase the Ripple block size.

Incorrect `ripple_to_vec` and `vec_to_ripple` handling in c++ object members

We have noticed issues that appear when using aligned vectors with C++ object member variables. This issue seems to come from the underlying clang/LLVM compiler, but it can affects the use of ripple_to_vec and vec_to_ripple when applied to C++ member variables.

Problem: Some conversions of aligned C++ object members are incorrect. Solution: Don’t apply the ripple_to_vec and vec_to_ripple conversions to aligned C++ member variables.

Hexagon-specific errors

Stack overflow

Effect a runtime segfault or error due to stack allocation past the stack limit.

Several aspects can contribute to a stack overflow.

Problem: Complex vector loop epilogue mask computation The calculation of complex enough masks can lead to predicate register spills, which can increase the function’s required stack size significantly. Loops annotated with ripple_parallel are split into full-vector loops, and an epilogue, which basically computes the last vector iteration (the “epilogue”). When the loop’s upper bound is unknown at compile time, the epilogue has to determine a mask defining which vector lanes should be active. The computation of the mask can result in extra required stack space, in the following cases:

the operations need more predicate registers than available in hardware. Spilling a predicate register requires its conversion to a regular vector register, which increases the size of the stack required to do the spilling.
an operation that isn’t supported on predicate registers is performed. This requires a conversion to regular HVX vectors, which increases register pressure, leading to more spilling, i.e., more stack.

Solution: ripple_parallel vectorizes the epilogue, which causes these risky mask computations. Two options are available:

if the loop body is an elementwise computation, simply replace ripple_parallel with ripple_parallel_peel. ripple_parallel_peel will execute the epilogue sequentially, avoiding the mask computation. If your computation is not elementwise, you can separate the full-tile code, which you annotate with ripple_parallel_full to indicate that there is no epilogue, and you write your own sequential version of the epilogue.

Problem: Heavy inlining increases expected stack size The compiler is in charge of allocating the values that need to be spilled on the stack. The allocation algorithms used by the compiler are not guaranteed to be optimal. Bigger functions have a higher chance of having suboptimal stack allocation. As a consequence, functions in which calls to many other functions are inlined by the compiler are also at risk of suboptimal stack allocation.

Solution: Since Hexagon has a high function call overhead, there is a need to tradeoff inlining, which reduces overhead due to function calls, with the risk of stack overflow.

Keyboard shortcuts

Ripple Troubleshooting Guide