DWARF Standard


HOME
SPECIFICATIONS
FAQ
ISSUES



211206.1 Markus Metzger SIMD location expressions Enhancement Open Markus Metzger


Section 2.5.1.3, pg 29ff

Implicitly vectorized code executes multiple instances of a source code
loop body or of a source code kernel function simultaneously in a single
sequence of instructions operating on a vector of data elements (cf. SIMD:
Single Instruction Multiple Data).

This can be used to implement a SIMT (cf. Single Instruction Multiple
Threads) execution model on SIMD hardware such as GPUs, e.g. for the
OpenCL, SYCL, or HIP languages.

It can also be used to implement SIMD loops on SIMD hardware such as GPUs
or on traditional CPU hardware with vector extensions like Intel AVX,
e.g. for OpenMP or for automatically vectorized code.

In all these cases, the original source code has been written for a single
thread in the case of SIMT or for a single vector element in the case of
SIMD.  The remainder of this text uses the term 'lane' to refer to a
single SIMT thread or a single SIMD vector element.

The number of lanes inside a vectorized code region is fixed.  For SIMT,
this typically covers an entire function.  For SIMD, this typically covers
an instance of a loop body.  For example, the following OpenCL kernel
function "vec_add" is logically invoked by a separate language thread for
each index in a range of indices specified when the kernel is dispatched:

~~~
__kernel void vec_add (__global char *dst, __global char *src)
{
  int i = get_global_id(0);
  dst[i] += src[i];
}
~~~

The following code fragment dispatches the "vec_add" kernel:

~~~
  size_t global_offset[1] = { 0 };
  size_t global_size[1] = { NITEMS };
  size_t group_size[1] = { NLANES };
  clEnqueueNDRangeKernel(queue, vec_add_kernel, 1, global_offset,
                         global_size, group_size, 0, NULL, NULL);
~~~

Each invocation calls 'get_global_id()' to determine the index it is
assigned to work on, which returns a different value in each OpenCL
thread.

When compiling for GPUs, the compiler may map one OpenCL thread to one
lane on the SIMD hardware.  To do that, the compiler implicitly vectorizes
the code to match the natural number of lanes on the target hardware.

The same loop or the same function may be compiled with different
vectorization factors resulting in different instances with a different
number of lanes.

In the SIMD case, we may end up with different instances of the same
source loop using different vectorization factors resulting in a different
number of lanes within the same function.  For example, when the following
C function is compiled with `gcc -O2 -fopenmp` on IA, the "for" statement
is transformed into both a 16-wide loop instance and a 1-wide loop
instance:

~~~
void vec_add (char dst[], char src[], int len) {
  #pragma omp simd
  for (int i = 0; i < len; ++i)
    dst[i] += src[i];
}
~~~

The vectorized loop instance packs 16 source elements into IA vector
registers and processes all 16 source elements in one machine instruction.
The code would load 16 adjacent source elements, each, add them, and store
the resulting 16 source elements back.

The trip count, which is the number of times the loop iterates during
execution, is not known at compile-time.  Therefore the compiler also
generates a 1-wide instance of the loop that processes one source element
at a time.  Control flows from the vectorized loop instance to the scalar
loop instance when the remaining trip count falls below a threshold
determined by the compiler.

When debugging the above kernel function or the above loop, the user would
like to be able to inspect the variable i and the array elements dst[i]
and src[i].  Since the code had been implicitly vectorized, multiple
instances of the source code are executed in parallel.

Debuggers may need to focus on a single lane at a time.  To map such
vectorized machine code back to scalar source code, debug information
must describe the location of a given variable with respect to a given
lane.

To describe this, we propose a new operator:
    DW_OP_push_lane
that allows the location of a variable to be described as function of the
lane.

The same scalar source code may be compiled using different vectorization
factors resulting in a different number of lanes at different locations in
the machine code.  Debuggers may need to show the number of lanes for the
current machine code location and only allow the user to focus on a lane
within those boundaries.

To describe this, we propose a new subprogram attribute:
    DW_AT_num_lanes
that describes the number of lanes for implicitly vectorized code.

Note that this only refers to implicitly vectorized code.  For explicitly
vectorized code, the source variables are themselves vectors and location
descriptions refer to the vector object.  Compilers may indicate this by
describing the code region as num_lanes == 1.  They would not use
DW_OP_push_lane in DWARF expressions.

Note that for the example in Appendix D involving arrays split between
memory and registers, I am using operators introduced in issue 211206.2.

---

Section 2.2, p.17.

Add
    DW_AT_num_lanes  |  Number of available lanes
to table 2.2.


Section 2.5.1.3, p.29.

Add
    16. DW_OP_push_lane
        The DW_OP_push_lane operation pushes a lane index value of generic
        type, which provides the context of the lane in which the
        expression is being evaluated.  See section 3.3.5 on page 79.

        Producers that widen source code into vectorized machine code may
        use this operation to describe the location of a source variable
        as function of a single lane in the widened machine code.

        Consumers will supply the lane argument to obtain the location of
        the instance of that source variable that corresponds to the
        provided lane argument.

to the end of the list on p.33.


Section 3.3.5, p.79.

Add
    SIMD instructions process multiple data elements in one instruction.
    The number of data elements that is processed with one instruction is
    typically referred to as SIMD width. Each individual data element is
    typically referred to as SIMD lane.

    When generating code for a SIMD architecture, compilers may need to
    implicitly widen the source code to match the SIMD width of the
    instruction set they are using.  Source variables are widened into a
    vector of variables, with one instance per SIMD lane. 

    A subroutine that is implicitly vectorized may have a DW_AT_num_lanes
    attribute whose value is either a constant, a location expression, or
    a location list describing the implicit vectorization factor and the
    corresponding number of lanes in the generated code.

    To refer to individual lanes in such vectorized code, for example to
    describe the location of widened source variables, prodocers may use
    the DW_OP_push_lane operation to have consumers supply the current
    focus lane for which to evaluate the location expression.  The pushed
    lane index must be an unsigned integer value between zero (inclusive)
    and the value of this attribute (exclusive) at the current location.
    See section 2.5.1.3 on page 29.

    If the source code had already been vectorized and is not further
    widened by the compiler, the value should be one.  This also applies
    to non-vectorized code, where the attribute is typically omitted.

    This value does not only apply to vector instructions.  If a loop or
    function has been widened, the entire loop or function body shall be
    annotated with the vectorization factor.

Section 7.5.4, p.207.

Add
    DW_AT_num_lanes  |  TBD  | constant, exprloc, loclist
to Table 7.5.

Section 7.7, p.223.

Add
    DW_OP_push_lane  |  TBD  |  0  |
to table 7.9.


Section D.17 (new)

Add
    D.17 SIMD location descriptions

    The following example uses a hypothetical machine with 64b scalar
    registers r0, r1, ..., and 256b vector registers v0, v1, ... that
    supports SIMD instructions with different SIMD widths.  Scalar
    arguments are passed in scalar registers starting with r0 for the
    first argument.

    Consider the source code in figure D.73, which is implicitly widened
    by a vectorization factor of 4 to match the 256b vector registers of
    the target machine, resulting in the pseudo-code in figure D.74.

    Figure D.73: C OpenMP source code
    ~~~
    void vec_add (int dst[], int src[], int len) {
      #pragma omp simd
      for (int i = 0; i < len; ++i)
          dst[i] += src[i];
    }
    ~~~

    Figure D.74: Pseudo Assembly code
    ~~~
    .l0:
    move.64b    r3, 0
    .l1:
    add.64b     r4, r3, 4
    cmp.64b     r4, r2
    jmp.ge      .l2
    load.256b   v0, [r0+4*r3]
    .l1.1
    load.256b   v1, [r1+4*r3]
    .l1.2
    add.simd-4  v2, v0, v1
    store.256b  [r0+4*r3], v2
    .l1.3
    mov.64b     r3, r4
    jmp         .l1
    .l2:
    add.64b     r4, r3, 1
    cmp.64b     r4, r2
    jmp.ge      .l3
    load.32b    r5, [r0+4*r3]
    .l2.1
    load.32b    r6, [r1+4*r3]
    .l2.2
    add.32b     r7, r5, r6
    store.32b   [r0+4*r3], r7
    .l2.3
    mov.64b     r3, r4
    jmp .l2
    .l3:
    return
    .l4:
    ~~~

    The machine code contains two instances of the source loop: one
    instance with SIMD width 4 at .l1, and one scalar instance at .l2 to
    handle any remaining elements.

    This function may be described in DWARF as shown in figure D.75.

    Figure D.75: Possible DWARF description of the function in D.73
                 compiled into pseudo machine code in D.74.
    ~~~
    DW_TAG_subprogram
        DW_AT_name "vec_add"
        DW_AT_num_lanes .loclist.0
        ...
        DW_TAG_variable
            DW_AT_name "i"
            DW_AT_type int
            DW_AT_location .loclist.1
            ...
    .type.arr:
    DW_TAG_array_type
        DW_AT_type int
        DW_TAG_subrange_type
            DW_AT_lower_bound 0

    .loclist.0:
    range [.l1, .l2)
        DW_OP_lit4
        DW_OP_stack_value
    end-of-list

    .loclist.1:
    range [.l0, .l1)
        DW_OP_regx r3
    range [.l1, .l2)
        DW_OP_bregx r3, 0
        DW_OP_push_lane
        DW_OP_plus
        DW_OP_stack_value
    range [.l2, .l4)
        DW_OP_regx r3
    end-of-list
    ~~~
to the end of Appendix D on page 365.

--
2022-05-19: Revised.  Previous version: http://dwarfstd.org/issues/211206.1-1.html 
2023-02-01: Revised.  Previous version: http://dwarfstd.org/issues/211206.1-2.html


All logos and trademarks in this site are property of their respective owner.
The comments are property of their posters, all the rest © 2007-2022 by DWARF Standards Committee.