DWARF Standard


HOME
SPECIFICATIONS
FAQ
ISSUES



211206.1 Markus Metzger SIMD location expressions Enhancement Open Markus Metzger


Section 2.5.1.3, pg 29ff

Implicitly vectorized code executes multiple instances of a source code
loop body or of a source code kernel function simultaneously in a single
sequence of instructions operating on a vector of data elements (cf. SIMD:
Single Instruction Multiple Data).

This can be used to implement a SIMT (cf. Single Instruction Multiple
Threads) execution model on SIMD hardware such as GPUs, e.g. for the
OpenCL, SYCL, or HIP languages.

It can also be used to implement SIMD loops on SIMD hardware such as GPUs
or on traditional CPU hardware with vector extensions like Intel AVX,
e.g. for OpenMP or for automatically vectorized code.

In all these cases, the original source code has been written for a single
thread in the case of SIMT or for a single vector element in the case of
SIMD.  The remainder of this text uses the term 'lane' to refer to a
single SIMT thread or a single SIMD vector element.

The number of lanes inside a vectorized code region is fixed.  For SIMT,
this typically covers an entire function.  For SIMD, this typically covers
an instance of a loop body.  For example, the following OpenCL kernel
function "vec_add" is logically invoked by a separate language thread for
each index in a range of indices specified when the kernel is dispatched:

~~~
__kernel void vec_add (__global char *dst, __global char *src)
{
  int i = get_global_id(0);
  dst[i] += src[i];
}
~~~
 
The following code fragment dispatches the "vec_add" kernel:
 
~~~
  size_t global_offset[1] = { 0 };
  size_t global_size[1] = { NITEMS };
  size_t group_size[1] = { NLANES };
  clEnqueueNDRangeKernel(queue, vec_add_kernel, 1, global_offset,
                         global_size, group_size, 0, NULL, NULL);
~~~

Each invocation calls 'get_global_id()' to determine the index it is
assigned to work on, which returns a different value in each OpenCL
thread.
 
When compiling for GPUs, the compiler may map one OpenCL thread to one
lane on the SIMD hardware.  To do that, the compiler implicitly vectorizes
the code to match the natural number of lanes on the target hardware.
 
The same loop or the same function may be compiled with different
vectorization factors resulting in different instances with a different
number of lanes.
 
In the SIMD case, we may end up with different instances of the same
source loop using different vectorization factors resulting in a different
number of lanes within the same function.  For example, when the following
C function is compiled with `gcc -O2 -fopenmp` on IA, the "for" statement
is transformed into both a 16-wide loop instance and a 1-wide loop
instance:

~~~
void vec_add (char dst[], char src[], int len) {
  #pragma omp simd
  for (int i = 0; i < len; ++i)
    dst[i] += src[i];
}
~~~

The vectorized loop instance packs 16 source elements into IA vector
registers and processes all 16 source elements in one machine instruction.
The code would load 16 adjacent source elements, each, add them, and store
the resulting 16 source elements back.

The trip count, which is the number of times the loop iterates during
execution, is not known at compile-time.  Therefore the compiler also
generates a 1-wide instance of the loop that processes one source element
at a time.  Control flows from the vectorized loop instance to the scalar
loop instance when the remaining trip count falls below a threshold
determined by the compiler.
 
When debugging the above kernel function or the above loop, the user would
like to be able to inspect the variable i and the array elements dst[i]
and src[i].  Since the code had been implicitly vectorized, multiple
instances of the source code are executed in parallel.

Debuggers may need to focus on a single lane at a time.  To map such
vectorized machine code back to scalar source code, debug information
must describe the location of a given variable with respect to a given
lane.

To describe this, we propose a new operator:
    DW_OP_push_lane
that allows the location of a variable to be described as function of the
lane.

The same scalar source code may be compiled using different vectorization
factors resulting in a different number of lanes at different locations in
the machine code.  Debuggers may need to show the number of lanes for the
current machine code location and only allow the user to focus on a lane
within those boundaries.

To describe this, we propose a new subprogram attribute:
    DW_AT_num_lanes
that describes the number of lanes for implicitly vectorized code.

Note that this only refers to implicitly vectorized code.  For explicitly
vectorized code, the source variables are themselves vectors and location
descriptions refer to the vector object.  Compilers may indicate this by
describing the code region as num_lanes == 1.  They would not use
DW_OP_push_lane in DWARF expressions.

---

Section 2.5.1.3, p.29.

Add
    16. DW_OP_push_lane
        The DW_OP_push_lane operation pushes a lane index value of generic
        type, which provides the context of the lane in which the
        expression is being evaluated.  See section 3.3.5 on page 79.

        Producers that widen source code into vectorized
        machine code may use this operation to describe the location of a
        source variable as function of a single lane in the widened
        machine code.  Consumers will supply the lane argument to obtain
        the location of the instance of that source variable that
        corresponds to the provided lane argument.

Section 3.3.5, p.79.

Add
    A subroutine that is implicitly vectorized may have a DW_AT_num_lanes
    attribute whose value is either a constant, a location expression, or
    a location list describing the implicit vectorization factor and the
    corresponding number of lanes.

    A value of 1 means that either the code is not vectorized or that the
    source has already been vectorized and was not implicitly widened.

    If the producer implicitly vectorized already vectorized source code,
    e.g. by widening an 8-wide vectorized source into 16-wide machine
    code, this value gives the implicit widening factor, 2 in the above
    example.

    This value does not only apply to vector instructions.  If a loop or
    function has been widened, the entire loop or function body shall be
    annotated with the widening factor.

Section 7.5.4, p.207.

Add

    Attribute Name   | Value | Classes
    ---------------- | ----- | -------
    DW_AT_num_lanes  |  TBD  | constant, exprloc, loclist

to Table 7.5.

--
2022-05-19: Revised.  Previous version: http://dwarfstd.org/ShowIssue.php?issue=211206.1-1 


All logos and trademarks in this site are property of their respective owner.
The comments are property of their posters, all the rest © 2007-2022 by DWARF Standards Committee.