Issue 211206.1: Add lane support for SIMD/SIMT machines

Author: Markus Metzger
Champion: Markus Metzger
Date submitted: 2021-12-06
Date revised: 2024-05-13
Date closed: 2024-05-13
Type: Enhancement
Status: Accepted
DWARF Version: 6

Section 2.5.1.3, pg 29ff

Implicitly vectorized code executes multiple instances of a source code loop body or of a source code kernel function simultaneously in a single sequence of instructions operating on a vector of data elements (cf. SIMD: Single Instruction Multiple Data).

This can be used to implement a SIMT (cf. Single Instruction Multiple Threads) execution model on SIMD hardware such as GPUs, e.g. for the OpenCL, SYCL, or HIP languages.

It can also be used to implement SIMD loops on SIMD hardware such as GPUs or on traditional CPU hardware with vector extensions like Intel AVX, e.g. for OpenMP or for automatically vectorized code.

In all these cases, the original source code has been written for a single thread in the case of SIMT or for a single vector element in the case of SIMD. The remainder of this text uses the term 'lane' to refer to a single SIMT thread or a single SIMD vector element.

The number of lanes inside a vectorized code region is fixed. For SIMT, this typically covers an entire function. For SIMD, this typically covers an instance of a loop body. For example, the following OpenCL kernel function vec_add is logically invoked by a separate language thread for each index in a range of indices specified when the kernel is dispatched:

__kernel void vec_add (__global char *dst, __global char *src)
{
  int i = get_global_id(0);
  dst[i] += src[i];
}

The following code fragment dispatches the vec_add kernel:

size_t global_offset[1] = { 0 };
size_t global_size[1] = { NITEMS };
size_t group_size[1] = { NLANES };
clEnqueueNDRangeKernel(queue, vec_add_kernel, 1, global_offset,
                       global_size, group_size, 0, NULL, NULL);

Each invocation calls get_global_id() to determine the index it is assigned to work on, which returns a different value in each OpenCL thread.

When compiling for GPUs, the compiler may map one OpenCL thread to one lane on the SIMD hardware. To do that, the compiler implicitly vectorizes the code to match the natural number of lanes on the target hardware.

The same loop or the same function may be compiled with different vectorization factors resulting in different instances with a different number of lanes.

In the SIMD case, we may end up with different instances of the same source loop using different vectorization factors resulting in a different number of lanes within the same function. For example, when the following C function is compiled with gcc -O2 -fopenmp on IA, the "for" statement is transformed into both a 16-wide loop instance and a 1-wide loop instance:

void vec_add (char dst[], char src[], int len) {
  #pragma omp simd
  for (int i = 0; i < len; ++i)
    dst[i] += src[i];
}

The vectorized loop instance packs 16 source elements into IA vector registers and processes all 16 source elements in one machine instruction. The code would load 16 adjacent source elements, each, add them, and store the resulting 16 source elements back.

The trip count, which is the number of times the loop iterates during execution, is not known at compile-time. Therefore the compiler also generates a 1-wide instance of the loop that processes one source element at a time. Control flows from the vectorized loop instance to the scalar loop instance when the remaining trip count falls below a threshold determined by the compiler.

When debugging the above kernel function or the above loop, the user would like to be able to inspect the variable i and the array elements dst[i] and src[i]. Since the code had been implicitly vectorized, multiple instances of the source code are executed in parallel.

Debuggers may need to focus on a single lane at a time. To map such vectorized machine code back to scalar source code, debug information must describe the location of a given variable with respect to a given lane.

DWARF-5 can describe the location of the vectorized object but it cannot describe the location of an instance of a variable inside that vectorized object. Further, type information would describe an individual instance, not the vectorized variable. DWARF-5 cannot describe the relationship between the two.

To describe the location of a source variable inside an implicitly vectorized object, we propose a new operator:

DW_OP_push_lane

that allows the location of a variable to be described as function of the lane.

The same scalar source code may be compiled using different vectorization factors resulting in a different number of lanes at different locations in the machine code. Debuggers may need to show the number of lanes for the current machine code location and only allow the user to focus on a lane within those boundaries.

To describe this, we propose a new subprogram attribute:

DW_AT_num_lanes

that describes the number of lanes for implicitly vectorized code.

Note that this only refers to implicitly vectorized code. For explicitly vectorized code, the source variables are themselves vectors and location descriptions refer to the vector object. Compilers may indicate this by describing the code region as num_lanes == 1. They would not use DW_OP_push_lane in DWARF expressions.


Section 2.2, p.17.

Add

DW_AT_num_lanes  |  Number of implicitly vectorized lanes

to table 2.2.

Section 2.5.1.3, p.29.

Add

16. DW_OP_push_lane
The DW_OP_push_lane operation pushes a lane index value of generic type, which provides the context of the lane in which the expression is being evaluated. See section 3.3.5 on page 79.

Producers that widen source code into vectorized machine code may use this operation to describe the location of a source variable as function of a single lane in the widened machine code.

Consumers will supply the lane argument to obtain the location of the instance of that source variable that corresponds to the provided lane argument.

to the end of the list on p.33.

Section 3.3.5, p.79.

Add

SIMD instructions process multiple data elements in one instruction. The number of data elements that is processed with one instruction is typically referred to as SIMD width. Each individual data element is typically referred to as SIMD lane.

When generating code for a SIMD architecture, compilers may need to implicitly widen the source code to match the SIMD width of the instruction set they are using. Source variables are widened into a vector of variables, with one instance per SIMD lane.

A subroutine that is implicitly vectorized may have a DW_AT_num_lanes attribute whose value describes the implicit vectorization factor and the corresponding number of lanes in the generated code. The value of this attribute is determined as described in Section 2.19 on page 55.

To refer to individual lanes in such vectorized code, for example to describe the location of widened source variables, producers may use the DW_OP_push_lane operation to have consumers supply the current focus lane for which to evaluate the expression. The pushed lane index must be an unsigned integer value between zero (inclusive) and the value of DW_AT_num_lanes (exclusive) at the current location. See section 2.5.1.3 on page 29.

If the attribute is omitted, its value is defined by the ABI.

If the source code had already been vectorized and is not further widened by the compiler, the value should be one.

This value does not only apply to vector instructions. If a loop or function has been widened, the entire loop or function body shall be annotated with the vectorization factor.

Section 7.5.4, p.207.

Add

DW_AT_num_lanes  |  TBD  | constant, exprval, vallist

to Table 7.5.

Section 7.7, p.223.

Add

DW_OP_push_lane  |  TBD  |  0  |

to table 7.9.

Section D.17 (new)

Add

D.17 SIMD location descriptions

The following example uses a hypothetical machine with 64b scalar registers r0, r1, ..., and 256b vector registers v0, v1, ... that supports SIMD instructions with different SIMD widths. Scalar arguments are passed in scalar registers starting with r0 for the first argument.

Consider the source code in figure D.73, which is implicitly widened by a vectorization factor of 8 to match the 256b vector registers of the target machine, resulting in the pseudo-code in figure D.74.

Figure D.73: C OpenMP source code

void vec_add (int dst[], int src[], int len) {
  #pragma omp simd
  for (int i = 0; i < len; ++i)
      dst[i] += src[i];
}

Figure D.74: Pseudo Assembly code

.l0:
    move.64b    r3, 0              ; i = 0
.l1:            ; implicitly 8-wide vectorized loop body
    add.64b     r4, r3, 8          ; inext = i + 8
    cmp.64b     r4, r2             ; compare inext to len
    jmp.ge      .l2                ; jump to .l2 if inext >= len
    load.256b   v0, [r0+4*r3]      ; v0[n] = dst[i+n] for n in [0..7]
.l1.1:
    load.256b   v1, [r1+4*r3]      ; v1[n] = src[i+n] for n in [0..7]
.l1.2:          ; add 8 elements
    add.simd-8  v0, v0, v1         ; v0[n] = v0[n] + v1[n] for n in [0..7]
    store.256b  [r0+4*r3], v0      ; dst[i+n] = v0[n] for i in [0..7]
.l1.3:
    mov.64b     r3, r4             ; i = inext
    jmp         .l1                ; loop back for more
.l2:            ; scalar loop body
    add.64b     r4, r3, 1          ; inext = i + 1
    cmp.64b     r4, r2             ; compare inext to len
    jmp.ge      .l3                ; jump to .l3 if inext >= len
    load.32b    r5, [r0+4*r3]      ; r5 = dst[i]
.l2.1:
    load.32b    r6, [r1+4*r3]      ; r6 = src[i]
.l2.2:          ; add a single element
    add.32b     r5, r5, r6         ; r5 = r5 + r6
    store.32b   [r0+4*r3], r5      ; dst[i] = r5
.l2.3:
    mov.64b     r3, r4             ; i = inext
    jmp .l2                        ; loop back for more
.l3:
    return
.l4:

The machine code contains two instances of the source loop: one instance with SIMD width 8 at .l1, and one scalar instance at .l2 to handle any remaining elements.

This function may be described in DWARF as shown in figure D.75.

Figure D.75: Possible DWARF description of the function in D.73 compiled into pseudo machine code in D.74.

DW_TAG_subprogram
    DW_AT_name "vec_add"
    DW_AT_num_lanes .vallist.0
    ...
    DW_TAG_variable
        DW_AT_name "i"
        DW_AT_type int
        DW_AT_location .loclist.1
        ...

.vallist.0:
    range [.l1, .l2)
        DW_OP_lit8
    end-of-list

.loclist.1:
    range [.l0, .l1)
        DW_OP_regx r3
    range [.l1, .l2)
        DW_OP_bregx r3, 0
        DW_OP_push_lane
        DW_OP_plus
        DW_OP_stack_value
    range [.l2, .l4)
        DW_OP_regx r3
    end-of-list

to the end of Appendix D on page 365.


2022-05-19: Revised.

2023-02-01: Revised.

2023-04-18: Revised based on feedback from 4/17/23 meeting.

2023-05-25: Revised based on feedback from 5/24/23 GPU meeting.

2023-06-12: Revised based on email feedback.

2023-09-11: Revised examples.

2024-04-29: Needs revision to use vallist class for DW_AT_num_lanes and remove DW_OP_stack_value from loclist.0 in example.

2024-04-30: Revised based on feedback from 4/29/24 meeting.

2024-05-08: Revised example to fix typo and add comments.

2024-05-13: Accepted with new title. Changed from "SIMD Location Descriptions" to "Add lane support for SIMD/SIMT machines."