Issue 211206.1: Add lane support for SIMD/SIMT machines
Author: | Markus Metzger |
---|---|
Champion: | Markus Metzger |
Date submitted: | 2021-12-06 |
Date revised: | 2024-05-13 |
Date closed: | 2024-05-13 |
Type: | Enhancement |
Status: | Accepted |
DWARF Version: | 6 |
Section 2.5.1.3, pg 29ff
Implicitly vectorized code executes multiple instances of a source code loop body or of a source code kernel function simultaneously in a single sequence of instructions operating on a vector of data elements (cf. SIMD: Single Instruction Multiple Data).
This can be used to implement a SIMT (cf. Single Instruction Multiple Threads) execution model on SIMD hardware such as GPUs, e.g. for the OpenCL, SYCL, or HIP languages.
It can also be used to implement SIMD loops on SIMD hardware such as GPUs or on traditional CPU hardware with vector extensions like Intel AVX, e.g. for OpenMP or for automatically vectorized code.
In all these cases, the original source code has been written for a single thread in the case of SIMT or for a single vector element in the case of SIMD. The remainder of this text uses the term 'lane' to refer to a single SIMT thread or a single SIMD vector element.
The number of lanes inside a vectorized code region is fixed. For SIMT,
this typically covers an entire function. For SIMD, this typically covers
an instance of a loop body. For example, the following OpenCL kernel
function vec_add
is logically invoked by a separate language thread for
each index in a range of indices specified when the kernel is dispatched:
__kernel void vec_add (__global char *dst, __global char *src)
{
int i = get_global_id(0);
dst[i] += src[i];
}
The following code fragment dispatches the vec_add
kernel:
size_t global_offset[1] = { 0 };
size_t global_size[1] = { NITEMS };
size_t group_size[1] = { NLANES };
clEnqueueNDRangeKernel(queue, vec_add_kernel, 1, global_offset,
global_size, group_size, 0, NULL, NULL);
Each invocation calls get_global_id()
to determine the index it is
assigned to work on, which returns a different value in each OpenCL
thread.
When compiling for GPUs, the compiler may map one OpenCL thread to one lane on the SIMD hardware. To do that, the compiler implicitly vectorizes the code to match the natural number of lanes on the target hardware.
The same loop or the same function may be compiled with different vectorization factors resulting in different instances with a different number of lanes.
In the SIMD case, we may end up with different instances of the same
source loop using different vectorization factors resulting in a different
number of lanes within the same function. For example, when the following
C function is compiled with gcc -O2 -fopenmp
on IA, the "for" statement
is transformed into both a 16-wide loop instance and a 1-wide loop
instance:
void vec_add (char dst[], char src[], int len) {
#pragma omp simd
for (int i = 0; i < len; ++i)
dst[i] += src[i];
}
The vectorized loop instance packs 16 source elements into IA vector registers and processes all 16 source elements in one machine instruction. The code would load 16 adjacent source elements, each, add them, and store the resulting 16 source elements back.
The trip count, which is the number of times the loop iterates during execution, is not known at compile-time. Therefore the compiler also generates a 1-wide instance of the loop that processes one source element at a time. Control flows from the vectorized loop instance to the scalar loop instance when the remaining trip count falls below a threshold determined by the compiler.
When debugging the above kernel function or the above loop, the user would
like to be able to inspect the variable i
and the array elements dst[i]
and src[i]
. Since the code had been implicitly vectorized, multiple
instances of the source code are executed in parallel.
Debuggers may need to focus on a single lane at a time. To map such vectorized machine code back to scalar source code, debug information must describe the location of a given variable with respect to a given lane.
DWARF-5 can describe the location of the vectorized object but it cannot describe the location of an instance of a variable inside that vectorized object. Further, type information would describe an individual instance, not the vectorized variable. DWARF-5 cannot describe the relationship between the two.
To describe the location of a source variable inside an implicitly vectorized object, we propose a new operator:
DW_OP_push_lane
that allows the location of a variable to be described as function of the lane.
The same scalar source code may be compiled using different vectorization factors resulting in a different number of lanes at different locations in the machine code. Debuggers may need to show the number of lanes for the current machine code location and only allow the user to focus on a lane within those boundaries.
To describe this, we propose a new subprogram attribute:
DW_AT_num_lanes
that describes the number of lanes for implicitly vectorized code.
Note that this only refers to implicitly vectorized code. For explicitly
vectorized code, the source variables are themselves vectors and location
descriptions refer to the vector object. Compilers may indicate this by
describing the code region as num_lanes == 1
. They would not use
DW_OP_push_lane
in DWARF expressions.
Section 2.2, p.17.
Add
DW_AT_num_lanes | Number of implicitly vectorized lanes
to table 2.2.
Section 2.5.1.3, p.29.
Add
16.
DW_OP_push_lane
TheDW_OP_push_lane
operation pushes a lane index value of generic type, which provides the context of the lane in which the expression is being evaluated. See section 3.3.5 on page 79.Producers that widen source code into vectorized machine code may use this operation to describe the location of a source variable as function of a single lane in the widened machine code.
Consumers will supply the lane argument to obtain the location of the instance of that source variable that corresponds to the provided lane argument.
to the end of the list on p.33.
Section 3.3.5, p.79.
Add
SIMD instructions process multiple data elements in one instruction. The number of data elements that is processed with one instruction is typically referred to as SIMD width. Each individual data element is typically referred to as SIMD lane.
When generating code for a SIMD architecture, compilers may need to implicitly widen the source code to match the SIMD width of the instruction set they are using. Source variables are widened into a vector of variables, with one instance per SIMD lane.
A subroutine that is implicitly vectorized may have a
DW_AT_num_lanes
attribute whose value describes the implicit vectorization factor and the corresponding number of lanes in the generated code. The value of this attribute is determined as described in Section 2.19 on page 55.To refer to individual lanes in such vectorized code, for example to describe the location of widened source variables, producers may use the
DW_OP_push_lane
operation to have consumers supply the current focus lane for which to evaluate the expression. The pushed lane index must be an unsigned integer value between zero (inclusive) and the value ofDW_AT_num_lanes
(exclusive) at the current location. See section 2.5.1.3 on page 29.If the attribute is omitted, its value is defined by the ABI.
If the source code had already been vectorized and is not further widened by the compiler, the value should be one.
This value does not only apply to vector instructions. If a loop or function has been widened, the entire loop or function body shall be annotated with the vectorization factor.
Section 7.5.4, p.207.
Add
DW_AT_num_lanes | TBD | constant, exprval, vallist
to Table 7.5.
Section 7.7, p.223.
Add
DW_OP_push_lane | TBD | 0 |
to table 7.9.
Section D.17 (new)
Add
D.17 SIMD location descriptions
The following example uses a hypothetical machine with 64b scalar registers
r0
,r1
, ..., and 256b vector registersv0
,v1
, ... that supports SIMD instructions with different SIMD widths. Scalar arguments are passed in scalar registers starting withr0
for the first argument.Consider the source code in figure D.73, which is implicitly widened by a vectorization factor of 8 to match the 256b vector registers of the target machine, resulting in the pseudo-code in figure D.74.
Figure D.73: C OpenMP source code
void vec_add (int dst[], int src[], int len) { #pragma omp simd for (int i = 0; i < len; ++i) dst[i] += src[i]; }
Figure D.74: Pseudo Assembly code
.l0: move.64b r3, 0 ; i = 0 .l1: ; implicitly 8-wide vectorized loop body add.64b r4, r3, 8 ; inext = i + 8 cmp.64b r4, r2 ; compare inext to len jmp.ge .l2 ; jump to .l2 if inext >= len load.256b v0, [r0+4*r3] ; v0[n] = dst[i+n] for n in [0..7] .l1.1: load.256b v1, [r1+4*r3] ; v1[n] = src[i+n] for n in [0..7] .l1.2: ; add 8 elements add.simd-8 v0, v0, v1 ; v0[n] = v0[n] + v1[n] for n in [0..7] store.256b [r0+4*r3], v0 ; dst[i+n] = v0[n] for i in [0..7] .l1.3: mov.64b r3, r4 ; i = inext jmp .l1 ; loop back for more .l2: ; scalar loop body add.64b r4, r3, 1 ; inext = i + 1 cmp.64b r4, r2 ; compare inext to len jmp.ge .l3 ; jump to .l3 if inext >= len load.32b r5, [r0+4*r3] ; r5 = dst[i] .l2.1: load.32b r6, [r1+4*r3] ; r6 = src[i] .l2.2: ; add a single element add.32b r5, r5, r6 ; r5 = r5 + r6 store.32b [r0+4*r3], r5 ; dst[i] = r5 .l2.3: mov.64b r3, r4 ; i = inext jmp .l2 ; loop back for more .l3: return .l4:
The machine code contains two instances of the source loop: one instance with SIMD width 8 at
.l1
, and one scalar instance at.l2
to handle any remaining elements.This function may be described in DWARF as shown in figure D.75.
Figure D.75: Possible DWARF description of the function in D.73 compiled into pseudo machine code in D.74.
DW_TAG_subprogram DW_AT_name "vec_add" DW_AT_num_lanes .vallist.0 ... DW_TAG_variable DW_AT_name "i" DW_AT_type int DW_AT_location .loclist.1 ... .vallist.0: range [.l1, .l2) DW_OP_lit8 end-of-list .loclist.1: range [.l0, .l1) DW_OP_regx r3 range [.l1, .l2) DW_OP_bregx r3, 0 DW_OP_push_lane DW_OP_plus DW_OP_stack_value range [.l2, .l4) DW_OP_regx r3 end-of-list
to the end of Appendix D on page 365.
2022-05-19: Revised.
2023-02-01: Revised.
2023-04-18: Revised based on feedback from 4/17/23 meeting.
2023-05-25: Revised based on feedback from 5/24/23 GPU meeting.
2023-06-12: Revised based on email feedback.
2023-09-11: Revised examples.
2024-04-29: Needs revision to use vallist class for DW_AT_num_lanes
and
remove DW_OP_stack_value
from loclist.0
in example.
2024-04-30: Revised based on feedback from 4/29/24 meeting.
2024-05-08: Revised example to fix typo and add comments.
2024-05-13: Accepted with new title. Changed from "SIMD Location Descriptions" to "Add lane support for SIMD/SIMT machines."