Issue 161122.1: Fixed-size variant of DW_FORM_strx

Author: Paul Robinson
Champion: Paul Robinson
Date submitted: 2016-11-22
Date revised:
Date closed:
Type: Improvement
Status: Accepted with modification
DWARF Version: 5
Section Many, pg Many
Some consumers (notably LLDB) can load debugging information faster if
the DIEs are (mostly) fixed-size.  It's possible to determine whether
a given DIE is fixed-size just by looking at the list of FORMs used by
its attributes, which are readily available in the abbreviations;
pre-parsing the abbreviations makes this a very simple task.

Key producers (at least gcc and clang) currently emit most or all
strings into the .debug_str section and use DW_FORM_strp to specify a
reference to the string.  In DWARF 5 we have introduced DW_FORM_strx;
we should expect producers to take advantage of this new form.

DW_FORM_strp is a fixed-size reference; DW_FORM_strx is a variable-size
ULEB.  Given that a whole lot of DIEs have a DW_AT_name or other string, 
this turns out to cause a whole lot of DIEs to go from being fixed-size 
in DWARF 4 to variable-size in DWARF 5.  This will likely have an 
adverse effect on loading time by LLDB and other consumers.

I've tried to quantify "a whole lot" by looking at two largish programs:
Clang (as built by GCC) and a PS4 game title (as built by Clang).  It's
straightforward to count the proportion of DIEs that are fixed-size in
DWARF 4.  If we then assume all DW_FORM_strp become DW_FORM_strx in
DWARF 5, we can estimate the proportion of fixed-size DIEs in DWARF 5.

                           DWARF 4   DWARF 5
Fixed-size DIEs in Clang:   93%        57%       
Fixed-size DIEs in game7:   89%        55%

This is a very serious consequence, so I'm proposing that we add a
fixed-size variant of DW_FORM_strx.

The next question is, how big should this be?  How many strings per CU 
do we see from these applications?

- Clang has 1659 CUs, with a maximum of 95652 unique strings used in
any one CU.  However, 97% of CUs use no more than 65535 strings.
54% use between 2^14 and 2^16 strings (so a fixed 2-byte index would
save space compared to a 3-byte ULEB).
- Game7 has 2461 CUs, with a maximum of 17093 unique strings used in
any one CU.  Therefore, 100% of CUs use no more than 65535 strings.

This data suggests a 2-byte fixed-size variant of DW_FORM_strx would
suffice for nearly all string references in these two applications.

Alternatively, we could redefine DW_FORM_strx as a fixed-size 4-byte
index. While this achieves the goal of making the form fixed-size, and
also makes it essentially impossible to overflow, it has a significant
size cost.  Looking again at my two applications, using a 2-byte form
(with escape to ULEB when needed) saves approximately 12% compared to
using a 4-byte form.

Textual changes:

This is the 'substantive' bit.

Section 7.5.5 p.217 (class string) 3rd sub-bullet
    Rewrite the bullet as follows:
    - as an indirect offset into the string table using an index into
      a table of offsets contained in the .debug_str_offsets section
      of the object file. Each index is interpreted as a zero-based
      index into this table.  There are two forms for this index, a
      fixed length two byte index (DW_FORM_strx2) and a variable length
      unsigned LEB128 index (DW_FORM_strx). The offset entries in the
      .debug_str_offsets section have the same representation as
      DW_FORM_strp values.

The rest of the changes are just mechanically adding DW_FORM_strx2 to
all the places that currently mention DW_FORM_strx.
[I had no idea there would be so many...]

Section 1.4, p.9, next to last bullet
    Add DW_FORM_strx2 to the list.

Section 3.1.1 p.65 item 13 (DW_AT_str_offsets_base)
    '(using DW_FORM_strx)'
 => '(using DW_FORM_strx or DW_FORM_strx2)'

Section 3.1.4 p.69 item 4 (DW_AT_str_offsets_base)
    '(using DW_FORM_strx)'
 => '(using DW_FORM_strx or DW_FORM_strx2)'

Section p.158 item 1 (DW_LNCT_path)
    'the form DW_FORM_strx'
 => 'the forms DW_FORM_strx and DW_FORM_strx2'

Section p.159
    Add DW_FORM_strx2 to the list.

Section 6.3.1 p.166 item 4 (opcode_operands_table)
    Add DW_FORM_strx2 to the list.

Section p.187 bullet 5 (string table)
    'DW_FORM_strp or DW_FORM_strx'
 => 'DW_FORM_strp, DW_FORM_strx, or DW_FORM_strx2'

Ibid bullet 6 (string offsets table)
    'the DW_FORM_strx form.'
 => 'the DW_FORM_strx or DW_FORM_strx2 forms.'

Section 7.3.5 p.190 3rd paragraph
    'the form DW_FORM_strx.'
 => 'the forms DW_FORM_strx or DW_FORM_strx2.'

Section 7.5.6
    Add DW_FORM_strx2 to table 7.6.

Appendix B, figure B.1
    Add DW_FORM_strx2 where we have DW_FORM_strx.

Appendix B, p.273, notes for fig B.1, note (e)
    'form DW_FORM_strx'
 => 'form DW_FORM_strx or DW_FORM_strx2'

Appendix B, p.277, notes for fix B.2, note (e)
    'form DW_FORM_strx'
 => 'form DW_FORM_strx or DW_FORM_strx2'

Appendix B, p.278, notes for fix B.2, note (eo)
    'form DW_FORM_strx'
 => 'form DW_FORM_strx or DW_FORM_strx2'

Appendix F.1, p.390, next to last paragraph
    'use DW_FORM_strx.'
 => 'use DW_FORM_strx or DW_FORM_strx2.'

Appendix F.1, p.391, next to last bullet
    '(via DW_FORM_strp or DW_FORM_strx).'
 => '(via DW_FORM_strp, DW_FORM_strx, or DW_FORM_strx).'

Appendix F.1, p.391, last bullet
    '(if form DW_FORM_strx is used).'
 => '(if forms DW_FORM_strx or DW_FORM_strx2 are used).'

Appendix F.2.3, p.400, second bullet
    'form code DW_FORM_strx'
 => 'form codes DW_FORM_strx or DW_FORM_strx2'

Figure F.6, p.401, split object example
    Could replace DW_FORM_strx with DW_FORM_strx2.

Figure F.7, p.404

Appendix F.3, p.407, 4th paragraph
    'form DW_FORM_strx'
 => 'form DW_FORM_strx or DW_FORM_strx2'


Accepted with modification 1/3/2017.
Add DW_FORM_strx[1234].