Issue 130313.1: Indirect string table (Split DWARF, part 1/5)

Author: Cary Coutant
Champion: Cary Coutant
Date submitted: 2013-03-13
Date revised:
Date closed:
Type: Enhancement
Status: Accepted with modifications
DWARF Version: 5
Section , pg 
Introduction
------------

With the current DWARF format, the debug information is designed
with the expectation that it will be processed by the linker to
produce an output binary with complete debug information, and
with fully-resolved references to locations within the
application. For very large applications, however, this approach
can result in excessively large link times and excessively large
output files. Several vendors have independently developed
proprietary approaches that allow the debug information to remain
in the relocatable object files, so that the linker does not have
to process the debug information or copy it to the output file.
These approaches have all required that additional information be
made available to the debug information consumer, and that the
consumer perform some minimal amount of relocation in order to
interpret the debug info correctly. The additional information
required, in the form of load maps or symbol tables, and the
details of the relocation are not covered by the DWARF
specification, and vary with each vendor's implementation.

For more background information, see the GCC wiki page at:

    http://gcc.gnu.org/wiki/DebugFission

This is the first in a series of proposals to extend the DWARF
specification to support the use of unlinked and unrelocated
debug information.

  * This first proposal introduces a mechanism for referring to
    strings indirectly, collecting the string offsets into a new
    section, and introducing a new form code for this purpose.

  * The second proposal introduces a similar indirect mechanism
    for referring to relocatable addresses in the loadable
    segments of the program, collecting all such addresses in a
    single section.

  * The third proposal introduces a mechanism for referring to
    range table entries in the .debug_ranges section with a
    single relocation per compilation unit rather than one for
    each reference.

  * The fourth proposal introduces a mechanism for splitting the
    DWARF debugging information into two sets of sections. One
    set remains in the relocatable object (.o) files, and is
    linked into the final executable; the other set is written to
    a non-relocatable DWARF object (.dwo) file, and remains in
    the build tree as an independent file where the debugger can
    reference it as needed.

  * The fifth proposal introduces a package file format for
    collecting the DWARF object (.dwo) files so that the
    debugging information can be easily saved and transported for
    debugging applications outside of the build tree.


Overview
--------

This proposal adds a new form code for referencing strings via a
table of string offsets. By removing the direct string offsets
for strings placed in the .debug_str section, we can eliminate
all the relocations necessary for each occurrence of DW_FORM_strp
in a .debug_info or .debug_types section. We consolidate these
offsets into a new .debug_str_offsets section, whose contents can
be interpreted as relative offsets into the .debug_str section,
thus needing no relocation. The original DW_FORM_strp values are
replaced by an index using the new form code DW_FORM_strx. The
index is an unsigned LEB128 value, which will be, on average,
much smaller than a relocated 4- or 8-byte string offset. If a
linker chooses to combine multiple .debug_str sections and
coalesce duplicate strings, it can relocate the offsets in
.debug_str_offsets without the aid of individual relocations.
Furthermore, multiple references to the same string can now use a
small index value to refer to the same offset entry in
.debug_str_offsets, for a net savings.

For strings that are referenced only once in a given compilation
unit, the compiler may choose to emit the string directly using
DW_FORM_str, or it may choose to use DW_FORM_strx with the
expectation that it will be coalesced at link time with other
references to that string.

For debug information that will be linked into the final
executable (i.e., when using this mechanism independently of the
remaining parts of this proposal), each compilation unit must
provide a pointer to the base of its contribution to the
.debug_str_offsets section, using a new attribute,
DW_AT_str_offsets_base. The value of this attribute will be a
relocated pointer to the base of the .debug_str_offsets
contribution for that compilation unit, using the form code
DW_FORM_sec_offset.


Changes to the DWARF Specification
----------------------------------

Section 2.2 ("Attribute Types")

In Figure 2, add the following rows:

    Attribute name            Identifies or Specifies
    --------------            -----------------------
    DW_AT_str_offsets_base    Base offset for string offsets table

Section 3.1.1 ("Normal and Partial Compilation Unit Entries")

In the numbered list introduced by "Compilation unit entries may
have the following attributes:", add the following item:

    12. A DW_AT_str_offsets_base attribute, whose value is a
        reference. This attribute points to the first string
        offset of the compilation unit's contribution to the
        .debug_str_offsets section. Indirect string references
        (using DW_FORM_strx) within the compilation unit must be
        interpreted as indexes relative to this base.

Section 3.1.3 ("Separate Type Unit Entries")

After the paragraph beginning "A type unit entry may have a
DW_AT_language attribute," add the following paragraph:

    A type unit entry may have a DW_AT_str_offsets_base
    attribute, whose value is a reference. This attribute points
    to the first string offset of the type unit's contribution to
    the .debug_str_offsets section. Indirect string references
    (using DW_FORM_strx) within the type unit must be interpreted
    as indexes relative to this base.

Section 7.5.4 ("Attribute Encodings")

Replace the first paragraph describing the "string" class with
the following:

    A string is a sequence of contiguous non-null bytes followed
    by one null byte. A string may be represented immediately in
    the debugging information entry itself (DW_FORM_string), as
    an offset into a string table contained in the .debug_str
    section of the object file (DW_FORM_strp), or as an indirect
    offset into the string table using an index into a table of
    offsets contained in the .debug_str_offsets section of the
    object file (DW_FORM_strx). In the 32-bit DWARF format, the
    representation of a DW_FORM_strp value is a 4-byte unsigned
    offset; in the 64-bit DWARF format, it is an 8-byte unsigned
    offset (see Section 7.4). The representation of a
    DW_FORM_strx value is an unsigned LEB128 value, which is
    interpreted as a zero-based index into an array of offsets in
    the .debug_str_offsets section. The offset entries in the
    .debug_str_offsets section have the same representation as
    DW_FORM_strp values.

In Figure 20, add the following row to the table:

    Attribute name            Value    Classes
    --------------            -----    -------
    DW_AT_str_offsets_base    [TBA]    string

In Figure 21, add the following row to the table:

    Form name       Value    Class
    ---------       -----    -----
    DW_FORM_strx    [TBA]    string

In Chapter 7, add a new section after Section 7.24:

7.xx String Offsets Table

    Each set of entries in the string offsets table contained in
    the .debug_str_offsets section begins with a header
    containing:

        1.  unit_length (initial length)
            A 4-byte or 12-byte length containing the length of
            the set of entries for this compilation unit, not
            including the length field itself. In the 32-bit
            DWARF format, this is a 4-byte unsigned integer
            (which must be less than 0xfffffff0); in the 64-bit
            DWARF format, this consists of the 4-byte value
            0xffffffff followed by an 8-byte unsigned integer
            that gives the actual length (see Section 7.4).

        2.  version (uhalf)
            A 2-byte version identifier containing the value 1
            (see Appendix F).

        3.  padding (uhalf)

    This header is followed by a series of string table offsets.
    For the 32-bit DWARF format, each offset is 4 bytes long; for
    the 64-bit DWARF format, each offset is 8 bytes long.

    The DW_AT_str_offsets_base attribute must point to the first
    entry following the header. The entries are indexed
    sequentially from this base entry, starting from 0.

Section 7.25 ("Dependencies and Constraints")

Replace the first sentence with the following:

    The debugging information in this format is intended to exist
    in the .debug_abbrev, .debug_aranges, .debug_frame,
    .debug_info, .debug_line, .debug_loc, .debug_macinfo,
    .debug_pubnames, .debug_pubtypes, .debug_ranges, .debug_str,
    .debug_str_offsets, and .debug_types sections of an object
    file, or equivalent separate file or database.

Appendix A ("Attributes by Tag Value")

In Figure 42, in the table entries for DW_TAG_compile_unit,
DW_TAG_partial_unit, and DW_TAG_type_unit, add
DW_AT_str_offsets_base.

Appendix B ("Debug Section Relationships")

In Figure 43, below the box for DW_FORM_strp, add a box for
DW_FORM_strx, pointing to a new section .debug_str_offsets, which
in turn points to .debug_str. The note for the new box should
read as follows:

    .debug_str_offsets
        The value of the DW_AT_str_offset_base attribute in the
        DW_TAG_compile_unit, DW_TAG_partial_unit, or
        DW_TAG_type_unit DIE is the offset in the
        .debug_str_offsets section of the string offsets array
        for that compilation unit or type unit.

Appendix F ("DWARF Section Version Numbers")

In Figure 97, add the following row:

    Section Name          V2    V3    V4    V5
    ------------          --    --    --    --
   .debug_str_offsets       x     x     x     5


---

Revised March 25. 2013
Revised May 26, 2013:  Change version number to 5.
April 23, 2013: Accepted with modifications.