Issue 130410.1: Accelerated access

Eric Christopher

Introduction
----------------

For named lookup, the pubnames and pubtypes sections described in
Section 6.1, "Accelerated Access" are insufficient for many needs. The
tables contain only public entries visible outside the current
compilation unit and due to insufficient specification may be
inconsistent. In addition the tables are much larger than they need to
be since they contain redundant copies of strings from the string
table. This has led to efforts to use them as a base for an external
index or to be thrown away altogether for a new index. This proposal
contains such a new index: a hash lookup table in dwarf with a fully
specified list of items that are contained in the index.

Overview
-------------

This proposal introduces an optional set of debugging sections that
contain a lookup table for accelerated access based on name. Important
properties for these tables are:

    Having a format that can be mapped into memory from disk and used as is
    Have an Extensible/Versioned table format so they can be used by
    many producers/consumers
    Contain all of the names needed for typical lookups out of the box
    Have strict rules for the contents of the tables

Because minimizing table size is important the accelerator table
format should allow the reuse of strings from common string tables so
the strings for the names are not duplicated. We also want to make
sure the table is ready to be used as-is by simply mapping the table
into memory with minimal header parsing.

Each table that is defined should have strict rules on exactly what is
in the accelerator tables and documented so clients can rely on the
content. In order to contain the items most likely to be requested by
clients we have three tables: a table of names (.debug_names), a table
of types (.debug_typenames), and a table of namespaces
(.debug_namespaces).

The .debug_names section should contain an entry for each of the following:

   DW_TAG_label, DW_TAG_inlined_subroutine, and
   DW_TAG_subprogram DIEs with address attributes (DW_AT_low_pc,
   DW_AT_high_pc, DW_AT_ranges, or DW_AT_entry_pc).

   DW_TAG_variable DIEs with DW_OP_addr or DW_OP_form_tls_address
   in the location.

For functions, the index should contain separate entries for both the
source name (DW_AT_name) and the linkage name (DW_AT_linkage_name) if
it has one.

DIEs with a DW_AT_declaration attribute are not entered into the index.

     .debug_typenames

     The .debug_typenames section should contain an entry for each of
     the following DIEs:

           DW_TAG_array_type
           DW_TAG_class_type
           DW_TAG_enumeration_type
           DW_TAG_enumerator
           DW_TAG_pointer_type
           DW_TAG_reference_type
           DW_TAG_string_type
           DW_TAG_structure_type
           DW_TAG_subroutine_type
           DW_TAG_typedef
           DW_TAG_union_type
           DW_TAG_ptr_to_member_type
           DW_TAG_set_type
           DW_TAG_subrange_type
           DW_TAG_base_type
           DW_TAG_const_type
           DW_TAG_constant
           DW_TAG_file_type
           DW_TAG_namelist
           DW_TAG_packed_type
           DW_TAG_volatile_type
           DW_TAG_restrict_type
           DW_TAG_interface_type
           DW_TAG_unspecified_type
           DW_TAG_shared_type

     Only types with a DW_AT_name attribute are included. Type DIEs
that have a DW_AT_declaration attribute are not entered into the
index.

     .debug_namespaces

     This section should contain all DW_TAG_namespace DIEs. All
anonymous namespaces should also be included as the name "(anonymous
namespace)".

The name lookups need to be fast and optimized for the kinds of
lookups that debuggers tend to do. Ideally we would like to touch as
few parts of the mapped table as possible when doing a name lookup and
be able to quickly find the name entry we are looking for, or discover
there are no matches. In the case of debuggers we optimized for
lookups that fail most of the time. When debugging across multiple .o
files in particular you can make the assumption that the symbol you're
looking for won't be in the file (.o/.dwo) that you're in and by
adding an additional level of indirection optimize for this case.

To address the requirements mentioned above we have structured the
hash tables a bit differently from a typical hash table. In this
proposal we have: a header, buckets, an array of all unique 32 bit
hash values, followed by an array of hash value data offsets, one for
each hash value, then the data for all hash values, for example:

.-------------.
|  HEADER     |
|-------------|
|  BUCKETS    |
|-------------|
|  HASHES     |
|-------------|
|  OFFSETS    |
|-------------|
|  DATA       |
`-------------'

The BUCKETS in the name tables are an index into the HASHES array. By
making all of the full 32 bit hash values contiguous in memory, we
allow ourselves to efficiently check for a match while touching as
little memory as possible. Most often checking the 32 bit hash values
is as far as the lookup goes. If it does match, it usually is a match
with no collisions. So for a table with "bucket_count" buckets, and
"hashes_count" unique 32 bit hash values, we can clarify the contents
of the BUCKETS, HASHES and OFFSETS as:

.-------------------------.
|  HEADER.magic           | 4-bytes
|  HEADER.version         | 2-bytes
|  HEADER.hash_function   | 2-bytes
|  HEADER.bucket_count    | 4-bytes
|  HEADER.hashes_count    | 4-bytes
|  HEADER.header_data_len | 4-bytes
|  header_data            | Varies
|-------------------------|
|  BUCKETS                | 4-bytes * bucket count (32 bit hash indexes)
|-------------------------|
|  HASHES                 | 4-bytes * hashes_count (32 bit hash values)
|-------------------------|
|  OFFSETS                | 4-bytes * hashes_count (32 bit offsets to
hash value data)
|-------------------------|
|  DATA                   |
`-------------------------'

The HEADER_DATA field describes, using ATOM enumerations, the contents
of the Data section. Each ATOM enumeration contains a pair of type of
data and the form in which it is encoded. This means that each section
can be flexibly described and encoded using the best means for the
data. An example from a current implementation is to use an atom that
describes the DIE offset of the entity we've hashed with an encoding
of DW_FORM_data4. This would give us:

   atom_count = 1;
   atom[0].type = DW_ATOM_die_offset
   atom[0].form = DW_FORM_data4

for a sample set of data that looks like:

.------------.
| 0x00001023 | DW_FORM_strp KeyType (.debug_str[0x0001023] => "main")
| 0x00000004 | uword HashData count
| 0x........ | DW_FORM_data4 HashData[0] DIE offset
| 0x........ | DW_FORM_data4 HashData[1] DIE offset
| 0x........ | DW_FORM_data4 HashData[2] DIE offset
| 0x........ | DW_FORM_data4 HashData[3] DIE offset
| 0xFFFFFFFF | DW_FORM_strp KeyType (end of hash chain)
`------------'

Changes to the DWARF Specification

-------------------------------------------------

Section 6.1 ("Accelerated Access")

Replace: "three different types of tables"
With: "two different types of tables"

Replace Section 6.1.1 in entirety with:
6.1.1 Name Lookup

For lookup by name, a hash table can be constructed in separate object
file sections named .debug_typenames for types, .debug_namespaces for
namespaces, and .debug_names for a variety of entities. Each section
contains the information for their respective entities that are
described in all of the compilation units in a single object file.

Contents of the Index Sections

     .debug_names

     The .debug_names section should contain an entry for each of the following:
     DW_TAG_label, DW_TAG_inlined_subroutine, and DW_TAG_subprogram DIEs 
     with address attributes (DW_AT_low_pc, DW_AT_high_pc, DW_AT_ranges, or
     DW_AT_entry_pc).

     DW_TAG_variable DIEs with DW_OP_addr or DW_OP_form_tls_address in 
     the location.

     For functions, the index should contain separate entries for both
the source name (DW_AT_name) and the linkage name (DW_AT_linkage_name)
if it has one. DIEs with a DW_AT_declaration attribute are not entered
into the index.

     .debug_typenames

     The .debug_typenames section should contain an entry for each of
the following DIEs:

           DW_TAG_array_type
           DW_TAG_class_type
           DW_TAG_enumeration_type
           DW_TAG_enumerator
           DW_TAG_pointer_type
           DW_TAG_reference_type
           DW_TAG_string_type
           DW_TAG_structure_type
           DW_TAG_subroutine_type
           DW_TAG_typedef
           DW_TAG_union_type
           DW_TAG_ptr_to_member_type
           DW_TAG_set_type
           DW_TAG_subrange_type
           DW_TAG_base_type
           DW_TAG_const_type
           DW_TAG_constant
           DW_TAG_file_type
           DW_TAG_namelist
           DW_TAG_packed_type
           DW_TAG_volatile_type
           DW_TAG_restrict_type
           DW_TAG_interface_type
           DW_TAG_unspecified_type
           DW_TAG_shared_type

     Only types with a DW_AT_name attribute are included. Type DIEs
that have a DW_AT_declaration attribute are not entered into the
index.

     .debug_namespaces

     This section should contain all DW_TAG_namespace DIEs. All
anonymous namespaces should also be included as the name "(anonymous
namespace)".

Each table begins with a header containing seven values:

    1. magic (uword)
    The header starts with a 32 bit "magic" value which must be 'HASH'
    encoded as an integer (0x48415348). This allows the detection of the
    start of the hash table and also allows the table's byte order to be
    determined so the table can be correctly extracted.

    2. version (uhalf)
    A version number. This number is specific to the name lookup table
    and is independent of the DWARF version number.

    3. hash_function (uhalf)
    The hash function enumeration that was used.

    The current values for the hash function enumerations include:

        0:DW_hash_function_djb
        This is the Daniel J. Bernstein string hash function.

    4. bucket_count (uword)
    This is the 32-bit unsigned word that represents how many buckets
    are in the table.

    5. hashes_count (uword)
    This is the 32-bit unsigned word that represents how many unique
    hash values and hash data offsets are in the table.

    6. header_data_len (uword)
    This is the number of bytes to skip to get to the hash indexes
    (buckets). Specifically this is the length of the following
    header_data field and does not include the size of the fields
    preceding the header_data_len field.

    7. header_data (variable)
    This section contains implementation specific header data with the
    definition of the contents of each data chunk. The format of this is
    described below.

Fixed Table Format

     The header is followed by the buckets, hashes, offsets, and hash
value data. The buckets are an array of 32-bit indexes into the array
of hashes. The hashes array contains all of the 32-bit hash values for
all names in the hash table. Each hash in the hashes table has a
corresponding offset into the offsets array that points to the data
for the hash value; this offset is relative to the section that the
table is within.

     The header_data field above contains a base offset and a
definition of the contents of each data chunk. To keep things
extensible, we create a list of item pairs, or Atoms, that may be
contained in the data for each name.

     Each Atom then contains a pair of type and form, encoded as:

     1. type (uhalf)
     The enumeration type listed below.

     2. form (uhalf)
     A DW_FORM that defines the exact encoding of the data for the Atom.

     The enumeration values and their meanings are:
           0: DW_ATOM_die_offset
           An offset into the debug_info section for the DIE for this name.

           1: DW_ATOM_cu_offset
           An offset into the debug_info section for CU that contains this DIE.

           2: DW_ATOM_die_tag
           The DW_TAG_* enumeration value of the entity.

           3: DW_ATOM_type_flags
           Flags for types described in the TypeFlags enumeration below.
                 The TypeFlags enumeration values and their meanings are:
                 2:DW_FLAG_type_implementation

           4: DW_ATOM_tu_die_offset
           An offset into the debug_types section for the DIE.

           5: DW_ATOM_tu_signature
           The type signature of a given type.

The complete header_data field then contains:

    1. die_offset_base (uword)
    The base DIE offset that should be added to any atoms that are
    encoded using the DW_FORM_refN or DW_FORM_ref_udata forms.

     2. atom_count (uword)
     The number of Atoms in the header.

     3. atoms (varies)
     The list of Atoms.

Data

       The data starting at the offset found in the offsets array
consists of a GROUP of index entries for all names that produce the
same 32-bit hash value. The GROUP consists of a sequence of SETS,
followed by a 0xffffffff sentinel.

       Each SET contains one or more INDEX ENTRIES for the same name.
The number of index entries in the SET is given by the entry_count
field.

       Each SET contains three fields:

       1. str_offset (uword)
       Offset into the .debug_str section.

       2. entry_count (uword)
       Size of the SET of index entries for this string.

       3. index entries (varies)
       The list of index entries.

       Each INDEX ENTRY describes one reference to the name. It
consists of a sequence of one or more ATOMS. The number of atoms per
index entry is given by the atom_count field in the header.

       Each ATOM provides a value whose meaning is given by the atom
type from the header, and whose representation is given by the form
code from the header.

Lookup
       Looking up an item in the table occurs in this order:

       (1) convert the name to a 32-bit hash value using the indicated
       hash function;

       (2) use the hash value modulo bucket_count to get an index from
       the buckets array;

       (3) use that index to find the first of the series of hash
       values in the hashes array;

       (4) search through the hashes array until you find a match for
       the hash value or you find a hash value that doesn't correspond 
       to the original bucket;

       (5) find the corresponding offset in the offsets array;

       (6) starting at that offset in the data portion, search through
       the sets of data until you find the matching string table offset
       entry, or until you find a 0xffffffff sentinel;

       (7) each set of data consists of the three fields described below;

       (8) once you've found the right set, the data contains an array
       of index entries for that name;

       (9) each index entry is a list of atoms, as described by the header_data.

Appendix D ("Examples")

Add the following section to the end of the appendix:

   D.13 Accelerated Access Example
   [TBD]

Appendix F - DWARF Section Version Numbers (informative)

Add an entry to the table for each of the sections with the number 5
for version.

---
12/17/2013: Deferred
Revised 3/17/2014.  Previous proposal http://dwarfstd.org/issues/130410.1-1..html
2014-07-17: Reopened.
11/19/2014 -- Withdrawn.

Author:	Eric Christopher
Champion:	Eric Christopher
Date submitted:	2013-04-10
Date revised:
Date closed:
Type:	Enhancement
Status:	Withdrawn
DWARF Version:	5