Issue 140323.1: Debug_line file attributes

Author: David Anderson
Champion: David Anderson
Date submitted: 2014-03-23
Date revised:
Date closed:
Type: Enhancement
Status: Rejected
DWARF Version: 5
Summary
-------
Proposing a change to .debug_line that can save 17% of the
debug_line space in a final executable (debug_line is about
1% of the executable size) and allows future modifications
to the include paths and file paths to be binary-compatible
(due to the use of abbreviations style FORM codes).

Savings Assume the compiler uses DW_FORM_strp in the line
table header and the linker commonizes the strings in a new
section: .debug_line_str  when producing an executable.

If one simply sums the sizes of the objects, the extra
relocation records and the lack of any commonization mean
the objects total size is naturally larger than for any DWARF
version .debug_line.

Measured by building a version of gcc and calculating based on that
as the example object file.

I have not implemented the notion proposed here.
Instead I used dwarfdump and some python code to
evaluate the space.

Introduction
-----------
Issue 130701.1 added a useful capability to the line table's file path
list to DWARF5, the ability to use timestamp and length or an MD5 hash
to the files listed in the .debug_line section.  The 'file_entry_format'
introduced is a binary incompatibility. It does not assure that any
future changes to the file table could be binary compatible.

The binary incompatibilty introduced meant the line table section
(.debug_line) needed a new version stamp for Version 5.

Mark Weiland proposes, in issue 140331.2, a generalization of the file names
table portion of the Line Number Program Header.  Effectively, it allows a
producer to add future extensions for each file in the table while keeping
binary compatibility.  Rather like 140323.1.
But proposal 140331.2 ignores the include-directories
portion of the header.

Recall that the line table intentionally does not refer to other .debug*
sections, hence it can be usefully left in an object even if the object
is stripped (all other .debug sections removed).  This proposal adds
section  .debug_line_str which, if it exists, must be kept if .debug_line
is kept.

KEY OBSERVATION:
All the include-directory names and file names and the new MD5 strings are
encoded directly in the table
in DW5 as of 1 June 2014.  While this introduces no string duplication
in the .debug_line of a single .o object, if a tool (such as the linker)
joins the .debug sections together into, say, an executable, there will be
duplication of the include directory strings and of header file names and
MD5 hashes listed in the file table.
END KEY OBSERVATION

This proposal goes further than 140331.2 while keeping the MD5 data that
130701.1 added for producers/consumers that want that MD5 data.
For space saving commonization at link time it
uses an additional string table.


Overview
--------

Instead of fixed fields for include and file paths we adopt something like
an abbreviation approach.

We allow all the include and file paths (and strings generally) to be
DW_FORM_strp or DW_FORM_str.  DW_FORM_strp refers to a new section
.debug_line_str so string unification for the linetable is easy and so
stripping all-but-linetable remains easy.

DW_FORM_strx would introduce additional dependencies in section relationships
and and more complexity for a strip function so it is not allowed here.

The DW_LNF_ codes list specifies lo_user and hi_user values so implementations
can invent additional DW_LNF_ codes without breaking binary compatibility.

-------
Difficulties:
Compilers sometimes delegate line table production to the
assembler. Changes such as this (Issue 140323.1) proposes
will impact such compilers and assemblers.

Cary Coutant mentioned (22 April 2014) that space taken
by the line table is not an important part of the space an
executable takes, other .debug sections take much more space.
Making it unclear the effort achieves enough to be worthwhile.

The issue of section interconnects is important.  Hence the
proposal refers to a new .debug_line_str section keeping the
interconnects simple and making unification (by such as a
linker creating an executable)

The size and time involved in relocations is significant.
Using DW_FORM_strp means every include directory or file
reqires relocation in building an executable.
And requires an offset-size value (typ:
4 bytes) offset to relocate plus the size of the string in
.debug_line_str .



-------
Quick overview of the changes to the line table header:
Referring to DWARF5, section 6.2.4.


10. standard_opcode_lengths (unchanged)
11. include directory forms.
a number N (ubyte) and
a list of N entries, each entry being
   a type designator (uleb) and a FORM code (uleb).
   type designators are
   DW_LNF_path

12. Include directories count
   uleb number of directories, #I
13. Include directories
   #I sets of data in the order presented in 11.

14.  file path forms
a number N (ubyte) and
a list of N entries, each entry being
   a type designator and a FORM code.
   type designators are
   DW_LNF_path
   DW_LNF_include_index
   DW_LNF_timestamp
   DW_LNF_size
   DW_LNF_MD5

15. File_names count
   uleb number of file names entries, #F
16. File_names
   #F sets of data in the order presented in 14.


One might wonder if all the 'count' entries above could be eliminated by
specifying the lists are terminated by a single zero-byte. That does not
seem to work as a relocated reference to .debug_line_str will likely have
a leading zero.

Space on disk
-------------
Here we compare the space using DW5 with DW_LNF_timestamp_size vs the
space with this proposal.

using the 4.8.2 gcc build and looking at the executable xgcc as a hopefully
meaningful base for computation, Using DW_FORM_strp and commonizing strings
would reduce the size of .debug_line by 25069 bytes (21% of debug_line) and
the size of the executable by 1%.

Using DW_FORM_str, meaning ignoring the possible savings and just placing
the strings in debug_line while ommitting fields unused in DWARF2 debug_line
there would be an increase in .debug_line of 240 bytes, about 5 bytes per
compilation unit in the 85 CUs in xgcc (about 0.1% of .debug_line size).


some further details about space on disk:
-----------------------------------------

The 'count' fields for directories and files take no more space than the
DW4 terminator NUL byte.

If DW_FORM_strp is used for the paths, and if the linker is actually building
an output object with .debug_line there is an opportunity for avoiding lots of
duplication of path strings by unifying the strings in the output .debug_line_str
section.

xgcc build 4.8.6, standard compile i686 host and target
CU count       85
size           2528434
text           341292 13.50%
rodata         238968  9.45%
debug_info     880907 34.84%
eh_frame+_hdr   60816  2.41%
debug_abbrev    72755  2.88%
debug_line     120308  4.76%
debug_str      164531  6.51%
debug_loc      439348 17.38%
debug_ranges    63080  2.49%

gcc typically leaves the timestamp and file length 0 in the files list in every
CU's line table.

The line table header bytes (per DW5 spec as of today) related to include files
and file names and the zero bytes for date/time is 30940

The bytes for same (in a debug_line_str file commonized for these strings)
including space for the offset to the string table: 10448

If you also then add in the extra space used to set up (per line table header)
the  savings decreases to 20067 bytes.  That is 17% of the
DW5 debug_line size.


Changes to the DWARF 5 specification
------------------------------------

Section 6.2.4 The Line Number Program Header


Replace numbers 11-13 in the header with:

11. The count of following include directory entry codes(ubyte).
12: A list of N directory_entry_codes.
   each entry code being
   a type designator (uleb) and a FORM code (uleb).
   The defined type designator is DW_LNF_path
   The FORM code may be DW_FORM_string or DW_FORM_strp
   (DW_FORM_strp in the line table header
   refers to .debug_line_str, not .debug_str).

13. The count of the number of directories (uleb).
14. include_directories (array of entries)
    An array of sets of include-directory data (the number in 13)
    with the form and order presented in directory_entry_codes (in 12).
    Entries in this sequence describe each path that was searched for included
    source files in this compilation.  (The paths include those directories specified
    explicitly by the user for the compiler to search and those the compiler
    searches without explicit direction.) Each path entry
    is either a full path name or is relative to the current directory of the
    compilation.

    The line number program assigns numbers to each of the file entries
    in order, beginning with 1. The current directory of the
    compilation is understood to
    be the zeroth entry and is not explicitly represented.


15. The count of file entry codes (ubyte).
16. file_entry_codes  (array of entry codes)
   A list of of file entry codes, each file
    entry code being a type designator and a FORM code.
   Type designators are
   DW_LNF_path
   DW_LNF_include_index (an index into the include_directories section)
   DW_LNF_timestamp
   DW_LNF_size
   DW_LNF_MD5
   The FORM code used for the DW_LNF_path should be
   DW_FORM_string or DW_FORM_strp
   (DW_FORM_strp referring to .debug_line_str).
   FORM codes for other DW_LNF_include_index,
   DW_LNF_timestamp, and DW_LNF_size
   should be "constant" form.

17. The count of the number of file names entries (uleb).


18. The file names list, the number of sets from 17.
    Each set is in the order presented in 16.

    Entries in this sequence describe source files that contribute to the
    line number information for this compilation unit or is used in other
    contexts, such as in a declaration coordinate or a macro file inclusion.

   . Each DW_LNF_path file name is a null-terminated string
     containing the full or relative path name of a source file.
     If the entry contains a file name or relative path name, the file
     is located relative to either the compilation directory
     (as specified by the DW_AT_comp_dir attribute given
     in the compilation unit) or one of the directories listed in the
     "include_directories" section.

   . Each DW_LNF_include_index is a constant representing the
     directory index of a directory in the "include_directories" section.

   . Each DW_LNF_timestamp is a constant representing the
     (implementation-defined) time of last modification
     for the file, or 0 if not available.

   . Each DW_LNF_size is a constant representing the length in bytes
     of the file, or 0 if not available.

   . Each DW_LNF_MD5 is a 16-byte MD5 digest of the file contents.


The three near-end paragraphs in the current document for section 6.2.4,
starting with "The directory index represents" are kept as
in the present document.

The final paragraph (italic) in 6.2.4 starting with
"A compiler may generate a single null byte for the file names
field" seems to be out of place in the current DW5 document
and possibly belongs somewhere else?


Table 7.24:  Line Number Table file entry format encodings

File entry format name      Value
DW_LNF_path                 0x1
DW_LNF_include_index        0x2
DW_LNF_timestamp            0x3
DW_LNF_size                 0x4
DW_LNF_MD5                  0x5
DW_LNF_lo_user              0x2000
DW_LNF_hi_user              0x3fff

---

5/12/2014 -- Replacement proposal.
6/02/2014 -- Replacement proposal. 
7/15/2014 -- Accepted
8/19/2014 -- Reconsidered and rejecte.  Replaced with 140724.1.