DWARF Standard


011228.1 David Anderson UTF-8 Clarification Clarification Accepted with modifications Dave Anderson


Dave Anderson posted several questions on 28-Dec-2001 about the
UTF-8 attribute and how it interacts with sections other than
debug_info (in which it is contained). Brender responded with
some tentative answers and further suggested that at minimum
additional italics commentary seemed warranted in several places.


Dave Anderson wrote:
>I wonder if there is a problem with the DW_AT_UTF8 attribute.
>An introduction to UTF8 is on
>in Chapter 2, General Structure, on page 12 of 26 in that pdf chapter.
>I am afflicted with ignorance on this in spite of reading
>the above info...
>One point: for all values in the range [0..127] unicode and
>ASCII are identical, which may make a difference...

By design, of course...

>Perhaps there is no problem, but I wonder.
>UTF8 contains solely 8bit characters, so ordinary
>string processing won't go haywire reading them, but
>on the other hand, I worry that in a couple cases below
>it might be *necessary* to know that
>a string is UTF8 but with no appropriate way to tell.
>I would be happy to learn my concerns are unfounded, so
>feel free to clear this up for me...
>3.1  DW_AT_UTF8 says that "all strings use UTF8" in that CU.
>This is clearly fine for
>	.debug_info
>	.debug_strp
> .debug_line
>	This has file names etc.
>	This is perhaps sensibly looked at as an 'extension' of
>	.debug_info.  Are these utf8? I guess so.
>        It's not read independent of .debug_info so this seems ok.

debug_line does have its own version number field. So, if it version 3
then I think it reasonable to expect that it depends on the use_UTF8
attribute of the compilation unit.

An italics comment to this effect seems appropriate someplace.

> .debug_macinfo
>	has strings.
>	This is perhaps sensibly looked at as an 'extension' of
>	.debug_info.  Are these utf8? I guess so.
>       It's not read independent of .debug_info so this seems ok.

debug_macinfo does not have it's own version number, so the argument
is not quite so direct as for .debug_line. Still, I would expect that
debug_macinfo is equally dependent on the unit use_UTF8 attribute.

An italics comment to this effect seems appropriate someplace.

    Aside: Of course, C (C99) has its own quite different source scheme
    for representing multibyte characters. But, since that scheme is
    defined in terms of the "C character set", which is a subset of
    7-bit ASCII (and its international equivalents), I think there is
    nothing that prevents its use in combination with UTF-8 (although
    one might hope that no producer would do so!).

>However, the following sections also have strings and
>things are  slightly less clear to me.
> .debug_frame can have an augmentation string
>	Is this utf8?

Hmmm, very interesting...

>	What if .debug_frame is used for exceptions, so exists
>	in stripped executables? .debug_frame may be
>	gone!  So then what?

Clearly the behavior of exception handling better not depend on whether
the executable is stripped or not!

If .debug_frame is gone, then the question is moot, yes?

>	Document/dwarf3 mistake or
>	quality-of-implementation issue or ???
>	.debug_frame is 'independent' of .debug_info.... No direct
>	connection to a CU in .debug_info.

Me thinks there is a real DWARF3 issue. Not sure how best to resolve it.

> .debug_pubnames has name strings
> .debug_pubtypes has name strings.
>	Till one actually has *found* the relevant
>	compilation unit then these strings cannot
>	be usefully printed?
>	Will searches work if some CUs are utf8 but not all?
>	Is this a quality of implementation issue or a
>	dwarf3 issue?
>	These *are* searched independent of .debug_info, though
>	the do reference .debug_info.
>	Document/dwarf3 mistake or quality-of-implementation issue or ???

Well, searches can work given a mixture of UTF8 and non-UTF-8; to do so
requires that the consumer look into the corresponding compilation unit
header before doing the search. Not hard, but definitely a new wrinkle.

Worth at least an italics explanation.


The email Ron responds to has a typo.
I meant to say that since .debug_frame can be present
yet .debug_info absent, what is supposed to happen?
Which Ron has noted as a key issue.

So I'm floating the following proposal for consideration.
(the existing ISSUE has no proposal)

PROPOSAL (ref draft 8/9, dwarf3):

x means the text is to orient the proposal reader and
x document editor, the
x literal proposed text below has no leading >.

xThe proposal is that the .debug_frame augmentation always
xbe ISO-IEC 10646-1:1993, never UFT8.  
xGiven the string is a compiler flag, 
xthere is no need to handle user-level names.


x4. Augmentation (string)

xAdd the following to bullet 4:

Augmentation string must use ISO/IEC 10646-1:1993, not UTF8.
The string is (thus) independent of DW_AT_use_UTF8.

x7.5.4  Attribute Encodings
xAt the end of the 'string' attribute discussion, add the following:

Because .debug_frame is useful independently of .debug_info the
augmentation string of .debug_frame must be ISO/IEC 10646-1:1993, not UTF8
(the augmentation format is thus independent of DW_AT_use_UTF8).

5/17/2005:  Accepted with modification:
  In the proposal, where it says ISO-10646, it should read
  ISO-646.  The latter is LATIN-1, the former is UCS-2/4.

  The Augmentation string will be UTF-8 encoded in 6.4.1. 
  No changed text in 7.5.4.

All logos and trademarks in this site are property of their respective owner.
The comments are property of their posters, all the rest © 2007-2017 by DWARF Standards Committee.