Issue 011228.1: UTF-8 Clarification
BACKGROUND
----------
Dave Anderson posted several questions on 28-Dec-2001 about the
UTF-8 attribute and how it interacts with sections other than
debug_info (in which it is contained). Brender responded with
some tentative answers and further suggested that at minimum
additional italics commentary seemed warranted in several places.
THE DIALOG SO FAR
-----------------
Dave Anderson wrote:
>I wonder if there is a problem with the DW_AT_UTF8 attribute.
>An introduction to UTF8 is on
>http://www.unicode.org/unicode/uni2book/u2.html
>in Chapter 2, General Structure, on page 12 of 26 in that pdf chapter.
>
>I am afflicted with ignorance on this in spite of reading
>the above info...
>
>One point: for all values in the range [0..127] unicode and
>ASCII are identical, which may make a difference...
By design, of course...
>Perhaps there is no problem, but I wonder.
>
>UTF8 contains solely 8bit characters, so ordinary
>string processing won't go haywire reading them, but
>on the other hand, I worry that in a couple cases below
>it might be *necessary* to know that
>a string is UTF8 but with no appropriate way to tell.
>
>I would be happy to learn my concerns are unfounded, so
>feel free to clear this up for me...
>
>3.1 DW_AT_UTF8 says that "all strings use UTF8" in that CU.
>
>This is clearly fine for
> .debug_info
> .debug_strp
>
> .debug_line
> This has file names etc.
> This is perhaps sensibly looked at as an 'extension' of
> .debug_info. Are these utf8? I guess so.
> It's not read independent of .debug_info so this seems ok.
debug_line does have its own version number field. So, if it version 3
then I think it reasonable to expect that it depends on the use_UTF8
attribute of the compilation unit.
An italics comment to this effect seems appropriate someplace.
> .debug_macinfo
> has strings.
> This is perhaps sensibly looked at as an 'extension' of
> .debug_info. Are these utf8? I guess so.
> It's not read independent of .debug_info so this seems ok.
debug_macinfo does not have it's own version number, so the argument
is not quite so direct as for .debug_line. Still, I would expect that
debug_macinfo is equally dependent on the unit use_UTF8 attribute.
An italics comment to this effect seems appropriate someplace.
Aside: Of course, C (C99) has its own quite different source scheme
for representing multibyte characters. But, since that scheme is
defined in terms of the "C character set", which is a subset of
7-bit ASCII (and its international equivalents), I think there is
nothing that prevents its use in combination with UTF-8 (although
one might hope that no producer would do so!).
>However, the following sections also have strings and
>things are slightly less clear to me.
>
> .debug_frame can have an augmentation string
> Is this utf8?
Hmmm, very interesting...
> What if .debug_frame is used for exceptions, so exists
> in stripped executables? .debug_frame may be
> gone! So then what?
Clearly the behavior of exception handling better not depend on whether
the executable is stripped or not!
If .debug_frame is gone, then the question is moot, yes?
> Document/dwarf3 mistake or
> quality-of-implementation issue or ???
> .debug_frame is 'independent' of .debug_info.... No direct
> connection to a CU in .debug_info.
Me thinks there is a real DWARF3 issue. Not sure how best to resolve it.
> .debug_pubnames has name strings
> .debug_pubtypes has name strings.
> Till one actually has *found* the relevant
> compilation unit then these strings cannot
> be usefully printed?
> Will searches work if some CUs are utf8 but not all?
> Is this a quality of implementation issue or a
> dwarf3 issue?
> These *are* searched independent of .debug_info, though
> the do reference .debug_info.
> Document/dwarf3 mistake or quality-of-implementation issue or ???
Well, searches can work given a mixture of UTF8 and non-UTF-8; to do so
requires that the consumer look into the corresponding compilation unit
header before doing the search. Not hard, but definitely a new wrinkle.
Worth at least an italics explanation.
RECOMMENDED ACTION
------------------
The email Ron responds to has a typo.
I meant to say that since .debug_frame can be present
yet .debug_info absent, what is supposed to happen?
Which Ron has noted as a key issue.
So I'm floating the following proposal for consideration.
(the existing ISSUE has no proposal)
PROPOSAL (ref draft 8/9, dwarf3):
x means the text is to orient the proposal reader and
x document editor, the
x literal proposed text below has no leading >.
xThe proposal is that the .debug_frame augmentation always
xbe ISO-IEC 10646-1:1993, never UFT8.
xGiven the string is a compiler flag,
xthere is no need to handle user-level names.
x6.4.1
x4. Augmentation (string)
xAdd the following to bullet 4:
Augmentation string must use ISO/IEC 10646-1:1993, not UTF8.
The string is (thus) independent of DW_AT_use_UTF8.
x7.5.4 Attribute Encodings
xAt the end of the 'string' attribute discussion, add the following:
Because .debug_frame is useful independently of .debug_info the
augmentation string of .debug_frame must be ISO/IEC 10646-1:1993, not UTF8
(the augmentation format is thus independent of DW_AT_use_UTF8).
=============================================================
5/17/2005: Accepted with modification:
In the proposal, where it says ISO-10646, it should read
ISO-646. The latter is LATIN-1, the former is UCS-2/4.
The Augmentation string will be UTF-8 encoded in 6.4.1.
No changed text in 7.5.4.