Issue 031002.2: Endianity Proposal

Author: Michael Eager
Champion: Michael Eager
Date submitted: 2003-10-02
Date revised:
Date closed:
Type: Extension
Status: Accepted with modifications
DWARF Version: 3
[ See revised proposal at end. ]

It appears that there is no way in Dwarf3 to explictly describe the
endianity of a data type.  All data is assumed to have the
endianity of the object file (although that is also not explicit).

I've just encountered a compiler (or a proposal for one) which
supports an extention to C which would allow creation of data in
both big and little endian formats.  If you didn't specify the
endianity, the default is generated.

I propose that we add the following attributes:

PROPOSAL:
    DW_AT_big_endian
    DW_AT_little_endian

These may be used in either a base type DIE, to describe the format
of the actual data, or as attributes of struct/union DIEs to indicate
the layout.

IMO, big endian format is well defined.  There are multiple definitions
for little endian format, depending on how you treat data which crosses
word boundries.

(I had actually thought that it was already possible to describe the
endianity of data and was surprised to find it missing.)

At the end, Jason Merrill proposes:
  DW_AT_endianness is more consistent with how other attributes
  work, even if it only has big and little values

======================== DISCUSSION FOLLOWS ================
What follows is the [edited lightly by David Anderson] contributions of
many folks. Any lack of clarity here is likely due to 
removing quoted text (for brevity here)
rather than due to the original authors.


-------

Chris Quenelle

> or as attributes of struct/union DIEs to indicate
> the layout.

Would it be possible to augment the dwarf information
to nail down the layout of a struct without needing
to know what the endiness is?





Brian Nettleton writes:

Since you already indicate there are potentially more than two orderings,
wouldn't it be better to have a single attribute, say DW_AT_byte_ordering,
and then enumerate the different definitions you want to support as a value
for this attribute similar to the DW_AT_accessibility or DW_AT_encoding
(there are other attributes that do this too).




Jim Blandy  writes:
Any reason to have two attributes, instead of one DW_AT_endianness
attribute, with values DW_END_big_endian, DW_END_little_endian,
DW_END_native (i.e., determine the endianness however you would had we
omitted this info --- I think it's good that the spec always provides
quiescent values in its enums)?

We would specify DW_END_little_endian to mean that earlier bytes are
less significant than later bytes.  That rules out the weird
mixed-endian architectures which I think you have in mind, where the
16-bit words appear in less-significant-earlier order, but the bytes
within those words are in more-significant-earlier order.  If anyone
actually has an architecture like that any more, they can propose new
enum values.  If we wanted to be fastidious, we could search out weird
orderings --- I think I've heard mixed-endianness referred to as
PDP-endianness --- and include whatever we find.

Andrew Cagney writes:

> IMO, big endian format is well defined.  There are multiple definitions
> for little endian format, depending on how you treat data which crosses
> word boundries.
> 
> (I had actually thought that it was already possible to describe the
> endianity of data and was surprised to find it missing.)

What about constructing "little-byte big-word" using the "piece" operator?



James Cownie writes:

> Any reason to have two attributes, instead of one DW_AT_endianness
> attribute, with values DW_END_big_endian, DW_END_little_endian,
> DW_END_native (i.e., determine the endianness however you would had we
> omitted this info --- I think it's good that the spec always provides
> quiescent values in its enums)?

If we're going to do it, why don't we solve the full problem ?

What we're trying to do is provide the permutation between bytes in
store and a value in a register. So, what we could do is

1) Have DW_AT_endianness which can take 

   DW_FORM_data      one of the pre-defined values
             DW_END_big, DW_END_little (this is just an
                     optimisation for the common case, they could be
             defined using the full map mechanism below)
   or

   DW_FORM_ref       a reference to an endian definition DIE
             as below.

2) Define DW_TAG_endian_definition (to be the target of the reference
   above) with one attribute

   DW_AT_endian_map    which would be a DW_FORM_block.

   The endian map is a byte stream in which the value of the byte at
   offset 'n' gives the power of 256 which that byte represents. 

   So, a little endian 32 bit quantity would be represented as
   {0,1,2,3}; a big-endian 32 bit quantity would be {3,2,1,0}; and we
   can also represent a middle-endian 32 bit word (cf PDP11 floating
   point) {2,3,0,1} and so on.

   Since we have a DW_FORM_block here we can generate maps for
   different length objects. This is in general necessary; an endian
   map should only be used with an object of the appropriate length
   (otherwise it may map bytes out of the object !).

   The choice of bytes as the members of the map limits us to 256 byte
   long registers; I don't think that's a problem.

Michael Eager
> What about constructing "little-byte big-word" using the "piece" operator?

DW_OP_piece is an expression operator, rather than a description of the
data type.

It might be possible to create an expression which describes how to
access some data, but it would be complex, especially if you want to
do this for structures.

Michael Eager writes, in response to  Jim Blandy:

I don't think that it would be a good idea to arbitrarily say that
certain architectures are suddenly "wierd" and therefore not suported.

For any particular architecture, there really is only one little endian
format and it is handled directly by the hardware.  Enumerating and
maintaining a table of the values for different little endian formats,
along with accurate descriptions is tedious and a significant
amount of work.  Implementing the variations is also significant
work.

Asking implementers to propose new values will only lead to undocumented
values when different implementers define their own values.

DW_AT_big_endian and DW_AT_little_endian are really sufficient to address
the problem of describing the data layout.  There's no need to over-
elaborate the solution.


-- 
Michael Eager  writes, responding to question about little-endian forms :
In its simplest form, little endian values have the most significant
byte at the lowest address.  Big endian values have the least significant
byte at the lowest address.

When you have values which are longer than one word or which overlap
word boundaries, different little endian architectures have made different
choices about which part of the value to place in the lower addressed word.

I'm avoiding going into more detail because it gets ugly, requires
pictures, and makes me wonder what the system designers were thinking.
I may be misremembering, but I think the VAX or PDP-11 was one where
the lower word had the lower significant values, although the bytes
were in little endian order.



Chris Quenelle


As I recall, the discussion started with the mention of a *proposal*
for a system.  For that hypothetical system, they would probably want
to implement a vendor-specific extension (big_endian/little_endian)
that would support a system that could do both at the same time.
If that works for them we could standardize it.  But it seems like
we opened a big can of worms here.  I wouldn't want to put the cart
before the horse.

I have no objection to adding a binary flag like big/little,
but I suspect a lot of implementations might not use it
because they already know that information from other sources,
like the platform the tool is running on, or ELF information
about the executable files.


Chris Quenelle 
> What about constructing "little-byte big-word" using the "piece" operator?
> 

It seems like the location description operations
(defined in the Dwarf3 doc) are powerful enough
(using DW_OP_piece) to nail down exactly where each and
every byte is (in memory or registers).

You might have to use shifts and masks to identify
each and every byte of the value.  But in theory it seems
possible.  If you started doing this for every variable,
you would want some accelerators.  Should we be looking
at an extension to the expression syntax to facilitate
more detailed info in the location expression?



Daniel Jacobowitz 

Don't some ARM systems use different layouts for doubles (GDB calls it
littlebyte_bigword) and 64-bit integers?


Jim Blandy 

Michael Eager 
> I don't think that it would be a good idea to arbitrarily say that
> certain architectures are suddenly "wierd" and therefore not
> suported.

Oh, I didn't mean to suggest otherwise.  When I wrote:

> > We would specify DW_END_little_endian to mean that earlier bytes
> > are less significant than later bytes.  That rules out the weird
> > mixed-endian architectures which I think you have in mind,

all I meant was that DW_AT_little_endian or DW_END_little_endian would
specifically mean that earlier bytes are less significant --- that is,
it would fully specify the representation --- and that a different
enum value / attribute would be used for PDP-endianness and things
like that.

> For any particular architecture, there really is only one little endian
> format and it is handled directly by the hardware.  Enumerating and
> maintaining a table of the values for different little endian formats,
> along with accurate descriptions is tedious and a significant
> amount of work.  Implementing the variations is also significant
> work.

I think I'm misunderstanding you.  The advantage of the extension to C
that you described (similar ones were considered for GNU CC, too) is
that you can declare a type explicitly to be little-endian, and then
compile that code to run on a big-endian machine, and the compiler
will generate code to do the byte-swapping for you.  
That is, one can explicitly select
byte orders that are not the standard ordering for the target.  So it
sounds like you're saying that the exact meaning of "little endian"
can be left to be specified by the target architecture --- but a
big-endian architecture won't bother to specify a little-endian
format that it doesn't use.

If DW_AT_little_endian doesn't actually specify the exact encoding,
but leaves it unstated whether you mean "earlier bytes are less
significant" or "PDP-endian", then it's not significant work to
implement --- it's impossible to implement.  You don't have enough
information to do the job.

Andrew Cagney

> 
> I'm avoiding going into more detail because it gets ugly, requires
> pictures, and makes me wonder what the system designers were thinking.
> I may be misremembering, but I think the VAX or PDP-11 was one where
> the lower word had the lower significant values, although the bytes
> were in little endian order. 

FYI, the Arm (well one variant), and TiC80 >32 bit floating-point 
formats come to mind.  Its because the bus is 32 bits and it makes for 
simpler h/w :-)


Jim Blandy 

Thinko:

Michael Eager

Really, the problem I'm trying to solve is quite simple.
It shouldn't lead to a redesign of expression evaluation.

Jim Blandy

Okay, I see more where you're coming from now.  In the case where
we're dealing with architectures that define both a little-endian and
a big-endian format, one just needs to indicate which one is in play.

My forty-two cents:

I guess I just don't agree that "little-endian" is ill-defined.  It
surprised me when you said that it was.  In my experience, I would say
that there is a well-established, unambiguous usage of "little-endian"
and "big-endian": in little-endian representations, less significant
bytes precede more significant bytes in memory; and in big-endian
representations, more significant bytes precede less significant
bytes.

I've never heard someone say "little-endian" or "big-endian", and
actually mean some mixed-endian representation.  There are certainly
examples of processors that use mixed-endian representations, both
ancient and modern.  But I've never heard these mixed-endian
representations described as "little-endian" or "big-endian" without
comment: they have always been flagged as unusual cases, and given
names like "PDP-endian" or "mixed-endian".

In that context, I think there's some value to describing the meaning
of DW_AT_big_endian and DW_AT_little_endian completely in the spec: it
makes the spec clearer, simply by being more explicit, and making less
reference to architecture-defined parameters.

Michael Eager

Jim Blandy wrote:
For architectures which support both big endian and little endian,
for example, PowerPC or ARM, then DW_AT_little_endian data is in the
same format as that used when a program is compiled little endian, and
conversely, DW_AT_big_endian is in the same format used when a program
is compiled big endian.

Currently, debuggers use the endianity specified in the ELF file
to determine whether data is big endian or little endian.  This
specification applies to all data. These two attributes would permit
a change to the granularity of the specification and allow
the big or little endian specification to apply to specified types.

I think that you are reading more into the proposal than is there.
Both byte orderings as specified by DW_AT_big_endian and
DW_AT_little_endian would be standard for the target.



Brian Nettleton

It's a mistake to think that such a compiler feature might only be limited
to architectures which support both modes.  One could certainly imagine
making use of such a feature generally.  An example might be writing for
network protocol (big-endian) on a little-endian architecture (x86) and
getting compiler help.  It is certainly in the realm of possibility to see
this feature expanding to compilers for targets with only a single mode.

So we shouldn't make the assumption that the feature is only defined for
architectures where both modes are defined.  Instead we should simply define
what little-endian and big-endian mean (of course in sync with the widely
understood meanings) and leave other definitions to be implementation
defined extensions.

-------
Jason Merrill

I think a DW_AT_endianness is more consistent with how other attributes
work, even if it only has big and little values; DW_AT_inline and
DW_AT_visibility are other examples of enumerated attributes.

==================================================================
Revised Proposal


Add "DW_AT_endianity, Specifies endianity of data" to Figure 2, pg 11.

In section 4.1 (pg 59), add the following:

  12.  A DW_AT_endianity attribute which specifies the endianity of the
       variable or constant.  The value of the this attribute specifies
       an ABI-defined byte ordering for the value.  If omitted, or the
       value for this attribute is zero, the default endianity of data
       for the target architecture is assumed.

In section 5.1 (pg 61) replace "All encodings assume the representation
that is “normal” for the target architecture." with the following:

  A DW_AT_endianity attribute may be specified (see section 4.1).  If
  omitted, the encoding is in the representation that is “normal” for
  the target architecture

Add to Figure 19 (pg 136):
   DW_AT_endianity    0x61    constant

======================

Second revised proposal:

Add "DW_AT_endianity, Specifies endianity of data" to Figure 2, pg 11.

In section 4.1, add the following:

  12.  A DW_AT_endianity attribute which specifies the endianity of the
       variable or constant.  The value of the this attribute specifies
       an ABI-defined byte ordering for the value.  If omitted, or the
       value for this attribute is zero, the default endianity of data
       for the target architecture is assumed.

       There are the following predefined values for DW_AT_endianity:
           DW_END_default -- default encoding
           DW_END_big     -- big-endian encoding
           DW_END_little  -- little-endian encoding
*These represent the default encoding formats as defined by the
       target architecture's ABI or processor definition.  The exact
       definition of these formats may differ in subtle ways for
       different architectures.*


In section 5.1 replace "All encodings assume the representation
that is “normal” for the target architecture." with the following:

  A DW_AT_endianity attribute may be specified (see section 4.1).  If
  omitted, the encoding is in the representation that is the default for
  the target architecture

Add to Figure 19:
   DW_AT_endianity    0x61    constant

New table:
   DW_END_default    0x00
   DW_END_big        0x01
   DW_END_little     0x02
   DW_END_lo_user    0x40
   DW_END_hi_user    0x7F

Add DW_END_endian to the vendor extension list in Section 7.1.