Class member reordering

Arpad Borsos

unread,

May 10, 2009, 6:39:08 AM5/10/09

to

My first adventure into dehydra land has brought up a nice analysis
which compares a structs sizeof with the sum of its members sizeof.
See https://bugzilla.mozilla.org/show_bug.cgi?id=492185

The problem is that types are always aligned on boundaries of their
own sizeof. An extreme example of what can happen when not taking care
of ordering:
struct a
{
void *a;
int b;
void *c;
short d;
void *e;
char f;
void *g;
char h;
};
sizeof(struct a) = 64

Here, pointers are 8 bytes and are always aligned on 8byte boundaries.
A 1byte char followed by a 8byte pointer thus wastes 7 bytes worth of
memory. Reordering the members gives the following results:
struct b
{
void *a;
void *c;
void *e;
void *g;
int b;
short d;
char f;
char h;
};
sizeof(struct b) = 40
This can save more than 50% of a structs size with the exact same
members.

I am thus proposing to reorder the members of offending classes like
this:
1. types known to be 8bytes long, such as double or int64
2. complex types such as classes/structs. These may contain 8byte
types or pointers and thus are themselves 8 or 4 byte aligned
If one KNOWS that classes contain 8byte types, place them before ones
only containing pointers.
3. pointers which are 8 or 4 bytes, numeric types which are as wide as
pointers
(4. complex types which are KNOWN to only contain 4byte members)
5. enums which are 4bytes
6. other 4 byte types, such as float or int32
(7. complex types which are KNOWN to only contain 2byte members)
8. 2byte types, such as int16
(9. complex types which are KNOWN to only contain 1byte members or no
members at all)
10. 1byte types, such as int8 or bool

Other things to take care of:
- Do not use PRBool for class members. Use PRPackedBool instead. Or
std bool if possible.
- Place Arrays, such as int a[2], next to their base type.
- Order access modifiers like this: public, protected, private per
member size.
- Inheritance can have negative impact when using large-to-small
ordering, consider the following example:
class d
{
int a;
};
sizeof(class d) = 4
class e : public d
{
void* b;
int c;
};
sizeof(class e) = 24
class f : public d
{
int c;
void* b;
};
sizeof(class f) = 16
I would say to not worry about this unless one KNOWS what he is doing.

All this may hurt readability in the case when access modifiers (or
#ifdefs) are used a lot but I believe the benefits are worth it.

Other comments/suggestions?

Boris Zbarsky

unread,

May 10, 2009, 10:57:07 AM5/10/09

to

Arpad Borsos wrote:
> - Do not use PRBool for class members. Use PRPackedBool instead. Or
> std bool if possible.

I don't think this rule is worth enforcing for NS_STACK_CLASS
classes/structs.

> - Order access modifiers like this: public, protected, private per
> member size.

This is pretty bad for readability, imo. Unless we allocate many
instances of a class, the sizeof issue is really not that important and
readability should trump it. So we should be ordering members by access
type before ordering by size in that case.

> Other comments/suggestions?

I'm not sure it makes sense to do this globally as a first cut. If we
have commonly-allocated classes that are wasting space, those are what
we should focus on.

-Boris

Ehsan Akhgari

unread,

May 10, 2009, 11:09:21 AM5/10/09

to Boris Zbarsky, dev-pl...@lists.mozilla.org

On Sun, May 10, 2009 at 7:27 PM, Boris Zbarsky <bzba...@mit.edu> wrote:
>> Other comments/suggestions?
>
> I'm not sure it makes sense to do this globally as a first cut. If we have
> commonly-allocated classes that are wasting space, those are what we should
> focus on.

A wild idea I have is to mash up the data about the amount of wasted
space for each class with the data obtained from the bloat logs, so
that we can try to do this as a first pass for cases where the gain
(amount of space saved times the number of allocations) is maximum.

--
Ehsan
<http://ehsanakhgari.org/>

Honza Bambas

unread,

May 10, 2009, 4:29:15 PM5/10/09

to Ehsan Akhgari, Boris Zbarsky, dev-pl...@lists.mozilla.org

Ehsan Akhgari wrote:
> On Sun, May 10, 2009 at 7:27 PM, Boris Zbarsky <bzba...@mit.edu> wrote:
>

>>> Other comments/suggestions?
>>>
>> I'm not sure it makes sense to do this globally as a first cut. If we have
>> commonly-allocated classes that are wasting space, those are what we should
>> focus on.
>>
>

> A wild idea I have is to mash up the data about the amount of wasted
> space for each class with the data obtained from the bloat logs, so
> that we can try to do this as a first pass for cases where the gain
> (amount of space saved times the number of allocations) is maximum.
>

Maybe it's a completely crazy suggestion, but what about to try to build
a preprocessor (py or pl) that would take .h and .cpp files walking
class declarations and construcotors' initiators and reorder the members
automatically to some temp .h and .cpp files written to the obj dir that
would actually be then compiled instead of the original ones in the
source directory?

-hb-
> --
> Ehsan
> <http://ehsanakhgari.org/>
> _______________________________________________
> dev-platform mailing list
> dev-pl...@lists.mozilla.org
> https://lists.mozilla.org/listinfo/dev-platform
>

Natch

unread,

May 10, 2009, 7:54:54 PM5/10/09

to

On May 10, 4:29 pm, Honza Bambas <honzab....@firemni.cz> wrote:

Wouldn't that hurt compile time considerably? (as if building toolkit
isn't long enough as is...)

Robert O'Callahan

unread,

May 10, 2009, 11:38:22 PM5/10/09

to

On 11/5/09 8:29 AM, Honza Bambas wrote:
> Maybe it's a completely crazy suggestion, but what about to try to build
> a preprocessor (py or pl) that would take .h and .cpp files walking
> class declarations and construcotors' initiators and reorder the members
> automatically to some temp .h and .cpp files written to the obj dir that
> would actually be then compiled instead of the original ones in the
> source directory?

You're right, that is completely crazy! :-)

Rob

Robert O'Callahan

unread,

May 10, 2009, 11:40:31 PM5/10/09

to

On 11/5/09 2:57 AM, Boris Zbarsky wrote:
> Arpad Borsos wrote:
>> - Do not use PRBool for class members. Use PRPackedBool instead. Or
>> std bool if possible.
>
> I don't think this rule is worth enforcing for NS_STACK_CLASS
> classes/structs.

I actually think it is. It's almost always trivial and free, so
consistency wins.

> I'm not sure it makes sense to do this globally as a first cut. If we
> have commonly-allocated classes that are wasting space, those are what
> we should focus on.

True. However, if someone volunteers to do the reordering everywhere
that it doesn't hurt readability (no extra visibility directives or
other additional syntax), they may as well do it IMHO.

Rob

Boris Zbarsky

unread,

May 10, 2009, 11:51:18 PM5/10/09

to

Robert O'Callahan wrote:
> I actually think it is. It's almost always trivial and free

It's not completely free, since PRPackedBool access is slower than
PRBool access, last I checked. Not much slower, granted.

>> I'm not sure it makes sense to do this globally as a first cut. If we
>> have commonly-allocated classes that are wasting space, those are what
>> we should focus on.
>
> True. However, if someone volunteers to do the reordering everywhere
> that it doesn't hurt readability (no extra visibility directives or
> other additional syntax), they may as well do it IMHO.

Agreed.

-Boris

Arpad Borsos

unread,

May 11, 2009, 5:11:22 AM5/11/09

to

On 11 Mai, 05:51, Boris Zbarsky <bzbar...@mit.edu> wrote:
> It's not completely free, since PRPackedBool access is slower than
> PRBool access, last I checked. Not much slower, granted.

Correct me if i'm wrong. As I understand, this is what happens when
the program reads a 1byte value:
1. read a "memory access granularity"-sized value from memory (is this
4 or 8 bytes on x64?)
0xFFFFAAFFFFFFFFFF (where the bools byte is the AA
2. shift so that the bool is the rightmost
0x000000000000FFFFAA
3. AND with a 1byte bitmask
0x0000000000000000AA

Neil

unread,

May 11, 2009, 8:33:47 AM5/11/09

to

Boris Zbarsky wrote:

> Robert O'Callahan wrote:
>
>> I actually think it is. It's almost always trivial and free
>
> It's not completely free, since PRPackedBool access is slower than
> PRBool access, last I checked. Not much slower, granted.

Writing's worse, of course, but especially so since XPCOM out parameters
are all PRBool*, so those have to be actual PRBools.

--
Warning: May contain traces of nuts.

Mark Banner

unread,

May 11, 2009, 9:24:05 AM5/11/09

to

It would certainly be nice if there was a script published somewhere
that could go through classes and tell you what space savings were
potentially available - I'd like to go through some of the comm-central
classes in association with a bloat log and have a quick look to see if
there's any big gains to be made.

Standard8

Arpad Borsos

unread,

May 11, 2009, 10:59:23 AM5/11/09

to

> It would certainly be nice if there was a script published somewhere
> that could go through classes and tell you what space savings were
> potentially available - I'd like to go through some of the comm-central
> classes in association with a bloat log and have a quick look to see if
> there's any big gains to be made.
>
> Standard8

That's what my dehydra script from https://bugzilla.mozilla.org/show_bug.cgi?id=492185
tries to do. It's not yet perfect, just found another bug in
calculating vtable pointer usage.

Zack Weinberg

unread,

May 11, 2009, 1:14:42 PM5/11/09

to Arpad Borsos, dev-pl...@lists.mozilla.org

Arpad Borsos <arpad....@googlemail.com> wrote:

This is only the code generated for CPUs that have no byte-sized memory
access instructions; no currently interesting architecture has that
limitation.

On x64 for instance, this

struct S { bool a, b, c, d; };
bool get(S *s) { return s->b; }
void set(S *s, bool v) { s->b = v; }

compiles to this

_Z3getP1S:
movzbl 1(%rdi), %eax
ret

_Z3setP1Sb:
movb %sil, 1(%rdi)
ret

However, the CPU might internally be doing something similar to the
fullword load, shift and mask that you describe, especially for writes;
it depends on how the ALU-to-cache interface works. We'd have to look
through the optimization guides for each processor of interest (not just
each architecture). Personally, I would guess that for a structure
with more than one or two booleans in it, the size savings are going to
speed things up more than any hypothetical extra work for a byte access
slows them down (every L2 miss costs hundreds of cycles).

zw