The problem is that types are always aligned on boundaries of their
own sizeof. An extreme example of what can happen when not taking care
of ordering:
struct a
{
void *a;
int b;
void *c;
short d;
void *e;
char f;
void *g;
char h;
};
sizeof(struct a) = 64
Here, pointers are 8 bytes and are always aligned on 8byte boundaries.
A 1byte char followed by a 8byte pointer thus wastes 7 bytes worth of
memory. Reordering the members gives the following results:
struct b
{
void *a;
void *c;
void *e;
void *g;
int b;
short d;
char f;
char h;
};
sizeof(struct b) = 40
This can save more than 50% of a structs size with the exact same
members.
I am thus proposing to reorder the members of offending classes like
this:
1. types known to be 8bytes long, such as double or int64
2. complex types such as classes/structs. These may contain 8byte
types or pointers and thus are themselves 8 or 4 byte aligned
If one KNOWS that classes contain 8byte types, place them before ones
only containing pointers.
3. pointers which are 8 or 4 bytes, numeric types which are as wide as
pointers
(4. complex types which are KNOWN to only contain 4byte members)
5. enums which are 4bytes
6. other 4 byte types, such as float or int32
(7. complex types which are KNOWN to only contain 2byte members)
8. 2byte types, such as int16
(9. complex types which are KNOWN to only contain 1byte members or no
members at all)
10. 1byte types, such as int8 or bool
Other things to take care of:
- Do not use PRBool for class members. Use PRPackedBool instead. Or
std bool if possible.
- Place Arrays, such as int a[2], next to their base type.
- Order access modifiers like this: public, protected, private per
member size.
- Inheritance can have negative impact when using large-to-small
ordering, consider the following example:
class d
{
int a;
};
sizeof(class d) = 4
class e : public d
{
void* b;
int c;
};
sizeof(class e) = 24
class f : public d
{
int c;
void* b;
};
sizeof(class f) = 16
I would say to not worry about this unless one KNOWS what he is doing.
All this may hurt readability in the case when access modifiers (or
#ifdefs) are used a lot but I believe the benefits are worth it.
Other comments/suggestions?
I don't think this rule is worth enforcing for NS_STACK_CLASS
classes/structs.
> - Order access modifiers like this: public, protected, private per
> member size.
This is pretty bad for readability, imo. Unless we allocate many
instances of a class, the sizeof issue is really not that important and
readability should trump it. So we should be ordering members by access
type before ordering by size in that case.
> Other comments/suggestions?
I'm not sure it makes sense to do this globally as a first cut. If we
have commonly-allocated classes that are wasting space, those are what
we should focus on.
-Boris
A wild idea I have is to mash up the data about the amount of wasted
space for each class with the data obtained from the bloat logs, so
that we can try to do this as a first pass for cases where the gain
(amount of space saved times the number of allocations) is maximum.
--
Ehsan
<http://ehsanakhgari.org/>
Maybe it's a completely crazy suggestion, but what about to try to build
a preprocessor (py or pl) that would take .h and .cpp files walking
class declarations and construcotors' initiators and reorder the members
automatically to some temp .h and .cpp files written to the obj dir that
would actually be then compiled instead of the original ones in the
source directory?
-hb-
> --
> Ehsan
> <http://ehsanakhgari.org/>
> _______________________________________________
> dev-platform mailing list
> dev-pl...@lists.mozilla.org
> https://lists.mozilla.org/listinfo/dev-platform
>
Wouldn't that hurt compile time considerably? (as if building toolkit
isn't long enough as is...)
You're right, that is completely crazy! :-)
Rob
I actually think it is. It's almost always trivial and free, so
consistency wins.
> I'm not sure it makes sense to do this globally as a first cut. If we
> have commonly-allocated classes that are wasting space, those are what
> we should focus on.
True. However, if someone volunteers to do the reordering everywhere
that it doesn't hurt readability (no extra visibility directives or
other additional syntax), they may as well do it IMHO.
Rob
It's not completely free, since PRPackedBool access is slower than
PRBool access, last I checked. Not much slower, granted.
>> I'm not sure it makes sense to do this globally as a first cut. If we
>> have commonly-allocated classes that are wasting space, those are what
>> we should focus on.
>
> True. However, if someone volunteers to do the reordering everywhere
> that it doesn't hurt readability (no extra visibility directives or
> other additional syntax), they may as well do it IMHO.
Agreed.
-Boris
Correct me if i'm wrong. As I understand, this is what happens when
the program reads a 1byte value:
1. read a "memory access granularity"-sized value from memory (is this
4 or 8 bytes on x64?)
0xFFFFAAFFFFFFFFFF (where the bools byte is the AA
2. shift so that the bool is the rightmost
0x000000000000FFFFAA
3. AND with a 1byte bitmask
0x0000000000000000AA
> Robert O'Callahan wrote:
>
>> I actually think it is. It's almost always trivial and free
>
> It's not completely free, since PRPackedBool access is slower than
> PRBool access, last I checked. Not much slower, granted.
Writing's worse, of course, but especially so since XPCOM out parameters
are all PRBool*, so those have to be actual PRBools.
--
Warning: May contain traces of nuts.
It would certainly be nice if there was a script published somewhere
that could go through classes and tell you what space savings were
potentially available - I'd like to go through some of the comm-central
classes in association with a bloat log and have a quick look to see if
there's any big gains to be made.
Standard8
That's what my dehydra script from https://bugzilla.mozilla.org/show_bug.cgi?id=492185
tries to do. It's not yet perfect, just found another bug in
calculating vtable pointer usage.
This is only the code generated for CPUs that have no byte-sized memory
access instructions; no currently interesting architecture has that
limitation.
On x64 for instance, this
struct S { bool a, b, c, d; };
bool get(S *s) { return s->b; }
void set(S *s, bool v) { s->b = v; }
compiles to this
_Z3getP1S:
movzbl 1(%rdi), %eax
ret
_Z3setP1Sb:
movb %sil, 1(%rdi)
ret
However, the CPU might internally be doing something similar to the
fullword load, shift and mask that you describe, especially for writes;
it depends on how the ALU-to-cache interface works. We'd have to look
through the optimization guides for each processor of interest (not just
each architecture). Personally, I would guess that for a structure
with more than one or two booleans in it, the size savings are going to
speed things up more than any hypothetical extra work for a byte access
slows them down (every L2 miss costs hundreds of cycles).
zw