It seems that NSPR containers and strings use explicitly a 32-bite
index_type. This means that in 64-bit code, where pointers are 64-bit
wide and therefore where *offsets* need to be 64-bit wide, they are
slower than they need be. (FWIW, fixing this in my matrix library
Eigen gave me a +10% speed increase in 64-bit code).
For example:
http://mxr.mozilla.org/mozilla-central/source/xpcom/glue/nsTArray.h#53
Excerpt:
class NS_COM_GLUE nsTArray_base {
public:
typedef PRUint32 size_type;
typedef PRUint32 index_type;
Another example:
http://mxr.mozilla.org/mozilla-central/source/xpcom/glue/nsStringAPI.h#58
Except:
class nsAString
{
public:
typedef PRUnichar char_type;
typedef nsAString self_type;
typedef PRUint32 size_type;
typedef PRUint32 index_type;
The right integer types to use for offsets, are size_t (unsigned) and
ptrdiff_t (signed). They are guaranteed to have the same size as
pointers, whence to give fast pointer arithmetic regardless of the
platform.
Cheers,
Benoit
Here with gcc 4.4 on linux x86-64, I get these results (using "perf" profiler):
For arrays of size 100, the version using 64bit indices is 8.6 % faster.
For arrays of size 10000, the version using 64bit indices is 11.1 % faster.
If you want to test yourself, here's a sample compilation command line
(for size 100):
g++ demo-slow-pointer-arithmetic.cpp -o x -O3 -DTEST_SIZE=100
-DTEST_REPEAT=1000000
Then profile the resulting executable, for example with perf:
perf record -f -g -- ./x
perf report -g flat --sort symbol | c++filt | head -10
# Samples: 20031338658
#
# Overhead Symbol
# ........ ......
#
52.49% [.] 0x00000000000ac2
16.00% [.] void merge_sort<int, unsigned int>(unsigned int, int*, int*)
14.73% [.] void merge_sort<int, unsigned long>(unsigned long, int*, int*)
11.68% [.] memcpy
1.54% [.] _int_free
Benoit
2010/9/26 Benoit Jacob <jacob.b...@gmail.com>:
We should in fact consider making that change. One thing to watch out
for is that at least for strings the 32-bit length may be packing with
the 32-bit flags (worth checking on this), so it's possible that strings
fit in two words on 64-bit now and would require 3 words if this change
were made. Just something to watch out for.
Note, also, that consumers by and large don't use the size_type and
index_type defines, so just changing the classes would not be enough to
get the speedup; we'd need to change the consumers too.
-Boris
ah ok. If that really matters, it is still possible to keep storing
the length as 32-bit, but use size_t offsets in the API and internally
in the implementation of string methods. One would then pay a
conversion cost when using the length.
> Note, also, that consumers by and large don't use the size_type and
> index_type defines, so just changing the classes would not be enough to get
> the speedup; we'd need to change the consumers too.
It's true that getting the full benefit from this change would require
updating the consumers too. But if some nonnegligible time is spent
inside of some expensive methods in these classes (for example in a
sort() method), then these methods could already benefit from this
change.
What are the API or ABI compatibility constraints here? Who is using
NSPR besides ourselves?
Benoit
>
> -Boris
> _______________________________________________
> dev-platform mailing list
> dev-pl...@lists.mozilla.org
> https://lists.mozilla.org/listinfo/dev-platform
>
The hot code for these classes is the consumers, generally.
> What are the API or ABI compatibility constraints here? Who is using
> NSPR besides ourselves?
Doesn't matter, since these are not NSPR classes. They're XPCOM classes.
For strings, Benjamin is the right person to ask about API/ABI compat.
For nsTArray, I don't think we have any compat issues at all, though
again worth checking with Benjamin.
-Boris
> For nsTArray, I don't think we have any compat issues at all, though
> again worth checking with Benjamin.
nsTArray cannot be safely used across library boundaries.
--
Warning: May contain traces of nuts.
> It seems that NSPR containers and strings use explicitly a 32-bite
Note: you aren't actually talking about NSPR containers at all here:
nsTArray and XPCOM strings are not NSPR. This is good, because NSPR
maintains binary compatibility and you wouldn't be able to change the struct
sizes of NSPR types.
I do think that what bz said about XPCOM strings is correct, and we should
continue to use a 32-bit size_type for strings so that they fit into two
words. There really isn't any reason to support strings longer than 32-bits
anyway.
> index_type. This means that in 64-bit code, where pointers are 64-bit
> wide and therefore where *offsets* need to be 64-bit wide, they are
> slower than they need be. (FWIW, fixing this in my matrix library
> Eigen gave me a +10% speed increase in 64-bit code).
The offsets only *need* to be 64 bits wide if you must accept arbitrary
sizes. In all of our code, we can decide that a TArray or string may not
have more than 32 bits of elements.
Are you saying that reading a 32-bit int out of the struct is slower than a
full 64-bit word? Or that doing math on a 64-bit pointer + 32-bit offset is
slower? If so, can we just expand the offset to 64 bits before doing the math?
For TArrays, I don't see any reason we can't make the size_type 64 bits,
since we don't save any space or complexity with a 32-bit type.
--BDS
OK, thanks for the explanation: I guess I was confused as I wrongly
believed that XPCOM was only, well, the COM component system.
> I do think that what bz said about XPCOM strings is correct, and we should
> continue to use a 32-bit size_type for strings so that they fit into two
> words. There really isn't any reason to support strings longer than 32-bits
> anyway.
OK. I would then like to measure how much speed we are losing in slow
pointer arithmetic there, and if there is something we can do without
affecting the sizeof(). For example, whether it would help to change
offsets to be size_t even though the length remains PRUint32.
>
>> index_type. This means that in 64-bit code, where pointers are 64-bit
>> wide and therefore where *offsets* need to be 64-bit wide, they are
>> slower than they need be. (FWIW, fixing this in my matrix library
>> Eigen gave me a +10% speed increase in 64-bit code).
>
> The offsets only *need* to be 64 bits wide if you must accept arbitrary
> sizes. In all of our code, we can decide that a TArray or string may not
> have more than 32 bits of elements.
Yes, yes, that was not my point (see below).
>
> Are you saying that reading a 32-bit int out of the struct is slower than a
> full 64-bit word?
No,
> Or that doing math on a 64-bit pointer + 32-bit offset is
> slower?
Yes, that's what I am saying. See my benchmark in the second email in
this thread.
By the way, if like me you find it weird that 50% of time is spent in
this unrecognized symbol, that seems to be operator new[]. Increasing
the size from 100 to 10000 to reduce the weight of dynamic array
allocation, I now get:
45.49% [.] void merge_sort<int, unsigned int>(unsigned int, int*, int*)
40.37% [.] void merge_sort<int, unsigned long>(unsigned long, int*, int*)
9.59% [.] memcpy
2.59% [.] 0x00000000000947
1.36% [.] main
So the version using 64bit offsets is 12.7% faster here.
> If so, can we just expand the offset to 64 bits before doing the
> math?
This is what the compiler does for us, and that explains the relative slowness.
Indeed, see attached file pointer-arithmetic.cpp. It does this:
template<typename index_type>
__attribute__((noinline))
int get_int_from_array(int* ptr, index_type index)
{
asm("#begin");
return ptr[index];
asm("#end");
}
I generate c++-demangled assembly like this:
g++ pointer-arithmetic.cpp -O3 -S -o pointer-arithmetic.s
cat pointer-arithmetic.s | c++filt > pointer-arithmetic.s++
Now we have this assembly code:
get_int_from_array<unsigned int>:
mov %esi, %esi
movl (%rdi,%rsi,4), %eax
ret
get_int_from_array<unsigned long>: movl (%rdi,%rsi,4), %eax
ret
So the difference is indeed the mov %esi, %esi which AFAIU (my asm
skills are inexistant) unpacks the offset to 64 bit.
> For TArrays, I don't see any reason we can't make the size_type 64 bits,
> since we don't save any space or complexity with a 32-bit type.
Cool, then I will have a look at implementing this change as soon as
possible (lots of emergencies to handle first in GFX land though).
Benoit
>
> --BDS
>> For TArrays, I don't see any reason we can't make the size_type 64 bits,
>> since we don't save any space or complexity with a 32-bit type.
>
> Cool, then I will have a look at implementing this change as soon as
> possible (lots of emergencies to handle first in GFX land though).
Note that I would not accept the change for Firefox 4 now. It will have to
wait for Firefox 5: we already have enough extensions being developed
against the existing 4 beta SDKs that changing it now isn't worth it.
--BDS