objective-c: how slow ?

Aurelien

unread,

Aug 31, 2001, 7:35:24 AM8/31/01

to

Hi,

I was just wondering what was the impact of messaging on the speed of an
objective-c program. I've read somewhere about the @SEL directive (I'm
not sure about the name), which allows for optimizing often invoked
methods. But is there a possibility that the compiler does such
optimization right away ? Is the @SEL "trick" an absolute requisite when
optimizing programs ?

Does someone know about potential benchmarking results ? Has someone
already tested the performance differences between programs written in
C, Objective-C, C++ and Java ?

Is there a reason (performance...) why an objective-c runtime and (at
least) the foundation library shouldn't be ported to an embedded linux ?

Aurélien

Frederic Stark

unread,

Aug 30, 2001, 9:31:19 PM8/30/01

to

Hi,

All this IMHO, of course.

Aurelien wrote:

> Hi,
>
> I was just wondering what was the impact of messaging on the speed of
> an objective-c program. I've read somewhere about the @SEL directive
> (I'm not sure about the name), which allows for optimizing often
> invoked methods. But is there a possibility that the compiler does
> such optimization right away ? Is the @SEL "trick" an absolute
> requisite when optimizing programs ?

You are probably referring to getting the IMP of a method. Not an
absolute requisite at all, but handy to have for a very very very tiny
number of specific cases. You get the perf of a C function pointer, and
a lot of headaches when something goes wrong.

>
> Does someone know about potential benchmarking results ? Has someone
> already tested the performance differences between programs written in
> C, Objective-C, C++ and Java ?

Given an infinite time to design and optimize, the performance is, from
best to worse:

* C
* C++
* ObjC
* Java

(Java could be faster than ObjC with an exceptionally good interpreter,
but, in my experience it does not happend. YMMV)

Now, nobody have and infinite time to design and optimize. ObjC enable a
more powerful form of messaging, which simplifies a lot of problems, so
more time can be spent on the design and high-level optimisation.
Performance critical code can be (re)written in C++ or C, as ObjC
interfaces exceptionnally well with C.

In many non-trivial projects, a lot of time is spent designing ways to
work around the lack of dynamicity in the language, re-inventing a worse
ObjC, with worse performance.

"Premature optimisation is the root of all evil". Choosing C or C++ for
performance reasons when the problem is nicely handed by ObjC would be
the ultimate premature optimisation possible.

> Is there a reason (performance...) why an objective-c runtime and (at
> least) the foundation library shouldn't be ported to an embedded linux ?

I don't see any. I may be wrong, as I don't know much about embedded linux.

Cheers,

--fred

Marko Mikulicic

unread,

Aug 31, 2001, 5:15:21 PM8/31/01

to

Frederic Stark wrote:
> Hi,
>
> All this IMHO, of course.
>
>
>

> Given an infinite time to design and optimize, the performance is, from
> best to worse:
>
> * C
> * C++
> * ObjC
> * Java
>
> (Java could be faster than ObjC with an exceptionally good interpreter,
> but, in my experience it does not happend. YMMV)

does the current objc compiler/runtime optimizes in some way
message lookups (inline caches, polymorphic inline caches,...).
In the case of negative answer, would it be very bad to allow the
runtime to dynamically modify the machine code of a objc method ?

If plain objc code could be as fast as with IMP kludges, it would
be more maintenable and flexile, preserving performance to the common
case.

Marko

PS:
Polymorphic inline caches (PIC) have the cost of a test and a direct
jump in the best case (cache hit) and 3 or 4 tests (pipelined) and a
normal method lookup. There could be performance issues because of the
self-modifying code (icache flushes) but a cache update should not
happen frequently.
Please refer to the Self prototyping OO language implementation,
and documentation for furhter informations look at:
http://www.sun.com/research/self/

Doc O'Leary

unread,

Aug 31, 2001, 1:38:03 PM8/31/01

to

In article <3B8EE8E7...@easynet.fr>, Frederic Stark
<st...@easynet.fr> wrote:

> Now, nobody have and infinite time to design and optimize. ObjC enable a
> more powerful form of messaging, which simplifies a lot of problems, so
> more time can be spent on the design and high-level optimisation.
> Performance critical code can be (re)written in C++ or C, as ObjC
> interfaces exceptionnally well with C.
>
> In many non-trivial projects, a lot of time is spent designing ways to
> work around the lack of dynamicity in the language, re-inventing a worse
> ObjC, with worse performance.
>
> "Premature optimisation is the root of all evil". Choosing C or C++ for
> performance reasons when the problem is nicely handed by ObjC would be
> the ultimate premature optimisation possible.

This is all spot on and very well said. I don't have any benchmarks
for the original poster, either, but I do have personal experience
moving a highly optimized C++ app over to ObjC and even the first pass
was faster than the C++ version. The move was made for precisely the
reasons described in the second paragraph above (sometimes it's smart
to be lazy :-), and we don't regret the choice of ObjC one bit.

Erik M. Buck

unread,

Aug 31, 2001, 4:00:48 PM8/31/01

to

Objective-C message dispatch is 2-3 times slower than a C function call.
Since not that much time is spent in function calls themselves as opposed to
the bodies of the functions, it seldom has a huge impact. For message
sending within a loop, the message can be hand optimized into a simple
function call and have exactly the same performance as C.

NeXT used Objective-C for writing kernel level drivers. There is no real
problem using Objective-C in a performance sensitive embedded system.
However, it would be very difficult if not impossible to prove the
determinism of the Objective-C message dispatch (since it uses hash tables),
so if you need determinism (i.e guaranteed execution time measured in CPU
clock ticks) then you don't want Objective-C.

I personally have never seen Java perform even half as fast as Objective-C.
Every Java program I have used on the desktop suffered from slow buggy
graphics and was prone to occasional halts/freezes presumably while garbage
was collected. Every Java implementation that I have seen as suffered from
a large memory footprint for the VM and then problems resulting from the
heap size not being large enough (Java heap sizes are limited for security
so that a Java program can not crash a machine by allocating too much
memory: (This is a job of the operating system and not a language, but what
can you do ?)

C++ virtual member functions are a tiny bit faster than Objective-C
messages. If you use a pointer to a virtual member function (as is common
with call-backs) then the performance is identical in most cases. Using the
pointer forces the compiler to do an additional pointer dereference similar
to what Objective-C does. I have had performance problems with C++ not
because of virtual methods but because you have to write more code which
takes longer to execute and is not as likely to be in a cache etc. When
profiling C++, it is not uncommon to see 40% or more time spent in
destructors and constructors. This happens because many programmers use
simple assignment which calls copy constructors and function argument
passing by value which causes a new object to be constructed in the stack
and destructed upon return. Here is a common slow pattern in C++;

/***** Begin example *****/

// swap objects in a collection
MyClass objectA = collection[i];
MyClass objectB = collection[i+1];
collection[i] = objectB;
collection[i+1] = objectA;

/***** End example *****/

Given that the [] operators have probably been overridden, collection[i]
turns into collection.GetObject(i) and collection[i] = objectB turns into
collection.SetObject(i, objectB).

So, how many times is the constructor and destructor for MyClass called per
line?

MyClass objectA = collection[i]; // 3 times: create objectA with default
constructor.
// copy object from
collection onto stack for return
// use copy
constructor for assignment
// destructor: 2
times original value is lost due to assignment
// stack value gives
away after assignment
MyClass objectB = collection[i+1]; // 3 times: same as above
// destructor 2 times
collection[i] = objectB; // at least 2 times: once onto stack
for function call and agian for
// assignment
// destructor at least
once
collection[i+1] = objectA; // same as above

Total: at least 10 constructor calls
at least 8 destructor calls

Each constructor or constructor call has the minimum overhead of a function
call
Constructors and destructors foo all of MyClass's super classes are also
called.
It is easy to get 20 or 30 function calls out of the code above.

Now, before C++ programmers say the example is not fair because references
should be used rather than pass by value. First, the code above looks
perfectly reasonable and harmless to an non-C++ high priest. The same code
might be written in C or Java (except there people would know what the []
and = operators do for sure).
An expert C++ programmer might write MyClass objectA(collection[i]); to
avoid the useless default constructor call. Second, tell that to the
creators of the CString class in MFC that requires pass by value semantics.

Marko Mikulicic

unread,

Sep 1, 2001, 12:56:17 AM9/1/01

to

Erik M. Buck wrote:
> Objective-C message dispatch is 2-3 times slower than a C function call.
> Since not that much time is spent in function calls themselves as opposed to
> the bodies of the functions, it seldom has a huge impact.

That depends on the programming style. In pure OOP languages like smalltalk
message passing IS the body of the functions.
Since Objective-C has hardwired control structures (from C), doesn't need to
handle integers though objects, it can have bodies that directly contain
"primitive" operations, explicitly programmed (as opposed to inlines by the
compiler). The impact of a slower dispatch isn't so great so it may be ignored,
but this situation may influence the design and coding decisions, for
example trying to avoid too deep message chains. (doubling the dispatch speed
will double the acceptable chain length, which as a design pattern
it's limited by the human creativity and should not be limited by processing
speed, if possible)

This can make the code hard to mantain in big projects, when someone
breaks the rule that in a collection should go only instances of class x.

> For message
> sending within a loop, the message can be hand optimized into a simple
> function call and have exactly the same performance as C.

You loose the polymorhism.
With Polymorphic inline caches you'd get the same performance as
indirect function calls and a code that will work even if one object
out of a million is of another class.

>
> NeXT used Objective-C for writing kernel level drivers. There is no real
> problem using Objective-C in a performance sensitive embedded system.
> However, it would be very difficult if not impossible to prove the
> determinism of the Objective-C message dispatch (since it uses hash tables),
> so if you need determinism (i.e guaranteed execution time measured in CPU
> clock ticks) then you don't want Objective-C.
>
> I personally have never seen Java perform even half as fast as Objective-C.
> Every Java program I have used on the desktop suffered from slow buggy
> graphics and was prone to occasional halts/freezes presumably while garbage
> was collected. Every Java implementation that I have seen as suffered from
> a large memory footprint for the VM and then problems resulting from the
> heap size not being large enough (Java heap sizes are limited for security
> so that a Java program can not crash a machine by allocating too much
> memory: (This is a job of the operating system and not a language, but what
> can you do ?)

I didn't manage to run a Java app more than 10 minutes of work.
I don't know why companies like oracle use it as the installation &
configuration tool ! They even provide their own jdk bundled on the CD,
and it doesn't run either or when it runs (it depends on the moon, it seems)
it (jre) ates 1GB of ram for the installation (512 KB of ram, 512 of swap, I swear).

Java I really an unhappy example, though.
With the HotSpot compiler (developed as part of the Self project at Sun) Java
could theoreticy be very fast and responsive.
But Java is a bad crossing of two world, a bad balance of runtime and
compiletime. I think that all this was imposed by marketing rules,
because Java had to be similar to C++ and must run safely on various
OS, each of them with security problems and limited common memory management
features.

Objective-C is also a hybrid but a well-though one.
The performance of the gui is surprising. Certainly
this is also because of the OpenStep desing (I had to run on a 33Mhz 68k),
but also the raw power of objc.
There are a couple of issues I don't like, mostly
the very verbose naming of methods (I really like the at:put: stuff from
smalltalk), but I understand that there is no way out, the signature
must be unique for a selector if it handles primitive types.
Objc has many drawbacks, many of which roots in C.
I'd like to put itegers in collections without wrappers, use
closures for actions, gracefully handle integer overflows, have
nil a real object, having real class object with metaclasses, ...
But for me OpenStep matters, not objc itself; changing objc
doesn't make sense. Adding a bit of performance, maybe.

Do you think that doubling the performance of message dispatch for common
targets will speedup the gui (responsivness) ?

Marko

Erik M. Buck

unread,

Aug 31, 2001, 8:00:11 PM8/31/01

to

> Do you think that doubling the performance of message dispatch for common
> targets will speedup the gui (responsivness) ?
>

Probably not when you consider that the processor time to draw the pixels of
a button completely dwarfs the processor time used performing that action of
the button in most cases. There is no hardware acceleration for Quartz. If
we really want to speed up the GUI in OSX, I suspect the best approach would
be to double the bus bandwidth between main memory and graphics memory.

Marko Mikulicic

unread,

Sep 1, 2001, 2:45:07 AM9/1/01

to

You're right,
MacOSX wastes (this is an opinion) memory bandwidth and computing power
(I guess menu transparency is not hardware accelerated, at least not all).
I thought of a more calm GNUstep with standard style running on modest linux
box (say at 300 MHz, to be modern). I personally don't like to assume that every
year everyone will upgrade to the last tehnology the market offers.
However, I was interested in the processing of events and notifications, when
managing very conplex gui, not the drawing of pixels, which isn't really a
matter of objc.
I noticed writing qt applications (C++) that when the connections grows in size
the application become quickly unresponsive (that's bad design caused by C++)
(~30k lines, mostly gui). I used extensively the signal/slot facilities of that
framework, which are similar to notifications in gnustep. Qt reinvents the wheel
of dynamic dispatch, which gnustep handles natively, and thus my belief that
gnustep will not suffer the same bottleneck. I don't understand yet very well
the gnustep internals, but it seemed to me that the notification mechanism is
widespreadly used in the gui code.

Marko

Marko Mikulicic

unread,

Sep 1, 2001, 2:59:15 AM9/1/01

to

Hi Alexander,

Could you please send me your patches or upload it somewhere ?
I'll try to port it on gcc 3.0.
I need more raw speed processing hudge data with gstep-db.

Thanks

Marko

Malmberg wrote:

>
> I did some hacking on gcc (2.95.2) a while back to try to make the
> message passing code more efficient and did some benchmarking. For the
> common case of a local, valid selector, calling an empty method took ~45
> cycles on a PII. After trying to make the generated code in the caller
> smaller and rewriting objc_msg_lookup* in assembly by hand (tried to
> optimize the common path through it), and making some other changes, I
> got it down to ~22 cycles. I could probably gain some more speed by
> sacrificing the smaller code, and possibly some more by more careful
> assembly.
>
> Anyway, gnustep base and gui compiled and worked fine with the new
> gcc/libobjc (well, not on the first try, but eventually). Once I get
> around to installing gcc 3.0, I'll probably try to move the changes
> over. If anyone's interested I could upload patches to 2.95.2 somewhere,
> or (given some time) change it to use the old call style and the new
> objc_msg_lookup, which is what I think would give the best performance.
>
> - Alexander Malmberg
>
>
> _______________________________________________
> Discuss-gnustep mailing list
> Discuss...@gnu.org
> http://mail.gnu.org/mailman/listinfo/discuss-gnustep
>

Nicola Pero

unread,

Sep 1, 2001, 12:11:29 AM9/1/01

to

> does the current objc compiler/runtime optimizes in some way
> message lookups (inline caches, polymorphic inline caches,...).

> PS:

> Polymorphic inline caches (PIC) have the cost of a test and a direct
> jump in the best case (cache hit) and 3 or 4 tests (pipelined) and a
> normal method lookup. There could be performance issues because of the
> self-modifying code (icache flushes) but a cache update should not
> happen frequently.
> Please refer to the Self prototyping OO language implementation,
> and documentation for furhter informations look at:
> http://www.sun.com/research/self/

Thanks for the pointers, I didn't know about that and it was a very
interesting reading.

Plainly, my answer is to your suggestion of using inline caches in the
Objective-C runtime is that the current Objective-C runtime can message at
a very high speed in multithreading, in a totally thread-safe way. The
current method cache is amazingly designed in such a way to run
thread-safe without locks, thus it is thread safe without incurring *any*
thread lock overhead.

My feeling reading about inline caches is that they require self-modifying
code, and for the solution to be viable you would need to have
self-modifying code which can run multithreading in a thread-safe way
without using locks (as adding thread locking would kill the performance).
It might actually be possible to implement it, but if it is, it is not
going to be trivial.

Nicola Pero

unread,

Sep 1, 2001, 9:22:33 AM9/1/01

to

Here is a rough thought about how we could implement an inline cache
optimization for the Objective-C runtime, without using self-modifying
code (or better, by using the C equivalent of self-modifying code):

you would modify the compiler so that instead of compiling

[receiver message];

into

{
IMP __imp = objc_msg_lookup(receiver, @selector(message));
__imp (receiver, @selector(message));
}

it would compile it into

{
static Class __class = Nil;
static IMP __imp = NULL;

if (receiver->isa != __class)
{
__class = receiver->isa;
__imp = objc_msg_lookup(receiver, @selector(message));
}

__imp (receiver, @selector(message));
}

it's a rough sketch, but I suppose something like this might actually
work! :-)

but only if you don't use multithreading. In multithreading, it is not
actually going to work, unless we add locks, but we can't. Any idea how
to make it work in multithreading ?

I suppose we might add it the compiler and it could be turned on with a
certain command line flag. That flag wouldn't work with threads, so if
you compile with the flag, you mustn't use threads. Some sort of
`thread-unsafe' optimization.

Btw, notice that a hand-made optimization done directly using IMPs in
Objective-C is faster than an inline cache hit, so in (the extremely rare)
places in which you might need a terrific messaging speed (which,
reasonably speaking, *must* be in a tight loop (and so optimizable by
IMPs) because unless you send some tons of millions of messages, who cares
if it takes 30% more or less time to send a message), you would want to
optimize directly using IMPs anyway.

Marko Mikulicic

unread,

Sep 1, 2001, 2:12:19 PM9/1/01

to

Self was designed to use a soft theading having a context switch exactly at the
beginning of a method (faking the stack overflow test in the prologue). Since
Self uses message dispatching for quasi everything (although massive inlining,
could defeat the responsiveness, but that's unlikely to happen), this provides
fine granularity and avoids locking. However, only one processor is exploited.
This may be good for a experimental system, but for production native threading
should be used and a way of implementing PICs using native threads. (also the
stack overflow stuff should be done using page faults).
I'm working on that. When done for OpenSelf maybe it could be ported to gcc objc.

How does the runtime cache work ?
Is there currently a big impact on performance when adding new methods at
runtime ?
Could it support multiple inheritance, delegation, dynamic inheritance ?

Marko

Marko Mikulicic

unread,

Sep 1, 2001, 4:56:50 PM9/1/01

to

Nicola Pero wrote:
> Here is a rough thought about how we could implement an inline cache
> optimization for the Objective-C runtime, without using self-modifying
> code (or better, by using the C equivalent of self-modifying code):
>
> you would modify the compiler so that instead of compiling
>
> [receiver message];
>
> into
>
> {
> IMP __imp = objc_msg_lookup(receiver, @selector(message));
> __imp (receiver, @selector(message));
> }
>
> it would compile it into
>
> {
> static Class __class = Nil;
> static IMP __imp = NULL;
>
> if (receiver->isa != __class)
> {
> __class = receiver->isa;
> __imp = objc_msg_lookup(receiver, @selector(message));
> }
>
> __imp (receiver, @selector(message));
> }
>
> it's a rough sketch, but I suppose something like this might actually
> work! :-)

(nil receiver must be explicitly handled)

Actually, the Portable Object Compiler (POC) does something alike
(it is a preprocessor).

But this shema will be trashing when frequent cache misses occour
(interleaved classes in an array for example). Polymorphic caches would perform
far better, although single call is a bit more expensive:

{
static struct PIC_call_t __picdata;

if(receiver->isa == __picdata.class1)
{
__picdata.imp1(receiver,@selector(message));
}
else if(receiver->isa == __picdata.class2)
{
__SWAP(__picdata.imp1,__picdata.imp2);
__SWAP(__picdata.class1,__picdata.class2);
__picdata.imp1(receiver,@selector(message));
}
else if(receiver->isa == __picdata.class3)
{
__SWAP(__picdata.imp2,__picdata.imp3);
__SWAP(__picdata.class2,__picdata.class3);
__picdata.imp2(receiver,@selector(message));
}
else
{
__picdata.imp3 = objc_msg_lookup(receiver, @selector(message));
__picdata.class3 = receiver->isa;
__picdata.imp3 (receiver, @selector(message));
}
}

Of course, handcrafted assembler would result in smaller code
(for example pushing arguments at the beginning, exploiting
branch delay slots if the architectures permits, avoid stalls
and icache misses prefetching target imp while maintaining LRU queue,...)

There are other ways for mantaining a LRU queue, this is also a sketch.
It's difficult to migrate a techinque from a dynamic compiled language
to a static compiled one: many informations that are present in the Self
runtime system are not cheap in objc (profiling informations used by the
inlining compiler).

>
> but only if you don't use multithreading. In multithreading, it is not
> actually going to work, unless we add locks, but we can't. Any idea how
> to make it work in multithreading ?

You could, for example, maintain profiling informations
(just the count of method calls on a persite basis).
If a thread is preempted just before writing the incremented value
then a method call accounting will be lost but who cares, we are talking
of millions of method calls; if none calls it later then it wouldn't be
a good candidate for a PIC however.
PIC update could be done by a separate thread (the GC for example,
while traversing stack frames for pointers to locals or parameters)
while freezing all other threads (this could be done incrementally
to minimize delays).
But this assumes some mechanisms that are part of Smalltalk/Self runtime
and not of objc, and maybe
adding them to objc would bring the same performance it has now.
However, the assumption can be that we can be conservative in determinig
wich method calls should be promoted in a PIC, and PIC updates can be
deferred.

>
> I suppose we might add it the compiler and it could be turned on with a
> certain command line flag. That flag wouldn't work with threads, so if
> you compile with the flag, you mustn't use threads. Some sort of
> `thread-unsafe' optimization.

It could be useful.
You may also use this in a threading environment, wich coarse handcrafted
locking and linking with a category compiled with this flag, but this
would become dangerous and tricky in some cases.

>
> Btw, notice that a hand-made optimization done directly using IMPs in
> Objective-C is faster than an inline cache hit, so in (the extremely rare)
> places in which you might need a terrific messaging speed (which,
> reasonably speaking, *must* be in a tight loop (and so optimizable by
> IMPs) because unless you send some tons of millions of messages, who cares
> if it takes 30% more or less time to send a message), you would want to
> optimize directly using IMPs anyway.

I have this specific situation:

I have modified gstep-objc to better support custum attribute types
(money, custum calendar dates). The current code hardcoded classes to database
types, but the framework (EOF) lets the user choose a specific class for a given
attribute (of course with a compatible protocol). My modifications require more
OO, and when hudge data loading is involved the overhead of transforming adapter
records in application object becomes visible.
The messaging is not done in a thight loop but still most of the methods are
called for common targets.
I think this is an example of a situation where optimizing by hand is not
possible, because framework interaction is not trivial.
Even if the targets are temporary and spatially dislocated, the size of todays
icaches is big enough to hold more than a level in the call graph, and also
you watch performance implications on architectures with different argument
passing shemas (register window: sparc, itanium, pa risc, and more to come).
x86 is unhappy as a base of design as it can perform amazingly but freezes the
development of new tehniques because the architecture imposes rules which
produce code under this rules, which in turn justificate new design assumptions.
ex: we should not add more registers because the encoding of instructions
would be larger, leaving less space on die to dcache for the stack, which
performs quite well in the place of register, if using the appropriate runtime
hardware code transformations (ex in ppro core) which inside is much like a risc
machine ....
This indeed performs well, but for old code and new code is pushed to be like
to old one. This situation is more relaxed than in the past thanks to compiler
development but some assumptions still reigns: function calls are slow,
leave your call graph so shallow as you can, ...

OO code can do amazing things, it's a shame that its price is so high.
Often we make a treadoff between performance and flexibility; the problem
is that when we hardcode this tread off we loose future developments.
If this balance is done at runtime the system stabilises very quickly
and then performs like coded by hand, changing automatically when his
environment requests it.

I admit that in many cases manual optimization does very well, surely better
than PICs.

Sorry for this hudge post, I'll learn to type gzipped :-)

Marko

Erik M. Buck

unread,

Sep 1, 2001, 11:16:11 AM9/1/01

to

> you would modify the compiler so that instead of compiling
>
> [receiver message];
>
> into
>
> {
> IMP __imp = objc_msg_lookup(receiver, @selector(message));
> __imp (receiver, @selector(message));
> }
>
> it would compile it into
>
> {
> static Class __class = Nil;
> static IMP __imp = NULL;
>
> if (receiver->isa != __class)
> {
> __class = receiver->isa;
> __imp = objc_msg_lookup(receiver, @selector(message));
> }
>
> __imp (receiver, @selector(message));
> }
>
> it's a rough sketch, but I suppose something like this might actually
> work! :-)
>

As you know, the above is not thread safe. Furthermore, the above is likely
SLOWER. The variable assignments require memory writes. The use of static
variables probably means the memory will not be in the cache. The
comparison and branch is almost as expensive as the function call, and for
the common case when the IMP for @selector(message) is in receiver's method
cache, you are getting effectively the same optimization. I doubt there is
ANY benefit to the above code. It probably will make things slower. It is
not thread safe. It requires more memory and more code. It abuses
processor memory cache. It substantially duplicates what is already
happening in objc_msg_send.

Given:

> {
> IMP __imp = objc_msg_lookup(receiver, @selector(message));
> __imp (receiver, @selector(message));
> }

One optimization would be to inline objc_msg_lookup()
Another optimization would be to use jmp rather than a function call to
execute the IMP. The NeXTstep runtime diddles the stack and jumps to the
IMP. It is slightly faster, but it is assembly language and therefore less
portable.

Finally, as you point out, when optimization is critical, the programmer can
call the IMP directly.

Marko Mikulicic

unread,

Sep 1, 2001, 6:58:14 PM9/1/01

to

Erik M. Buck wrote:
>>you would modify the compiler so that instead of compiling
>>
>>
>

> As you know, the above is not thread safe. Furthermore, the above is likely
> SLOWER. The variable assignments require memory writes. The use of static
> variables probably means the memory will not be in the cache.

Yes, you are right. This is because I prefer "self-modifying" code
(in Self not really self-modifying, because PIC updates are deferred).
Also using a writable text segment you could "fix" method calls as direct calls
instead indirect ones, which can aid the cpu's prefetch logic.

> The
> comparison and branch is almost as expensive as the function call, and for
> the common case when the IMP for @selector(message) is in receiver's method
> cache, you are getting effectively the same optimization. I doubt there is
> ANY benefit to the above code. It probably will make things slower.

I tried it on a simple case and got these results
- 11.31 secs for full lookup
- 5.20 secs for simple inline cache (Nicola's code)
- 4.20 for plain-c
the body is a tight loop of 1e8 iteration; the method called simply return
the argument incremented.
I know it's not a benchmark, it's far too simplified because it doesn't take
account of the presence of the IC in the data cache.

> It is
> not thread safe. It requires more memory and more code. It abuses
> processor memory cache. It substantially duplicates what is already
> happening in objc_msg_send.

various levels of cache have their purposes. Duplicating is not always bad.

> One optimization would be to inline objc_msg_lookup()
> Another optimization would be to use jmp rather than a function call to
> execute the IMP. The NeXTstep runtime diddles the stack and jumps to the
> IMP.

This would be good, expecially for methods with many argument (or massive ones).

> It is slightly faster, but it is assembly language and therefore less
> portable.
>
>
> Finally, as you point out, when optimization is critical, the programmer can
> call the IMP directly.

You must know that the target will not change. To handle correctly
an object of the wrong type inside a collection you will end up to
implement a inline cache by hand however. This is simple to manage
when the calls are directly in the tight loop

I don't want that this thread becomes something like
the everlasting fight between GC and not GC, but let me make a
parallel with it:
The C/C++/whatever programmer argues that manual memory allocation is best
because you have the control over your memory management and GC is sloow,
undeterministic, ... but when the need arises he introduces something called
"smartpointer" which is nothing else than a reference counting GC, which is
probably (I don't want dispute here about that) the worst GC around these days.

In the cases you can use directly IMP then use them.
But if you can't (like my situation with gstep-db) you still could code
inline caches by hand, but then you'd be limited by what the c compiler
can produce (it doesn't make sense to code it in asm). But the compiler
itself could direcly generate ICs or PICs (with the appropriate flags of course)
, and it will generate them better that you as a end user could, because it
will be optimized for a given architecture (something you must not care about).

Objc is at the border, you don't notice the performance degradation because
you can cross it to the C side; when C is too tight you can cross it again to
the smalltalk side. So you can have reasonable performance and flexibility out
of both worlds but at the price of living in a hybrid framework, where integers
are not objects, nil has nil behavior etc.

I agree that maybe the small perf. gain is not worth the effort, unless you
change coding style, but then it is not OpenStep anymore.

Please read the Self performance related papers
(http://www.sun.com/research/self/papers/papers.html) if you have time.
They managed to run pure OO language only twice slower than optimized C
(non strictly OO tasks) having bound-checking, GC, integer as objects
(automatic multi precision on overflow), full debugging support (dynamic
deoptimizations), multithreading, dynamic inheritance, and many other features
off topic. Many of these features are not interesting to objc but this is the
reason why Self was twice slower than C. I think objc could do better.

Marko

Nicola Pero

unread,

Sep 1, 2001, 4:28:40 PM9/1/01

to

> > you would modify the compiler so that instead of compiling
> >

> > [receiver message];
> >
> > into
> >
> > {
> > IMP __imp = objc_msg_lookup(receiver, @selector(message));
> > __imp (receiver, @selector(message));
> > }
> >
> > it would compile it into
> >
> > {
> > static Class __class = Nil;
> > static IMP __imp = NULL;
> >
> > if (receiver->isa != __class)
> > {
> > __class = receiver->isa;
> > __imp = objc_msg_lookup(receiver, @selector(message));
> > }
> >
> > __imp (receiver, @selector(message));
> > }
> >
> > it's a rough sketch, but I suppose something like this might actually
> > work! :-)
> >
>

> As you know, the above is not thread safe. Furthermore, the above is likely
> SLOWER.

I actually agree with you ... it was a tentative try to imagine how you
could implement inline caching in Objective-C ... it's interesting to try
importing ideas ... and it is definitely an interesting sets of ideas to
optimize OO language messaging ... thanks Marko for pointing this out ...
but I agree with you that it looks like an optimization
designed/appropriate for different environments/languages ... in the
context of ObjC, it doesn't seem to make as much sense as it does in other
environments (where it migth definitely be a great idea) ... the gain in
speed (if any and if positive :-)) looks like it would normally be so
little that it's not probably worth the effort.

Btw - because of poseAs:, the above code - which uses a locally cached
static IMP and Class, can't be used - the compiler would rather need to
generate a table per file, where to store the cached IMPs and Classes for
all method invocations in the file, so that when poseAs: is called, we
would loop through the files and update the cache tables ...

Malmberg

unread,

Sep 2, 2001, 7:17:08 PM9/2/01

to

> Hi Alexander,
>
> Could you please send me your patches or upload it somewhere ?
> I'll try to port it on gcc 3.0.
> I need more raw speed processing hudge data with gstep-db.

I've uploaded the patch to:

http://w1.423.telia.com/~u42303319/gcc-objc.diff

and the optimized functions in:

http://w1.423.telia.com/~u42303319/sendmsg_alex.S

(which goes in the libobjc/ directory). It's a bit messy and not very
portable, but it works. It should be fairly easy to change the functions
in sendmsg_alex.S so they return the function-pointer instead of jumping
to it (which would use the current call style instead of the one I use).
Really early testing seemed to show that it performed slightly better
that way, but the new call style did reduce code size of base and gui by
~10%, which may be enough to compensate.

Anyway, this discussion has made me think more about how to optimize
message passing, so I guess I'll get gcc 3.0 and start experimenting
again. I'm a bit sceptical of actually gaining any significant amount of
speed, though.

> Thanks
>
> Marko

- Alexander Malmberg

Helge Hess

unread,

Sep 3, 2001, 6:02:42 AM9/3/01

to

Chris Hanson wrote:
> There's no reason Java has to be slow -- that it is says more about
> Sun than it does about languages running on virtual machines, or
> about garbage collection, or about dynamic languages.

There are other reasons but the VM why Java is slow. One is that large
parts of the basic classes do extensive locking for no reason.
Take a look, many 'how to write fast Java' tutorials start by
reimplementing most basic classes ...

Greetings
Helge
--
SKYRIX Software AG - http://www.skyrix.com
Web Application Technology for Enterprises

Malmberg

unread,

Sep 5, 2001, 8:53:55 AM9/5/01

to

OK, I've done a bunch of benchmarking and testing and have found some
interesting results. In all cases, the method/function called ignores
its arguments and returns an integer. Function prologue and epilogue is
included. I've tested by running a loop of calls 100,000,000 times on a
PII 400mhz with no load. Most of the time I've taken the assembly
generated by gcc and fixed the worst issues before testing (for whatever
reason, it likes to move the %esp value around for no good reason). I'm
using gcc 2.95.2, but that shouldn't matter.

So, in no particular order (speed in cycles/iteration, not necessarily
an integer due to overhead and pairing issues):

direct call, zero args 8.12
direct call, one arg (self) 9.20
direct call, two args (self+selector) 11.04

c++ virtual call, one arg (self) 10.52
c++ virtual call, two args (self+dummy) 12.08

all these calls are with two args (self+selector):

old obj-c lookup/call 42.72
new obj-c lookup/call (like the patch) 21.68
even more optimized obj-c 18.36
really optimized obj-c (not safe) 17.12

dynamic inline cache (in c) hit 13.08
call using IMP 11.48

'outline' cache, style 1:
level 1 hit 13.92
level 1 hit, no nil test 12.60
level 1 miss (double nil test) 24.60
level 1 miss (proper) 23.68

level 2 hit 15.36
level 2 miss (double nil test) 25.12

level 3 hit 18.08
level 3 miss (double nil test) 28.04

'outline' cache, style 2:
level 1 hit 13.40
level 1 miss (double nil test) 24.16

level 2 hit 14.96
level 2 miss (proper) 25.12

inline cache:
level 1 hit ~13

Now some comments:

Inline caches are a bad idea because:
1. They would make the code larger.
2. They're hard to do in a thread-safe way without locking (probably
impossible).
3. Modifying the code means shared libraries can't be shared between
programs. This is a problem with the other cache-systems as well, but
that could be solved (at the loss of speed) by an extra level of
indirection.

Even inlining the nil test didn't help performance (which is a bit
surprising). Having gcc load the class for the object and placing it in
a register before calling improves performance slightly, though.

With 'outline' caches, style 1, I'd create short stubs like:
movl 4(%%esp),%%ecx // get object from stack
testl %%ecx,%%ecx // check nil receiver
jz _objc_nil_method
movl (%%ecx),%%eax // load class
cmpl some_class,%%eax
jnz objc_msg_lookup_call_no_nil_test
jmp the_real_method

and patch the calls to these functions. The stubs would be packed
together in some memory allocated by the runtime. This could be made
thread safe without locking since jmp and call destinations can be
updated atomically with one movl (at least I think so).

Also, these stubs can be changed dynamically. If one of the stubs sees a
lot of misses going to a particular class, the 'jnz objc_...' could be
patched to jump to the cmpl of another stub. The jmp to the function
could also be patched if the implementation changes. However, it
wouldn't be possible to change the class for a stub.

Note that in most of my testing of the caches, misses were sent to the
normal lookup/call function, so loading the class and checking for a nil
receiver is done twice, which seems to cost 1-1.5 cycles.

'Outline' caches, style 2, had stubs like:
movl 4(%%esp),%%ecx
testl %%ecx,%%ecx
jz _objc_nil_method
movl (%%ecx),%%eax
cmpl some_class_1,%%eax
jz the_real_method_1
cmpl some_class_2,%%eax
jz the_real_method_2
// etc
jmp objc_msg_lookup_call_no_nil_test

This is slightly faster than style 1, but you can't change the chaining
of compares.

Since it turned out that the normal lookup can be optimized down to just
~7.5 cycles slower than a direct c call (~6 cycles might be possible), I
doubt (automatic) caching will ever help. Even in the best safe cache,
the third level is almost as slow as the normal lookup. And these were
static caches, keeping stats over method calls and updating calls would
slow it down even more.

I'll get to work on making the optimized lookup work properly in all
cases, which should be fast enough. 7-8 cycles slower than a c++ call
isn't bad considering the capabilities and safe nil receiver handling.

- Alexander Malmberg

Marko Mikulicic

unread,

Sep 5, 2001, 6:34:18 PM9/5/01

to

Malmberg wrote:

>
> Inline caches are a bad idea because:
> 1. They would make the code larger.

Perhaps only compile some methods with IC through a compiler flag
could help where hard coded IMPs are not well suited, and still
save code where they are not needed.

> 2. They're hard to do in a thread-safe way without locking (probably
> impossible).

Deferred update is not possible because maintaing call statistics
is a performance bottleneck in objc, so I guess that IC are not feasible
in objc in a thread safe way. Any other ideas ?

> 3. Modifying the code means shared libraries can't be shared between
> programs. This is a problem with the other cache-systems as well, but
> that could be solved (at the loss of speed) by an extra level of
> indirection.

How are C++ virtual calls handled in shared libraries ?
I've heard that the biggest performance problem in the startup
of a KDE app is copy-on-write of virtual pointer tables contained
in shared libraries. They must be updated because the shared libraries can
be mapped at different addresses.
On systems which support copy-on-write pages, IC are possible at the cost of
some memory. In contrast to C++ the pages must not be updated at load time by
the dynamic linker so startup time is not effected by this.

Marko Mikulicic

Malmberg

unread,

Sep 5, 2001, 7:26:11 PM9/5/01

to

> > 2. They're hard to do in a thread-safe way without locking (probably
> > impossible).
>
> Deferred update is not possible because maintaing call statistics
> is a performance bottleneck in objc, so I guess that IC are not feasible
> in objc in a thread safe way. Any other ideas ?

Well, if you really _really_ need the extra 4-5 cycles/call, how about
adding a runtime function objc_need_more_speed(@selector(foo: bar:
zot:),[SomeClass class],[SomeOtherClass class],...,NULL);. It would
generate an 'outline' stub (style 2) with those classes and change the
sarray entries for the selector in those classes to point to a function
that patches the caller to directly call the stub. You could even add an
argument that controls whether the stub should bother to correctly
handle a nil receiver or not (if you can guarantee that you'll never
call that selector with a nil receiver you can gain ~1 cycle/call).
There are a few practical problems, but I think it could be done.

Theoretically, there's recompilation.

I had a look at some of the PIC-papers, and it seems their target is
quite different from ours. As far as I can tell, they put a direct call
at ~200ns and a normal lookup at ~15000ns (~75 times slower). With the
optimized obj-c lookup, a direct call takes ~27.5ns, and a lookup
~47.5ns (1.7 times slower).

> How are C++ virtual calls handled in shared libraries ?
> I've heard that the biggest performance problem in the startup
> of a KDE app is copy-on-write of virtual pointer tables contained
> in shared libraries. They must be updated because the shared libraries can
> be mapped at different addresses.

I suppose so, though I've never looked at how that's done. I'll see if I
can find the code responsible for this.

- Alexander Malmberg

Marko Mikulicic

unread,

Sep 6, 2001, 3:38:53 AM9/6/01

to

Malmberg wrote:
>> > 2. They're hard to do in a thread-safe way without locking (probably
>> > impossible).
>>
>>Deferred update is not possible because maintaing call statistics
>>is a performance bottleneck in objc, so I guess that IC are not feasible
>>in objc in a thread safe way. Any other ideas ?
>>
>
> Well, if you really _really_ need the extra 4-5 cycles/call, how about
> adding a runtime function objc_need_more_speed(@selector(foo: bar:
> zot:),[SomeClass class],[SomeOtherClass class],...,NULL);. It would
> generate an 'outline' stub (style 2) with those classes and change the
> sarray entries for the selector in those classes to point to a function
> that patches the caller to directly call the stub. You could even add an
> argument that controls whether the stub should bother to correctly
> handle a nil receiver or not (if you can guarantee that you'll never
> call that selector with a nil receiver you can gain ~1 cycle/call).
> There are a few practical problems, but I think it could be done.

You have manual control over the PIC.
Why you can't simply have a macro generate the code in place ?
Perhaps some stubs could be reused reducing code size.
For example: if we know that some collection contains x kinds of
objects, of which only a few really dominates (the top 3 are more that 70%,say)
it could help use a common stub for accessing an element from the array (when
the messages sent to that object are in a tight loop or sparse across deep
call chain, so maintaing a global per iteration IMP is not feasible).
You could manually know at runtime which classes are to be cached running
a kind of "prepare" pass, but it's stricly application dependent and if some end
user really needs that, he can probably do it best himself.

Is the cost of the JMP to the stub worh ?

I'm know convinced that implementing PICs in objc is not worth the effort
beacouse the very nature of this language. Generated code is mixed with user
code. We have only spoken of simple situations, but when the method call is
intermangled with some nasty C stuff we never know how (and witch compiler
backend, depending even on the version of gcc) will gcc alloc register and
shedule instructions. If we generate stubs in C code, gcc will handle them best
as he can, but handcrafted assembler is expecially well suited when
reimplementing things as method calls because they deviate a bit from standard C
function calls. In Self everything is generated, so a clever compiler can play
with all factors wich is not only instruction best-case duration.

> Theoretically, there's recompilation.

Sorry, what do you mean ?

>
> I had a look at some of the PIC-papers, and it seems their target is
> quite different from ours. As far as I can tell, they put a direct call
> at ~200ns and a normal lookup at ~15000ns (~75 times slower). With the
> optimized obj-c lookup, a direct call takes ~27.5ns, and a lookup
> ~47.5ns (1.7 times slower).

Objc has a simpler lookup because it doesn't have multiple inheritance,
delegation and dynamic inheritance. Moreover Self has to use true message sends
also for accessing instance variables (local variables too, theoretically, but
they are inlined)

Self has comparable execution times to objc IC in the case of a cache hit.
It uses profiling information to update PICs and also to know when and where
to inline a method body, saving the whole call.
I'm comparing apples to bicycles because Self has very short methods on average
and also has to inline most control structures to be usable, while preserving
the possibility to extend basic flow control (ifTrue: and co.)
The only drawback I see in Self is that it does great at the expense of code
size which eats icache.

>>How are C++ virtual calls handled in shared libraries ?
>>I've heard that the biggest performance problem in the startup
>>of a KDE app is copy-on-write of virtual pointer tables contained
>>in shared libraries. They must be updated because the shared libraries can
>>be mapped at different addresses.
>>
>
> I suppose so, though I've never looked at how that's done. I'll see if I
> can find the code responsible for this.

Sorry, it was a raetoric question.
The dynamic linker updates the virtual pointer tables.
KDE guys have solved this problem in the last release adding a kind of
prelinking. you can find interesting info at
http://www.research.att.com/~leonb/objprelink/

Marko

Malmberg

unread,

Sep 7, 2001, 6:21:35 AM9/7/01

to

[snip]

> You have manual control over the PIC.
> Why you can't simply have a macro generate the code in place ?

You don't know what values to use for the class compare/method jump
until run-time, so you'd have to make a list of all IC:s and patch them
when the program loads.

> Is the cost of the JMP to the stub worh ?

0.5-1 cycles. The unconditional jump is cheap, and lack of conditional
call instructions means that the stub can be slightly more efficient
than the IC ('call stub;compare;jz method;jmp lookup' instead of
'compare;jmp call_stub;call lookup;jmp cont;call_stub:call
method;cont:').

> > Theoretically, there's recompilation.
>
> Sorry, what do you mean ?

Collecting stats on test runs and using the information to inline common
methods and generate efficient IC:s. Not likely to happen anytime soon.

[snip]

> Objc has a simpler lookup because it doesn't have multiple inheritance,
> delegation

Doesn't forwarding and DO count as delegation?

> and dynamic inheritance.

Those are language limitations. The runtime could handle all of those
(at least if I'm thinking of the same things you are) at the same speed.

- Alexander Malmberg

Marko Mikulicic

unread,

Sep 7, 2001, 9:36:18 PM9/7/01

to

Malmberg wrote:
> [snip]
>
>>You have manual control over the PIC.
>>Why you can't simply have a macro generate the code in place ?
>>
>
> You don't know what values to use for the class compare/method jump
> until run-time, so you'd have to make a list of all IC:s and patch them
> when the program loads.

The compiler could generate per object information with locations of IC-s,
and the runtime will patch them as the program loads. But this would slow down
startup if thousands of IC:s are used. An alternative could be placing a full IC
code inline but with the first instruction a jump to the patching routine, which
will restore the first instr. of the IC and update the values known only at runime.

>>Is the cost of the JMP to the stub worh ?
>>
>
> 0.5-1 cycles. The unconditional jump is cheap, and lack of conditional
> call instructions means that the stub can be slightly more efficient
> than the IC ('call stub;compare;jz method;jmp lookup' instead of
> 'compare;jmp call_stub;call lookup;jmp cont;call_stub:call
> method;cont:').

You can code the stub like that in the inline code too, you would save
0.5-1 cycles but it would require compiler changes.
Anyway you would be writing in the text segment, which would most
probably arise in a copy on write page fault.

You method has the power/defect to patch whatever caller calls the registered
methods. Sometimes this can be good, but sometimes this could lead to unexpected
page faulting during execution and core growth.
Your solution is more easy to implement because the compiler itself needs not
to be changed but can give serious performance problems that can be hard to
predict because they depend of the execution profile of the program.

>
>
>>>Theoretically, there's recompilation.
>>>
>>Sorry, what do you mean ?
>>
>
> Collecting stats on test runs and using the information to inline common
> methods and generate efficient IC:s. Not likely to happen anytime soon.

Can gprof gather useful data for this ?

>
> [snip]
>
>>Objc has a simpler lookup because it doesn't have multiple inheritance,
>>delegation
>>
>
> Doesn't forwarding and DO count as delegation?

Not really. Delegation uses the delegate to lookup the method
but forwards the original receiver.
(One example is better than a thousand words. see attachment)

>
>>and dynamic inheritance.
>>
>
> Those are language limitations. The runtime could handle all of those
> (at least if I'm thinking of the same things you are) at the same speed.

I admit I don't know in detail the lookup mechanism in the objc runtime
but I suppose that rebuilding the sarray after a inheritance change would be
very expensive. Multiple inheritance requres a resoulution when the same
selector is found in the ancestors and also handle inheritance cycles. This
makes the lookup so expensive and also makes any global selector based caching
sheme very difficult to implement.

Maybe I'm wrong. Do you think all these can be handled by the objc runtime ?
Can methods be added at runtime to the cache (suppose they were builded runtime
by byte-code) ? I could be very easy to implement a RAD tool for objc,
or even implement a Smalltalk or Self system on top of the objc runtime
(which would have goot C interfacing though objc/openstep wrappers) ?

Marko

Marko Mikulicic

unread,

Sep 7, 2001, 9:38:24 PM9/7/01

to

I forgot the attachment.

main.m

Erik M. Buck

unread,

Sep 7, 2001, 4:23:21 PM9/7/01

to

> Can methods be added at runtime to the cache (suppose they were builded
runtime
> by byte-code) ? I could be very easy to implement a RAD tool for objc,
> or even implement a Smalltalk or Self system on top of the objc runtime
> (which would have goot C interfacing though objc/openstep wrappers) ?
>

Yes. New methods are added to existing classes at runtime all the time via
dynamically loaded categories. How the methods are created is irrelevant.
See Objective-Everything http://www.tiptop.com/Objective/ and Joy
http://www.aaa-plus.com/news/ObjectiveC-as-ScriptingLanguagepressRelease.htm
l

Adding methods to existing classes at runtime is completely routine. Adding
new classes at runtime is common. There used to be a free tool called eval
that would dynamically (just in time) recompile method implementations,
dynamically load the new implementations, and patch the runtime to use the
new versions with existing classes. In other words, you could edit the C
code of a method implementation while the program that uses the method was
running, press a button, and load the new implementation into the running
program.

Erik M. Buck

unread,

Sep 7, 2001, 4:27:50 PM9/7/01

to

> I admit I don't know in detail the lookup mechanism in the objc runtime
> but I suppose that rebuilding the sarray after a inheritance change would
be
> very expensive. Multiple inheritance requres a resoulution when the same

As long as the instance variable layout does not change, dynamically
changing the super class of an existing class is one pointer operation.
Each class object has a "superclass" variable. The message cache tables are
per-class (At least in my runtime and Apple's runtime) so changing a
superclass automatically changes the method cache with no extra effort.

Richard Frith-Macdonald

unread,

Sep 7, 2001, 6:03:00 PM9/7/01

to

On Saturday, September 8, 2001, at 01:36 AM, Marko Mikulicic wrote:

>> [snip]
>>> Objc has a simpler lookup because it doesn't have multiple
>>> inheritance,
>>> delegation
>> Doesn't forwarding and DO count as delegation?
>
> Not really. Delegation uses the delegate to lookup the method
> but forwards the original receiver.
> (One example is better than a thousand words. see attachment)

Actually, what you are describing is not true delegation ...
Delegation is what is commonly done in ObjC - an object asks another
object to do a job for it.

The capability you are describing here is also possible using the
ObjC runtime (the GNUstep base library uses it in a few places), and
is called 'behaviors'. The object takes on a behavior of another
object rather than delegating to another object.

So, with delegation and object A which does not handle a particular
message, passes the message to a delegate object B which executes
the appropriate method, whereas with behaviors, A gets the method from
B and executes it itsself.

The behavior mechanism is not directly supported by the language, and
is therefore dangerous in ObjC - since you need to be careful not to
take on any behavior which would try to work with instance variables
that don't exist in your object and cause a crash. In other languages
where ivars are only accessed by methods, you just need to worry about
runtime exceptions, not actual crashes when you make a mistake.

Marko Mikulicic

unread,

Sep 8, 2001, 1:00:11 AM9/8/01

to

Richard Frith-Macdonald wrote:
>
> So, with delegation and object A which does not handle a particular
> message, passes the message to a delegate object B which executes
> the appropriate method, whereas with behaviors, A gets the method from
> B and executes it itsself.

Sorry, its only a fact of terminology. I mistaken calling it "true"
as opposed to "not true". I'm grown up with this termin, and many people use it,
but it could be wrong or at least have different meanings.

>
> The behavior mechanism is not directly supported by the language, and
> is therefore dangerous in ObjC - since you need to be careful not to
> take on any behavior which would try to work with instance variables
> that don't exist in your object and cause a crash. In other languages
> where ivars are only accessed by methods, you just need to worry about
> runtime exceptions, not actual crashes when you make a mistake.
>

I have a vision about OO language which certainly you won't share (You would be
probably using smalltalk not objc), that instance variable must not be accessed
directly, so that subclasses can overrive ivar access with some method. (for
flexibility). The VM/runtime should optimize out actual unnecessary calls, but
there must be the flexibility.
I don't want to dispute OO desing here. I think I would not be using objc if
there was not OpenStep. I think that NeXT guys created a very confortable
framework (althrough when I first saw these loong method names I felt pain)
where you can be very productive from the beginning. And objc doesn't need any
new feature for OpenStep.

Marko

Marko Mikulicic

unread,

Sep 8, 2001, 1:08:10 AM9/8/01

to

But this means that a cached lookup needs to walk through the inheritance graph.
IC-s cache the target where the method is effectively found so no cycles are
lost traversing the inheritance graph.
I've made some simplified testing and it turns out that for a 10 deep
inheritance there is a 6% difference between calling the root and the child
(the test method returns the incremented argument). In this case IC are ~2.3
times faster than cached lookup.

What is the relationship between lookup speed and the number of methods in
that class ?

It should not be an issue, expecially if the framework doesn't have many longer
inheritances. (not uncommon in Self)

I always assumed that selector based cache must flatten inheritance because
walking through it always seemd to me inpractical, but as it turns out it
doesn't have a heavy impact in class-based languages. Self heavily use
delegation which can considerably lengthen the walk.

A couple of questions:

Why sends to "super" are not directly called, if dynamic inheritance is not used
? (A compiler flag perhaps?)
Is this related to dynamic loading of classes ?
Perhaps super sends are a good place for ICs (you get 99.99% hits)

Do you think that the increase in code size due to PICs, and its presence in the
icache, could lead to performance loss comparable to the gain (~2 times) in
message sends ? (the performance loss can show up somewhere else, that's my worry)

Marko

Malmberg

unread,

Sep 7, 2001, 6:59:31 PM9/7/01

to

[snip]

> An alternative could be placing a full IC
> code inline but with the first instruction a jump to the patching routine, which
> will restore the first instr. of the IC and update the values known only at runime.

There are threading issues with that. A full jump is 5 bytes, which is
more than can be patched atomically (or are the 80 bit FP stores
atomic?).

> > 0.5-1 cycles. The unconditional jump is cheap, and lack of conditional
> > call instructions means that the stub can be slightly more efficient
> > than the IC ('call stub;compare;jz method;jmp lookup' instead of
> > 'compare;jmp call_stub;call lookup;jmp cont;call_stub:call
> > method;cont:').
>
> You can code the stub like that in the inline code too, you would save
> 0.5-1 cycles but it would require compiler changes.

No you can't, since at some point you have to put the return address on
the stack. Manually pushing the return address costs (for whatever
reason) ~15 cycles, so that's not an option. Calling a local stub will
be no more efficient than calling the outline stub.

> Anyway you would be writing in the text segment, which would most
> probably arise in a copy on write page fault.

That will be a problem with any IC. A second level of indirection would
help a lot, but would probably cost ~0.5 cycles.

> Your solution is more easy to implement because the compiler itself needs not
> to be changed but can give serious performance problems that can be hard to
> predict because they depend of the execution profile of the program.

I don't think they'd be hard to predict. After calling a runtime
function, every call to those method implementations causes the caller
to be patched (memory[return_address-0x4]=selector->stub_adr). There'll
be a 4k page copy the first time for each page, but if you can't live
with that, you simply can't use selfmodifying code (or you could load
the entire program and its libraries non-shared, which would be
predictable but cost memory and load speed).

[snip]

> > Doesn't forwarding and DO count as delegation?
>
> Not really. Delegation uses the delegate to lookup the method
> but forwards the original receiver.
> (One example is better than a thousand words. see attachment)

Forwarding doesn't do that (currently), but it would be trivial to add a
flag that tells NSInvocation to set self of the called method to the
object doing the forwarding. If that isn't fast enough, it would be
possible to add a objc_add_delegate_for_class(someClass,theDelegate)
function that added all methods in theDelegate to someClass if it didn't
already have a method for that selector. You could even do it for an
individual object. It's a bit dangerous, though, since the receiver
probably expects self to be of the correct type.

[snip]

> I admit I don't know in detail the lookup mechanism in the objc runtime

In short, each selector gets an index, and each class has an array of
implementations for selectors (implemented as a two-level sparse array,
but that makes no difference). The lookup uses the index to get the
selector from the array (and if there isn't one, it goes on to do
forwarding etc.).

- Alexander Malmberg

Marko Mikulicic

unread,

Sep 8, 2001, 3:48:06 AM9/8/01

to

Malmberg wrote:
> [snip]
>
>>An alternative could be placing a full IC
>>code inline but with the first instruction a jump to the patching routine, which
>>will restore the first instr. of the IC and update the values known only at runime.
>>
>
> There are threading issues with that. A full jump is 5 bytes, which is
> more than can be patched atomically (or are the 80 bit FP stores
> atomic?).

FST (64-bit) should be atomic (at least on single processors, I don't know on
SMP if intel processors lock the bus for a full quadword write, and there are
cache coherency issues).

You could arrange first arrange the jump to a routine which will yield any the
calling thread. (must be temporary generated because the return address would be
unknown).
If another thread is already executing the stub code when it is stopped, then
atomically modify the last call instruction of the stub to call some generated
routine that sets some variable and then calls the old function, and spin on the
var. That's tricky and not worth the effort (expecially when considering page
fault, see later) but this noncooperative locking can do the job when the
updating is not done often.

>
> No you can't, since at some point you have to put the return address on
> the stack. Manually pushing the return address costs (for whatever
> reason) ~15 cycles, so that's not an option. Calling a local stub will
> be no more efficient than calling the outline stub.

I don't understand. You inline a call, and this call will push the ip. The
call will return and you go straight off.

>
>>Anyway you would be writing in the text segment, which would most
>>probably arise in a copy on write page fault.
>>
>
> That will be a problem with any IC. A second level of indirection would
> help a lot, but would probably cost ~0.5 cycles.

Yes, but you have control on where the IC are. If the patcher is in the sarray
you never know who will patched (potential callers can be anywhere in the code)

>
>> Your solution is more easy to implement because the compiler itself needs not
>>to be changed but can give serious performance problems that can be hard to
>>predict because they depend of the execution profile of the program.
>>
>
> I don't think they'd be hard to predict. After calling a runtime
> function, every call to those method implementations causes the caller
> to be patched (memory[return_address-0x4]=selector->stub_adr).

yes. but you don't know who are the callers. IC optimization is caller based,
not target based. Some callers can benefit from it, others can only loose
(imagine you have two arrays, one with 90% of objects A and the other with only
a couple of objects A. Somewhere in the code a pass though the second array is
done, calling some method of A which patches the code, thinking that I likes to
be optimized for A, but it's wrong. The coder predicted the first array to
contain likely objects A so instructed the runtime to generate a IC for calls in
some places and not though all the system).

> There'll
> be a 4k page copy the first time for each page, but if you can't live
> with that, you simply can't use selfmodifying code (or you could load
> the entire program and its libraries non-shared, which would be
> predictable but cost memory and load speed).

You proposed to have stubs in separate regions and JMP to it. This is perfect,
because in 4KB there can be a lot of stubs. It's also thread safe, because you
can atomically change the address of a stub in the caller.

>
> [snip]
>
>>>Doesn't forwarding and DO count as delegation?
>>>
>>Not really. Delegation uses the delegate to lookup the method
>>but forwards the original receiver.
>>(One example is better than a thousand words. see attachment)
>>
>
> Forwarding doesn't do that (currently), but it would be trivial to add a
> flag that tells NSInvocation to set self of the called method to the
> object doing the forwarding. If that isn't fast enough, it would be
> possible to add a objc_add_delegate_for_class(someClass,theDelegate)
> function that added all methods in theDelegate to someClass if it didn't
> already have a method for that selector. You could even do it for an
> individual object.

You are right. Objc can do that. Self implements inheritance (SI and MI) through
"delegates" (This termin has different meaning in the Self and Objc world).
Unimplemented method calls are forwarded to the delegate who responds to it.
If more than one delegate responds to it an exception is raised. This is the
reason Self lookup is so slow and why PIC are needed. Self was an experimental
system, they wanted to see what could be done if some language features, turned
down as expensive or mind-blowing in classical languages, are instead made
efficient.

> It's a bit dangerous, though, since the receiver
> probably expects self to be of the correct type.

The framework is build accordingly. However ivars should not be accessed directly.

>
> [snip]
>
>>I admit I don't know in detail the lookup mechanism in the objc runtime
>>
>
> In short, each selector gets an index, and each class has an array of
> implementations for selectors (implemented as a two-level sparse array,
> but that makes no difference). The lookup uses the index to get the
> selector from the array (and if there isn't one, it goes on to do
> forwarding etc.).

It is like I was thinking.
The thing I was not expecting is that selector maps are per class, requiring
inheritance graph walk for every miss.
This slows down also delegates (forwardInvocaton) wich must be called after a
full cache miss (until the root class). Is this true?

The speed of this sarray impementation leaves me without breath.

However, has anyone benchmarks on PPC, alphas, sparcs, other ?.
It seems to me that the balance between lookup and overall method call
on intel is kept by the relatively slow function call machinery (sparc register
windows speedup parameter passing, and alphas return addr registers advantages
leaf functions).

(I will test on alpha and sparc as soon as I get home from work (a couple of weeks))

Marko

Malmberg

unread,

Sep 8, 2001, 6:44:05 PM9/8/01

to

> But this means that a cached lookup needs to walk through the inheritance graph.
> IC-s cache the target where the method is effectively found so no cycles are
> lost traversing the inheritance graph.

I think you've misunderstood the s-arrays. The s-array for a class
includes all methods for that class, including ones that have been eg.
inherited.

> I've made some simplified testing and it turns out that for a 10 deep
> inheritance there is a 6% difference between calling the root and the child
> (the test method returns the incremented argument). In this case IC are ~2.3
> times faster than cached lookup.

In obj-c? That shouldn't be possible, but I'll check.

> What is the relationship between lookup speed and the number of methods in
> that class ?

The lookup time is constant (for all valid selectors with
implementations after the first call to a method in that class, at
least).

> Why sends to "super" are not directly called, if dynamic inheritance is not used
> ? (A compiler flag perhaps?)
> Is this related to dynamic loading of classes ?

Dynamic loading of categories and poseAs: (I suppose that would qualify
as dynamic inheritance) would break, but it should be possible to add a
compiler flag to do it.

> Perhaps super sends are a good place for ICs (you get 99.99% hits)

100% until you use poseAs:, in which case you'd get 0% hits for that
class.

> Do you think that the increase in code size due to PICs, and its presence in the
> icache, could lead to performance loss comparable to the gain (~2 times) in
> message sends ? (the performance loss can show up somewhere else, that's my worry)

I don't know, and I wouldn't want to try to find out. Things like that
are very tricky to measure.

> Marko

- Alexander Malmberg

Malmberg

unread,

Sep 8, 2001, 6:57:56 PM9/8/01

to

> I don't understand. You inline a call, and this call will push the ip. The
> call will return and you go straight off.

But the different calls will return to different places, so you'll need
jumps to merge the branches.

Outline cache:

pushl arg_2
pushl arg_1
pushl receiver
pushl selector
call stub
// done

stub:
movl 4(%esp),%eax
testl %eax,%eax
jz objc_nil_method
movl (%eax),%eax
cmpl $some_class,%eax
jz some_method
jmp real_lookup

Inline cache:

pushl arg_2
pushl arg_1
pushl receiver
pushl selector
movl receiver,%eax
testl %eax,%eax
jz handle_nil
movl (%eax),%eax
cmpl $some_class,%eax
jz handle_class
call real_lookup
jmp merge
handle_nil:
call objc_nil_method
jmp merge
handle_class:
call some_method
merge:
// done

The path for a cache hit is the same length in both versions.

[snip]

> yes. but you don't know who are the callers. IC optimization is caller based,
> not target based. Some callers can benefit from it, others can only loose

This is probably the worst drawback, and I can't see any good way to fix
it.

- Alexander Malmberg

Marko Mikulicic

unread,

Sep 9, 2001, 4:33:42 AM9/9/01

to

Malmberg wrote:

>
>>But this means that a cached lookup needs to walk through the inheritance graph.
>>IC-s cache the target where the method is effectively found so no cycles are
>>lost traversing the inheritance graph.
>>
>
> I think you've misunderstood the s-arrays. The s-array for a class
> includes all methods for that class, including ones that have been eg.
> inherited.

uh ?

Erick M. Buck wrote:

>As long as the instance variable layout does not change, dynamically
>changing the super class of an existing class is one pointer operation.
>Each class object has a "superclass" variable. The message cache tables are
>per-class (At least in my runtime and Apple's runtime) so changing a
>superclass automatically changes the method cache with no extra effort.

Perhaps something changed in the runtime. but which way ?

>
>> I've made some simplified testing and it turns out that for a 10 deep
>>inheritance there is a 6% difference between calling the root and the child
>>(the test method returns the incremented argument). In this case IC are ~2.3
>>times faster than cached lookup.
>>
>
> In obj-c? That shouldn't be possible, but I'll check.

It's only 6%, possibily a mistake (althrouh I executed the test 20 times for 40
secs run each). I'd expect something more for a walkthrough.

But then poseAs: (dynamic inheritance) has to rebuild the whole s-array.

Is there more than one developed runtimes implementations around (for gnustep, I
mean)?

>
>> What is the relationship between lookup speed and the number of methods in
>>that class ?
>>
>
> The lookup time is constant (for all valid selectors with
> implementations after the first call to a method in that class, at
> least).

Ok. as one would expect from sparse arrays.

>>Why sends to "super" are not directly called, if dynamic inheritance is not used
>>? (A compiler flag perhaps?)
>>Is this related to dynamic loading of classes ?
>>
>
> Dynamic loading of categories and poseAs: (I suppose that would qualify
> as dynamic inheritance) would break, but it should be possible to add a
> compiler flag to do it.

I heard of some problems with poseAs:. Someone said that it works only if you
call it when no instance of that class is alive. (or at least NeXT runtime did
so) Is this just a bug or an implementation feature ?
If so, it does not really qualify as dynamic inheritance. And also if the
s-array conains the flattened method table, it must be rebuilded if the parent
changes, so it can't really be used for fast dynamic inheritance (required by
some applications of DI), but of course it can be acceptable for slowly changing
reparenting.

>
>>Perhaps super sends are a good place for ICs (you get 99.99% hits)
>>
>
> 100% until you use poseAs:, in which case you'd get 0% hits for that
> class.

poseAs could update the cache.
Or you could update on every miss, since the misses are so rare.

>
>>Do you think that the increase in code size due to PICs, and its presence in the
>>icache, could lead to performance loss comparable to the gain (~2 times) in
>>message sends ? (the performance loss can show up somewhere else, that's my worry)
>>
>
> I don't know, and I wouldn't want to try to find out. Things like that
> are very tricky to measure.

I'm not asking you to do it. I was only courious about your feeling about that.
We cannot know everything, we must speculate with far less data than stricly
required.
When I'll have more time I will implement two level outline (thread-safness
mostly) PICs and we will see how the impact on overall performances is, just for
curiosity.

Marko

Bissell, Tim

unread,

Sep 10, 2001, 6:32:36 AM9/10/01

to

Marko M wrote:

Adding methods to existing classes at runtime is completely routine. Adding
new classes at runtime is common. There used to be a free tool called eval
that would dynamically (just in time) recompile method implementations,
dynamically load the new implementations, and patch the runtime to use the
new versions with existing classes. In other words, you could edit the C
code of a method implementation while the program that uses the method was
running, press a button, and load the new implementation into the running
program.

----

Ages ago (four, five years?) I wrote a Smalltalk on Objective-C-runtime
system
on NeXTStep. Each smalltalk method in a class was a call to a function
which
translated parameters, pushed them on a Stack and called a Smalltalk
interpreter
(based on Tim Budd's Little Smalltalk). It worked fairly well except that
the NeXT runtime would stop working if you unloaded a class, and then
reloaded a
class of the same name, which I had to do to add (Obj-C accessible) ivars to
a class.
I'll have to dig it out, and see if it works better with the latest 3.0 GCC
runtime...

Tim

----------------------------------------------------------------------
If you have received this e-mail in error or wish to read our e-mail
disclaimer statement and monitoring policy, please refer to
http://www.drkw.com/disc/email/ or contact the sender.
----------------------------------------------------------------------