== 2 of 4 ==
Date: Thurs, Sep 11 2008 9:48 am
From: "David Griswold"
On Tue, Sep 9, 2008 at 1:12 AM, prunedtree <prune...@gmail.com> wrote:
> Dave: regarding PICs, I'm pretty sure it's critical that you have a
> monomorphic send attempt followed by a megamorphic send for the case
> where a call site is nearly-monomorphic.
I'm not sure how that would work. What you suggest sounds pretty much like
a standard inline-cache. With type-feedback the form of the send needs to
record the encountered polymorphism of the send; the kind of send you
suggest can't be distinguished from a truly megamorphic send. Such a send
would be slower for megamorphic sends, since the cache will usually miss, so
it is wasted time that is eliminated in Strongtalk, as well as eliminating
the updating of the cache. The question is, when the cache misses, what do
you do? If you don't convert such sends into the megamorphic form (with no
inline-cache), how do you detect megamorphic sends?
I am sure you are right that slightly polymorphic sends would be slower
without PICs or an inline cache, however I have my doubts how important they
are statistically. As you pointed out, many of them become monomorphic
after inlining and/or customization; my intuition is that the resulting
distribution is highly bi-modal and dominated by monomorphic and
megamorphic, with not much in-between. As you also pointed out, since
boolean control structures are hardcoded, that eliminates the biggest source
of slightly polymorphic sends.
In essence, this is a
> degenerate PIC of length 1, and I guess it seems natural to allow
> bigger PICs.
As I said above, I don't think it really is like a PIC, since a PIC upgrades
itself to the next higher arity send when a cache miss occurs, and doesn't
just do a megamorphic send and update the cache. Allowing bigger PICs of
variable size I think is a big mistake we made in Strongtalk, since suddenly
you get a lot of extra complexity for very little payoff. Variable size
PICs cause fragmentation in the PIC area, requiring compaction (which isn't
done in the current system but would eventually be necessary).
I think eliminating variable size PICs would be a great improvement. But
there are two different ways that could be done, depending on how important
sends of arity 2 are, which as I said is not clear.
Um, variable arity doesn't imply variable size. In the VisualWorks VM I use PICs and megamorphic PICs. I call the former "closed PICs" because they have a maximum arity and hence apply to a closed set. I call the latter "open PICs" because they apply to an open set of classes. When a send site becomes polymorphic I allocate a PIC that has room for 8 entries and fill in only two. The last instruction of the PIC is a jump to a routine that extends the PIC. When the third class is encountered the jump to the extend routine is overwritten by the dispatch for the third class and a new jump to the extend routine is written. When the routine is reached from a PIC with all 8 entries filled it finds or allocates an open PIC and binds the send site to the open PIC.
So no fragmentation and plenty of type information. For VisualWorks I would see the following typical distributions
unlinked send 33% of all sites
monomorphic send 60%
polymorphic send 6% (2 to 8 cases)
megamorphic send 0.6%
i.e. the ratios of monomorphic to polymorphic to megamorphic sends are very close to 100 to 10 to 1.
Adding PICs resulted in a substantial increase in performance, 30% to 50% for most Smalltalk-intensive code, e.g. recompiling the system was twice as fast. But crucially even after adding PICs class hierarchy traversal is the most expensive operation in the execution engine, still accounting for about 5% to 10% of entire execution time. Note that in the steady state *all* this lookup activity comes from open PICs when the probe of the first-level method lookup cache fails (closed PICs optimize out doesNotUnderstand: lookup, so in the steady state there are no lookups from doesNotUnderstand:).
So what would happen if we eliminated closed PICs from the VisualWorks engine? Lookup activity would increase by 90% and performance would reduce, e.g. 1 = 0.95 + 0.05 => 0.95 + (10 * 0.05) => 1.45. Ouch!
So I'm convinced that closed PICs pay their way.
The maximum size of a closed PIC on the other hand is open to tuning. While I used 8, 6 would be perfectly fine. I saw what looked to be a 1/x distribution for the arity frequency in closed PICs.
If they are not that important, which I suspect is true, then separately
allocated PICs could be eliminated entirely, and we could use a degenerate
inline 1 element PIC for the monomorphic case, which patches to a cache-less
megamorphic send on cache miss. Nothing else would be needed.
If it turns out that sends of arity 2 are too important not to optimize,
then 2 element allocated PICs could be retained. That is still much simpler
than now, since all PICs would have the same size, and no fragmentation
would occur, and thus no compaction would be needed.
> [...] It would actually be
> interressing to benchmark strongtalk with various settings (no PICs,
> PICs limited to various sizes). If PICs of length 1 are found to be
> enough, then the PICs code could indeed be much more simple.
In fact, that is the first experiment I tried when Strongtalk went open
source, but it didn't appear to work the easy way, and I didn't follow up on
it. I tried just lowering the constant that determines the max PIC size to
1 and 2, but I got no change at all in the benchmarks I tried, and the
inlined code structure didn't appear to change. I suspect that the constant
is probably also hardcoded somewhere, perhaps in the assembly code, so just
changing the constant wasn't enough :-(. But I agree that actually getting
the experiment to work shouldn't be very hard and would be very, very