One of the things I want to do prior to the 1.0 release is change how the
VM manages memory so as to reduce its footprint and improve performance in
low memory situations. I'm not ready to get started on this yet, but I
want to describe the problem and what I propose to do about it.
First, I'll explain how things currently work.
When a new thread is created, we allocate a 64KB heap for its exclusive
use. As long as there is room left on this heap (and no other thread is
waiting to garbage collect), we allocate objects on it by simply
incrementing the heap index by the size of each object and returning a
pointer to the space allocated. If the size requested is greater than
64KB, we use a different mechanism, described later.
If the size is less than or equal to 64KB, but there is no space left on
the thread-local heap, we do a minor garbage collection. First, the
thread sets a VM-wide variable indicating that it needs exclusive access
to VM-managed memory and waits until all other threads become "idle",
where "idle" means either not doing anything at all (i.e. sleeping or
blocking) or executing native code, which does not have direct access to
VM-managed memory (except for the case of GetPrimitiveArrayCritical).
Any thread which is not already idle at this point will become idle the
next time it tries to allocate memory, invoke a native method, sleep or
block. Any thread which is idle during garbage collection will block if
it tries to become active again.
Once all other threads are idle, the original thread may proceed to do a
minor garbage collection. At this point, we visit all objects reachable
from the roots (i.e. stack references, global references, etc.) except for
any objects which are tenured, as defined below. For each object visited,
we copy it to a new heap and update any live references to that object to
point to the new copy.
In addition to the thread-local heaps, we define four other heaps, each of
which is allocated as a single, contiguous block of memory:
* gen1: contains objects which survived the last collection but have
not yet survived enough collections to reach tenure
* gen2: contains objects which have survived enough collections to
reach tenure, which means they are not visited during a minor
collection. During a minor collection, any live objects found in
gen1 which have reached tenure are copied to gen2.
* nextGen1: during collection, we copy all live objects found in
either thread-local heaps or gen1 which have not reached tenure to
this heap, which will replace gen1 when we are finished
* nextGen2: during a major collection, we copy to this heap all live
objects found in gen2, plus any found in gen1 which have reached
tenure
Before we start a minor collection, we allocate nextGen1 with a size which
represents the worst case scenario: all objects on thread-local heaps and
all objects on gen1 (except those which are about to reach tenure) are
reachable. In addition, we ensure that gen2 has enough free space to
accomodate all objects on gen1 which might reach tenure during this
connection. If there is not enough free space in gen2, we upgrade this
collection to a major collection, and thus allocate nextGen2 with the
appropriate worst-case size, plus possibly additional space to delay
future major collections. There are also other heuristic-based criteria
which may cause us to upgrade a minor collection to a major one.
If there is not enough memory to allocate these heaps, either due to the
memory ceiling established when the VM was created or due to a malloc
failure, we abort the process. This is something we obviously want to
avoid. At some point, I want replace the abort with an OutOfMemoryError,
but right now I'm more concerned with avoiding this situation entirely
when possible.
In addition to copying each live object during collection, we increment
its age. The old age is stored in a bit set attached to gen1, while the
new age is recorded in a bit set attached to nextGen1. In both cases, the
bit set contains enough bits per object to represent every age up to the
tenure threshold.
Once an object's age reaches the tenure threshold, it is copied to gen2 or
nextGen2 as described above. These heaps also each have a bit set
attached to them, with one bit per object reference. We use this bit set
at runtime to mark dirty object references to be visited during a minor
collection. This addresses the following case: if we update a reference
inside a tenured object, "A", to point to a non-tenured object, "B", and
that is the only such reference to "B", we would never visit "B" during a
garbage collection since "A" is tenured and thus not visited during a
minor collection. To fix this, we mark the reference as dirty when it is
updated so that we remember to visit it during the next minor collection,
even though the rest of the references inside "A" might not be visited.
The above discussion describes how the VM manages small (i.e. less than
64KB), movable objects. In addition to these, we must also manage objects
which should not be moved, either because they are so large that it is
inefficient to move them at each collection, or because they must remain
at a fixed address for other reasons. These objects are refered to as
"fixed" or "fixies" in the code and are allocated in their own blocks of
memory, independent of the heaps described above. In addition to the
space needed for such objects, we include space for a header with which we
track the age of the object and link it to other objects for the purpose
of mark-and-sweep collection, plus a footer containing a bitset to track
its internal references once it reaches tenure.
So that's how the VM currently manages memory, but there is one major
problem with this solution.
The problem is that we waste large amounts of memory by allocating
nextGen1 and nextGen2 according to the worst-case scenario (i.e. all
objects are reachable), whereas the average case requires much less memory
because many, if not most, objects are garbage. The situation is even
worse for nextGen2, since we allocate up to twice as much as we need for
the worst case so as to delay the need for future major collections.
This problem can become fatal for the VM if there is no memory left to
allocate one or both of these heaps, since we abort on this condition.
One possible solution is to break up the heaps into non-contigous blocks,
allocating more blocks only as necessary. This is what I tried initially,
but there was a performance problem. The garbage collection code must be
able to quickly determine which heap an object belongs to, if any, or if
it is an immovable object, so we know where to copy it or how to mark it,
respectively. It's important that this check happens quickly, since it is
performed several times for every object we visit. When a heap is
composed of multiple blocks, we must check each block individually to see
if the object is within its bounds, slowing things down dramatically.
This got me thinking about how to break up the heaps while still achieving
O(1) performance for the check described above. I also wanted to somehow
eliminate the bit sets attached to gen1 and nextGen1 for tracking object
ages. These bit sets are sparsely populated and somewhat complicated to
read and write. Here's what I came up with:
Memory for movable objects is allocated in blocks, each of which is 64KB
in size and is aligned on a 64KB boundary. Each block contains a header
and one or more objects, all of which are the same age. The header
contains the heap which the block belongs to, the age of the objects it
contains, and pointers to the next and previous blocks in a linked list.
Any block containing objects which have reached the tenure threshold also
includes a footer which is a bitmask used to mark dirty references.
Blocks larger than 64KB minus the header and footer size are allocated
individually as immovable objects.
To determine the block which an object belongs to, and thus its age and
heap, one simply ANDs the pointer to that object with ~0xFFFF to find the
block header, since all blocks are aligned on a 64KB boundary.
Each heap is just a linked list of these blocks, or, in the case of gen1
and nextGen1, several linked lists, one for each object age between one
and the tenure threshold. Garbage collection proceeds as described above,
except that blocks are allocated dynamically as needed. Immovable objects
are collected via mark-and-sweep as before.
That's it. I've left out some details, but hopefully you get the idea.
As always, questions and comments are welcome.
Another blog I came across into: http://sebastiansylvan.com/2012/12/01/garbage-collection-thoughts/What I have been thinking lately is that instead of complicated object graph perhaps if this can be simplified into and intermediate form. You strip objects into Tuples which are allocated in. You have functions / mutators which mutate the tuples. If you perform such a transformation how can this be mapped back to the Java memory model.
--
You received this message because you are subscribed to the Google Groups "Avian" group.
To unsubscribe from this group and stop receiving emails from it, send an email to avian+un...@googlegroups.com.
To post to this group, send email to av...@googlegroups.com.
Visit this group at http://groups.google.com/group/avian.
For more options, visit https://groups.google.com/groups/opt_out.
These annotations can be safely ignored in another VM like Oracle or OpenJDK.
In Avian perhaps you can have special treatment. You can have a verification hash stamp also to ensure these tags were compiled and verified by Avian. If it was created by another compiler then just ignore them in Avian also.
Best keep Avian multi language with special optimisations for Scala, Groovy, JRuby, Jython, which are the major JVM languages.
Also the LMS concept I mention below can be used perhaps to generate Language or optimised VMs. You can profile the code paths in the VM for the workload and write out a VM optimal to run the given workload. That is that the Avian VM can morph itself and rewrite itself for specific workloads and languages or constraints, if I am to repeat my self for clarity. A step beyond VM modularisation and self tuning. In this case it will be more of a VM Constructor Kit than a VM it self. Analyse the new language it self and in some cases the workload optimised for it. BTW, it might be worth while if we can dig up some papers on the Trasmeta Cruso Processor through the code morphing it has was at micro code level perhaps this can be used in a VM also. In the profiting or analysis runs we do not care much about performance.
There is a research project called Lightweight Modular Staging and Delite perhaps from which we can borrow the optimisations to do HotSpot like optimisation, VM tuning and VM morphing.
Another thought is at class loading time analyse the fields and bulk allocate memory.
Allocate a tuple for the 1st objects fields followed by the fields in the 1st field and so on and so forth for all initialised fields breaking cycles with s defined strategy. Then any similar block allocated to any object created in the constructor can be freed at the last reference. Any new objects created in the constructors which are stored in fields extend the block. As all objects are in fields and are continuous all can be de allocated when the 1st object is not referenced any more. Also if the 1st object is referenced there is not way this block will have dead objects. This can be done if all the objects in the fields are only referenced within this block. Much more loose criteria than the annotations.
To make this more efficient you have to separate the memory allocation concerns from the mutation state transition concerns. By analysing the byte code loading you can identify memory allocation statements and control flow statements. Then you can deep analyse the memory allocation before hand. If this branch is taken this objects will need allocation if not this set, so on and so forth. This can be used for bulk allocation and de allocation.
If you take this a step further you can strip this to 3. Memory management and access concerns. Control flow concerns and sequential / block concerns. The striped out control concerns can help in deep ahead of time branch prediction.
Tracing can be minimised. Also we can see if we can take something from the C4 or may be an improvement.
Also perhaps something can aid optimisation is metadata. Perhaps for bundles we can save metadata which the Avian VM can use to perhaps get an edge. Any other VM will not use it as run as it usually does.
Perhaps the best is to let Avian progress toward V1 as it is. But having said that here are few ideas.
Best is to have an Intermediate Representation (IR).
I will be using Instructions but Code Words (Op code portion in general instruction representation is striped out in to a CW) will be better over Instructions in this case. I will be using the terms interchangeably. Also the VM will have different pipelines like in VLIW / Tagged VLIW implementation. Certain IR instructions will go into a particular pipeline type only. The best is to have a decoder (changeable depending on what to run) which will push the IR into each of these pipelines.
All the internal threads will be lightweight threads, but will maintain a native thread pool to execute the OS and Emulation Abstractions.
In the case there are Native calls there is not need to switch to native execution if a decoder is there for these instructions. Even machine code compiled for different architectures. (Trasmeta processors could execute any architecture. This can be done at the VM stage also. Perhaps we can target this.)
This can be used to have seamless calls between Native, Foreign Machine Instructions, Java and .Net with little overhead.
If we put up a proper architecture this will not be difficult to do.
Pulling in things will complicate things and best left to something like Proguard.
Best is to block (for any local variable) allocate recusing to each the fields. If the main object loose reference we can delete the whole block. If the object escapes allocate a block for the objects and compact block. IO controller can do the trick without having to use CPU cycles.
Since we are dealing with block we can reference count large blocks and trace when tracing is optimal.
If you have a tuple algo to flatten the block there will be no pointer to pointer look up also.
With each cycle or look ahead optimisation and trace through optimisation we can at some point execute a native application much faster on the VM than run without the VM.
One point to note is we can use Clang. But for this kind of programming there is a better tool. The is a language called Cal Language. This will be better to implement the pipelining pipeline execution logic. Bet is to use the right tool for the job. Where C++ is better C++ and perhaps Cal where it can do a great job.