Memory Management

495 views
Skip to first unread message

Joel Dice

unread,
Mar 18, 2008, 10:50:43 AM3/18/08
to av...@googlegroups.com
Hi all.

One of the things I want to do prior to the 1.0 release is change how the
VM manages memory so as to reduce its footprint and improve performance in
low memory situations. I'm not ready to get started on this yet, but I
want to describe the problem and what I propose to do about it.

First, I'll explain how things currently work.

When a new thread is created, we allocate a 64KB heap for its exclusive
use. As long as there is room left on this heap (and no other thread is
waiting to garbage collect), we allocate objects on it by simply
incrementing the heap index by the size of each object and returning a
pointer to the space allocated. If the size requested is greater than
64KB, we use a different mechanism, described later.

If the size is less than or equal to 64KB, but there is no space left on
the thread-local heap, we do a minor garbage collection. First, the
thread sets a VM-wide variable indicating that it needs exclusive access
to VM-managed memory and waits until all other threads become "idle",
where "idle" means either not doing anything at all (i.e. sleeping or
blocking) or executing native code, which does not have direct access to
VM-managed memory (except for the case of GetPrimitiveArrayCritical).
Any thread which is not already idle at this point will become idle the
next time it tries to allocate memory, invoke a native method, sleep or
block. Any thread which is idle during garbage collection will block if
it tries to become active again.

Once all other threads are idle, the original thread may proceed to do a
minor garbage collection. At this point, we visit all objects reachable
from the roots (i.e. stack references, global references, etc.) except for
any objects which are tenured, as defined below. For each object visited,
we copy it to a new heap and update any live references to that object to
point to the new copy.

In addition to the thread-local heaps, we define four other heaps, each of
which is allocated as a single, contiguous block of memory:

* gen1: contains objects which survived the last collection but have
not yet survived enough collections to reach tenure

* gen2: contains objects which have survived enough collections to
reach tenure, which means they are not visited during a minor
collection. During a minor collection, any live objects found in
gen1 which have reached tenure are copied to gen2.

* nextGen1: during collection, we copy all live objects found in
either thread-local heaps or gen1 which have not reached tenure to
this heap, which will replace gen1 when we are finished

* nextGen2: during a major collection, we copy to this heap all live
objects found in gen2, plus any found in gen1 which have reached
tenure

Before we start a minor collection, we allocate nextGen1 with a size which
represents the worst case scenario: all objects on thread-local heaps and
all objects on gen1 (except those which are about to reach tenure) are
reachable. In addition, we ensure that gen2 has enough free space to
accomodate all objects on gen1 which might reach tenure during this
connection. If there is not enough free space in gen2, we upgrade this
collection to a major collection, and thus allocate nextGen2 with the
appropriate worst-case size, plus possibly additional space to delay
future major collections. There are also other heuristic-based criteria
which may cause us to upgrade a minor collection to a major one.

If there is not enough memory to allocate these heaps, either due to the
memory ceiling established when the VM was created or due to a malloc
failure, we abort the process. This is something we obviously want to
avoid. At some point, I want replace the abort with an OutOfMemoryError,
but right now I'm more concerned with avoiding this situation entirely
when possible.

In addition to copying each live object during collection, we increment
its age. The old age is stored in a bit set attached to gen1, while the
new age is recorded in a bit set attached to nextGen1. In both cases, the
bit set contains enough bits per object to represent every age up to the
tenure threshold.

Once an object's age reaches the tenure threshold, it is copied to gen2 or
nextGen2 as described above. These heaps also each have a bit set
attached to them, with one bit per object reference. We use this bit set
at runtime to mark dirty object references to be visited during a minor
collection. This addresses the following case: if we update a reference
inside a tenured object, "A", to point to a non-tenured object, "B", and
that is the only such reference to "B", we would never visit "B" during a
garbage collection since "A" is tenured and thus not visited during a
minor collection. To fix this, we mark the reference as dirty when it is
updated so that we remember to visit it during the next minor collection,
even though the rest of the references inside "A" might not be visited.

The above discussion describes how the VM manages small (i.e. less than
64KB), movable objects. In addition to these, we must also manage objects
which should not be moved, either because they are so large that it is
inefficient to move them at each collection, or because they must remain
at a fixed address for other reasons. These objects are refered to as
"fixed" or "fixies" in the code and are allocated in their own blocks of
memory, independent of the heaps described above. In addition to the
space needed for such objects, we include space for a header with which we
track the age of the object and link it to other objects for the purpose
of mark-and-sweep collection, plus a footer containing a bitset to track
its internal references once it reaches tenure.

So that's how the VM currently manages memory, but there is one major
problem with this solution.

The problem is that we waste large amounts of memory by allocating
nextGen1 and nextGen2 according to the worst-case scenario (i.e. all
objects are reachable), whereas the average case requires much less memory
because many, if not most, objects are garbage. The situation is even
worse for nextGen2, since we allocate up to twice as much as we need for
the worst case so as to delay the need for future major collections.
This problem can become fatal for the VM if there is no memory left to
allocate one or both of these heaps, since we abort on this condition.

One possible solution is to break up the heaps into non-contigous blocks,
allocating more blocks only as necessary. This is what I tried initially,
but there was a performance problem. The garbage collection code must be
able to quickly determine which heap an object belongs to, if any, or if
it is an immovable object, so we know where to copy it or how to mark it,
respectively. It's important that this check happens quickly, since it is
performed several times for every object we visit. When a heap is
composed of multiple blocks, we must check each block individually to see
if the object is within its bounds, slowing things down dramatically.

This got me thinking about how to break up the heaps while still achieving
O(1) performance for the check described above. I also wanted to somehow
eliminate the bit sets attached to gen1 and nextGen1 for tracking object
ages. These bit sets are sparsely populated and somewhat complicated to
read and write. Here's what I came up with:

Memory for movable objects is allocated in blocks, each of which is 64KB
in size and is aligned on a 64KB boundary. Each block contains a header
and one or more objects, all of which are the same age. The header
contains the heap which the block belongs to, the age of the objects it
contains, and pointers to the next and previous blocks in a linked list.
Any block containing objects which have reached the tenure threshold also
includes a footer which is a bitmask used to mark dirty references.
Blocks larger than 64KB minus the header and footer size are allocated
individually as immovable objects.

To determine the block which an object belongs to, and thus its age and
heap, one simply ANDs the pointer to that object with ~0xFFFF to find the
block header, since all blocks are aligned on a 64KB boundary.

Each heap is just a linked list of these blocks, or, in the case of gen1
and nextGen1, several linked lists, one for each object age between one
and the tenure threshold. Garbage collection proceeds as described above,
except that blocks are allocated dynamically as needed. Immovable objects
are collected via mark-and-sweep as before.

That's it. I've left out some details, but hopefully you get the idea.
As always, questions and comments are welcome.

Suminda Dharmasena

unread,
Sep 14, 2013, 5:28:54 AM9/14/13
to av...@googlegroups.com, di...@mailsnare.net
Hi,

1st impression comment. What I am saying may me tangential to what is discussed here. Perhaps you can share your thoughts in this.

We can give the memory management a helping hand. If we had the following annotations things would be easy:
  • @Contained
    • Class
      • At this level it means the fields or any of the object structure cannot leak out
      • Any getters / functions can only return clones
      • Setters / functions can only assign clones of the objects to fields
      • Only clones of fields or local variables can be passed to calls on other objects of which the parameters not @Contained
      • Any parameters passed to function can passes to a function which is @Contained 
      • Same effect as making all fields as @Contained
      • The rationale is that the this object and any containing objects can be deleted when this object is deleted. Ideally these can be blocked together and released together. If this object is reference by a @Contained local variable then this can be released at last reference without having to GC. Also can be stack allocated.
    • Field
      • Cannot be leaked out, i.e., cannot be returned or passed in in place of a parameter which is not marked as @Contained
      • Can only return Clones
      • Can only pass Clones to function parameters not marked as contained
      • When the current object is GCed all @Contained fields are with hold Objects which are also recursively @Contained are eligible for release. If they were block allocated the whole block can be released at once.
    • Parameter
      • Cannot become part of the object, i.e., cannot be assigned to a static or field or be passes as a function parameter which is not marked as @Contained and cannot be assigned to a field
      • Only clones can be assigned to fields and only clones can be passed in place of parameters which are not @Contained
      • Rationale is if you hold only 1 reference to a @Contained local variable and pass this to a function, and when it returns you still have only 1 reference so when this variable goes out of scope then this can be safely deleted without having to GC
    • Function
      • All parameters and locally created objects do not leaks with exception if it is marked @ReturnContained. @Contained local variables do not leak any way so they can be used within their usual constraints.
    • Local variable
      • Can not be passed to in place of a non @Contained function parameter
  • @ContaineAll
    • Mark the fields, local variables, function and parameters @Contained for a class
  • @ContaineStrict
    • All fields also can only assigned objects which have all fields fields recursively marked as @Contained. Can be at the class, field, local variable level.
    • This part of the Object Graph can be block allocated and released at once. If there is high memory pressure then you can scavenge some free space from the block and re size it. Also can have staged release and re use. For this such block can have meta data about sub blocks.
  • @ReturnContianed
    • Can only return a @Contained local variable or @Contained parameter or this is the object is @Contained or an Object returned from a function marked as @ReturnContianed
  • @ReleaseAtBlockEnd
    • Eligible for release at end of block. If in a loop new object created can be assigned to the same memory slot. If feasible retain old object and progressively apply mutations to make it the new object.
  • @ReleaseAtBlockExit(pool=n)
    • Eligible for release after bock exit. When in a loop: Allocate a pool for the object to be create for the local variable and create 1 or more object templates and initialise the pool to these templates. Apply a mutator to this to get the new state of the object and adjust reference accordingly and apply constructor state transformation minus memory application for pre pooled memory. Release the pool at the end. This will be useful of successive mutations of new objects assigned to the handle is has lesser variance from the template than the previous state of the object.
  • @ReleaseAtReturn(pool=n)
    • Similarly pool until function return.
These can be inserted by a pre possessor or the compiler it self. Perhaps also deducible by some dynamic analysis using Delite / LMS plus some further optimisation.

Any items without these annotations will be allocated normally and will go through the normal GC process.

Perhaps using of LLVM we might need a non movable memory block to start with while exploring how to me more effective.




Perhaps we can get some ideas from Nimrod (http://nimrod-code.org/) also.

S

Suminda Dharmasena

unread,
Sep 14, 2013, 5:30:03 AM9/14/13
to av...@googlegroups.com, di...@mailsnare.net
Also we can get some idea from Azul's C4 collector.

Suminda Dharmasena

unread,
Sep 14, 2013, 7:26:10 AM9/14/13
to av...@googlegroups.com, di...@mailsnare.net
Another blog I came across into: http://sebastiansylvan.com/2012/12/01/garbage-collection-thoughts/

What I have been thinking lately is that instead of complicated object graph perhaps if this can be simplified into and intermediate form. You strip objects into Tuples which are allocated in. You have functions / mutators which mutate the tuples. If you perform such a transformation how can this be mapped back to the Java memory model.

Joshua Warner

unread,
Sep 14, 2013, 12:37:31 PM9/14/13
to av...@googlegroups.com, di...@mailsnare.net
Interesting ideas overall.  A few comments:

You have to be careful with java annotations because (in my mind at least), they shouldn't modify the functional behavior of the code itself. The VM will have to check all of the annotations individually, and then ignore (rather than throwing a VMError or something) those that fail the test.  It then has to recheck any annotations that depended on the outcome of that analysis, and so forth.  When we can't bail out early and just refuse to compile a class, such checks can quickly become expensive, at least when they fail (which, barring other measures, would be all too often).

I think there's limited benefit from allocating objects together (i.e. still pointer objects).  The garbage colllector still have to trace all of them to determine the liveness of objects they reference (and, in avian's gen1, to copy them).  The only way I can see to cut down on the GC's work would be to inline their fields directly in the parent classes - but this can make it expensive to swap in a different object - so we'd only want to apply this optimization in cases where the field is likely static.  This also impacts hashing and equality - so we'd have to make sure that the inlined object had them properly overloaded so it won't be noticeable.

Real benefits come if we can guarantee that whole subgraphs of objects don't need to be traced.  Immutability helps dramatically here.  Isolation, a stronger version of uniqueness that indicates the sole ingoing pointer to a subgraph in the entire VM, would also help immensely.  We might be able to then store the entire subgraph in an arena (as in arena-based memory management), and free it as one.  This paper covers these ideas well, even though it's focussed at the parallelism benefits rather than the GC benefits: http://research.microsoft.com/pubs/170528/msr-tr-2012-79.pdf.

Many of these ideas would be difficult to implement in Java while maintaining compatibility. I would tend to take the approach of picking or inventing a different language that has these properties.  I've long wanted to have avian be a multi-language VM, but haven't settled on a good choice for another language that seems like it could actually be useful.

-Joshua


On 14 September 2013 05:26, Suminda Dharmasena <sirina...@gmail.com> wrote:
Another blog I came across into: http://sebastiansylvan.com/2012/12/01/garbage-collection-thoughts/

What I have been thinking lately is that instead of complicated object graph perhaps if this can be simplified into and intermediate form. You strip objects into Tuples which are allocated in. You have functions / mutators which mutate the tuples. If you perform such a transformation how can this be mapped back to the Java memory model.

--
You received this message because you are subscribed to the Google Groups "Avian" group.
To unsubscribe from this group and stop receiving emails from it, send an email to avian+un...@googlegroups.com.
To post to this group, send email to av...@googlegroups.com.
Visit this group at http://groups.google.com/group/avian.
For more options, visit https://groups.google.com/groups/opt_out.

Suminda Dharmasena

unread,
Sep 14, 2013, 3:11:57 PM9/14/13
to av...@googlegroups.com, di...@mailsnare.net
These annotations can be safely ignored in another VM like Oracle or OpenJDK.
 
In Avian perhaps you can have special treatment. You can have a verification hash stamp also to ensure these tags were compiled and verified by Avian. If it was created by another compiler then just ignore them in Avian also.
 
Best keep Avian multi language with special optimisations for Scala, Groovy, JRuby, Jython, which are the major JVM languages. Also the LMS concept I mention below can be used perhaps to generate Language or optimised VMs. You can profile the code paths in the VM for the workload and write out a VM optimal to run the given workload. That is that the Avian VM can morph itself and rewrite itself for specific workloads and languages or constraints, if I am to repeat my self for clarity. A step beyond VM modularisation and self tuning. In this case it will be more of a VM Constructor Kit than a VM it self. Analyse the new language it self and in some cases the workload optimised for it. BTW, it might be worth while if we can dig up some papers on the Trasmeta Cruso Processor through the code morphing it has was at micro code level perhaps this can be used in a VM also. In the profiting or analysis runs we do not care much about performance.
 
There is a research project called Lightweight Modular Staging and Delite perhaps from which we can borrow the optimisations to do HotSpot like optimisation, VM tuning and VM morphing.
 
 
Another thought is at class loading time analyse the fields and bulk allocate memory.
 
Allocate a tuple for the 1st objects fields followed by the fields in the 1st field and so on and so forth for all initialised fields breaking cycles with s defined strategy. Then any similar block allocated to any object created in the constructor can be freed at the last reference. Any new objects created in the constructors which are stored in fields extend the block. As all objects are in fields and are continuous all can be de allocated when the 1st object is not referenced any more. Also if the 1st object is referenced there is not way this block will have dead objects. This can be done if all the objects in the fields are only referenced within this block. Much more loose criteria than the annotations.
 
To make this more efficient you have to separate the memory allocation concerns from the mutation state transition concerns. By analysing the byte code loading you can identify memory allocation statements and control flow statements. Then you can deep analyse the memory allocation before hand. If this branch is taken this objects will need allocation if not this set, so on and so forth. This can be used for bulk allocation and de allocation.
 
If you take this a step further you can strip this to 3. Memory management and access concerns. Control flow concerns and sequential / block concerns. The striped out control concerns can help in deep ahead of time branch prediction.
 
Tracing can be minimised. Also we can see if we can take something from the C4 or may be an improvement.
 
 
Also perhaps something can aid optimisation is metadata. Perhaps for bundles we can save metadata which the Avian VM can use to perhaps get an edge. Any other VM will not use it as run as it usually does.

Joshua Warner

unread,
Sep 14, 2013, 4:26:34 PM9/14/13
to av...@googlegroups.com

These annotations can be safely ignored in another VM like Oracle or OpenJDK.

As long as they're designed that way, yes.  It's just a slippery slope to do stuff like that.
 
 
In Avian perhaps you can have special treatment. You can have a verification hash stamp also to ensure these tags were compiled and verified by Avian. If it was created by another compiler then just ignore them in Avian also.

That's certainly an option.  Keep in mind, though, that avian doesn't have a java source -> java bytecode compiler, so the first time it sees the java code (right now, at least), is when it loads the code for execution.
 
 
Best keep Avian multi language with special optimisations for Scala, Groovy, JRuby, Jython, which are the major JVM languages.
 
I'd rather stay away from optimizations that target particular JVM languages, but rather optimizations that are pretty broadly applicable (even to non-java-based languages, preferably).
 
Also the LMS concept I mention below can be used perhaps to generate Language or optimised VMs. You can profile the code paths in the VM for the workload and write out a VM optimal to run the given workload. That is that the Avian VM can morph itself and rewrite itself for specific workloads and languages or constraints, if I am to repeat my self for clarity. A step beyond VM modularisation and self tuning. In this case it will be more of a VM Constructor Kit than a VM it self. Analyse the new language it self and in some cases the workload optimised for it. BTW, it might be worth while if we can dig up some papers on the Trasmeta Cruso Processor through the code morphing it has was at micro code level perhaps this can be used in a VM also. In the profiting or analysis runs we do not care much about performance.

Sounds interesting, I'll have to read up on that.  My guess is that that will require having the majority of the VM written in an analyzable form, which would in turn require either a rewrite (from c++ to some higher-level language, probably), or tying ourselves to a particular c++ compiler toolchain (clang would be the obvious choice).  Otherwise, there's no way to transform the VM itself.

That'd be a lot of work.  And I'm not sure what came out the other side would be recognizable as "Avian" anymore.  Probably something closer to a meta-circular VM like pypy or Maxine.

 
There is a research project called Lightweight Modular Staging and Delite perhaps from which we can borrow the optimisations to do HotSpot like optimisation, VM tuning and VM morphing.

My guess is most of those optimizations won't be directly applicable in avian (though I could be wrong).  They're look like they all generate high-level code (in compiled languages), whereas avian's focus is on generating machine code.
 
Another thought is at class loading time analyse the fields and bulk allocate memory.
 
Allocate a tuple for the 1st objects fields followed by the fields in the 1st field and so on and so forth for all initialised fields breaking cycles with s defined strategy. Then any similar block allocated to any object created in the constructor can be freed at the last reference. Any new objects created in the constructors which are stored in fields extend the block. As all objects are in fields and are continuous all can be de allocated when the 1st object is not referenced any more. Also if the 1st object is referenced there is not way this block will have dead objects. This can be done if all the objects in the fields are only referenced within this block. Much more loose criteria than the annotations.

This sounds a lot like region-based memory management as implemented in ParaSail (https://forge.open-do.org/plugins/moinmoin/parasail/).  I think that'd be really tricky to do in Java, which is really built to use a traditional garbage collector behind the scenes.

You'd probably have much better luck designing an intermediate language that provides the necessary guarantees, and then translating Java bytecode into that - rather than trying to shoehorn a bunch of analysis routines into avian.
 
 
To make this more efficient you have to separate the memory allocation concerns from the mutation state transition concerns. By analysing the byte code loading you can identify memory allocation statements and control flow statements. Then you can deep analyse the memory allocation before hand. If this branch is taken this objects will need allocation if not this set, so on and so forth. This can be used for bulk allocation and de allocation.

Being able to analyze memory allocation patterns and recover useful information from it is still a very active area of research.  I don't think it's mature enough to do anything like what you're suggesting, particularly when applied directly to Java.
 
 
If you take this a step further you can strip this to 3. Memory management and access concerns. Control flow concerns and sequential / block concerns. The striped out control concerns can help in deep ahead of time branch prediction.

I don't understand what benefits this would confer.
 
Tracing can be minimised. Also we can see if we can take something from the C4 or may be an improvement.

When I last looked into Azul's technology, it looked like it required pretty low-level and intense cooperation from the operating system - which is not something Avian can rely on.
 
Also perhaps something can aid optimisation is metadata. Perhaps for bundles we can save metadata which the Avian VM can use to perhaps get an edge. Any other VM will not use it as run as it usually does.

At this point, you'd probably be better of dropping Java class files as an intermediate representation, and instead adopt something more fitting to the analyses you're doing.  

-Joshua 

Suminda Dharmasena

unread,
Sep 15, 2013, 5:07:56 PM9/15/13
to av...@googlegroups.com
Perhaps the best is to let Avian progress toward V1 as it is. But having said that here are few ideas.
 
Best is to have an Intermediate Representation (IR). I will be using Instructions but Code Words (Op code portion in general instruction representation is striped out in to a CW) will be better over Instructions in this case. I will be using the terms interchangeably. Also the VM will have different pipelines like in VLIW / Tagged VLIW implementation. Certain IR instructions will go into a particular pipeline type only. The instructions and pipeline types will be:
  • Arithmetic logic and other operations pipeline / instructions (multiple pipelines)
  • Branch and loop pipeline / instructions
  • Memory allocation, access and de reference (planning memory allocation with minimum tracing, can be used for virtual function resolution, reference tracking, also can be broken into 3 pipelines as each of them are important on their own right)
  • Underlying OS calls pipeline (since a VM we can have this. Will make things efficient as these are expensive and perhaps if this pipeline is run on a native thread then we might prevent a context switch so the cache does not get flushed too often. Must look into this further.)
  • Foreign OS abstraction layer call pipeline (will come back on this later)
The best is to have a decoder (changeable depending on what to run) which will push the IR into each of these pipelines.
 
Sequence instructions will be stuffed into the sequence pipeline such that any instruction between the multiple sequence pipelines will not have dependencies. When a branch instruction is encountered in the decoder then this will stuff this into the branching pipeline tagging the point in which of the sequence pipelines, memory pipeline, and other pipelines this correspond to. The sequence pipelines is continued to be filled with the next instructions. Also when a memory related operation is encountered it is put into memory pipeline. If an OS call is encountered it is put into the relevant pipeline. Memory pipeline for allocation will have by which point the allocation must be final and which point the object type must be determined (for virtual calls) and what point the memory will be referenced and when we would loose reference to memory.
 
All the internal threads will be lightweight threads, but will maintain a native thread pool to execute the OS and Emulation Abstractions.
 
All sequence, memory and other pipelines are execute until the branch marker is hit in a particular pipeline, in which case it stalls if the branch is not determined (or do a speculative look ahead if it can be done).
 
When all sequence pipelines block for IO then the decoder kicks in as start stuffing new IRs if the pipeline is not filled. If it is reasonably full then forward optimisation of the code to be executed is done on selected pipelines. There will be a marker where forward optimisation, AOT compilation and memory requirement analysis has been done.
 
In the case there are Native calls there is not need to switch to native execution if a decoder is there for these instructions. Even machine code compiled for different architectures. (Trasmeta processors could execute any architecture. This can be done at the VM stage also. Perhaps we can target this.) The relevant decoder needs to kick in and stuff the particular pipelines. The decoder should be able to know when a system calls is made and if this related to the underlying OS. If it is the underlying OS it is put to the particular pipeline else put to the OS wrapper pipeline. The calls will be made against a wrapper which handles the calls. 
 
This can be used to have seamless calls between Native, Foreign Machine Instructions, Java and .Net with little overhead.
 
If we put up a proper architecture this will not be difficult to do.
 
Also back to memory in depth.
 
Pulling in things will complicate things and best left to something like Proguard.
 
Best is to block (for any local variable) allocate recusing to each the fields. If the main object loose reference we can delete the whole block. If the object escapes allocate a block for the objects and compact block. IO controller can do the trick without having to use CPU cycles. Since we are dealing with block we can reference count large blocks and trace when tracing is optimal.
 
If you have a tuple algo to flatten the block there will be no pointer to pointer look up also.
 
With each cycle or look ahead optimisation and trace through optimisation we can at some point execute a native application much faster on the VM than run without the VM.
 
One point to note is we can use Clang. But for this kind of programming there is a better tool. The is a language called Cal Language. This will be better to implement the pipelining pipeline execution logic. Bet is to use the right tool for the job. Where C++ is better C++ and perhaps Cal where it can do a great job.
 
Suminda

Joshua Warner

unread,
Sep 15, 2013, 6:46:59 PM9/15/13
to av...@googlegroups.com

Perhaps the best is to let Avian progress toward V1 as it is. But having said that here are few ideas.

I don't think it's necessary to wait on architecture changes, but what we do need is a smaller number of well-thought out and actionable ideas.
 
 
Best is to have an Intermediate Representation (IR).

Avian already has an IR - albeit an ad-hoc, poorly-formalized one.  One of the first steps would be to formalize and aggressively test the existing IR.
 
I will be using Instructions but Code Words (Op code portion in general instruction representation is striped out in to a CW) will be better over Instructions in this case. I will be using the terms interchangeably. Also the VM will have different pipelines like in VLIW / Tagged VLIW implementation. Certain IR instructions will go into a particular pipeline type only. The best is to have a decoder (changeable depending on what to run) which will push the IR into each of these pipelines.

I'm having a hard time understanding what the concrete benefits of this system would be.  It seems like this would throw the traditional pass-based optimizer out the window, and we'd need a good reason to believe doing so will lead to net benefits.
  
All the internal threads will be lightweight threads, but will maintain a native thread pool to execute the OS and Emulation Abstractions.

I agree.  Green threads (at least as an option, if not the default) would be a cool feature.
 
In the case there are Native calls there is not need to switch to native execution if a decoder is there for these instructions. Even machine code compiled for different architectures. (Trasmeta processors could execute any architecture. This can be done at the VM stage also. Perhaps we can target this.)

Unless I misunderstand, it sounds like you're talking about dynamic binary translation.  Possible in theory, but one HELL of a large project, and far beyond Avian's goals.
  
This can be used to have seamless calls between Native, Foreign Machine Instructions, Java and .Net with little overhead.

Careful claiming that this will be low-overhead.  Last I checked, the state of the art in binary translation typically runs a few tens of percent slower than a native binary - and that's when it actually translates correctly.  It's horribly complicated and error-prone.
 
If we put up a proper architecture this will not be difficult to do.

On the contrary, it's damn difficult to do.  Take a look at this paper: http://theory.stanford.edu/~sbansal/pubs/osdi08.pdf.
  
Pulling in things will complicate things and best left to something like Proguard.

I wouldn't rule out doing simple, local and safe transformations in avian itself - like promoting a non-escaping object to stack allocation, or inlining a trivial method.
 
Best is to block (for any local variable) allocate recusing to each the fields. If the main object loose reference we can delete the whole block. If the object escapes allocate a block for the objects and compact block. IO controller can do the trick without having to use CPU cycles.

I haven't heard of any hardware that can do this sort of thing, outside of the tangential resemblence to the spiller on sparc hardware.  And if it's not in hardware, it's using CPU cycles.
 
Since we are dealing with block we can reference count large blocks and trace when tracing is optimal.

I expect you'd find that most of the time this optimization will not give you anything.  There are all sorts of object graph structures that will not be conglomerated and freed together, and you'll be reduced to reference-counting individual objects.
 
If you have a tuple algo to flatten the block there will be no pointer to pointer look up also.

Not sure what you mean by this.
 
 
With each cycle or look ahead optimisation and trace through optimisation we can at some point execute a native application much faster on the VM than run without the VM.

Read the super-optimizer paper I linked to above.  It's possible, but only if the native compiler didn't already do a better job at optimizing.  I don't think Avian will ever compete with true optimizing compilers for quality of generated code.  That's just not the target area.
 
 
One point to note is we can use Clang. But for this kind of programming there is a better tool. The is a language called Cal Language. This will be better to implement the pipelining pipeline execution logic. Bet is to use the right tool for the job. Where C++ is better C++ and perhaps Cal where it can do a great job.

True, but I don't want to tie Avian to any particular compiler toolchain right now.  After the bit I read about Cal (on wikipedia, and just skimming at that), I don't see what benefit that would have over other languages like Lisp, Haskell, Erlang, ML, ...  or even C++ for that matter.

-------------

Suminda, I think you have a lot of interesting ideas, and a good grasp on some of the things that are at least _theoretically_ possible.  However, you've thrown about a lot of less-than-practical ideas, and not really explored any of them in depth.

Perhaps you can pick one or two interesting, promising, and most importantly, practical ideas that we can discuss at length.  I would also humbly suggest using Avian on a decent-size project, or maybe digging into the code and fixing a bug or adding a small feature to Avian.  That way, you'll have a much better understanding of where Avian is, where it's going, and where it needs help.

Sincerely, and with respect,
Joshua
Reply all
Reply to author
Forward
0 new messages