[Brainstorming] Basis for the re-design of the compiler

0 views
Skip to first unread message

David Pineau

unread,
Jun 4, 2013, 7:50:11 PM6/4/13
to rathaxe...@googlegroups.com
Greetings, people !

As you may all know, we planned to start the re-write of the compiler in May
or in June. Unfortunately, we didn't get any news from Lionel since the last
hackathon.

Anyways, I think it's time to start thinking about how we are going to design
the new compiler. The current development status being more of a
trial-and-error kind of thing, the design is quite perfectible, and we plan to use
this opportunity to its fullest.

~~~~

First of all, we need to identify every single component or compilation step,
in order to isolate them (both for testing purposes and ease of development).
Here is the list I can identify (complete/comment it where you see fit):

Compilation steps:
 - Parsing (Takes the code in, builds an AST)
 - PlaceHolder Parsing (Identifies back-end placeholders, and parse them)
 - Introspection (Identifies C/RTX declarations, annotates type trees)
 - Type Checking (Tries to fully match -qualify- rathaxes types against
                          available interfaces)
 - Registration (Not actually a compilation step; registers
                      interfaces/templates/drivers to the cache)
 - Generation (Takes the parsed driver, generated the C code for it)

Components:
 - Tree structure (AST nodes definitions)
 - Parser
 - Cache (registers and loads a bunch of info about the saved interfaces and
   templates)
 - Linker (Uses the cache to relate templates for any other component that
   needs it)
 - Type checker (compilation step; ensures that the types are consistent with
    registered interfaces)
 -  Resolver

Currently unsolved issues (with their proposed solution):
 - Generation process does not take note properly of optional/required
   sequences.
     -> The solution is to rework the whole generation process (more about
         that afterwards)
 - We have no definite way to manage unknown types (kernel-defined types
   for instance) in the code of a template.
    -> A new syntax to declare a C type may be a useful placeHolder, which
        must be checked together when weaving the trees.

~~~~

For now, that will be all for this list. Now, allow me to describe my thoughts
about the redesign, and how I envision things. I do not necessarily believe
that each component I identified should translate to a class; but they have to
be properly identifiable in the code.

Code Organization:
Currently, the code is organized by compilation steps. A purely procedural way
of organizing code, which is half helpful, and half hindering the development
efforts.

Let me elaborate on that: It is helpful because it makes it easy to
search for code in a specific compilation step. But at the same time, it's
harmful because when we try to add/fix/modify a feature into the mix, we may
have to touch multiple files, while no documents tells which files contains all
the code for which feature.

This is why I believe that following the flow of the rewrite, we should completely
reorganize our code. I mean, we were working with a purely procedural language,
so we were limited by choice. Now that we're moving on to python. we have a
full-fledged object language at our disposal. I want your opinion on this:
~
Do you think that it may be wise to associate an object = a feature ?
That is, to use hooks within each compilation steps to run feature-specific code
when it has a meaning. This way, we could concentrate a feature's code in one
place, hopefully making it easier to add/fix/re-factor/modify.
~
This is one of my current thinking subjects, and I would really appreciate your
input on that subject.


Code Generation:
As I wrote in the previous listing, the code is currently being generated by a
kind-of forceful way. We merely recursively try to pull everything that fits into
a place-holder (pointcut, and such). This has the demerits that is does not fully
comply to our multi-platform and flexibility goals: Only include the code that is
necessary to include.

During previous hackathons we have discussed it a few times, and a new
generation model began to take form. The basic idea is to separate the
generation process into multiple steps: first a pre-selection of the templates
depending on the generation configuration; followed by a second selection,
based on what sequences are implemented and/or used in the driver's code,
and finally, only weave the final code tree by only using those selected
templates (and those that they explicitly use).

This method is not fully designed yet, but will heavily rely on the typing system
(on the side of the middle-end language), as well as the linker, that will have to
use the cache to build the proper pre-selection of templates. This means that
we have to take special care while designing the cache, the linker, and the
typing system.

~~~~

This leads me to the last part of this fat mail: In order to re-design, we must
identify every requirements for each component (and data structure ?).
This may be the part where everyone's help is most precious.

The focus here will be over the Cache and the linker, the two central components.
Since their roles are quite related and it's hard to separate, I'll simply list the
known requirements; and we'll separate them at a later time.

What I know for sure for the Cache/Linker is:
 - We need to register a bunch of data:
    > Source files (Interfaces, templates, drivers)
    > Compiled files (idem)
 - It has to be structured/indexed:
    > By Interface (for interface search and the typing system)
    > By configuration (for pre-selection by config)
    > By Template (for explicit template selection)
    > By dependency (for the resolver to easily select what's used or not)
 - We need to be able to do some structured queries, for an informational
   purpose (listing the support for a given config/template for instance):
    > List the currently registered / not registered templates for a given interface
       and a given configuration
    > List the configurations supported by the registered code for a given template
 - We need to identify multiple flavors for the cache:
    > System cache (Library of Interfaces, Templates and drivers stored
       system-wide)
    > User cache (Registered Interfaces; Templates and drivers stored in the
       user's data)


Which leads me to ask:
 - Which functionalities are "Linker" or "Cache" ?
 - Is it necessary/Useful/Doable to separate both ?



Next is the new resolver algorithm.
The aim is to get a finer resolution that does not include unused code into the
generated driver. Meaning: Generate only what you need/want.

First step, a first big selection on the cache is done through matching the
configuration against the multiple templates' constraints.

Second step, by using the cache's dependency system, and selecting
through vthe templates and sequences used/implemented (or not) in the
driver, we can get only what's used for the last step.

Finally, the last step is the part that matches the closest with the current
resolver algoritm, since it's the one that recursively descends through the
placeholders; and resolves them by using the cache/linker. The only new
part here is that it won't take everything in, but only what has been
previously selected.



The last part I want to elaborate on is my thought about how we are going
to structure our code and how we are going to work with a new language
that offers multiple object features.

I explained that I was thinking about how to easily add a new feature,
since we currently need to dive into multiple files for each feature that
we want to add, which is actually quite an efficiency sink. I want us to be
able to add new features at a much faster rate, thanks to a new and useful
design.

Sadly, I have no miracle solution; and I feel that "One object for a feature"
is kind of a stupid mistake we must avoid at all costs. Do you have any
proposition about this specific issue ? We could have for example; say...
One base class for each compilation step, which would be added to an
internal list of features within that specific compilation step's code. Then,
we could regroup the classes for each feature in a file, or something of
the likes ?

Don't hesitate commenting those Ideas. I am a mere beginner in python,
and every single piece of advice that could help is welcome; be it as a
guideline or as a design advice.


~~~~

Well, that was a bit long, but I hope it wasn't too hard to read through :)

I may have missed many things, but you are of course all welcome to fix
it by talking about it.

TL;DR: Here are my thoughts to use as a base for re-design discussions
for the soon-to-come python rewrite of rathaxes. Please read it, and let's
discuss it !

Cheers,

--
David "Joachim" Pineau,
Developer R&D at Scality

Thomas Sanchez

unread,
Jun 5, 2013, 3:25:35 AM6/5/13
to rathaxe...@googlegroups.com
Greetings from Berlin :)

2013/6/5 David Pineau <dav.p...@gmail.com>:
> Greetings, people !
>
> As you may all know, we planned to start the re-write of the compiler in May
> or in June. Unfortunately, we didn't get any news from Lionel since the last
> hackathon.
>
> Anyways, I think it's time to start thinking about how we are going to
> design
> the new compiler. The current development status being more of a
> trial-and-error kind of thing, the design is quite perfectible, and we plan
> to use
> this opportunity to its fullest.
>
> ~~~~
>
> First of all, we need to identify every single component or compilation
> step,
> in order to isolate them (both for testing purposes and ease of
> development).
> Here is the list I can identify (complete/comment it where you see fit):
>
> Compilation steps:
> - Parsing (Takes the code in, builds an AST)
> - PlaceHolder Parsing (Identifies back-end placeholders, and parse them)
> - Introspection (Identifies C/RTX declarations, annotates type trees)

Doesn't the identification (as long as the placeholder identification)
take place when you parse ?
These 3 step would be the parsing box ?

> - Type Checking (Tries to fully match -qualify- rathaxes types against
> available interfaces)
Yup (and we would lookup in the cache if some of the referenced
interfaces are already compiled ?)

> - Registration (Not actually a compilation step; registers
> interfaces/templates/drivers to the cache)
> - Generation (Takes the parsed driver, generated the C code for it)

For the generation I agree but should we already think about the tool
generation ? (I mean the makefile for example)

>
> Components:
> - Tree structure (AST nodes definitions)
> - Parser
> - Cache (registers and loads a bunch of info about the saved interfaces and
> templates)
> - Linker (Uses the cache to relate templates for any other component that
> needs it)
> - Type checker (compilation step; ensures that the types are consistent
> with
> registered interfaces)
> - Resolver

Didn't you forget the structure representing an interface too ? Maybe
something more "functionally aware" than an ast would be
simpler to manipulate. I imagine this "class" as a wrapper over an ast.

>
> Currently unsolved issues (with their proposed solution):
> - Generation process does not take note properly of optional/required
> sequences.
> -> The solution is to rework the whole generation process (more about
> that afterwards)
> - We have no definite way to manage unknown types (kernel-defined types
> for instance) in the code of a template.

Didn't the cnorm can parse them ?

> -> A new syntax to declare a C type may be a useful placeHolder, which
> must be checked together when weaving the trees.

>
> ~~~~
>
> For now, that will be all for this list. Now, allow me to describe my
> thoughts
> about the redesign, and how I envision things. I do not necessarily believe
> that each component I identified should translate to a class; but they have
> to
> be properly identifiable in the code.

Agreed.

>
> Code Organization:
> Currently, the code is organized by compilation steps. A purely procedural
> way
> of organizing code, which is half helpful, and half hindering the
> development
> efforts.
>
> Let me elaborate on that: It is helpful because it makes it easy to
> search for code in a specific compilation step. But at the same time, it's
> harmful because when we try to add/fix/modify a feature into the mix, we may
> have to touch multiple files, while no documents tells which files contains
> all
> the code for which feature.
Can't we be inspired by the way llvm and clang are organized ? I'm not
competent to tell how we should do it but maybe Lionel can help us
with that.
I think, since we will be using Python we should have clearly
identified block of features. Otherwise it would be a nightmare to
understand the call graph.

>
> This is why I believe that following the flow of the rewrite, we should
> completely
> reorganize our code. I mean, we were working with a purely procedural
> language,
> so we were limited by choice. Now that we're moving on to python. we have a
> full-fledged object language at our disposal. I want your opinion on this:
> ~
> Do you think that it may be wise to associate an object = a feature ?

Or a module.

> That is, to use hooks within each compilation steps to run feature-specific
> code
Yes
> when it has a meaning. This way, we could concentrate a feature's code in
> one
> place, hopefully making it easier to add/fix/re-factor/modify.
Sounds good.
> ~
> This is one of my current thinking subjects, and I would really appreciate
> your
> input on that subject.
>
>
> Code Generation:
> As I wrote in the previous listing, the code is currently being generated by
> a
> kind-of forceful way. We merely recursively try to pull everything that fits
> into
> a place-holder (pointcut, and such). This has the demerits that is does not
> fully
> comply to our multi-platform and flexibility goals: Only include the code
> that is
> necessary to include.

Is it really a problem at this moment ?

>
> During previous hackathons we have discussed it a few times, and a new
> generation model began to take form. The basic idea is to separate the
> generation process into multiple steps: first a pre-selection of the
> templates
> depending on the generation configuration; followed by a second selection,
> based on what sequences are implemented and/or used in the driver's code,
> and finally, only weave the final code tree by only using those selected
> templates (and those that they explicitly use).
>

Logical.

> This method is not fully designed yet, but will heavily rely on the typing
> system
> (on the side of the middle-end language), as well as the linker, that will
> have to
> use the cache to build the proper pre-selection of templates. This means
> that
> we have to take special care while designing the cache, the linker, and the
> typing system.

I know the cache is important, but in the begining if we decide to
reparse everything it may not be a problem if we can't come
up with a definitive AST in the first steps. Or do I miss something ?

>
> ~~~~
>
> This leads me to the last part of this fat mail: In order to re-design, we
> must
> identify every requirements for each component (and data structure ?).
> This may be the part where everyone's help is most precious.
>
> The focus here will be over the Cache and the linker, the two central
> components.
> Since their roles are quite related and it's hard to separate, I'll simply
> list the
> known requirements; and we'll separate them at a later time.
>
> What I know for sure for the Cache/Linker is:
> - We need to register a bunch of data:
> > Source files (Interfaces, templates, drivers)
> > Compiled files (idem)
> - It has to be structured/indexed:
> > By Interface (for interface search and the typing system)
> > By configuration (for pre-selection by config)
> > By Template (for explicit template selection)
> > By dependency (for the resolver to easily select what's used or not)

It will be fun ... It reminds me graph databases.

> - We need to be able to do some structured queries, for an informational
> purpose (listing the support for a given config/template for instance):
> > List the currently registered / not registered templates for a given
> interface
> and a given configuration
> > List the configurations supported by the registered code for a given
> template
> - We need to identify multiple flavors for the cache:
> > System cache (Library of Interfaces, Templates and drivers stored
> system-wide)
> > User cache (Registered Interfaces; Templates and drivers stored in the
> user's data)
>

We may start to thing to an actual database...

>
> Which leads me to ask:
> - Which functionalities are "Linker" or "Cache" ?
> - Is it necessary/Useful/Doable to separate both ?

Between the beginning and the end of my reading, I'm not sure anymore
of what is the cache.
Can you recall me ?

For me, it was just an optimization permitting not to compile each
time the back-end files. If i'm right,
can we start without it, and add it when we will have a more stable
compilation ?

>
>
>
> Next is the new resolver algorithm.
> The aim is to get a finer resolution that does not include unused code into
> the
> generated driver. Meaning: Generate only what you need/want.
>
> First step, a first big selection on the cache is done through matching the
> configuration against the multiple templates' constraints.
>
> Second step, by using the cache's dependency system, and selecting
> through vthe templates and sequences used/implemented (or not) in the
> driver, we can get only what's used for the last step.
>
> Finally, the last step is the part that matches the closest with the current
> resolver algoritm, since it's the one that recursively descends through the
> placeholders; and resolves them by using the cache/linker. The only new
> part here is that it won't take everything in, but only what has been
> previously selected.
>

If I understand properly, the resolver matches the code from the rtx
with the one
in the backend, BUT, in the backend we won't be using everything.
I can't see how you can detect unused code without false positive:

You will need to follow each call/reference and add what is needed. But what is
you entry point ? How can you indicate if a symbol is not required by
the kernel ?


>
>
> The last part I want to elaborate on is my thought about how we are going
> to structure our code and how we are going to work with a new language
> that offers multiple object features.
>
> I explained that I was thinking about how to easily add a new feature,
> since we currently need to dive into multiple files for each feature that
> we want to add, which is actually quite an efficiency sink. I want us to be
> able to add new features at a much faster rate, thanks to a new and useful
> design.
>
> Sadly, I have no miracle solution; and I feel that "One object for a
> feature"
> is kind of a stupid mistake we must avoid at all costs. Do you have any
> proposition about this specific issue ? We could have for example; say...
> One base class for each compilation step, which would be added to an
> internal list of features within that specific compilation step's code.
> Then,
> we could regroup the classes for each feature in a file, or something of
> the likes ?
>
> Don't hesitate commenting those Ideas. I am a mere beginner in python,
> and every single piece of advice that could help is welcome; be it as a
> guideline or as a design advice.

For python, let's say that we work with attributes instead of interfaces/class.
However, I don't have your experience in parser/compiler to answer
this question.
That's why we should see how the others have achieve this goal.

>
>
> ~~~~
>
> Well, that was a bit long, but I hope it wasn't too hard to read through :)

Hard as fuck the morning, but really interesting though.

>
> I may have missed many things, but you are of course all welcome to fix
> it by talking about it.
>
> TL;DR: Here are my thoughts to use as a base for re-design discussions
> for the soon-to-come python rewrite of rathaxes. Please read it, and let's
> discuss it !

Next time, put it in the beginning :D



--
Thomas Sanchez

David Pineau

unread,
Jun 5, 2013, 7:32:12 AM6/5/13
to rathaxe...@googlegroups.com



2013/6/5 Thomas Sanchez <thomas...@gmail.com>
Well, up 'till now, those were separate.
But I agree, they could be all done at once (at least parsing + introspection),
but it's a bit different for the placeholders. We needed to group them in an
iterable within the tree, and it may not be the easiest while parsing.
 

>  - Type Checking (Tries to fully match -qualify- rathaxes types against
>                           available interfaces)
Yup (and we would lookup in the cache if some of the referenced
interfaces are already compiled ?)

Exactly.
 

>  - Registration (Not actually a compilation step; registers
>                       interfaces/templates/drivers to the cache)
>  - Generation (Takes the parsed driver, generated the C code for it)

For the generation I agree but should we already think about the tool
generation ? (I mean the makefile for example)

Well, I think we still have so much to resolve in the driver code generation
that I don't really pay attention to the generations of the build tools yet.

But it is a good point, we must still think about it to plan for it.
 

>
> Components:
>  - Tree structure (AST nodes definitions)
>  - Parser
>  - Cache (registers and loads a bunch of info about the saved interfaces and
>    templates)
>  - Linker (Uses the cache to relate templates for any other component that
>    needs it)
>  - Type checker (compilation step; ensures that the types are consistent
> with
>     registered interfaces)
>  -  Resolver

Didn't you forget the structure representing an interface too ? Maybe
something more "functionally aware" than an ast would be 
simpler to manipulate. I imagine this "class" as a wrapper over an ast.

Was included in my "Tree structure", which is obviously a really large definition
for now. I didn't want to enter in the details too fast.


>
> Currently unsolved issues (with their proposed solution):
>  - Generation process does not take note properly of optional/required
>    sequences.
>      -> The solution is to rework the whole generation process (more about
>          that afterwards)
>  - We have no definite way to manage unknown types (kernel-defined types
>    for instance) in the code of a template.

Didn't the cnorm can parse them ?

It did, but many issues arise with the unstrict concept. Multiple times, we end up
having a decision issue about "which identifier is the type ?". It's an almost
unsolvable problem.
Maybe we should, but then I'm the same as you are: I've never read the llvm and
clang code, and don't know how it's organized, but that may be a great source of
inspiration.
It has been becoming an issue for a few months. First, it makes hard debugging
by checking the generated code (too much noise).
It is also an issue in the way we can't generate the code when an optional sequence
is not implemented. It's becoming a real blocker.
 

>
> During previous hackathons we have discussed it a few times, and a new
> generation model began to take form. The basic idea is to separate the
> generation process into multiple steps: first a pre-selection of the
> templates
> depending on the generation configuration; followed by a second selection,
> based on what sequences are implemented and/or used in the driver's code,
> and finally, only weave the final code tree by only using those selected
> templates (and those that they explicitly use).
>

Logical.

> This method is not fully designed yet, but will heavily rely on the typing
> system
> (on the side of the middle-end language), as well as the linker, that will
> have to
> use the cache to build the proper pre-selection of templates. This means
> that
> we have to take special care while designing the cache, the linker, and the
> typing system.

I know the cache is important, but in the begining if we decide to
reparse everything it may not be a problem if we can't come
up with a definitive AST in the first steps. Or do I miss something ?


Well, we could do that. But I feel that we cannot leave it out at the start, since its
too involved in each step. Maybe am I wrong ? I don't know, really.
 

>
> ~~~~
>
> This leads me to the last part of this fat mail: In order to re-design, we
> must
> identify every requirements for each component (and data structure ?).
> This may be the part where everyone's help is most precious.
>
> The focus here will be over the Cache and the linker, the two central
> components.
> Since their roles are quite related and it's hard to separate, I'll simply
> list the
> known requirements; and we'll separate them at a later time.
>
> What I know for sure for the Cache/Linker is:
>  - We need to register a bunch of data:
>     > Source files (Interfaces, templates, drivers)
>     > Compiled files (idem)
>  - It has to be structured/indexed:
>     > By Interface (for interface search and the typing system)
>     > By configuration (for pre-selection by config)
>     > By Template (for explicit template selection)
>     > By dependency (for the resolver to easily select what's used or not)

It will be fun ... It reminds me graph databases.

I thought about it. Didn't dare talk about it :p
 

>  - We need to be able to do some structured queries, for an informational
>    purpose (listing the support for a given config/template for instance):
>     > List the currently registered / not registered templates for a given
> interface
>        and a given configuration
>     > List the configurations supported by the registered code for a given
> template
>  - We need to identify multiple flavors for the cache:
>     > System cache (Library of Interfaces, Templates and drivers stored
>        system-wide)
>     > User cache (Registered Interfaces; Templates and drivers stored in the
>        user's data)
>

We may start to thing to an actual database...

Maybe, but it was how we saw the cache early one
 

>
> Which leads me to ask:
>  - Which functionalities are "Linker" or "Cache" ?
>  - Is it necessary/Useful/Doable to separate both ?

Between the beginning and the end of my reading, I'm not sure anymore
of what is the cache.
Can you recall me ?

For me, it was just an optimization permitting not to compile each
time the back-end files. If i'm right,
can we start without it, and add it when we will have a more stable
compilation ?

There's a bit of yes and a bit of no:
-> Yes because it mostly registers pre-compiled templates, and interfaces.
Maybe we could skip the "pre-compiled part"
-> No because as you noticed, it also is a kind of database about the existing
interfaces, templates and drivers (drivers may not be the first of our worries
but it will come later).
-> No because if we don't register anything, we will have to parse much more
than we will use. (We don't want to parse EVERY single
 

>
>
>
> Next is the new resolver algorithm.
> The aim is to get a finer resolution that does not include unused code into
> the
> generated driver. Meaning: Generate only what you need/want.
>
> First step, a first big selection on the cache is done through matching the
> configuration against the multiple templates' constraints.
>
> Second step, by using the cache's dependency system, and selecting
> through vthe templates and sequences used/implemented (or not) in the
> driver, we can get only what's used for the last step.
>
> Finally, the last step is the part that matches the closest with the current
> resolver algoritm, since it's the one that recursively descends through the
> placeholders; and resolves them by using the cache/linker. The only new
> part here is that it won't take everything in, but only what has been
> previously selected.
>

If I understand properly, the resolver matches the code from the rtx
with the one
in the backend, BUT, in the backend we won't be using everything.
I can't see how you can detect unused code without false positive:

You will need to follow each call/reference and add what is needed. But what is
you entry point ? How can you indicate if a symbol is not required by
the kernel ?


It's in my opinion an issue of categorizing the code.
For instance, we have recently started to identify the sequences implemented in
the front-end as "Callbacks".

Which callbacks are provided to the kernel depends on which callbacks are
required/optional+impl and present in the front-end.

The templates are mostly code provided for use to the front-end developer.

Then, annotation and dependency identification may be done at the compilation
in order to resolve the few shadow spots left. I won't say that it is enough, but in
my head it seems like a good start.
 

>
>
> The last part I want to elaborate on is my thought about how we are going
> to structure our code and how we are going to work with a new language
> that offers multiple object features.
>
> I explained that I was thinking about how to easily add a new feature,
> since we currently need to dive into multiple files for each feature that
> we want to add, which is actually quite an efficiency sink. I want us to be
> able to add new features at a much faster rate, thanks to a new and useful
> design.
>
> Sadly, I have no miracle solution; and I feel that "One object for a
> feature"
> is kind of a stupid mistake we must avoid at all costs. Do you have any
> proposition about this specific issue ? We could have for example; say...
> One base class for each compilation step, which would be added to an
> internal list of features within that specific compilation step's code.
> Then,
> we could regroup the classes for each feature in a file, or something of
> the likes ?
>
> Don't hesitate commenting those Ideas. I am a mere beginner in python,
> and every single piece of advice that could help is welcome; be it as a
> guideline or as a design advice.

For python, let's say that we work with attributes instead of interfaces/class.
However, I don't have your experience in parser/compiler to answer
this question.
That's why we should see how the others have achieve this goal.

Agreed
 

>
>
> ~~~~
>
> Well, that was a bit long, but I hope it wasn't too hard to read through :)

Hard as fuck the morning, but really interesting though.

Eh, sorry about that :p
 

>
> I may have missed many things, but you are of course all welcome to fix
> it by talking about it.
>
> TL;DR: Here are my thoughts to use as a base for re-design discussions
> for the soon-to-come python rewrite of rathaxes. Please read it, and let's
> discuss it !

Next time, put it in the beginning :D


Did you read it first ? :D
 


--
Thomas Sanchez

--
--
ML Rathaxes
www.rathaxes.org

---
You received this message because you are subscribed to the Google Groups "Rathaxes Development List" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rathaxes-deve...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.





--
David Pineau,
Developer R&D at Scality

Lionel Auroux

unread,
Jul 23, 2013, 9:31:42 AM7/23/13
to rathaxes-devel
Hi, sorry I have took my time to write a response due to my fucking planning.

I will respond in a different order than joa but I will try to fill all his open topics.

1) Python + Design :

Codeworker allowed us to identify quickly the big picture of our final compiler.
So we identify the different layer of rathaxes but codeworker's expressivity guide us to do some misconception choices.

Yes, we need a correct abstraction around AST Nodes.
Python AST nodes will be instance of classes.
All dumping stuff done with codeworker will be the same with python but for debugging purpose.
Our experience allow us now to design correctly classes for all nodes representing all rathaxes statements/expressions.
Just wait the works done on cnorm and you will understand the benefits.

2) Code generation

In fact, the generation plan is define by the interface statement.

We have all dependencies:
- dependence on other interfaces (interface X : Y, Z)
- variables names in which templates could depend.
- qualifiers value in which frontend callbacks could depend.

RTI will be compiled and produced meta info that will be usefull for code production.

Interface node classe could have method to check if a front statement is reliable with its constraints.
ex: theItf.checkDevice(theDevice) ...
idem for template:
ex: theItf.checkTemplate(theTempl)

If we think brick after brick. It's easier to "unit test them".
So python design with classes support will allow us to simply write all the check functions used by type checking...

So when we read a configuration:
- we load targeted devices
- we load dependent interfaces
- interfaces check if related devices are conform to constraints, and check callbacks qualifiers constraints.
- now we have the list of which templates we need to load, only on variables constraints, not on its values.
- we select templates that have correct value constraints.

The last 2 steps could be easier if we use a graph db backend to store and query our template chunks.

3) Cache + Linker

For me this feature must be transparent.
And we could achieve this if "cache+linker" is connected to type checking system.

-First, we have a identification problem:

During type checking of one statement, we produce query like "what is X".
X exists because we previously parse it and check it. so we could get info to perform type checking.
X didn't exists so we need to load it from cache. All the problem is to qualify X to uniquely identify it.
Our cache system act like an "auto-import" in python, or like an "smart assembly" in C#.
And to achieve this we need to make the difference between parametric chunks, and specialized chunks.
I called parametric chunks, chunks identified by variables ("with" statement but only the variable names are important).
I called specialized chunks, chunks identified by values (the complete query of the "with" statement).
Specialized chunks exists only if the parametric chunks exists. And only the parametric chunks could be identified by our type system.

And so, the linker algorithm is transparent. It's because that we have a dependence inherited from the front for our generation than we load this or that chunks.
The power of our query system is central to do stuff like "load all templates type M::N when os=X, version>=Y, bus=Z, featureA=W,"

- We have also a problem of storage:

The storage of the backend library must fulfill our requirements: introspection, querying

The use of a graph database must be explored quickly to see if it could be a solution.

4) Code Organisation

Our actual model could be called a layered model. It's hard to understand what and where to change things because all is entangled.
To follow LLVM and Clang development, I know that don't really exist a perfect way to implement compiler stuff with a clear design.
I try to implement in cnorm 4 some stuff that I've seen during the EuroLLVM conference but a  C frontend is well known.

Rathaxes is another kind of compiler.

However I've thought about something that could help us.
The big challenge is to understand clearly our "sequence diagram" for all the production job. I propose for this a "pipeline model".

The basic idea is to get a complete matrix of the production chain. Each layer correspond of a feature. Each rows correspond to a step in the production.
One step is basically the same, it's just a tree transformation and/or collection of info, but our goal is to identify correctly the major transformation steps.
Each layer is process for each rows. One layer could store info during the step X and use it in the step Y.
My model isn't perfect, I need feedback and a big live brainstorming with the team to see if is viable.
I've tried to write an example in the attachment. We could throw away this idea or expand it.
We must improve the model to see all data sharing needs.
With our current implementation, joa, could you try to expand my sample and see what is stored where, what we need to share etc...
the only share zones are the BackendLibrary or the AST object in my example.
You could modify what you need to simulate the existent production code.
It could be the first refactoring step.



2013/6/5 David Pineau <dav.p...@gmail.com>



--
Cordialement,
Lionel Auroux
pipeline.py

David Pineau

unread,
Jul 24, 2013, 6:25:24 AM7/24/13
to rathaxe...@googlegroups.com



2013/7/23 Lionel Auroux <lionel...@gmail.com>
I agree about the transparency thing. In the codeworker prototype, I chose to mostly use the cache/linker for such requests, and thus requiring registering things into the cache before.
 

-First, we have a identification problem:

During type checking of one statement, we produce query like "what is X".
X exists because we previously parse it and check it. so we could get info to perform type checking.
X didn't exists so we need to load it from cache. All the problem is to qualify X to uniquely identify it.

That's here the typing is to be taken into account. The role of this step is to fully qualify every bit of code. Then, it should be uniquely identifiable.
 
Our cache system act like an "auto-import" in python, or like an "smart assembly" in C#.
And to achieve this we need to make the difference between parametric chunks, and specialized chunks.

Okay, you lost me there.
 
I called parametric chunks, chunks identified by variables ("with" statement but only the variable names are important).
I called specialized chunks, chunks identified by values (the complete query of the "with" statement).
Specialized chunks exists only if the parametric chunks exists. And only the parametric chunks could be identified by our type system.

What would be the use of those so-called parametric chunks ? I don't see why nor how it could be useful.
 

And so, the linker algorithm is transparent. It's because that we have a dependence inherited from the front for our generation than we load this or that chunks.
The power of our query system is central to do stuff like "load all templates type M::N when os=X, version>=Y, bus=Z, featureA=W,"


Why only taking the variables expressed in the with statement ?

I mean, currently, if a variable isn't expressed in the with statement, then the matching implementation is considered to match EVERY single value for this variable. Do you want to change completely this behavior ?
 

- We have also a problem of storage:

The storage of the backend library must fulfill our requirements: introspection, querying

The use of a graph database must be explored quickly to see if it could be a solution.

I'm currently exploring the structure we could apply to a graph database, and how we could use it. I'm a bit slow on this, though.
 

4) Code Organisation

Our actual model could be called a layered model. It's hard to understand what and where to change things because all is entangled.

Yup, that's my issue: Fixing a feature is not easy, code-speaking. You have sometimes to crawl through half of the base code, making it almost impossible for a novice (on the compiler) to fix anything.
 
To follow LLVM and Clang development, I know that don't really exist a perfect way to implement compiler stuff with a clear design.
I try to implement in cnorm 4 some stuff that I've seen during the EuroLLVM conference but a  C frontend is well known.

Rathaxes is another kind of compiler.

However I've thought about something that could help us.
The big challenge is to understand clearly our "sequence diagram" for all the production job. I propose for this a "pipeline model".

The basic idea is to get a complete matrix of the production chain. Each layer correspond of a feature. Each rows correspond to a step in the production.
One step is basically the same, it's just a tree transformation and/or collection of info, but our goal is to identify correctly the major transformation steps.
Each layer is process for each rows. One layer could store info during the step X and use it in the step Y.
My model isn't perfect, I need feedback and a big live brainstorming with the team to see if is viable.
I've tried to write an example in the attachment. We could throw away this idea or expand it.
We must improve the model to see all data sharing needs.
With our current implementation, joa, could you try to expand my sample and see what is stored where, what we need to share etc...
the only share zones are the BackendLibrary or the AST object in my example.
You could modify what you need to simulate the existent production code.
It could be the first refactoring step.


From what I could understand of your sample, It is quite close to what I thought about. The only thing that I cannot picture for now, is How, in this model, I can separate each feature I talked about (for instance, each type of placeholder is a different feature to me: explicit sequence/chunk call, method calls, attribute calls, pointcuts, values)

Your sample presents the steps as wide things, but I wonder if my idea would match your model. I will try to update your sample and put it in this thread for feedback/ideas.

David Pineau

unread,
Jul 24, 2013, 8:34:34 AM7/24/13
to rathaxe...@googlegroups.com



2013/7/24 David Pineau <dav.p...@gmail.com>
Alright, before trying anything out, I want to clarify some points about the model I wish to see for the compiler's rewrite.

Reading your sample, I see that your pipeline contains three main steps:
 - type_with
 - transform
 - product.

Then, you put a series of modules implementing one or more steps of the pipeline (I do not call them features, I'll explain why later) into a pipeline, which defines what is to be applied to the current Code/Tree.

For instance, the all-in-one compilation includes the modules:
 - RtiDependance (type)
 - RtiRegister (product)
 - BltCompileTemplate (type, transform, product)
 - BltRegister (product)
 - RtxConfiguration (type, transform)
 - RtxProductCode (product)

Thus, each step calls the modules in order, to manipulate the AST. Good, we can type/transform/product.

Up 'till there, I see nothing wrong with it, though the Idea to type the configuration before even registering the interfaces/templates bothers me... May be related to the fact that I didn't see the cache/linker as seamless compared to the all-in-one process.

No, what really bothers me actually, is that I cannot see how the features can be isolated from one another in this model. To me, it feels like formalizing what's implemented today in the codeworker code, but without improving it. There's this pipeline thing, with a series of steps.. Well it changes a bit from today...

For instance, let's consider the pointcut mechanism as a Feature (I want to consider each single mechanism of the language/compiler as a feature, not a bunch of them at a time) Then the associated code could be found in:
 - RtiDependence (type -> check a pointcut's type)
 - RtiRegister (product -> Register in the cache the right items to allow easy resolution of the chunks to be included in this pointcut)
 - BltCompileTemplate (type/transform/product -> obvious, right ?)
 - BltRegister (product -> Same as RtiRegister)
 - RtxProductCode (product -> Algorithm to weave a pointcut)
 
One of the issues for me today is that if I have to fix this feature, I have to go through half of the code. This model reproduces the same behavior.
Often, we have to fix/implement new features; and I feel that it may be more important than fixing a pipeline step, in terms of development efficiency and ease of entry within the compiler.

I know that either way, one type of change will be harder: Either the feature fix/updates will be harder with this pipeline model; either altering the pipeline will be harder with a feature-oriented model. What do you suggest about it ?
Reply all
Reply to author
Forward
0 new messages