The problem I face is that the pylint code doesn't actually seem to
make inferences about the code that it is analyzing. It all seems
like preparation :-)
I am writing this after briefly studying the astng code in
site-packages\logilab_astng-0.19.3-py2.6.egg\logilab\astng
and the pylint checker code in pylint-0.19.0\checkers
--- Goals
The primary goal, indeed the only goal, for the "new pylint" is to
make it the best possible platform for analysis of Python code.
Pylint, pychecker and similar tools are arguably the most important
tools in the Python world other than Python itself.
By "best possible" I mean the following:
1. It should be as fast as possible.
2. It should be as simple and easy-to-understand as possible.
3. It should be as flexible and extensible as possible.
It should be easy for other people to write their own checkers. That
means that the code must be well documented and packaged for easy use.
--- The grand strategy
1. One of the biggest surprise of my programming life was the
realization that multi-pass compiler algorithms can be significantly
*faster* than single-pass algorithms. At the time, that was a huge
shock. Now, it seems second nature to me. So the first part of the
strategy is that the new pylint will consist of multiple passes,
largely independent of each other. This will have salutary effects
everywhere.
2. In contrast to pylint, the new pylint will treat ast trees as read
only. Early passes will convert the ast trees of the modules being
analyzed into self-contained data structures. This seems like
avoidable overhead, but it should greatly simplify later passes.
Prepasses that simplify later passes pay huge dividends.
3. At present, pylint analyzes only one file at a time. The new
pylint will handle lists of files. This will save a lots of work.
The new pylint will preprocess each file only once.
--- The passes
The new pylint will process files in approximately this way:
Pass 1: read all file-level ast's.
For each file in the list of files passed to the (new) pylint, pylint
will read the ast for that file. It takes one line of code: tree =
ast.parse(s,filename=fn,mode='exec')
This pass creates file2astDict, a dictionary associating filenames
with ast's.
Pass 2: import all files.
This pass traverses all the ast's in file2astDict, looking for import
statements. If the imported file does not already exist in
file2astDict.keys(), this pass will read the ast for that file and
create an entry in file2astDict.
Pass 3: ast to internal data
For every ast in file2astDict.keys(), this pass will create internal
data structures that
a) contain all the info in the ast but
b) is optimized for the task of data analysis.
This pass will create **pylint nodes** corresponding to ast nodes. I
don't want to monkey-patch the ast tree: I want a new tree that does
*exactly* what I want.
Pass 4: post-process pylint nodes
Details hazy at present, but this pass would perform preliminary
analysis common to all inferences. It might massage the data in
pylint nodes to make the data easier to use later.
Pass 5: dynamic analysis
This is the fun/hard part. We want to infer the types that variables
can have, and the types that functions (and classes with __call__
members) can have. Analyzing assignment statements is the big one.
Pass 6: apply checkers
When all the preliminary work is complete, checking the code in
various ways should be simply a matter of checking data in the pylint
trees.
As a background requirement, I see no reason for any of these passes,
including the checkers, ever to raise an exception. The present
pylint code is almost impossible to understand because one is never
sure what exceptions are being tossed around. Exceptions have global
effect. They are not a great idea when trying to simplify code. In
contrast, a multi-pass scheme does clearly defined tasks on clearly-
defined data. In such an environment, there should be no surprises.
--- Good artists copy--great artists steal
My strength as a programmer is repackaging and revision. The new
design is, in effect, a massive refactoring of existing code. There is
plenty of useful code in pylint. I plan to steal as much of it as
possible :-)
--- Where do I go from here?
I started this design because I want to see how I would approach the
problem. This design will help me study the existing pylint code.
The question is, do I really intend to do an alternate pylint? I'm
not sure yet, but I'm not ruling it out. I have been looking for a
project that showcases Leo's strengths. A new pylint would do that.
More importantly, pylint can and should be improved, and at the very
least a new pylint might spur further work on the existing pylint.
The competition could be interesting :-)
Edward
P.S. A crucial part of being active is picking projects that are
small enough to keep my energy high. The last three months have been
a high-energy time for me because I've been focused on small projects:
fixing one bug at a time. The design presented above breaks a very
large tasks into more manageable pieces.
Better, it might be possible to fold the new design, step by step,
into the *existing* pylint. If possible, that would be a win for
everyone. Unless I completely misunderstand pylint, reading ast trees
once for each file in a *list* of files would greatly speed up the
existing pylint. So that's something to look into.
EKR
> Pass 5: dynamic analysis
>
> This is the fun/hard part. We want to infer the types that variables
> can have, and the types that functions (and classes with __call__
> members) can have. Analyzing assignment statements is the big one.
We can think of this phase as the heart of a globally optimizing
compiler. That is, this phase will have information about all files
in the files list readily available. For example, I would want to
analyze all of Leo's core files at once.
After I wrote the design, I started wondering whether some of the
ideas of the pypy project might apply. Indeed, the pypy compiler has
more detailed type information available to it than can any static
checker such as pylint. I won't rule that out, but I'm not sure it's
needed. I would settle for a pylint that could catch all obvious
blunders in function calls. Apparently the present pylint doesn't do
that.
Edward
> 1. One of the biggest surprise of my programming life was the
> realization that multi-pass compiler algorithms can be significantly
> *faster* than single-pass algorithms. At the time, that was a huge
> shock. Now, it seems second nature to me. So the first part of the
> strategy is that the new pylint will consist of multiple passes,
> largely independent of each other. This will have salutary effects
> everywhere.
I consider the anticipated preprocessing passes to be similar in
complexity to peephole optimizers in an optimizing compiler. I've
written them for the C language. It is amazing how effective and fast
peepholes can be.
I want to convert ast trees to trees of pylint nodes so that the
"peepholes" will have complete freedom to munge the tree as they
please.
In effect, only the peepholes will detail with the grungy details of
ast syntax and semantics. After the peepholes are done, the rest of
pylint will have the easiest-possible data to work with.
And make no mistake: data, not code, is the key to simplifying a
program.
Edward
> I want to convert ast trees to trees of pylint nodes so that the
> "peepholes" will have complete freedom to munge the tree as they
> please.
I was mistaken. This does not seem to be a big issue. Most of the
checkers (files in the pylint-0.19.0\checkers directory) are
relatively straightforward. The tree interface that the checkers use
isn't complex, so the checkers would not be greatly simplified with a
simpler tree design.
I have found the rabbit hole that leads to Wonderland, namely the
following two checkers:
1. VariablesChecker.visit_name (checkers/variables.py) checks that
names are defined.
2. TypeChecker.visit_callfunc (checkers/typecheck.py) checks that
function calls make sense.
I would be willing to throw away almost everything but these two
tests.
visit_name is hairy code, but it is self-contained. See the
"_to_consume" logic. It does not use any inference machinery.
visit_callfunc checks that the called function or method is
**inferred** to be a callable object as follows:
called = safe_infer(node.func)
if called is not None and not called.callable():
self.add_message('E1102', node=node,
args=node.func.as_string())
# The E1102 messages says <x> is not callable.
I am still mostly clueless about how the inference machinery works,
but now I have something specific to study, namely the call to
safe_infer(node.func). This is good progress.
Edward
P.S. Although the checkers themselves are relatively simple, the
inference machinery contained in the astng classes is anything but
simple. It may well be true that rewriting those classes to use
multiple passes would be a good thing. We shall see...
EKR
> the following design is active preparation for studying the pylint code.
I've just made contact with Sylvain Thénault and Emile Anclin at
logilabs. You can see our friendly correspondence at
http://lists.logilab.org/pipermail/python-projects/2010-February/thread.html
It looks like great minds are thinking alike about the internals of
pylint. This is exciting. pylint is an essential part of the Python
world, and almost any amount of work in improving it will be well-
justified.
Edward
> --
> You received this message because you are subscribed to the Google Groups "leo-editor" group.
> To post to this group, send email to leo-e...@googlegroups.com.
> To unsubscribe from this group, send email to leo-editor+...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/leo-editor?hl=en.
>
>
It's a start. It was nice to get off on the right foot by fixing a
bug that has discomfited several people.
Edward
What I like is watching people meet and figure out how they can help each other,
and setting up a (tentative) date in Paris is SO romantic.
:-]