What to use for finding as many syntax errors as possible.

420 views
Skip to first unread message

Antoon Pardon

unread,
Oct 9, 2022, 6:09:45 AMOct 9
to
I would like a tool that tries to find as many syntax errors as possible
in a python file. I know there is the risk of false positives when a
tool tries to recover from a syntax error and proceeds but I would
prefer that over the current python strategy of quiting after the first
syntax error. I just want a tool for syntax errors. No style
enforcements. Any recommandations? -- Antoon Pardon

Avi Gross

unread,
Oct 9, 2022, 11:50:34 AMOct 9
to
Anton

There likely are such programs out there but are there universal agreements
on how to figure out when a new safe zone of code starts where error
testing can begin?

For example a file full of function definitions might find an error in
function 1 and try to find the end of that function and resume checking the
next function. But what if a function defines local functions within it?
What if the mistake in one line of code could still allow checking the next
line rather than skipping it all?

My guess is that finding 100 errors might turn out to be misleading. If you
fix just the first, many others would go away. If you spell a variable name
wrong when declaring it, a dozen uses of the right name may cause errors.
Should you fix the first or change all later ones?
> --
> https://mail.python.org/mailman/listinfo/python-list
>

Peter J. Holzer

unread,
Oct 9, 2022, 12:17:42 PMOct 9
to
On 2022-10-09 12:09:17 +0200, Antoon Pardon wrote:
> I would like a tool that tries to find as many syntax errors as possible in
> a python file. I know there is the risk of false positives when a tool tries
> to recover from a syntax error and proceeds but I would prefer that over the
> current python strategy of quiting after the first syntax error. I just want
> a tool for syntax errors. No style enforcements. Any recommandations?

There seems to have been increased interest in good error recovery over
the last years. I thought I had bookmarked a bunch of projects, but the
only one I can find right now is Lezer
(https://marijnhaverbeke.nl/blog/lezer.html) which is part of the
CodeMirror (https://codemirror.net/) editor. Python is listed as a
currently supported language, so you might want to check that out.

Disclaimer: I haven't used CodeMirror, so I can't say anything about
its quality. The blog entry about Lezer was interesting, though.

hp

--
_ | Peter J. Holzer | Story must make more sense than reality.
|_|_) | |
| | | h...@hjp.at | -- Charles Stross, "Creative writing
__/ | http://www.hjp.at/ | challenge!"
signature.asc

Antoon Pardon

unread,
Oct 9, 2022, 1:00:19 PMOct 9
to


Op 9/10/2022 om 17:49 schreef Avi Gross:
> My guess is that finding 100 errors might turn out to be misleading. If you
> fix just the first, many others would go away.

At this moment I would prefer a tool that reported 100 errors, which would
allow me to easily correct 10 real errors, over the python strategy which quits
after having found one syntax error.

--
Antoon.

Thomas Passin

unread,
Oct 9, 2022, 1:17:10 PMOct 9
to
https://stackoverflow.com/questions/4284313/how-can-i-check-the-syntax-of-python-script-without-executing-it

People seemed especially enthusiastic about the one-liner from jmd_dk.

Karsten Hilbert

unread,
Oct 9, 2022, 1:24:07 PMOct 9
to
But the point is: you can't (there is no way to) be sure the
9+ errors really are errors.

Unless you further constrict what sorts of errors you are
looking for and what margin of error or leeway for false
positives you want to allow.

Karsten
--
GPG 40BE 5B0E C98E 1713 AFA6 5BC0 3BEA AC80 7D4F C89B

Peter J. Holzer

unread,
Oct 9, 2022, 1:29:56 PMOct 9
to
On 2022-10-09 12:59:09 -0400, Thomas Passin wrote:
> https://stackoverflow.com/questions/4284313/how-can-i-check-the-syntax-of-python-script-without-executing-it
>
> People seemed especially enthusiastic about the one-liner from jmd_dk.

I don't think that one-liner solves Antoon's requirement of continuing
after an error. It uses just the normal python parser so it has exactly
the same limitations.

Some of the mentioned tools may do what Antoon wants, though.
signature.asc

Peter J. Holzer

unread,
Oct 9, 2022, 1:37:11 PMOct 9
to
On 2022-10-09 19:23:41 +0200, Karsten Hilbert wrote:
> Am Sun, Oct 09, 2022 at 06:59:36PM +0200 schrieb Antoon Pardon:
> > Op 9/10/2022 om 17:49 schreef Avi Gross:
> > >My guess is that finding 100 errors might turn out to be misleading. If you
> > >fix just the first, many others would go away.
> >
> > At this moment I would prefer a tool that reported 100 errors, which would
> > allow me to easily correct 10 real errors, over the python strategy which quits
> > after having found one syntax error.
>
> But the point is: you can't (there is no way to) be sure the
> 9+ errors really are errors.

As a human who knows Python in many cases you can be sure. Sometimes you
aren't sure, then you leave that one for the next iteration. No big
deal. This isn't the 1960s when you sent your punched cards in and got
the result back next week. So neither the parser nor you need to be
perfect. Just better than one error at a time.
signature.asc

Antoon Pardon

unread,
Oct 9, 2022, 1:51:39 PMOct 9
to


Op 9/10/2022 om 19:23 schreef Karsten Hilbert:
> Am Sun, Oct 09, 2022 at 06:59:36PM +0200 schrieb Antoon Pardon:
>
>> Op 9/10/2022 om 17:49 schreef Avi Gross:
>>> My guess is that finding 100 errors might turn out to be misleading. If you
>>> fix just the first, many others would go away.
>> At this moment I would prefer a tool that reported 100 errors, which would
>> allow me to easily correct 10 real errors, over the python strategy which quits
>> after having found one syntax error.
> But the point is: you can't (there is no way to) be sure the
> 9+ errors really are errors.
>
> Unless you further constrict what sorts of errors you are
> looking for and what margin of error or leeway for false
> positives you want to allow.

Look when I was at the university we had to program in Pascal and
the compilor we used continued parsing until the end. Sure there
were times that after a number of reported errors the number of
false positives became so high it was useless trying to find the
remaining true ones, but it still was more efficient to correct the
obvious ones, than to only correct the first one.

I don't need to be sure. Even the occasional wrong correction
is probably still more efficient than quiting after the first
syntax error.

--
Antoon.

Weatherby,Gerard

unread,
Oct 9, 2022, 2:05:06 PMOct 9
to
PyCharm.

Does a good job of separating these are really errors from do you really mean that warnings from this word is spelled right.

https://www.jetbrains.com/pycharm/

From: Python-list <python-list-bounces+gweatherby=uchc...@python.org> on behalf of Antoon Pardon <antoon...@vub.be>
Date: Sunday, October 9, 2022 at 6:11 AM
To: pytho...@python.org <pytho...@python.org>
Subject: What to use for finding as many syntax errors as possible.
*** Attention: This is an external email. Use caution responding, opening attachments or clicking on links. ***

I would like a tool that tries to find as many syntax errors as possible
in a python file. I know there is the risk of false positives when a
tool tries to recover from a syntax error and proceeds but I would
prefer that over the current python strategy of quiting after the first
syntax error. I just want a tool for syntax errors. No style
enforcements. Any recommandations? -- Antoon Pardon
--
https://urldefense.com/v3/__https://mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!kxDZilNf74VILuntVEzVZ4Wjv6RPr4JUbGpWrURDJ3CtDNAi9szBWweqrDM-uHy-o_Sncgrm2BmJIRksmxSG_LGVbBU$<https://urldefense.com/v3/__https:/mail.python.org/mailman/listinfo/python-list__;!!Cn_UX_p3!kxDZilNf74VILuntVEzVZ4Wjv6RPr4JUbGpWrURDJ3CtDNAi9szBWweqrDM-uHy-o_Sncgrm2BmJIRksmxSG_LGVbBU$>

MRAB

unread,
Oct 9, 2022, 2:51:49 PMOct 9
to
> is probably still more efficient than quiting after the first
> syntax error.
>
When I did some programming in COBOL, a single omitted "." would
completely confuse the compiler and it was best to fix that one error
and then try again.

On the other hand, TurboPascal would also stop on the first error and
put the cursor at the error position in the IDE, but as it compiled
quickly, it wasn't a problem. It was no slower than it would've been if
it had found multiple errors and you pressed a key to advance to the
next error.

Avi Gross

unread,
Oct 9, 2022, 3:18:56 PMOct 9
to
Antoon, it may also relate to an interpreter versus compiler issue.

Something like a compiler for C does not do anything except write code in
an assembly language. It can choose to keep going after an error and start
looking some more from a less stable place.

Interpreters for Python have to catch interrupts as they go and often run
code in small batches. Continuing to evaluate after an error could cause
weird effects.

So what you want is closer to a lint program that does not run code at all,
or merely writes pseudocode to a file to be run faster later.

Many languages now have blocks of code that are not really be evaluated
till later. Some code is built on the fly. And some errors are not errors
at first. Many languages let you not declare a variable before using it or
allow it to change types. In some, the text is lazily evaluated as late as
possible.

I will say that often enough a program could report more possible errors.
Putting your code into multiple files and modules may mean you could
cleanly evaluate the code and return multiple errors from many modules as
long as they are distinct. Finding all errors is not possible if recovery
from one is not guaranteed.

Take a language that uses a semicolon to end a statement. If absent usually
there would be some error but often something on the next line. Your
evaluator could do an experiment and add a semicolon and try again. This
might work 90% of the time but sometimes the error was not ending the line
with a backslash to make it continue properly, or an indentation issue and
even spelling error. No guarantees.

Is it that onerous to fix one thing and run it again? It was once when you
handed in punch cards and waited a day or on very busy machines.

On Sun, Oct 9, 2022, 1:03 PM Antoon Pardon <antoon...@vub.be> wrote:

>
>
> Op 9/10/2022 om 17:49 schreef Avi Gross:
> > My guess is that finding 100 errors might turn out to be misleading. If
> you
> > fix just the first, many others would go away.
>
> At this moment I would prefer a tool that reported 100 errors, which would
> allow me to easily correct 10 real errors, over the python strategy which
> quits
> after having found one syntax error.
>
> --
> Antoon.
>
> --
> https://mail.python.org/mailman/listinfo/python-list
>

Avi Gross

unread,
Oct 9, 2022, 3:45:33 PMOct 9
to
I will say that those of us meaning me, who express reservations are not
arguing it is a bad idea to get more info in one sweep. Many errors come in
bunches.

If I keep calling some function with the wrong number or type of arguments,
it may be the same in a dozen places in my code. The first error report may
make me search for the others places so I fix it all at once. Telling me
where some instances are might speed that a bit.

As long as it is understood that further errors are a heuristic and
possibly misleading, fine.

But an error like setting the size of a fixed length data structure to the
right size may result in oodles of errors about being out of range that
magically get fixed by one change. Sometimes too much info just gives you a
headache.

But a tool like you described could have uses even if imperfect. If you are
teaching a course and students submit programs, could you grade the one
with a single error higher than one with 5 errors shown imperfectly and
fail the one with 600?

On Sun, Oct 9, 2022, 1:53 PM Antoon Pardon <antoon...@vub.be> wrote:

>
>
> Op 9/10/2022 om 19:23 schreef Karsten Hilbert:
> > Am Sun, Oct 09, 2022 at 06:59:36PM +0200 schrieb Antoon Pardon:
> >
> >> Op 9/10/2022 om 17:49 schreef Avi Gross:
> >>> My guess is that finding 100 errors might turn out to be misleading.
> If you
> >>> fix just the first, many others would go away.
> >> At this moment I would prefer a tool that reported 100 errors, which
> would
> >> allow me to easily correct 10 real errors, over the python strategy
> which quits
> >> after having found one syntax error.
> > But the point is: you can't (there is no way to) be sure the
> > 9+ errors really are errors.
> >
> > Unless you further constrict what sorts of errors you are
> > looking for and what margin of error or leeway for false
> > positives you want to allow.
>
> Look when I was at the university we had to program in Pascal and
> the compilor we used continued parsing until the end. Sure there
> were times that after a number of reported errors the number of
> false positives became so high it was useless trying to find the
> remaining true ones, but it still was more efficient to correct the
> obvious ones, than to only correct the first one.
>
> I don't need to be sure. Even the occasional wrong correction
> is probably still more efficient than quiting after the first

Antoon Pardon

unread,
Oct 9, 2022, 3:46:39 PMOct 9
to


Op 9/10/2022 om 21:18 schreef Avi Gross:
> Antoon, it may also relate to an interpreter versus compiler issue.
>
> Something like a compiler for C does not do anything except write code in
> an assembly language. It can choose to keep going after an error and start
> looking some more from a less stable place.
>
> Interpreters for Python have to catch interrupts as they go and often run
> code in small batches. Continuing to evaluate after an error could cause
> weird effects.
>
> So what you want is closer to a lint program that does not run code at all,
> or merely writes pseudocode to a file to be run faster later.

I just want a parser that doesn't give up on encoutering the first syntax
error. Maybe do some semantic checking like checking the number of parameters.

> I will say that often enough a program could report more possible errors.
> Putting your code into multiple files and modules may mean you could
> cleanly evaluate the code and return multiple errors from many modules as
> long as they are distinct. Finding all errors is not possible if recovery
> from one is not guaranteed.

I don't need it to find all errors. As long as it reasonably accuratly
finds a significant number of them.

> Is it that onerous to fix one thing and run it again? It was once when you
> handed in punch cards and waited a day or on very busy machines.

Yes I find it onerous, especially since I have a pipeline with unit tests
and other tools that all have to redo their work each time a bug is corrected.

--
Antoon.

Antoon Pardon

unread,
Oct 9, 2022, 3:46:47 PMOct 9
to

Peter J. Holzer

unread,
Oct 9, 2022, 3:47:08 PMOct 9
to
On 2022-10-09 15:18:19 -0400, Avi Gross wrote:
> Antoon, it may also relate to an interpreter versus compiler issue.
>
> Something like a compiler for C does not do anything except write code in
> an assembly language. It can choose to keep going after an error and start
> looking some more from a less stable place.
>
> Interpreters for Python have to catch interrupts as they go and often run
> code in small batches. Continuing to evaluate after an error could cause
> weird effects.

I don't think this is really an issue. A python file is completely
compiled to byte code before execution starts.

It's true that a syntax error before an import prevents that import, but
since imports are usually at the start of a file, a syntax error will
only rarely prevent the import (and files intended to be imported
generally don't have weird side effects anyway).

One issue is could be that compilers which generate executables are
generally thorough and slow, while the compilers which generate
byte-code for immediate consumption by an interpreter are generally
simple and fast. So there is more incentive for the former to discover
as many errors as possible and they are also better equipped to do this.
signature.asc

Antoon Pardon

unread,
Oct 9, 2022, 4:03:13 PMOct 9
to


Op 9/10/2022 om 21:44 schreef Avi Gross:
> But an error like setting the size of a fixed length data structure to the
> right size may result in oodles of errors about being out of range that
> magically get fixed by one change. Sometimes too much info just gives you a
> headache.

So? The user of such a tool doesn't need to go through all the provided information.
If after correcting a few errors, the users find the rest of the information gives
him a headache, he can just ignore all that and just run a new iteration.

--
Antoon Pardon

Barry

unread,
Oct 9, 2022, 4:10:16 PMOct 9
to


> On 9 Oct 2022, at 18:54, Antoon Pardon <antoon...@vub.be> wrote:
>
> 
>
> Op 9/10/2022 om 19:23 schreef Karsten Hilbert:
>> Am Sun, Oct 09, 2022 at 06:59:36PM +0200 schrieb Antoon Pardon:
>>
>>> Op 9/10/2022 om 17:49 schreef Avi Gross:
>>>> My guess is that finding 100 errors might turn out to be misleading. If you
>>>> fix just the first, many others would go away.
>>> At this moment I would prefer a tool that reported 100 errors, which would
>>> allow me to easily correct 10 real errors, over the python strategy which quits
>>> after having found one syntax error.
>> But the point is: you can't (there is no way to) be sure the
>> 9+ errors really are errors.
>>
>> Unless you further constrict what sorts of errors you are
>> looking for and what margin of error or leeway for false
>> positives you want to allow.
>
> Look when I was at the university we had to program in Pascal and
> the compilor we used continued parsing until the end. Sure there
> were times that after a number of reported errors the number of
> false positives became so high it was useless trying to find the
> remaining true ones, but it still was more efficient to correct the
> obvious ones, than to only correct the first one.

If it’s very fast to syntax check then one at a time is fine.
Python is very fast to syntax check so I personal do not need the multi error version.
My editor has syntax check on a key and it’s instant to drop me a syntax error.

Barry

Karsten Hilbert

unread,
Oct 9, 2022, 4:47:09 PMOct 9
to
Am Sun, Oct 09, 2022 at 07:51:12PM +0200 schrieb Antoon Pardon:

> >But the point is: you can't (there is no way to) be sure the
> >9+ errors really are errors.
> >
> >Unless you further constrict what sorts of errors you are
> >looking for and what margin of error or leeway for false
> >positives you want to allow.
>
> Look when I was at the university we had to program in Pascal and
> the compilor we used continued parsing until the end. Sure there
> were times that after a number of reported errors the number of
> false positives became so high it was useless trying to find the
> remaining true ones, but it still was more efficient to correct the
> obvious ones, than to only correct the first one.
>
> I don't need to be sure. Even the occasional wrong correction
> is probably still more efficient than quiting after the first
> syntax error.

A-ha, so you further defined your context.

Under which I can agree to the objective :-)

Best,

Chris Angelico

unread,
Oct 9, 2022, 6:24:03 PMOct 9
to
On Mon, 10 Oct 2022 at 06:50, Antoon Pardon <antoon...@vub.be> wrote:
> I just want a parser that doesn't give up on encoutering the first syntax
> error. Maybe do some semantic checking like checking the number of parameters.

That doesn't make sense though. It's one thing to keep going after
finding a non-syntactic error, but an error of syntax *by definition*
makes parsing the rest of the file dubious. What would it even *mean*
to not give up? How should it interpret the following lines of code?
All it can do is report the error.

You know, if you'd not made this thread, the time you saved would have
been enough for quite a few iterations of "fix one syntactic error,
run it again to find the next".

ChrisA

Cameron Simpson

unread,
Oct 9, 2022, 6:45:46 PMOct 9
to
On 09Oct2022 21:46, Antoon Pardon <antoon...@vub.be> wrote:
>>Is it that onerous to fix one thing and run it again? It was once when
>>you
>>handed in punch cards and waited a day or on very busy machines.
>
>Yes I find it onerous, especially since I have a pipeline with unit tests
>and other tools that all have to redo their work each time a bug is
>corrected.

It is easy to get the syntax right before submitting to such a pipeline.
I usually run a linter on my code for serious commits, and I've got a
`lint1` alias which basicly runs the short fast flavour of that which
does a syntax check and the very fast less thorough lint phase.

I say this just to ease your write/run-tests cycle.

Regarding your main request, had you considered writing your own wrapper
tool? Something which ran something like:

python -We:invalid -m py_compile your_python_file.py

If there's an error, report it, then make a new file commencing with the
next unindented line after the error, with all preceeding lines
commented out (to keep the line numbers the same). Then run the check
again. Repeat until the file's empty or there are no errors.

This doesn't sound very complex.

Cheers,
Cameron Simpson <c...@cskk.id.au>

Thomas Passin

unread,
Oct 9, 2022, 9:13:44 PMOct 9
to

On 10/9/2022 1:29 PM, Peter J. Holzer wrote:
> On 2022-10-09 12:59:09 -0400, Thomas Passin wrote:
>>
https://stackoverflow.com/questions/4284313/how-can-i-check-the-syntax-of-python-script-without-executing-it
>>
>> People seemed especially enthusiastic about the one-liner from jmd_dk.
>
> I don't think that one-liner solves Antoon's requirement of continuing
> after an error. It uses just the normal python parser so it has exactly
> the same limitations.

Yes, of course. Interesting, though. py_compile tends to be what I use
for a quick check. I linked to the page mostly for the other
possibilities, as you mentioned below:

avi.e...@gmail.com

unread,
Oct 10, 2022, 12:41:55 AMOct 10
to
Cameron,

Your suggestion makes me shudder!

Removing all earlier lines of code is often guaranteed to generate errors as
variables you are using are not declared or initiated, modules are not
imported and so on.

Removing just the line or three where the previous error happened would also
have a good chance of invalidating something.

Someone who really wants to be able to isolate large parts of their code so
that an error in once does not compromise lots of remaining code, might
build their code in small units on the level of single functions per file
and do lots of imports. They can then ask for all the files to be
pseudo-compiled to byte-code and that might provide lots of errors to look
at in one pass.

But asking for a one-file version to find errors and somehow go past them
and look for more is more daunting but of course can be done with partial
accuracy and usefulness at best.

As an analogy, if tolerated, think of a spell-checker on a document that can
find oodles of words spelled wrong. Unfortunately, a spell corrector can
drive us nuts if it knows little about context. If it sees a word like
"reid" should it just change it to "read" or "red" or perhaps "reed" or look
to see if the real problem is it is supposed to be unified (no space) with a
word before or after? Will it know if the word appears in a context where a
language like Latin or French or German or Hungarian is being quoted and
perhaps it is spelled right, or if wrong, has other more likely corrections?

Now if you add a grammar detector, and it knows you are looking for an
adjective or a verb or a noun, it may do better.

I use Google translate quite a bit as a tool as I often have to type in
various languages and it provides a handy keyboard or lets me check if I
used the right grammar especially in languages with silly ideas that objects
can have 2 or even three genders. So putting in phrases like "this xyz" can
result in language-specific text that tells me if it is masculine or
feminine or perhaps neuter. But the reason I mention it is how often it is
WRONG. I mean many languages have multiple words that are spelled the same
but used and pronounced differently in various contexts. The English word
"read" can sound like reed or like red so past tense sounds different as in
I read that book last week versus please read it to me now. But some
languages such as Hebrew which generally may not show the vowels, can get
totally confused in this program as humans often need lots of context to
figure out whether the current short word is in a context where it means
"you: feminine and singular and is pronounced aht or it is a way of showing
what follows is a direct object and loosely means "the" in a redundant way
and is pronounced as "eht". Quite a few words have three or more possible
ways to pronounce the same letters and without vowel guides need context and
sometimes some spreadsheet-like ingenuity as multiple other words are also
in limbo and once resolved can impact what other words may now mean.
Obviously adding back the vowels makes things clear so people who are used
to seeing books written that old way can get hopelessly lost reading a
modern newspaper.

End of digression, just assume I could have gone on for many pages
describing my annoyances at what Google translate does to many other
languages that show the imperfections in what is really a great and powerful
tool.

Well parsing any program in most languages can be equally complex and
require lots of context. For example, you can often use the same identifier
to be the name of a regular variable or the name of a function and sometimes
other things such as the name of a module. They can often be disambiguated
in context. Perhaps the same name following by parentheses should be a
function call while a name followed by :: or ::: might in that language
require it to be the name of a module/package. If followed by [ it might
need to be something indexable such as an array or list and so on. So say
there is an error in the variable. Can the interpreter or linter figure out
what the error is and almost repair it? Can it see a variable name like
"alpXha" and note there is no such identifier in the current namespace but
there is one called "alpha" that might be the one without the X? But what if
what is missing is an open parent or maybe the matching close paren. Does it
know if the problem is a bad variable name or a bad function invocation or
one of many other possible problems. Code with a random blemish is often not
easily figured out. If I type the name of a function without parentheses, it
could be an attempt to call the function with no arguments (an error though
in many languages) or it could be I want to pass the function itself as n
argument in functional programming. But if I have another variable of type
array, might it not be parentheses missing but square brackets?

The compiler or interpreter often cannot fix it so it often tries to skip
forward till it finds something unambiguous that mark the beginning of a new
section. That might be something like an unquoted semicolon at the end of a
line or a matching close bracket. Depending on such choices, again, varying
amounts of the program may be ignored in evaluating what follows. But this
is not the same as a human speedreading or daydreaming who misses a bit here
and there and just hopes it was not crucial and that what follows probably
remains worthy and valid. I have sometimes missed something like a name and
then seen pages of pronouns like "she" and eventually give up as no more
hints arrive and I have to go back or ask someone lest a big bunch of the
text makes no sense to me.

Someone is wanting to treat code from a spelling checker perspective and
wants all possible mistakes thrown at them at once. As I pointed out, in
real life many kinds of context can matter and a really good checker might
even consult a personal list of words it has learned you want ignored, like
people's names or some abbreviations like LOL. It may even read marked-up
text in say HTML or XML or similar formats that is marked with the language
they supposedly contain and calls up a spell-checker appropriate for each
region.

But if they want a really intelligent program that recovers enough from
errors to reliably continue, maybe not easy.

They have explained and amended that they understand some of these issues
and are willing to get lots of false negatives or red herrings and their
real goal is to have a chance to detect and maybe fix a few things per round
rather than just one. Not a bad wish. Just not a trivial wish to grant and
satisfy.
--
https://mail.python.org/mailman/listinfo/python-list

Cameron Simpson

unread,
Oct 10, 2022, 2:34:17 AMOct 10
to
On 10Oct2022 00:41, avi.e...@gmail.com <avi.e...@gmail.com> wrote:
>Your suggestion makes me shudder!

And fair enough too. I don't do this for me, I'm just suggesting an
approach which might bring something to Antoon's objective.

>Removing all earlier lines of code is often guaranteed to generate errors as
>variables you are using are not declared or initiated, modules are not
>imported and so on.

Antoon's interested in syntax errors.

>Removing just the line or three where the previous error happened would also
>have a good chance of invalidating something.

Doubtless. He accepts that any such resume-the-parse can bring
misleading error messages. Antoon is not expecting magic, just getting
several complaints instead of just the first syntax error.

I must admit I sympathise a bit, as one of my own major irks is command
line tools which moan about the first bad option instead of noting it
and moving on to complain about other things as well, then quitting
after the command line parse. Pure laziness a lot of the time IMO; I've
done it myself, but do like to make multiple complaints when it's
feasible.

Cheers,
Cameron Simpson <c...@cskk.id.au>

Antoon Pardon

unread,
Oct 10, 2022, 3:04:47 AMOct 10
to


Op 10/10/2022 om 00:45 schreef Cameron Simpson:
> On 09Oct2022 21:46, Antoon Pardon <antoon...@vub.be> wrote:
>>> Is it that onerous to fix one thing and run it again? It was once
>>> when you
>>> handed in punch cards and waited a day or on very busy machines.
>>
>> Yes I find it onerous, especially since I have a pipeline with unit
>> tests
>> and other tools that all have to redo their work each time a bug is
>> corrected.
>
> It is easy to get the syntax right before submitting to such a
> pipeline.  I usually run a linter on my code for serious commits, and
> I've got a `lint1` alias which basicly runs the short fast flavour of
> that which does a syntax check and the very fast less thorough lint phase.

If you have a linter that doesn't quit after the first syntax error,
please provide a link. I already tried pylint and it also quits after
the first syntax error.

--
Antoon Pardon

Cameron Simpson

unread,
Oct 10, 2022, 4:51:07 AMOct 10
to
On 10Oct2022 09:04, Antoon Pardon <antoon...@vub.be> wrote:
>>It is easy to get the syntax right before submitting to such a
>>pipeline.  I usually run a linter on my code for serious commits, and
>>I've got a `lint1` alias which basicly runs the short fast flavour of
>>that which does a syntax check and the very fast less thorough lint
>>phase.
>
>If you have a linter that doesn't quit after the first syntax error,
>please provide a link. I already tried pylint and it also quits after
>the first syntax error.

I don't have such a linter. I did outline an approach for you to write
one of your own by wrapping an existing parser program.

I have a personal "lint" script which runs a few linters. The first
check is `py_compile` which quits at the first syntax error. The other
linters are not even tried if that fails.

I do not know what your editing environment is; I'd have thought that
some IDEs should make the first syntax error very obvious and easy to go
to, and an obvious indication that the file as a whoe is syntacticly
good/bad. If you have such, between them you could fairly easily resolve
syntax errors rapidly, perhaps rapidly enough to make up for a
stop-at-the-first-fail syntax check.

Cheers,
Cameron Simpson <c...@cskk.id.au>

Michael F. Stemper

unread,
Oct 10, 2022, 9:22:16 AMOct 10
to
On 09/10/2022 10.49, Avi Gross wrote:
> Anton
>
> There likely are such programs out there but are there universal agreements
> on how to figure out when a new safe zone of code starts where error
> testing can begin?
>
> For example a file full of function definitions might find an error in
> function 1 and try to find the end of that function and resume checking the
> next function. But what if a function defines local functions within it?
> What if the mistake in one line of code could still allow checking the next
> line rather than skipping it all?
>
> My guess is that finding 100 errors might turn out to be misleading. If you
> fix just the first, many others would go away. If you spell a variable name
> wrong when declaring it, a dozen uses of the right name may cause errors.
> Should you fix the first or change all later ones?

How does one declare a variable in python? Sometimes it'd be nice to
be able to have declarations and any undeclared variable be flagged.

When I was writing F77 for a living, I'd (temporarily) put:
IMPLICIT CHARACTER*3
at the beginning of a program or subroutine that I was modifying,
in order to have any typos flagged.

I'd love it if there was something similar that I could do in python.

--
Michael F. Stemper
87.3% of all statistics are made up by the person giving them.

Robert Latest

unread,
Oct 10, 2022, 1:06:58 PMOct 10
to
Michael F. Stemper wrote:
> How does one declare a variable in python? Sometimes it'd be nice to
> be able to have declarations and any undeclared variable be flagged.

To my knowledge, the closest to that is using __slots__ in class definitions.
Many a time have I assigned to misspelled class members until I discovered
__slots__.

Robert Latest

unread,
Oct 10, 2022, 1:08:52 PMOct 10
to
Antoon Pardon wrote:
> I would like a tool that tries to find as many syntax errors as possible
> in a python file.

I'm puzzled as to when such a tool would be needed. How many syntax errors can
you realistically put into a single Python file before compiling it for the
first time?

Robert Latest

unread,
Oct 10, 2022, 1:14:32 PMOct 10
to
<avi.e...@gmail.com> wrote:
> Cameron,
>
> Your suggestion makes me shudder!

Me, too

> Removing all earlier lines of code is often guaranteed to generate errors as
> variables you are using are not declared or initiated, modules are not
> imported and so on.

all of which aren't syntax errors, so the method should still work. Ugly as
hell though. I can't think of a reason to want to find multiple syntax errors
in a file.

Peter J. Holzer

unread,
Oct 10, 2022, 3:33:20 PMOct 10
to
On 2022-10-10 09:23:27 +1100, Chris Angelico wrote:
> On Mon, 10 Oct 2022 at 06:50, Antoon Pardon <antoon...@vub.be> wrote:
> > I just want a parser that doesn't give up on encoutering the first syntax
> > error. Maybe do some semantic checking like checking the number of parameters.
>
> That doesn't make sense though.

I think you disagree with most compiler authors here.

> It's one thing to keep going after finding a non-syntactic error, but
> an error of syntax *by definition* makes parsing the rest of the file
> dubious.

Dubious but still useful.

> What would it even *mean* to not give up?

Read the blog post on Lezer for some ideas:
https://marijnhaverbeke.nl/blog/lezer.html

This is in the context of an editor. But the same problem applies to
compilers. It's not very important if a compile run only takes a second
or so but even then it might be helpful to see several error messages
and not only one at a time. It becomes much more important as compile
times get longer (as an extreme[1] example, when I worked on a largeish
cobol program in the 1980s, compiling the thing took about half an hour.
I really wanted to fix *everything* before starting the compiler again.)

Marijn isn't the only person who revisited this problem recently[2].
I've read a few other blog posts and papers on that topic at about the
same time.

hp

[1] Yes, there are programs where a full compile takes much longer than
that. But you can usually get away with recompiling only a small
part, so you don't have to wait that long during normal development.
That cobol compiler couldn't do that.

[2] "Recently" means "in the last 10 years or so".
signature.asc

Chris Angelico

unread,
Oct 10, 2022, 5:02:51 PMOct 10
to
On Tue, 11 Oct 2022 at 06:34, Peter J. Holzer <hjp-p...@hjp.at> wrote:
>
> On 2022-10-10 09:23:27 +1100, Chris Angelico wrote:
> > On Mon, 10 Oct 2022 at 06:50, Antoon Pardon <antoon...@vub.be> wrote:
> > > I just want a parser that doesn't give up on encoutering the first syntax
> > > error. Maybe do some semantic checking like checking the number of parameters.
> >
> > That doesn't make sense though.
>
> I think you disagree with most compiler authors here.
>
> > It's one thing to keep going after finding a non-syntactic error, but
> > an error of syntax *by definition* makes parsing the rest of the file
> > dubious.
>
> Dubious but still useful.

There's a huge difference between non-fatal errors and syntactic
errors. The OP wants the parser to magically skip over a fundamental
syntactic error and still parse everything else correctly. That's
never going to work perfectly, and the OP is surprised at this.

> > What would it even *mean* to not give up?
>
> Read the blog post on Lezer for some ideas:
> https://marijnhaverbeke.nl/blog/lezer.html
>
> This is in the context of an editor.

Incidentally, that's actually where I would expect to see that kind of
feature show up the most - syntax highlighters will often be designed
to "carry on, somehow" after a syntax error, even though it often
won't make any sense (just look at what happens to your code
highlighting when you omit a quote character). It still won't always
be any use, but you do see *some* attempt at it.

But if the OP would be satisfied with that, I rather doubt that this
thread would even have happened. Unless, of course, the OP still lives
in the dark ages when no text editor available had any suitable
features for code highlighting.

ChrisA

Cameron Simpson

unread,
Oct 10, 2022, 6:17:41 PMOct 10
to
On 11Oct2022 08:02, Chris Angelico <ros...@gmail.com> wrote:
>There's a huge difference between non-fatal errors and syntactic
>errors. The OP wants the parser to magically skip over a fundamental
>syntactic error and still parse everything else correctly. That's
>never going to work perfectly, and the OP is surprised at this.

The OP is not surprised by this, and explicitly expressed awareness that
resuming a parse had potential for "misparsing" further code.

I remain of the opinion that one could resume a parse at the next
unindented line and get reasonable results a lot of the time.

In fact, I expect that one could resume tokenising at almost any line
which didn't seem to be inside a string and often get reasonable
results.

I grew up with C and Pascal compilers which would _happily_ produce many
complaints, usually accurate, and all manner of syntactic errors. They
didn't stop at the first syntax error.

All you need in principle is a parser which goes "report syntax error
here, continue assuming <some state>". For Python that might mean
"pretend a missing final colon" or "close open brackets" etc, depending
on the context. If you make conservative implied corrections you can get
a reasonable continued parse, enough to find further syntax errors.

I remember the Pascal compiler in particular had a really good "you
missed a semicolon _back there_" mode which was almost always correct, a
nice boon when correcting mistakes.

Cheers,
Cameron Simpson <c...@cskk.id.au>

Cameron Simpson

unread,
Oct 10, 2022, 6:22:54 PMOct 10
to
On 09/10/2022 10.49, Avi Gross wrote:
>>My guess is that finding 100 errors might turn out to be misleading.
>>If you
>>fix just the first, many others would go away. If you spell a variable name
>>wrong when declaring it, a dozen uses of the right name may cause errors.
>>Should you fix the first or change all later ones?

Just to this, these are semantic errors, not syntax errors. Linters do
an ok job of spotting these. Antoon is after _syntax errors_.

On 10Oct2022 08:21, Michael F. Stemper <michael...@gmail.com> wrote:
>How does one declare a variable in python? Sometimes it'd be nice to
>be able to have declarations and any undeclared variable be flagged.

Linters do pretty well at this. They can trace names and their use
compared to their first definition/assignment (often - there are of
course some constructs which are correct but unclear to a static
analysis - certainly one of my linters occasionally says "possible
undefine use" to me because there may be a path to use before set). This
is particularly handy for typos, which often make for "use before set"
or "set and not used".

>I'd love it if there was something similar that I could do in python.

Have you used any lint programmes? My "lint" script runs pyflakes and
pylint.

Cheers,
Cameron Simpson <c...@cskk.id.au>

Chris Angelico

unread,
Oct 10, 2022, 6:48:26 PMOct 10
to
On Tue, 11 Oct 2022 at 09:18, Cameron Simpson <c...@cskk.id.au> wrote:
>
> On 11Oct2022 08:02, Chris Angelico <ros...@gmail.com> wrote:
> >There's a huge difference between non-fatal errors and syntactic
> >errors. The OP wants the parser to magically skip over a fundamental
> >syntactic error and still parse everything else correctly. That's
> >never going to work perfectly, and the OP is surprised at this.
>
> The OP is not surprised by this, and explicitly expressed awareness that
> resuming a parse had potential for "misparsing" further code.
>
> I remain of the opinion that one could resume a parse at the next
> unindented line and get reasonable results a lot of the time.

The next line at the same indentation level as the line with the
error, or the next flush-left line? Either way, there's a weird and
arbitrary gap before you start parsing again, and you still have no
indication of what could make sense. Consider:

if condition # no colon
code
else:
code

To actually "restart" parsing, you have to make a guess of some sort.
Maybe you can figure out what the user meant to do, and parse
accordingly; but if that's the case, keep going immediately, don't
wait for an unindented line. If you want for a blank line followed by
an unindented line, that might help with a notion of "next logical
unit of code", but it's very much dependent on the coding style, and
if you have a codebase that's so full of syntax errors that you
actually want to see more than one, you probably don't have a codebase
with pristine and beautiful code layout.

> In fact, I expect that one could resume tokenising at almost any line
> which didn't seem to be inside a string and often get reasonable
> results.

"Seem to be"? On what basis?

> I grew up with C and Pascal compilers which would _happily_ produce many
> complaints, usually accurate, and all manner of syntactic errors. They
> didn't stop at the first syntax error.

Yes, because they work with a much simpler grammar. But even then,
most syntactic errors (again, this is not to be confused with semantic
errors - if you say "char *x = 1.234;" then there's no parsing
ambiguity but it's not going to compile) cause a fair degree of
nonsense afterwards.

The waters are a bit muddied by some things being called "syntax
errors" when they're actually nothing at all to do with the parser.
For instance:

>>> def f():
... await q
...
File "<stdin>", line 2
SyntaxError: 'await' outside async function

This is not what I'm talking about; there's no parsing ambiguity here,
and therefore no difficulty whatsoever in carrying on with the
parsing. You could ast.parse() this code without an error. But
resuming after a parsing error is fundamentally difficult, impossible
without guesswork.

> All you need in principle is a parser which goes "report syntax error
> here, continue assuming <some state>". For Python that might mean
> "pretend a missing final colon" or "close open brackets" etc, depending
> on the context. If you make conservative implied corrections you can get
> a reasonable continued parse, enough to find further syntax errors.

And, more likely, you'll generate a lot of nonsense. Take something like this:

items = [
item[1],
item2],
item[3],
]

As a human, you can easily see what the problem is. Try teaching a
parser how to handle this. Most likely, you'll generate a spurious
error - maybe the indentation, maybe the intended end of the list -
but there's really only one error here. Reporting multiple errors
isn't actually going to be at all helpful.

> I remember the Pascal compiler in particular had a really good "you
> missed a semicolon _back there_" mode which was almost always correct, a
> nice boon when correcting mistakes.
>

Ahh yes. Design a language with strict syntactic requirements, and
it's not too hard to find where the programmer has omitted them. Thing
is.... Python just doesn't HAVE those semicolons. Let's say that a
variant Python required you to put a U+251C ├ at the start of every
statement, and U+2524 ┤ at the end of the statement. A whole lot of
classes of error would be extremely easy to notice and correct, and
thus you could resume parsing; but that isn't benefiting the
programmer any. When you don't have that kind of information
duplication, it's a lot harder to figure out how to cheat the fix and
go back to parsing.

ChrisA

Thomas Passin

unread,
Oct 10, 2022, 7:42:51 PMOct 10
to
The Leo editor (https://github.com/leo-editor/leo-editor) will notify
you of undeclared variables (and some syntax errors) each time you save
your (Python) file.

avi.e...@gmail.com

unread,
Oct 10, 2022, 10:09:36 PMOct 10
to
Michael,

A reasonable question. Python lets you initialize variables but has no
explicit declarations. Languages differ and I juggle attributes of many in
my mind and am reacting to the original question NOT about whether and how
Python should report many possible errors all at once but how ANY language
can be expected to do this well. Many others do have a variable declaration
phase or an optional declaration or perhaps just a need to declare a
function prototype so it can be used by others even if the formal function
creation will happen later in the code.

But what I meant in a Python context was something like this:

Wronk = who cares # this should fail
...
If (Wronk > 5): ...
...
Wronger = Wronk + 1
...
X = minimum(Wronk, Wronger, 12)

The first line does not parse well so you have an error. But in any case as
the line makes no sense, Wronk is not initialized to anything. Later code
may use it in various ways and some of those may be seen as errors for an
assortment of reasons, then at one point the code does provide a value for
Wronk and suddenly code beyond that has no seeming errors. The above
examples are not meant to be real but just give a taste that programs with
holes in them for any reason may not be consistent. The only relatively
guaranteed test for sanity has to start at the top and encounter no errors
or missing parts based on an anything such as I/O errors.

And I suggest there are some things sort of declared in python such as:

Import numpy as np

Yes, that brings in code from a module if it works and initializes a
variable called np to sort of point at the module or it's namespace or
whatever, depending on the language. It is an assignment but also a way to
let the program know things. If the above is:

Import grumpy as np

Then what happens if the code tries to find a file named "grumpy" somewhere
and cannot locate it and this is considered a syntax error rather than a
run-time error for whatever reason? Can you continue when all kinds of
functionality is missing and code asking to make a np.array([1,2,3]) clearly
fails?

Many of us here are talking past each other.

Yes, it would be nice to get lots of info and arguably we may eventually
have machine-learning or AI programs a bit more like SPAM detectors that
look for patterns commonly found and try to fix your program from common
errors or at least do a temporary patch so they can continue searching for
more errors. This could result in the best case in guessing right every
time. If you allowed it to actually fix your code, it might be like people
who let their spelling be corrected and do not proofread properly and send
out something embarrassing or just plain wrong!

And it will compile or be interpreted without complaint albeit not do
exactly what it is supposed to!




-----Original Message-----
From: Python-list <python-list-bounces+avi.e.gross=gmai...@python.org> On
Behalf Of Michael F. Stemper
Sent: Monday, October 10, 2022 9:22 AM
To: pytho...@python.org
--
https://mail.python.org/mailman/listinfo/python-list

Chris Angelico

unread,
Oct 10, 2022, 10:42:14 PMOct 10
to
On Tue, 11 Oct 2022 at 13:10, <avi.e...@gmail.com> wrote:
> If the above is:
>
> Import grumpy as np
>
> Then what happens if the code tries to find a file named "grumpy" somewhere
> and cannot locate it and this is considered a syntax error rather than a
> run-time error for whatever reason? Can you continue when all kinds of
> functionality is missing and code asking to make a np.array([1,2,3]) clearly
> fails?

That's not a syntax error. Syntax is VERY specific. It is an error in
Python to attempt to add 1 to "one", it is an error to attempt to look
up the upper() method on None, it is an error to try to use a local
variable you haven't assigned to yet, and it is an error to open a
file that doesn't exist. But not one of these is a *syntax* error.

Syntax errors are detected at the parsing stage, before any code gets
run. The vast majority of syntax errors are grammar errors, where the
code doesn't align with the parseable text of a Python program.
(Non-grammatical parsing errors include using a "nonlocal" statement
with a name that isn't found in any surrounding scope, using "await"
in a non-async function, and attempting to import braces from the
future.)

ChrisA

avi.e...@gmail.com

unread,
Oct 10, 2022, 11:12:00 PMOct 10
to
Cameron, or OP if you prefer,

I think by now you have seen a suggestion that languages make choices and
highly structured ones can be easier to "recover" from errors and try to
continue than some with way more complex possibilities that look rather
unstructured.

What is the error in code like this?

A,b,c,d = 1,2,

Or is it an error at all?

Many languages have no concept of doing anything like the above and some
tolerate a trailing comma and some set anything not found to some form of
NULL or uninitialized and some ...

If you look at human language, some are fairly simple and some are way too
organized. But in a way it can make sense. Languages with gender will often
ask you to change the spelling and often how you pronounce things not only
based on whether a noun is male/female or even neuter but also insist you
change the form of verbs or adjectives and so on that in effect give
multiple signals that all have to line up to make a valid and understandable
sentence. Heck, in conversations, people can often leave out parts of a
sentence such as whether you are talking about "I" or "you" or "she" or "we"
because the rest of the words in the sentence redundantly force only one
choice to be possible.

So some such annoying grammars (in my opinion) are error
detection/correction codes in disguise. In days before microphones and
speakers, it was common to not hear people well, like on a stage a hundred
feet away with other ambient noises. Missing a word or two might still allow
you to get the point as other parts of the sentence did such redundancies.
Many languages have similar strictures letting you know multiple times if
something is singular or plural. And I think another reason was what I call
stranger detection. People who learn some vocabulary might still not speak
correctly and be identifiable as strangers, as in spies.

Do we need this in the modern age? Who knows! But it makes me prefer some
languages over others albeit other reasons may ...

With the internet today, we are used to expecting error correction to come
for free. Do you really need one of every 8 bits to be a parity bit, which
only catches may half of the errors, when the internals of your computer are
relatively error free and even the outside is protected by things like
various protocols used in making and examining packets and demanding some be
sent again if some checksum does not match? Tons of checking is built in so
at your level you rarely think about it. If you get a message, it usually is
either 99.9999% accurate, or you do not have it shown to you at all. I am
not talking about SPAM but about errors of transmission.

So my analogies are that if you want a very highly structured language that
can recover somewhat from errors, Python may not be it.

And over the years as features are added or modified, the structure tends to
get more complex. And R is not alone. Many surviving languages continue to
evolve and borrow from each other and any program that you run today that
could partially recover and produce pages of possible errors, may blow up
when new features are introduced.

And with UNICODE, the number of possible "errors" in what is placed in code
for languages like Julia that allow them in most places ...


-----Original Message-----
From: Python-list <python-list-bounces+avi.e.gross=gmai...@python.org> On
Behalf Of Cameron Simpson
Sent: Monday, October 10, 2022 6:17 PM
To: pytho...@python.org
Subject: Re: What to use for finding as many syntax errors as possible.

On 11Oct2022 08:02, Chris Angelico <ros...@gmail.com> wrote:
>There's a huge difference between non-fatal errors and syntactic
>errors. The OP wants the parser to magically skip over a fundamental
>syntactic error and still parse everything else correctly. That's never
>going to work perfectly, and the OP is surprised at this.

The OP is not surprised by this, and explicitly expressed awareness that
resuming a parse had potential for "misparsing" further code.

I remain of the opinion that one could resume a parse at the next unindented
line and get reasonable results a lot of the time.

In fact, I expect that one could resume tokenising at almost any line which
didn't seem to be inside a string and often get reasonable results.

I grew up with C and Pascal compilers which would _happily_ produce many
complaints, usually accurate, and all manner of syntactic errors. They
didn't stop at the first syntax error.

All you need in principle is a parser which goes "report syntax error here,
continue assuming <some state>". For Python that might mean "pretend a
missing final colon" or "close open brackets" etc, depending on the context.
If you make conservative implied corrections you can get a reasonable
continued parse, enough to find further syntax errors.

I remember the Pascal compiler in particular had a really good "you missed a
semicolon _back there_" mode which was almost always correct, a nice boon
when correcting mistakes.

avi.e...@gmail.com

unread,
Oct 10, 2022, 11:25:09 PMOct 10
to
I stand corrected Chris, and others, as I pay the sin tax.

Yes, there are many kinds of errors that logically fall into different
categories or phases of evaluation of a program and some can be determined
by a more static analysis almost on a line by line (or "statement" or
"expression", ...) basis and others need to sort of simulate some things
and look back and forth to detect possible incompatibilities and yet others
can only be detected at run time and likely way more categories depending on
the language.

But when I run the Python interpreter on code, aren't many such phases done
interleaved and at once as various segments of code are parsed and examined
and perhaps compiled into block code and eventually executed?

So is the OP asking for something other than a Python Interpreter that
normally halts after some kind of error? Tools like a linter may indeed fit
that mold.

This may limit some of the objections of when an error makes it hard for the
parser to find some recovery point to continue from as no code is being run
and no harmful side effects happen by continuing just an analysis.

Time to go read some books about modern ways to evaluate a language based on
more mathematical rules including more precisely what is syntax versus ...

Suggestions?

-----Original Message-----
From: Python-list <python-list-bounces+avi.e.gross=gmai...@python.org> On
Behalf Of Chris Angelico
Sent: Monday, October 10, 2022 10:42 PM
To: pytho...@python.org
Subject: Re: What to use for finding as many syntax errors as possible.

--
https://mail.python.org/mailman/listinfo/python-list

Chris Angelico

unread,
Oct 10, 2022, 11:34:31 PMOct 10
to
On Tue, 11 Oct 2022 at 14:13, <avi.e...@gmail.com> wrote:
> With the internet today, we are used to expecting error correction to come
> for free. Do you really need one of every 8 bits to be a parity bit, which
> only catches may half of the errors...

Fortunately, we have WAY better schemes than simple parity, which was
only really a thing in the modem days. (Though I would say that
there's still a pretty clear distinction between a good message where
everything has correct parity, and line noise where half of them
don't.) Hamming codes can correct one-bit errors (and detect two-bit
errors) at a price of log2(size)+1 bits of space. Here's a great
rundown:

https://www.youtube.com/watch?v=X8jsijhllIA

There are other schemes too, but Hamming codes are beautifully elegant
and easy to understand.

ChrisA

Chris Angelico

unread,
Oct 10, 2022, 11:55:48 PMOct 10
to
On Tue, 11 Oct 2022 at 14:26, <avi.e...@gmail.com> wrote:
>
> I stand corrected Chris, and others, as I pay the sin tax.
>
> Yes, there are many kinds of errors that logically fall into different
> categories or phases of evaluation of a program and some can be determined
> by a more static analysis almost on a line by line (or "statement" or
> "expression", ...) basis and others need to sort of simulate some things
> and look back and forth to detect possible incompatibilities and yet others
> can only be detected at run time and likely way more categories depending on
> the language.
>
> But when I run the Python interpreter on code, aren't many such phases done
> interleaved and at once as various segments of code are parsed and examined
> and perhaps compiled into block code and eventually executed?

Hmm, depends what you mean. Broadly speaking, here's how it goes:

0) Early pre-parse steps that don't really matter to most programs,
like checking character set. We'll ignore these.
1) Tokenize the text of the program into a sequence of
potentially-meaningful units.
2) Parse those tokens into some sort of meaningful "sentence".
3) Compile the syntax tree into actual code.
4) Run that code.

Example:
>>> code = """def f():
... print("Hello, world", 1>=2)
... print(Ellipsis, ...)
... return True
... """
>>>

In step 1, all that happens is that a stream of characters (or bytes,
depending on your point of view) gets broken up into units.

>>> for t in tokenize.tokenize(iter(code.encode().split(b"\n")).__next__):
... print(tokenize.tok_name[t.exact_type], t.string)

It's pretty spammy, but you can see how the compiler sees the text.
Note that, at this stage, there's no real difference between the NAME
"def" and the NAME "print" - there are no language keywords yet.
Basically, all you're doing is figuring out punctuation and stuff.

Step 2 is what we'd normally consider "parsing". (It may well happen
concurrently and interleaved with tokenizing, and I'm giving a
simplified and conceptualized pipeline here, but this is broadly what
Python does.) This compares the stream of tokens to the grammar of a
Python program and attempts to figure out what it means. At this
point, the linear stream turns into a recursive syntax tree, but it's
still very abstract.

>>> import ast
>>> ast.dump(ast.parse(code))
"Module(body=[FunctionDef(name='f', args=arguments(posonlyargs=[],
args=[], kwonlyargs=[], kw_defaults=[], defaults=[]),
body=[Expr(value=Call(func=Name(id='print', ctx=Load()),
args=[Constant(value='Hello, world'), Compare(left=Constant(value=1),
ops=[GtE()], comparators=[Constant(value=2)])], keywords=[])),
Expr(value=Call(func=Name(id='print', ctx=Load()),
args=[Name(id='Ellipsis', ctx=Load()), Constant(value=Ellipsis)],
keywords=[])), Return(value=Constant(value=True))],
decorator_list=[])], type_ignores=[])"

(Side point: I would rather like to be able to
pprint.pprint(ast.parse(code)) but that isn't a thing, at least not
currently.)

This is where the vast majority of SyntaxErrors come from. Your code
is a sequence of tokens, but those tokens don't mean anything. It
doesn't make sense to say "print(def f[return)]" even though that'd
tokenize just fine. The trouble with the notion of "keeping going
after finding an error" is that, when you find an error, there are
almost always multiple possible ways that this COULD have been
interpreted differently. It's as likely to give nonsense results as
actually useful ones.

(Note that, in contrast to the tokenization stage, this version
distinguishes between the different types of word. The "def" has
resulted in a FunctionDef node, the "print" is a Name lookup, and both
"..." and "True" have now become Constant nodes - previously, "..."
was a special Ellipsis token, but "True" was just a NAME.)

Step 3: the abstract syntax tree gets parsed into actual runnable
code. This is where that small handful of other SyntaxErrors come
from. With these errors, you absolutely _could_ carry on and report
multiple; but it's not very likely that there'll actually *be* more
than one of them in a file. Here's some perfectly valid AST parsing:

>>> ast.dump(ast.parse("from __future__ import the_past"))
"Module(body=[ImportFrom(module='__future__',
names=[alias(name='the_past')], level=0)], type_ignores=[])"
>>> ast.dump(ast.parse("from __future__ import braces"))
"Module(body=[ImportFrom(module='__future__',
names=[alias(name='braces')], level=0)], type_ignores=[])"
>>> ast.dump(ast.parse("def f():\n\tdef g():\n\t\tnonlocal x\n"))
"Module(body=[FunctionDef(name='f', args=arguments(posonlyargs=[],
args=[], kwonlyargs=[], kw_defaults=[], defaults=[]),
body=[FunctionDef(name='g', args=arguments(posonlyargs=[], args=[],
kwonlyargs=[], kw_defaults=[], defaults=[]),
body=[Nonlocal(names=['x'])], decorator_list=[])],
decorator_list=[])], type_ignores=[])"

If you were to try to actually compile those to bytecode, they would fail:

>>> compile(ast.parse("from __future__ import braces"), "-", "exec")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "-", line 1
SyntaxError: not a chance

And finally, step 4 is actually running the compiled bytecode. Any
errors that happen at THIS stage are going to be run-time errors, not
syntax errors (a SyntaxError raised at run time would be from
compiling other code).

> So is the OP asking for something other than a Python Interpreter that
> normally halts after some kind of error? Tools like a linter may indeed fit
> that mold.

Yes, but linters are still going to go through the same process laid
out above. So if you have a huge pile of code that misuses "await" in
non-async functions, sure! Maybe a linter could half-compile the code,
then probe it repeatedly until it gets past everything. That's not
exactly a common case, though. More likely, you'll have parsing
errors, and the only way to "move past" a parsing error is to guess at
what token should be added or removed to make it "kinda work".

Alternatively, you'll get some kind of messy heuristics to try to
restart parsing part way down, but that's pretty imperfect too.

> This may limit some of the objections of when an error makes it hard for the
> parser to find some recovery point to continue from as no code is being run
> and no harmful side effects happen by continuing just an analysis.

It's pretty straight-forward to ensure that no code is run - just
compile it without running it. It's still possible to attack the
compiler itself, but far less concerning than running arbitrary code.
Attacks on the compiler are usually deliberate; code you don't want to
run yet might be a perfectly reasonable call to os.unlink()...

> Time to go read some books about modern ways to evaluate a language based on
> more mathematical rules including more precisely what is syntax versus ...
>
> Suggestions?
>

I'd recommend looking at Python's compile() function, the ast and
tokenizer modules, and everything that they point to.

ChrisA

avi.e...@gmail.com

unread,
Oct 11, 2022, 2:42:43 AMOct 11
to
I think we are in agreement here, Chris. My point is that the error
detection and correction is now done at levels where there is not much need
to use earlier and inefficient methods like parity bits set aside. We use
protocols like TCP and IP and layers above them and above those to maintain
the integrity of packets and sessions and forms of encryption allowing
things like authentication. There is tons of overhead, even when some is
fairly efficient, but we hardly notice it unless things go wrong.

So written language sent (as in this email/post) does not need lots of
redundancy and all the extra effort is, IMNSHO opinion, largely wasted. If I
see a bear, I do not wish to check their genitals or DNA to determine their
irrelevant gender before asking someone to run from it. If I happen to know
the gender, as in a zoo, gender only matters for things like breeding
purposes. I do not want to memorize terms in languages that have not only
words like lion and lioness or duck and drake and goose and gander, but for
EVERYTHING in some sense so I can say the equivalent of ANIMAL-male and
ANIMAL-female with unique words. Life would be so much simpler if I could
say your dog was nice and not be corrected that it was a bitch and I used
the wrong word endings. If I really wanted to say it was a female dog, well
I could just add a qualified. Most of the time, who cares?

The same applies to so much grammatical nonsense which is also usually
riddled with endless exceptions to the many rules. Make the languages simple
with little redundancy and thus far easier to learn.

I can say similar things about some programming languages that either have
way too many rules or too few of the right ones.

There are tradeoffs and if you want a powerful language it will likely not
be easy to control. If you want a very regulated language, you may find it
not very useful as many things are hard to do ad others not possible. I know
that strongly typed languages often have to allow some method of cheating
such as unions of data types, or using a parent class as the sort of
object-type to allow disparate objects to live together. Python is far from
the most complex but as noted, it is not trivial to evaluate even the syntax
past errors.

But I admit it is fun and a challenge to learn both kinds and I spent much
of my time doing so. I like the flexibility of seeing different approaches
and holding contradictions in my mind while accepting both and yet neither!
LOL!


-----Original Message-----
From: Python-list <python-list-bounces+avi.e.gross=gmai...@python.org> On
Behalf Of Chris Angelico
Sent: Monday, October 10, 2022 11:24 PM
To: pytho...@python.org
Subject: Re: What to use for finding as many syntax errors as possible.

--