ENB: About PR #2303, scan_lines and unit tests

Edward K. Ream

unread,

Nov 4, 2021, 10:05:26 AM11/4/21

to leo-editor

In this Engineering Notebook post I'll discuss recent work on #2276 and PR #2303.

Work is almost finished. I'll merge the work into devel in a day or three.

Background

#2276 suggests adding @section-delims. Work has expanded to include:

- A thorough review of at_fast.scan_lines

- A project to cover at_fast.scan_lines with unit tests.

The coverage tests revealed that @raw has been broken for about three years!

Refactoring scan_lines

I "refactored" the sections in scan_lines to make the use of various flags more clear. The start of the main loop of scan_lines is now:

for i, line in enumerate(lines[start:]):
# Strip the line only once.
strip_line = line.strip()
if after ref:
<< handle afterref line>>
continue
if verbatim:
<< handle verbatim line >>
continue
if line == verbatim_line: # <delim>@verbatim.
verbatim = True
continue
<< finalize line >>
if not in_doc and not strip_line.startswith(sentinel):

# Faster than a regex!
body.append(line)
continue
# These three sections might clear in_doc.
<< handle @others >>
<< handle section refs >>
<< handle node_start >>
if in_doc:
<< handle @c or @code >>
else:
<< handle @ or @doc >>
if line.startswith(comment_delim1 + '@-leo'):

# Faster than a regex!
# The @-leo sentinel adds *nothing* to the text.
i += 1
break

Imo, this new "rendering" of the code is way clearer than in devel: ordering dependencies are now explicit. Very little code actually changed because of this refactoring!

Complications

@last clearly does work in the devel branch, but the first versions of the new unit tests for @last failed! This mystery was solved by having unit tests ensure that the gnx in the generated node (in the test) matches the root's gnx. Having the gnx's match makes the unit tests much stronger.

The new << final checks >> section of at_fast.scan_lines is more complex than expected. Indeed, there is now a new, hacky, root_gnx_adjusted flag. This flag is set only << handle node_start >>. Ironically, no unit test that handles gnx's correctly can exercise this code!

Summary

This post explains some of the complexities of scan_lines. Félix will likely have to understand every line of scan_lines in order to transliterate it.

scan_lines is the one and only place in Leo where section references seem essential. Without sections, the scan_lines would be too big to understand. Alas, splitting scan_lines into methods would slow down Leo's read code.

Edward

tbp1...@gmail.com

unread,

Nov 4, 2021, 10:28:32 AM11/4/21

to leo-editor

On Thursday, November 4, 2021 at 10:05:26 AM UTC-4 Edward K. Ream wrote:

... Without sections, the scan_lines would be too big to understand. Alas, splitting scan_lines into methods would slow down Leo's read code.

But is this really the case, or is this a case of premature optimization? Maybe the extra time for method calls would be a minor part of the whole time.

Even if methods would work here, it doesn't mean I would be happy with giving up sections, though. I often use them as a kind of pseudo-code to make the code more clear.

Edward K. Ream

unread,

Nov 4, 2021, 11:17:35 AM11/4/21

to leo-editor

On Thu, Nov 4, 2021 at 9:28 AM tbp1...@gmail.com <tbp1...@gmail.com> wrote:

But is this really the case, or is this a case of premature optimization?

leoPy.leo contains 411 files. Speed matters here, even a few percentage points.

Edward

Edward K. Ream

unread,

Nov 4, 2021, 3:17:43 PM11/4/21

to leo-editor

Let me attempt a better answer.

I don't know what the slowdown would be if the code used function calls instead of sections. But the performance question is moot because imo the present code is the clearest possible code.

Indeed, the << init scan_lines >> section defines a lot of state variables. It would be clumsy (and a maybe slow) to be passing these state variables to helper functions. Sure, we could use stat ivars instead...

But the main point is that the present code highlights the structure of the code pretty well. The meta pattern of the main loop is:

if << a condition >>:

<< handle the line >>

continue

or sometimes:

if << a condition >>:

<< handle the line >>

continue

else:

<< do something else >>

continue

Some sections contain both the condition and the continue statement, so the pattern isn't quite as clear as it could be. I've chosen to "render" the sections as I have to highlight conditions that are not simply pattern matches. The explicitly shown tests are those that confused me the most while I was reacquainting myself with the code.

Anyway, each handler deals with only a small subset of all the state variables, so the code, while hardly trivial, is relatively easy to understand.

Summary

I could say more, but I think I've said enough :-)

Imo, the present code is likely the fastest possible code and the clearest possible code. Hard to do better than that.

Edward

Reply all

Reply to author

Forward