ENB: About the JS importer and #1481

22 views

Skip to first unread message

Edward K. Ream

unread,

Feb 1, 2020, 6:00:51 AM2/1/20

to leo-editor

Recent revs in the js-importer branch provide a basic fix for #1481. This long Engineering Notebook post will discuss this fix, it's limitations, further problems and possible fixes for those problems.

As always, feel free to ignore. However, this post describes important aspects of the code and its design. It may more than usual interest to Leo's devs.

The immediate problem

Perfect import failed for reveal.js because i.gen_lines failed to allocate lines properly to nodes. This caused lines to appear out of order.

i.gen_lines is part of the pipeline, defined in the base Importer class. At present, the JS_Importer class (and most other importers) are subclasses of the Importer class. Iirc, all importers use i.gen_lines unchanged. The JS importer overrides i.starts_block, one of i.gen_lines's helpers.

i.gen_lines uses a parse state to determine the start and end of blocks. For Javascript, these blocks correspond (roughly) to classes and functions. Alas, there are many ways to define a class or function in JS. JS is, by far, the most difficult language to parse of all the languages handled by Leo's importers.

Perfect import failed for reveal.js because the importer mistook a line like if('function'){ as the start of a function. The present fix will usually work (Leo can now import reveal.js), but not always, as I'll now explain...

Tokenizing Javascript

You ask, how hard can it be to recognize strings like 'function' in JS? The answer is "very very hard". Tokenizing JS depends on context. In particular, it is difficult to determine whether a '/' character is the "div" arithmetic operator or the start of a regular expression! JS is, by far, the most difficult language to scan (tokenize) of all the languages handled by Leo's importers.

i.scan_line updates the parse state after carefully tokenizing each line. The JS importer method js_i.scan_line overrides i.scan_line to handle '/' properly. As you can see, it's not pretty. The bug happened because js_i.starts_block contained regex's that didn't distinguish the "function" keyword from a string containing "function". The partial fix is mostly a hack. js_i.starts_block now keeps track of whether function looks like it is in a string. But this faux tokenizing could fail if quotes (or "function") appear in a regex, as they very well might.

A proper fix would involve fully tokenizing each line, in both js_i.scan_line and in js_i.starts_block. Therefore, we want a stand-alone javascript tokenizer, written in python. Present revs include a copy of JsLex.py, but it isn't hooked up yet. The JsLex code contains a note that it doesn't handle non-ascii characters properly, so it may need substantial revision. Happily, JsLex does contain a suite of unit tests.

Other problems

The lines of reveal.js that caused the perfect import to fail are not handled very well even after perfect import succeeds. Here are the complications...

The JS importer must be completely immune from indentation, and it is. As a direct consequence, perfect import tests ignore leading whitespace. However, once lines are allocated to nodes, the importer tries to adjust nodes to make them as pleasing as possible. In particular, the importer removes common leading whitespace from all nodes. In addition, under special circumstances, the importer may try to move one or more lines from the end of one node to the start of the previous or following sibling node. The post pass part of the pipeline handles all these adjustments. It would be difficult/impossible to handle them in gen_lines.

At present, the JS importer only generates @others directives, never section references. This could be changed, but imo using @others is much better. However, @others does impose additional constraints on what the post pass can do. It's time for an extended example. Here is the gist of the code that caused perfect import to fail:

function startEmbeddedContent( element ) {
    toArray( element.querySelectorAll( 'video, audio' ) ).forEach( function( el ) {
        if('function') { // The culprit
            promise.catch( function() {
                el.addEventListener( 'play', function() {
                    el.controls = false;
                } );
            } );
        }
    }
}

To have any chance of understanding what is going on, the post pass must be disabled. Here are the results, without the post pass. Headlines are preceded by ===:

=== function startEmbeddedContent

function startEmbeddedContent( element ) {
    @others
}

=== toArray( element.querySelectorAll('video, audio')).forEach function

    toArray( element.querySelectorAll('video, audio')).forEach( function(el) {
        if('function') {
            @others
    }

=== promise.catch function

            promise.catch(function() {
                @others
            } );
        } // <=====

=== el.addEventListener('play', function

                el.addEventListener('play', function() {
                    el.controls = false;
                } );

Now you can see why a post pass is desirable.

It's "obvious" that the last line of the node "promise.catch function" belongs in the previous node. However, such a move can not be done in general!

Only the post pass has any chance of having enough data to make this adjustment. This adjustment can only be done because the node "promise.catch function" is the last node under the range of the @others in the node "toArray...". The adjustment would be invalid if the node "promise.catch function" had any following siblings.

Summary

The present code now imports reveal.js without error. However, the latest code is a hack. A proper fix entails carefully tokenizing lines in two places. I plan to use JsLex to do this. JsLex will need work to handle non-ascii characters properly. I'll do that as part of the fix for #1481.

The post pass attempts to reallocate lines to make the result more palatable. The JS importer uses @others instead of section references. This imposes constraints on possible adjustments. I'll attempt to improve the post pass as another part of #1481. This may involve a rewrite/rethink. This post has been part of the rethinking process.

The present problems with Leo's JS importer arise from well-known infelicities in JS itself. The fixes to #1481 actually show the strengths of the Leo's importer architecture. Fixes are straightforward and will be confined only to the JS_Importer class.

Edward

Edward K. Ream

unread,

Feb 1, 2020, 6:11:40 AM2/1/20

to leo-editor

On Saturday, February 1, 2020 at 5:00:51 AM UTC-6, Edward K. Ream wrote:

> This long Engineering Notebook post will discuss this fix, it's limitations, further problems and possible fixes for those problems.

After importing an @auto tree, it would be possible to convert @auto to @clean. This would allow you to adjust nodes to your complete satisfaction.

> It's "obvious" that the last line of the node "promise.catch function" belongs in the previous node. However, such a move can not be done in general!

Worse, the last line belongs after @others, not at the very end of the previous node. This might be a big ask. True, only one @others may appear in the previous node, but a completely sound approach would have to retokenize the previous node to determine where the @others is! But that's probably taking caution too far :-) In any case, a rethink is necessary.