Jupytext - Script To Split Into Child Nodes

67 views
Skip to first unread message

Thomas Passin

unread,
Oct 27, 2024, 5:21:35 PM10/27/24
to leo-editor
Here is a script that will take an @jupytext tree and split it out into child nodes, one node per cell.  To use, make a script button for it, select a @jupytext node, and run the script.  The script tries to create reasonable headlines for the nodes but of course it can't be perfect.

Please try it out so we can get a sense if it's going to be a useful solution. WARNING: there is NO undo capability yet so ONLY TRY IT ON A COPY of your file.

"""Move cells into child nodes of the root of an @jupytext node."""

CELL_MARKER = '# %%'
MARKER_LEN = len(CELL_MARKER)

def get_ipynb_header(notebook):
    start = notebook.find('# ---')
    end = notebook.find('# ---', 1) + 4
    header = notebook[start:end]
    return header

# Find root = root position for file
def optional_filter(p):
    return p.h.startswith('@jupytext')

def find_current_atfile(p):
    for p in c.p.copy().self_and_parents():
        if optional_filter(p):
            return p
    else:
        return None

def make_headline(cell: str) -> str:
    lines = cell.split('\n')
    for line in lines[1:]:
        line = line.replace('#', '').strip()
        if not line or line.startswith('%%'):
            continue
        words = line.split()
        n = min(6, len(words))
        return ' '.join(words[:n])

root = find_current_atfile(c.p)
if root is None:
    g.es(' No .ipynb tree found')
else:
    contents = root.b
    header = get_ipynb_header(contents)
    i0 = 0
    n = -1  # Number of children
    while True:
        i0 = contents.find(CELL_MARKER, i0)
        if i0 == -1:
            break
        i1 = contents.find(CELL_MARKER, i0 + 1)
        cell = contents[i0:i1] if i1 > -1 else contents[i0:]
        n += 1
        p0 = root.insertAsNthChild(n)
        p0.b = cell
        p0.h = make_headline(cell)
        if i1 == -1:
            break
        i0 = i1 - MARKER_LEN
    root.b = header + '\n@others'
    c.redraw()

HaveF HaveF

unread,
Oct 28, 2024, 3:43:23 AM10/28/24
to leo-editor
Thanks Thomas, It works great!

Although for `jupyter lab`, its ToC is displayed according to the markdown level in the document, I think it is not a big problem to extract nodes according to cells like you did, anyway, nodes are easy to operate in Leo.

Thanks for your script!

SCR-20241028-nwcw.png

Edward K. Ream

unread,
Oct 28, 2024, 5:00:23 AM10/28/24
to leo-e...@googlegroups.com
On Sun, Oct 27, 2024 at 4:21 PM Thomas Passin <tbp1...@gmail.com> wrote:
Here is a script that will take an @jupytext tree and split it out into child nodes, one node per cell.  To use, make a script button for it, select a @jupytext node, and run the script.  The script tries to create reasonable headlines for the nodes but of course it can't be perfect.

This script, with undo, deserves to be a Leo command. If you like, please submit a PR. I'll be happy to provide advice.

Edward

Thomas Passin

unread,
Oct 28, 2024, 6:58:03 AM10/28/24
to leo-editor
On Monday, October 28, 2024 at 3:43:23 AM UTC-4 iamap...@gmail.com wrote:
Thanks Thomas, It works great!

Although for `jupyter lab`, its ToC is displayed according to the markdown level in the document, I think it is not a big problem to extract nodes according to cells like you did, anyway, nodes are easy to operate in Leo.

I wasn't sure if levels of indentation were meaningful in Jupyter notebooks so I didn't handle them at this early stage.  It won't be much harder to include them.

Another feature that's sorely needed and easy to add is to put an "@nocolor" directive in markdown cells.  Since all non-code lines in a jupytext file are comments, they get colored by Leo's colorizer and for many if not most themes they are hard to read - not what one wants for pleasant editing.

So stay tuned ...
 

Thomas Passin

unread,
Oct 28, 2024, 7:04:16 AM10/28/24
to leo-editor
I would for sure like help with the undo.  Also, I think this command should be made part of the import process. If there ever turns out to be a need for someone to look at the original flat jupytext node there could be another command for that.

Edward K. Ream

unread,
Oct 28, 2024, 7:39:30 AM10/28/24
to leo-e...@googlegroups.com
On Mon, Oct 28, 2024 at 6:04 AM Thomas Passin  wrote:


I would for sure like help with the undo.  Also, I think this command should be made part of the import process. If there ever turns out to be a need for someone to look at the original flat jupytext node there could be another command for that.

Alright. I'll add the command in a separate PR.

Edward

Edward K. Ream

unread,
Oct 28, 2024, 9:25:10 AM10/28/24
to leo-e...@googlegroups.com
On Sun, Oct 27, 2024 at 4:21 PM Thomas Passin <tbp1...@gmail.com> wrote:
Here is a script that will take an @jupytext tree and split it out into child nodes, one node per cell. 

See #4137. Expect a PR today or tomorrow.

At that time Leo 6.8.2 will be code complete, barring serious problems with recent PRs.

Edward

Thomas Passin

unread,
Oct 28, 2024, 12:31:03 PM10/28/24
to leo-editor
I need some guidance on edge cases before I can get the script to indent beyond one level.

1. If the first cell has no markdown header level, it should only be indented one level, right?  There's nowhere else for it to go.
2. If the first cell has a markdown header level > 1, the indentation should just be one, right?  Again, a cell can't be indented more than one level from the previous one.
3. If a later cell has no markdown header level, at what level should it be indented?  Choices are, as best I can see:
    a. The same level as the previous cell;
    b. One indent level from the root.
4. Code cells should be indented at the same level as the previous cell, except that the minimum indent level for any cell is 1. Yes?
5. What should the script do if the markdown header level is not consistent with the indent level?  Example:

     # %% [markdown]
     # ### A level-three cell

     # %% [markdown]
     # ##### Seems to be a level-5 cell but the next cell after the level-3 can only be indented to level-4 at most.
   
6. Any other edge cases that you know about.

On Monday, October 28, 2024 at 3:43:23 AM UTC-4 iamap...@gmail.com wrote:

Edward K. Ream

unread,
Oct 28, 2024, 3:03:46 PM10/28/24
to leo-e...@googlegroups.com
On Mon, Oct 28, 2024 at 11:31 AM Thomas Passin <tbp1...@gmail.com> wrote:
I need some guidance on edge cases before I can get the script to indent beyond one level.

These are all good questions. I was planning to work on this script, but I'll let you do it if you like.

I don't remember what Leo's markdown importer does. My guess is that it creates dummy nodes for missing levels.

Edward

Thomas Passin

unread,
Oct 28, 2024, 5:21:59 PM10/28/24
to leo-editor
I don't mind if you do the work on the script. You will get the node levels right faster than I.  The main thing is the edge cases, because sure as anything somebody is going to produce a file that you would swear wouldn't happen.

Edward K. Ream

unread,
Oct 28, 2024, 6:30:45 PM10/28/24
to leo-e...@googlegroups.com
On Mon, Oct 28, 2024 at 4:22 PM Thomas Passin <tbp1...@gmail.com> wrote:
I don't mind if you do the work on the script. You will get the node levels right faster than I.  The main thing is the edge cases, because sure as anything somebody is going to produce a file that you would swear wouldn't happen.

Alright. I'm on it.  Here's my plan:

- Start with the markdown importer.
  The importer probably handles the edge cases.
- Put only @others in the root (level 0) node.
- Put all other nodes at level 1 or higher.
- Place markdown nodes using the importer's algorithm.
- Add python nodes at the current markdown level.
- Add undo.

I expect to complete this work soon. I'll reassign #4137 to Leo 6.8.3 if problems arise. The issue is not a release blocker.

Edward

Edward K. Ream

unread,
Oct 28, 2024, 6:39:50 PM10/28/24
to leo-e...@googlegroups.com
Alright. I'm on it.  Here's my plan:

- Start with the markdown importer.

Oops. That won't work directly. I forgot that .ipynb files are json, not plain text.

Thomas's script parses the json directly, but it is probably best to use jupytext to convert the .json file to pseudo-python. A short prototype script is next.

Edward

Thomas Passin

unread,
Oct 28, 2024, 7:27:34 PM10/28/24
to leo-editor
Wait! The imported file in the @jupytext node is already in juyptext format, not JSON. My script walks through the jupytext string, not any JSON. You  got fooled by the .ipynb extension in the filename of the @jupytext node (I've been concerned about that possibility since this at-file type was created). I don't think you want to process the original JSON because it can contain features that jupytext doesn't capture, like long data: URLs for any images that have been generated and saved in the .ipynb JSON files.  jupytext files don't contain those data:urls, as I understand it.  Anyway, the nice jupytext people have already done that work for us.

What needs to be done for producing the different indentation levels is to:

1. Count the number of leading "#" characters of the first non-blank line after the start marker of a markdown cell (code cells will keep the indentation of the previous cell) (remove the leading "#" character first since that's just a sentinel added by the jupytext conversion). That's the indentation level.

2. Translate those indentation numbers into the right indentation patterns for the new nodes.

#1 is pretty straightforward  except possibly for edge cases.
#2 will be easier and quicker if you do it instead of me.  But I'm fine with doing it myself.

Edward K. Ream

unread,
Oct 28, 2024, 7:51:02 PM10/28/24
to leo-editor
On Monday, October 28, 2024 at 6:27:34 PM UTC-5 Thomas wrote:

Wait! The imported file in the @jupytext node is already in juyptext format, not JSON.
...
 Anyway, the nice jupytext people have already done that work for us.

Thanks! Very helpful.

What needs to be done for producing the different indentation levels is to:

I'll read about your approach after I play with your prototype script.

I need to wrap my head around the assumptions of your script.

Edward

P.S. Here is a prototype script I wrote before realizing I had best understand your script first!

g.cls()
import io
import jupytext
from leo.plugins.importers.markdown import do_import

path = r'c:\test\ekr-small-test.ipynb'

if 0: # Read the .ipynb file into raw_contents and dump it.
    with open(path, 'r') as f:
        raw_contents = f.read()
    g.printObj(raw_contents)

if 0:  # Dump the notebook in semi-readable format.
    fmt = 'py:percent'
    notebook = jupytext.read(path, fmt=fmt)
    jm = g.app.jupytextManager
    jm.dump_notebook(notebook)

if 0: # Import the contents as markdown.
    # Read the pseudo python text into contents.
    fmt = 'py:percent'
    notebook = jupytext.read(path, fmt=fmt)
    with io.StringIO() as f:
        jupytext.write(notebook, f, fmt=fmt)
        contents = f.getvalue()
    # g.printObj(contents)

    h = 'markdown-test'
    p = g.findNodeAnywhere(c, h)
    assert p, h
    p.b = ''
    p.deleteAllChildren()
    do_import(c, p, contents)
    c.redraw(p)

EKR

Edward K. Ream

unread,
Oct 28, 2024, 8:27:28 PM10/28/24
to leo-editor
On Monday, October 28, 2024 at 6:27:34 PM UTC-5 Thomas wrote:

Wait! The imported file in the @jupytext node is already in juyptext format, not JSON.

 This hint made everything clear.

- The @jupytext node must contain all the imported text.
  I got confused because I had edited my target node by hand.
- The script will replace the @jupytext node.
  As a workaround I created a copy so could rerun the script.

My .ipynb test file revealed a bug in the get_ipynb_header function. I changed:

    end = notebook.find('# ---', 1) + 4
to:
    end = notebook.find('# ---', start + 4) + 4

With this change the function returns the expected header. And now everything works!

Thomas, your script does a creditable job already. I'll focus on your approach.

Leo's markdown importer does less well, although conceivably the importer could be re-imagined.

Edward

P.S. Here is the prototype that uses Leo's markdown importer.

import io
import jupytext
from leo.plugins.importers.markdown import do_import

# Read the .ipynb file into contents (pseudo-python)
notebook = jupytext.read(path, fmt=fmt)
fmt = 'py:percent'

with io.StringIO() as f:
    jupytext.write(notebook, f, fmt=fmt)
    contents = f.getvalue()
# Use Leo's markdown importer to create an outline.
p = g.findNodeAnywhere(c, 'markdown-test')

p.b = ''
p.deleteAllChildren()
do_import(c, p, contents)
c.redraw(p)

EKR

HaveF HaveF

unread,
Oct 28, 2024, 8:33:37 PM10/28/24
to leo-e...@googlegroups.com
On Tue, Oct 29, 2024 at 12:31 AM Thomas Passin <tbp1...@gmail.com> wrote:
I need some guidance on edge cases before I can get the script to indent beyond one level.


Ah, there are so many edge cases. My suggestion is to only extract nodes at the markdown heading level without smaller subnodes. That is,

``` jupytext
# %% [markdown]
# # header 1

# %%
print("python1")

# %% [markdown]
# ## header 1.1

# %%
print("python1.1")

# %% [markdown]
# ### header 1.1.1

# %% [markdown]
# ##### header 1.1.1.1.5

# %%
print("python1.1.1.1.5")
```

Only generate nodes corresponding to the markdown Headings level, that is, only generate
```
# header 1
## header 1.1
### header 1.1.1
##### header 1.1.1.1.5
```

These 4 nodes (do not need to be indented to become child nodes). This should make the code simpler and there is no need to consider edge cases. (If all the original cells are turned into nodes, there may be too many nodes; if there is no markdown heading, then do nothing)

The reason is that in jupyter lab, we can already see the indentation of its markdown heading level directly, and we don’t need to copy its display form to Leo. More likely to use it is that we will manually reorganize the content in the form of Leo.

In addition to the above command, I think it is enough to add a Leo form with only four nodes, flattening it back to the original @jupytext node without any child nodes. This is very simple, iterate the contents of all the child nodes of the current node, append to the end of the current node, and then delete these four nodes.


HaveF HaveF

unread,
Oct 28, 2024, 8:37:11 PM10/28/24
to leo-e...@googlegroups.com
Oh, I forgot about this edge case

``` jupytext
# %% [markdown]
# ### header 1.1.1
# content balabalabala 1.1.1
# #### header 1.1.1.1
# content balabalabala 1.1.1.1

# %% [markdown]
# #### header 1.1.1.2
```

In this case, only two nodes should be generated:

```
### header 1.1.1
#### header 1.1.1.2
```

Thomas Passin

unread,
Oct 28, 2024, 11:59:04 PM10/28/24
to leo-editor
I think we are roughly in agreement.  My plan is to look only at the indentation of the first line after the cell marker.  Any other lines in the cell with heading levels won't - by definition - change the indentation of the cell because a cell can only have one indentation level. And cells can only start where there is a cell marker.

In my experience, people will use heading levels to get a font size they want without thinking about whether semantically they are really indicating a section with a different indentation. The same problem happens with HTML where people use <h1>, <h2>, etc to get the size font they want, where they should style the heading font size with CSS instead.  I don't know if this ever happens with jupyter files but we have to assume that it will.

Thomas Passin

unread,
Oct 29, 2024, 12:11:52 AM10/29/24
to leo-editor
On Monday, October 28, 2024 at 8:27:28 PM UTC-4 Edward K. Ream wrote:
On Monday, October 28, 2024 at 6:27:34 PM UTC-5 Thomas wrote:

Wait! The imported file in the @jupytext node is already in juyptext format, not JSON.

 This hint made everything clear.

- The @jupytext node must contain all the imported text.
  I got confused because I had edited my target node by hand.

I did the same thing myself as I developed the script.
 
- The script will replace the @jupytext node.

I don't think that's really needed.  After all the new nodes have been created, my script simply deletes all the text from the root node except for the jupytext header (don't forget to insert @others directives where needed). And remember to add @nocolor directives to all markdown cells so they don't get syntax colored as comments.  I didn't get around to that in my script yet.
 
  As a workaround I created a copy so could rerun the script.
 
Yes, I did the same by hand.  It's a nuisance. For each run I copied the @juyptext node with a file name changed to something like "original-name-test.ipynb". If I continue to work on the script I'll probably create a script in a button that creates a new node with that name, selects it, and then runs the indenting script.

Edward K. Ream

unread,
Oct 29, 2024, 5:16:13 AM10/29/24
to leo-e...@googlegroups.com
On Mon, Oct 28, 2024 at 7:33 PM HaveF wrote:
On Tue, Oct 29, 2024 at 12:31 AM Thomas Passin wrote:
 
Ah, there are so many edge cases. My suggestion is to only extract nodes at the markdown heading level without smaller subnodes.

Thanks for these ideas. I'll discuss the possibilities in a new ENB.

Spoiler: Thomas's script should enhance @jupytext rather than become a separate command. Splitting the @jupytext node into an outline can be done with complete safety.

Edward

Thomas Passin

unread,
Oct 29, 2024, 8:58:59 AM10/29/24
to leo-editor
On Tuesday, October 29, 2024 at 5:16:13 AM UTC-4 Edward K. Ream wrote:

Spoiler: Thomas's script should enhance @jupytext rather than become a separate command. Splitting the @jupytext node into an outline can be done with complete safety.

I agree. And doing so would put the @juptext imports on a par with @file and @clean, which also impose structure as best they can.
Reply all
Reply to author
Forward
0 new messages