Proposal: a data model and UI for formatting a proofed book

49 views
Skip to first unread message

Arun Prasad

unread,
Sep 19, 2022, 1:08:56 AM9/19/22
to ambuda-discuss
What is this?
A rough sketch of a data model that let us quickly format a proofed book.

Why do this?
Our current proofing tools create a heuristic TEI file, but different kinds of manual intervention are still necessary for cases like:
- page boundaries
- mixed content (e.g. Sanskrit and commentary, Sanskrit and translation)
- extraneous content that we want to drop from the document

We also want to make this work as fast as possible.

Proposal for a data model
(For clarity, I'll use simple declarative language here, but this design is quite uncertain.)

For most cases, we can do formatting on the block level. Let's start there.

Each page will have a new column format that is simply a JSON blob:

{
  "version": "1",
  "items": [...]
}

Each item in items defines how to transform the raw page text into something structured. For example, page 31 of some text with three verses might be annotated like so:

"items": [
  {"scope": "meta", "type": "page-number", "value": "31" }
  {"scope": "block", "type": "verse"},
  {"scope": "block", "type": "verse"},
  {"scope": "block", "type": "verse"},
],

scope defines the scope for the given item. If a page has N distinct text blocks, items will always have exactly N dicts with scope=block, and they are listed in 1:1 correspondence (excluding items with other scopes).

Sample block types: verse, paragraph, skip, heading, trailer, merge-up (merge into previous element, eg for paragraphs that extend over a page)
Sample meta types: page-number, section-start

We can pre-populate items with our current heuristic approach.

UI sketch
I imagine a simple two-column form, where the left side has the raw page text and the right side is a dynamic form to which we can add rows, each of which have dropdowns for scope/type and other fields as needed. On page load, we create the right side by parsing the existing JSON, and on save, we write the application data to the database through an API.

Longer-term
Long-term, this combination of text and JSON structure lends itself well to an editor like ProseMirror.

Shreevatsa R

unread,
Sep 19, 2022, 8:24:55 PM9/19/22
to Arun Prasad, ambuda-discuss
If I understand correctly, the question is about encoding all the other information about a proofread page, besides the actual text itself.

I think something like the above will work, but I'm not sure "format" needs to be a separate column. That is, instead of keeping the page's raw text separately, and a "format" to indicate how it should be interpreted (whether each block is a verse or paragraph or whatever), I think we can/should keep the page's data itself as a structured document:

[
{"page-number": 31},
{"type": "verse", "content": "vāgarthāv iva saṃpṛktau…"},
...
]

and so on.

We'll have to "upgrade" each raw plain-text page-contents to this structured format, but we can evolve the document format and the corresponding UI over time based on whatever needs come up, etc.
(And just as you said: on page load, populate the editor using the existing text/JSON document, and on save write it back to the database -- this "long-term" doesn't have to be very far; it's just 2 or 3 PRs away IMO.)


--
You received this message because you are subscribed to the Google Groups "ambuda-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ambuda-discus...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ambuda-discuss/2a80681d-6d5e-4ea1-a482-5673d2d0abdcn%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Kishore Chitrapu

unread,
Sep 20, 2022, 6:27:34 AM9/20/22
to ambuda-discuss
Separation of concerns between data and presentation/format has advantages. It lends itself to flexibility in developing presentation layers. 

For example, we may want one topic (Gita) with multiple commentaries. Also, we may stitch audio and video renditions together with each verse. The above json can easily be extended to:

"items": [
  {"scope": "meta", "type": "page-number", "value": "31" }
  {"scope": "block", "type": "verse", commentaries:{"karta1": "text", "karta2":"text"}, audio:{"gayaka1":"ganam.mp3"}, video:{"pravachanam1": "sloka1.mp4"}},
  {"scope": "block", "type": "verse"},
],

With the flexibility comes a cost to dynamically compute the layout for each format request. Essentially, a re-drawing of stanza boundaries for each page view. Sometimes Python text parsing library (readlines) may not detect the stanza boundary and mess up the format. One idea is to go extra mile to even save the entire grid: 
 {"scope": "block", "type": "verse", coordinates: {"begin": 81, "end": 160}}
 {"scope": "block", "type": "verse", coordinates: {"begin": 161, "end": 240}}

As Sreevatsa and you pointed out, the other option is to take the hit upfront and save formatted text. The text is readily available for presentation and is cheaper in terms on resource consumption. However, any slight change in the schema in future leads to multiple large file updates.

I feel the former is good to start with. We can always put extra effort and come to the latter.


 

Ashwin Ramaswami

unread,
Sep 22, 2022, 11:29:54 AM9/22/22
to ambuda-discuss
Once we agree on the data model, we should use JSON Schema (https://json-schema.org/) to encode it. I actually maintain a React library that generates UIs from JSON Schemas (https://react-jsonschema-form.readthedocs.io/en/latest/), but we can probably implement something else lightweight to create the UI here.

Shreevatsa R

unread,
Oct 2, 2022, 9:44:06 PM10/2/22
to ambuda-discuss
How about a schema like this (in some mongrel pseudo-BNF notation that I can make clear if necessary!):

ProofedPage := Metadata + Block*
Metadata := {PageNumber: String, ...}
Block := Verse | Paragraph | Heading | Trailer | Skip
Verse := Line*
Paragraph := Line*
Line := (inline text)
Heading := Level + (inline text)

The UI is just a Prosemirror editor, which is good at precisely this :)

To make it clearer what I mean: I have a proof-of-concept implementing just something like "Page := (Paragraph|Verse)*" — I'll clean it up and hopefully send as a PR later this week for discussion, but see this screenshot (The ugly borders for now are just to demonstrate that it's aware of blocks of lines, including the distinction between paragraphs and verses, and can render them differently, e.g. indent verses):

Screen Shot 2022-10-02 at 18.18.53.png

(I'm yet to add a menu for switching between paragraph and verse etc; right now it's with keyboard shortcuts.)

The backend will receive this document as JSON, and have a document that is already structured — of course the structure could be as minimal as just a sequence of lines if that's all the initial proofreader typed; other proofreaders can edit the structure as needed.


Arun Prasad

unread,
Oct 2, 2022, 10:13:07 PM10/2/22
to ambuda-discuss
I like this as a medium-term solution. My reservation for this in the short-term is that I'm still not sure what the schema will look like (though your suggestion is great!), and I'm concerned that in the process of iterating and understanding the different text shapes we have to deal with, we'll have to port all of our plain-text revisions to a v1 schema, a v2 schema, a v3 schema ... I'd love suggestions on how to navigate this. Perhaps this is even a non-problem!

A separate JSON column has its own problems long-term (e.g. if we wanted version control for it, our solution would be a clumsy copy of what we already do for text), but in the short-term we can play around it with and try it out for various texts.

So if there isn't a clean way to migrate, I prefer using a separate column for the very short term, experimenting with it to find a representation that fits multiple texts and needs, then migrating to a unified representation.

Shreevatsa R

unread,
Oct 2, 2022, 11:33:03 PM10/2/22
to Arun Prasad, ambuda-discuss
On Sun, 2 Oct 2022 at 19:13, Arun Prasad <aru...@gmail.com> wrote:
I like this as a medium-term solution. My reservation for this in the short-term is that I'm still not sure what the schema will look like (though your suggestion is great!), and I'm concerned that in the process of iterating and understanding the different text shapes we have to deal with, we'll have to port all of our plain-text revisions to a v1 schema, a v2 schema, a v3 schema ... I'd love suggestions on how to navigate this. Perhaps this is even a non-problem!

This is true but not a big problem IMO: add a `version_id` in the JSON, and have some code on the backend to, *when the field is requested*, convert from v0 -> v1 -> v2 -> v_latest (the version requested by the frontend).
(I already had to implement something like this to handle existing pages: a function to convert from v0 to v1, where if the field doesn't parse as JSON it's treated as plain text = version 0.)
Note that these (sequences of) conversions will happen only during the response from the backend when the page is loaded, and then when the user saves the revision, the new version will be saved back — we don't need to run a batch process to port all existing revisions; a proofread page that no one visits (or even opens in the editor but doesn't save back) will stay at its current version.

This may still sound flaky, but it's fine AFAICT: consider that we'll have to sort of deal with this problem anyway (even if keeping plain text fields, as the meaning of the separate column keeps changing), and with JSON fields we can always do at least as well as what we could do with plain text, as we can always "flatten" the current document into plain text. Also, I think that having the schema be visually rendered (so that it's transparent what's the effect of the structure info we do have so far) can actually help iterate better / identify changes needed to the schema. (The cost is slightly more code, but if we manage to get it set up right and document it well etc, I hope it won't be significant enough to affect iteration velocity, and knowing exactly what we have could even help by giving more confidence…)

With the separate-column solution, I'm somewhat concerned about the two going out of sync (e.g. a stray newline or two in the plain text, or an innocuous-looking change in the block-detection heuristic, can make the existing correspondence meaningless), though I guess that's really unlikely too…

 

Shreevatsa R

unread,
Oct 9, 2022, 5:29:01 PM10/9/22
to Arun Prasad, ambuda-discuss
Another thought that occurred to me is that the UI and backend are actually orthogonal: whether we keep one column or two in the backend, in the frontend we can render the result this way (use it as the UI for editing), and save the (client-side) structured document to either one or two columns as we wish.

(Of course, if we have a single document in the frontend it seems a waste to throw it away, and I remain concerned about mapping heuristically identified blocks to information about them that is stored separately, and how to make sure this stays meaningful as the text is edited and blocks are added/removed, which is another argument for storing as a single column, but in principle they are independent.)

Separately: another thing the schema needs to support is footnotes, and marks (bold and italic).


Reply all
Reply to author
Forward
0 new messages