What is this?
A rough sketch of a data model that let us quickly format a proofed book.
Why do this?
Our current proofing tools create a heuristic TEI file, but different kinds of manual intervention are still necessary for cases like:
- page boundaries
- mixed content (e.g. Sanskrit and commentary, Sanskrit and translation)
- extraneous content that we want to drop from the document
We also want to make this work as fast as possible.
Proposal for a data model
(For clarity, I'll use simple declarative language here, but this design is quite uncertain.)
For most cases, we can do formatting on the block level. Let's start there.
Each page will have a new column format that is simply a JSON blob:
{
"version": "1",
"items": [...]
}
Each item in items defines how to transform the raw page text into something structured. For example, page 31 of some text with three verses might be annotated like so:
"items": [
{"scope": "meta", "type": "page-number", "value": "31" }
{"scope": "block", "type": "verse"},
{"scope": "block", "type": "verse"},
{"scope": "block", "type": "verse"},
],
scope defines the scope for the given item. If a page has N distinct text blocks, items will always have exactly N dicts with scope=block, and they are listed in 1:1 correspondence (excluding items with other scopes).
Sample block types: verse, paragraph, skip, heading, trailer, merge-up (merge into previous element, eg for paragraphs that extend over a page)
Sample meta types: page-number, section-start
We can pre-populate items with our current heuristic approach.
UI sketch
I imagine a simple two-column form, where the left side has the raw page text and the right side is a dynamic form to which we can add rows, each of which have dropdowns for scope/type and other fields as needed. On page load, we create the right side by parsing the existing JSON, and on save, we write the application data to the database through an API.
Longer-term
Long-term, this combination of text and JSON structure lends itself well to an editor like ProseMirror.