Proposal: text database design and URL schema

26 views
Skip to first unread message

Arun Prasad

unread,
Oct 8, 2022, 2:23:14 PM10/8/22
to ambuda-discuss
Here I propose the following:

1. a new database schema for text and audio data.
2. a simple URL schema to access that data.

This is a first attempt. Please comment or amend it as you see fit.

For the database, I propose the following schema:

- Text is the abstract concept of a text, such as the Mahabharata.
- Edition is a specific version of a text, such as the Mahabharata critical edition.
- Section is the unit of organization: sargas, kāṇḍas, etc.
- Block is the unit of content: verses, paragraphs, etc. It has two types: TextBlock and AudioBlock

Relationships:
- A Text has one or more Editions.
- An Edition has one or more Sections.
- A Section has one or more Blocks.

Rough schemas:

- Text:
  - id: int
  - slug: str
  - title: str
  - language: str (Sanskrit, Hindi, English, ...)
  - type: TextType (an enum)
  - default_edition_id: foreign_key(Edition.id)
- Edition:
  - id: int
  - slug: str
  - title: str
  - language: str
  - structure: JSON string (see notes below)
  - text_id: foreign_key(Text.id)
- Section:
  - id: int
  - slug: str
  - title: str
  - edition_id: foreign_key(Edition.id)
- Block:
  - id: int
  - slug: str
  - title: str
  - type: BlockType (an enum)
   - a block with TextBlockType has the extra text column "content" which stores an XML blob.
   - a block with AudioBlockType has the extra string column "media" which stores a UUID referring to an item on our media server.
  - edition_id: foreign_key(Edition.id)
  - section_id: foreign_key(Section.id)

Notes on edge cases:
- Many texts will have just one edition or just one section. Even so, we will store them in this schema.
- Some texts have hierarchical sections (e.g. the Ramayana). Rather than manage this hierarchy relationally, just store a JSON blob of sections in Edition.structure that arranges them hierarchically as needed. By doing so, we avoid having to deal with "grand-sections" or "great-grand-sections" in our database.

For URLs, I propose the following schema:

Some notation: $text means a text's slug, $block means a block's slug, etc. (A slug is a human-readable ID suitable for use in a URL.) Using this notation, I suggest addressing the data above as follows:

A text: /texts/$text
An edition: /texts/$text:$edition
A section: /texts/$text:$edition/$section
A block: /texts/$text:$edition/$block

$section and $block use a simple numbering scheme:
- Sections are numbered in order: 1, 2, 3, ...
- For hierarchical sections, we use 1.1, 1.2, 1.3, ... 2.1, 2.2, 2.3, ...
- Blocks are numbered according to their section:
  - Generally, we use $section.1, $section.2, ...
  - For header (atha ...) and footer (iti ...) elements, we can use @header and @footer
  - For paragraphs with no clear numbering, we use $section.1a, $section.1b, ... where "1" is the slug of the previous verse. If no such verse exists, use $section.a, $section.b, ...

Notes:
- If ":$edition" is removed, we will use the text's default edition.

Using the Mahabharata as an example:

/texts/mahabharata (points to Text)
/texts/mahabharata:bori-1966 (points to Edition)
/texts/mahabharata/1.1 (points to Section 1.1 using the default edition)
/texts/mahabharata:bori-1966/1.1 (points to Section 1.1 using the specified edition)
/texts/mahabharata:bori-1966/1.1.1 (points to Block 1.1.1 using the specified edition)

By using "," and "-", we can specify multiple blocks at once:

/texts/ramayanam:baroda-1960/1.1.1-1.1.10 (first 10 verses of the Ramayana, Baroda edition)/texts/meghadutam/1.1,1.3 (first and third verses of the Meghaduta, default edition)

Arun Prasad

unread,
Oct 8, 2022, 2:27:23 PM10/8/22
to ambuda-discuss
In addition:

For text relationships, I propose the following tables:

TextRelationship:
  - parent_id: foreign_key(Text.id)
  - child_id: foreign_key(Text.id)
  - relationship: TextRelation (enum). Values include "commentary," "translation," "recording"

BlockRelationship:
  - parent_id: foreign_key(Block.id)
  - child_id: foreign_key(Block.id)

These schemas are simple and highly flexible. I expect most relationships will be one-to-many, but I'm sure a few text might fit more naturally in a many-to-many mold. The main advantage of using a new table instead of (say) a `parent_id` on Text is that it keeps our data uniform and predictable while giving us the flexibility to build a graph of Sanskrit content.

Kishore Chitrapu

unread,
Oct 9, 2022, 11:50:57 AM10/9/22
to ambuda-discuss
Consider adding "version" in Edition schema. Generally speaking, we have the luxury of immutable data sources. The version, origination time, and source fingerprint (cryptographic hash of the original text) will be useful to preserve the authenticity of the texts. Also, any value in putting a cover page pic go here? IMO any visual data to the reader to connect the Edition to their previous interaction and a sense of authenticity improves UX.

Shreevatsa R

unread,
Oct 9, 2022, 5:33:48 PM10/9/22
to ambuda-discuss
One reservation I have about this schema is that it merely captures in digital form the editions that already exist as print books, rather than accommodating what is further possible digitally. That is, it's not clear whether the "edition" needs to be higher-level than sections and blocks: it may be more meaningful to talk of a single block (verse, paragraph) as existing in multiple pāṭhāntaras, and a specific printed "edition" as making choices for each block (and also the sequence of blocks within a section). (This would also mean not having to store each block multiple times when two different editions are mostly identical.) I think looking at some of the existing digital-humanities projects for how they handle editions may be interesting.

--
You received this message because you are subscribed to the Google Groups "ambuda-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ambuda-discus...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ambuda-discuss/18daccb8-4fc3-4667-8cfb-096d7474e8e2n%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Arun Prasad

unread,
Oct 27, 2022, 11:06:05 PM10/27/22
to ambuda-discuss
Forgive the late follow-up. As one point of comparison, generally the Perseus project doesn't maintain multiple editions. Do you know of any projects that do?

> accommodating what is further possible digitally

I like what you suggest, but it also seems to me that the data model would be much more complex. That is, instead of the structure above:

relationships: Text --> Section --> Block
columns:
- Block.content (XML blob column)

we would have something like:

relationships: Text --> Section --> Block --> AbstractBlock
columns:
- Block.transform (transformation rules -- JSON?)
- AbstractBlock.content (complex XML blob column that encodes all readings)

Here transform is potentially tightly coupled to MultiformBlock.content , as we might need to update the corresponding transform logic depending on the variant पाठs being added.

I think we can get a similar effect more simply by doing the following:

relationships: Text --> Section --> Block --> AbstractBlock
columns:
- Block.content (XML blob column)
- Block.abstract_block_id (pointer to an AbstractBlock record)
- AbstractBlock.content (complex XML blob column that encodes all readings)

Then for a given AbstractBlock we can store whatever data we like. At the same time, our main rendering path stays simple because we can just read Block.content as normal.

> This would also mean not having to store each block multiple times when two different editions are mostly identical.

What is the cost of duplicate storage? Storage of text data at our scale is essentially free.

Arun

Arun Prasad

unread,
Oct 28, 2022, 11:16:18 AM10/28/22
to ambuda-discuss
Sorry, "MultiformBlock" should be "AbstractBlock."


> Consider adding "version" in Edition schema. Generally speaking, we have the luxury of immutable data sources. The version, origination time, and source fingerprint (cryptographic hash of the original text) will be useful to preserve the authenticity of the texts.

Agreed 100%

> Also, any value in putting a cover page pic go here? IMO any visual data to the reader to connect the Edition to their previous interaction and a sense of authenticity improves UX.

Makes sense, but I'd like to think about it more. In particular, we'll be supporting thousands of texts eventually, and showing thousands of images naively will hurt usability. Perhaps an image just on the text page (and not the index)?

Arun

Shreevatsa R

unread,
Oct 28, 2022, 12:44:35 PM10/28/22
to Arun Prasad, ambuda-discuss
• Yes, good point that we can ignore the cost of duplicate storage.

• To start with we could also just simply do what you originally proposed (without any AbstractBlock), or even simply treat different editions as parallel texts. (And identify parallel versions later.)

• I don't know off-hand about projects that maintain multiple editions, though I'm sure several do. (I searched for the buzzwords—"digital humanities" with "[critical] editions"—and found this large catalogue of links to projects, though a couple of minutes of clicking didn't find a good example. Suhas, or someone actually working in academia, may know of a few. See for example saktumiva.org which has an example of four editions of a certain text. (And I've heard that other people are using it too…)

• Another point: sections may have multiple levels: the Rāmāyaṇa has kāṇḍas, themselves organized into sargas. The Mahābhārata has 18 parvas, each with multiple (I don't know what they're called), themselves grouped into groupings also called "parva"s IIUC. But it may be that only these very large texts have these multiple levels, so it's not worth "edge-case poisoning" just to accommodate them.

Reply all
Reply to author
Forward
0 new messages