Kate Subtitle

1 view

Skip to first unread message

Rachelle Shriver

unread,

Aug 5, 2024, 1:12:20 AM8/5/24

to spirpurmahead

Thisis not a Xiph codec, though it may be embedded in Ogg alonside other Xiphcodecs, such as Vorbis and Theora. As such, please do not assume that Xiph hasanything to do with this, much less responsibility.

Text and images can be carried and animated by a Kate stream.Most of the time, they will (optionally) be multiplexed with audio/video to carry subtitles,song lyrics (with or without karaoke data), etc.

Series of curves (splines, segments, etc) may be attached to various properties(text position, font size, etc) to create animated overlays. This allows scrollingor fading text to be defined. This can even be used to draw arbitrary shapes, sohand drawing can also be represented by a Kate stream.

Example uses of Kate streams are movie subtitles for Theora videos, either text based,as may be created by ffmpeg2theora, or imagebased, such as created by Thoggen (patching needed), and lyrics,as created by oggenc, from vorbis-tools.

As I was adding support for Theora, Speex and FLAC to some software of mine, I found myselfwanting to have song lyrics accompanying Vorbis audio. Since Vorbis comments are limited tothe headers, one can't add them in the stream as they are sung, so another multiplexed streamwould be needed to carry them.

Each Kate packet starts with a one byte type. A type with the MSB set(eg, between 0x80 and 0xff) indicates a header packet, while a type withthe MSB cleared (eg, between 0x00 and 0x7f) indicates a data packet.All header packets then have the Kate magic, from byte offset 1 to byteoffset 7 ("kate\0\0\0"). Note that this applies only to header packets:data packets do not contain the Kate signature.

When embedded in Ogg,the first packet in a Kate stream (always packet type 0x80,the id header packet) must be placed on a separate page. The corresponding Oggpacket must be marked as beginning of stream (BOS).All subsequent header packetsmust be on one or more pages. Subsequently, each data packet must be on a separatepage.

Category is currently loosely defined, and I haven't found yet a nice way topresent it in a generic way, but is meant for automatic classifying ofvarious multiplexed Kate streams (eg, to recognize that some streams aresubtitles (in a set of languages), and some others are commentary (in apossibly different set of languages, etc).

Decoding is done in a way similar to libvorbis. First, initialize a kate_info and akate_comment structure. Then, read headers by calling kate_decode_headerin. Onceall headers have been read, a kate_state is initialized for decoding using kate_decode_init,and kate_decode_packetin is called repeatedly with data packets. Events (eg, text) can beretrieved via kate_decode_eventout.

Encoding is also done in a way similar to libvorbis. First initialize a kate_infoand a kate_comment structure, and fill them out as needed. kate_encode_headers willcreate ogg packets from those. Then, kate_encode_text is called repeatedly for allthe text events to add. When done, calling kate_encode_finish will create an end ofstream packet.

Here, all Ogg packets are sent to kate_high_decode_packetin, which does the rightthing (header/data classification, decoding, and event retrieval). Note that youdo not get access to the comments directly using this, but you do get access to thekate_info via events.

The number of bits these parts occupy is variable, and each streammay choose how many bits to dedicate to each. The kate_info structurefor a stream holds that information in the granule_shift field,so each part may be reconstructed from a granulepos.

The timestamp T of a given Kate packet is split into a base B andoffset O, and these are stored in the granulepos of that packet.The split is done such that the B is the time of the earliest eventstill active at the time, and the O is the time elapsed between Band T. Thus, T = B + O. This mimics the way Theora stores its owntimestamps in granulepos, where the base acts as a keyframe, andan offset acts as the position of an intra frame from the previouskeyframe. Since Kate allows time overlapping events, however, thechoice of the base to use is slightly more complex, as it may notbe the starting time of the previous event, if the stream containstime overlapping events.

Kate data packets (data packet type 0) includes timing information (start time,end time, and time of the earliest event still active). All these are stored as64 bit at the rate defined by the granule rate, so they do not suffer from thegranule_shift space limitation.

The Kate bitstream format includes motion definition, originally for karaoke purposes, butwhich can be used for more general purpose, such as line based drawing, or animation ofthe text (position, color, etc)

Motions are defined by the means of a series of curves (static points, segments, splines (catmull-rom, bezier, and b-splines)).A 2D point can be obtained from a motion for any timestamp during the lifetime of a text.This can be used for moving a marker in 2D above the text for karaoke, or to use the xcoordinate to color text when the motion position passes each letter or word, etc.Motions have an attached semantics so the client code knows how to use a particular motion.Predefined semantics include text color, text position, etc).

Since a motion can be composed of an arbitrary number of curves, each of which may havean arbitrary number of control points, complex motions can be achieved. If the motion isthe main object of an event, it is even possible to have an empty text, and use the motionas a virtual pencil to draw arbitrary shapes. Even on-the-fly handwriting subtitles couldbe done this way, though this would require a lot of control points, and would not be ableto be used with text-to-speech.

It is also possible for motions to be discontinuous - simply insert a curve of 'none' type.While the timestamp lies within such a curve, no 2D point will be generated. This can beused to temporarily hide a marker, for instance.

It is worth mentionning that pauses in the motion can be trivially included by insertingat the right time and for the right duration a simple linear interpolation curve with onlytwo equal points, equal to the position the motion is supposed to pause at.

Kate defines a set of predefined mappings so that each decoder user interprets a motion inthe same way. A mapping is coded on 8 bits in the bitstream, and the first 128 are reservedfor Kate, leaving 128 for application specific mappings, to avoid constraining creative usesof that feature. Predefined mappings include frame (eg, 0-1 points are mapped to the size ofthe current video frame), or region, to scale 0-1 to the current region. This allows curvesto be defined without knowing in advance the pixel size of the area it should cover.

For uses which require more than two coordinates (eg, text color, where 4 (RGBA) values areneeded, Kate predefines the semantics text_color_rg and text_color_ba, so a 4D point can beobtained using two different motions.

Since attaching motions to text position, etc, makes it hard for the client to keep track ofeverything, doing interpolation, etc, the library supplies a tracker object, which handles theinterpolation of the relevant properties.Once initialized with a text and a set of motions, the client code can give the tracker a newtimestamp, and get back the current text position, text color, etc.

Using a tracker is not necessary, if one wants to use the motions directly, or just ignore them,but it makes life easier, especially when considering the the order in which motions are applieddoes matter (to be defined formally, but the current source code is informative at this point).

Though this is not a feature of the bitstream format, I have created a text file format todescribe a series of events to be turned into a Kate bitstream.At its minimum, the following is a valid input to the encoder:

Motions, regions, styles can be declared in a definitions block to be reused by events, or canbe defined inline. Defining those in the definitions block places them in a header so they canbe reused later, saving space. However, they can also be defined in each event, so they will besent with the event. This allows them to be generated on the fly (eg, if the bitstream is beingstreamed from a realtime input).

Please note that the Kate file format is fully separate from the Kate bitstream format. Thedifference between the two is similar to the difference between a C source file and the resultingobject file, when compiled.

Note that the format is not based on XML for a very parochial reason: I tend to dislike verymuch editing XML by hand, as it's really hard to read. XML is really meant for machines to parsegenerically text data in a shared syntax but with possibly unknown semantics, and I need thosetext representations to be editable easily.

This also implies that there could be an XML representation of a Kate stream, which would beuseful if one were to make an editor that worked on a higher level than the current all-textrepresentation, and it is something that might very well happen in the future, in parallel withthe current format.

When seeking to a particular time in a movie with subtitles, we may end up at a place when a subtitle has been started, but is not removed yet. Pure streaming doesn't have this problem as it remembers the subtitle being issued (as opposed to, say, Vorbis, for which all data valid now is decoded from the last packet). With Kate, a text string valid now may have been issued long ago.

The actual text in events may include simple HTML-like markup (at the moment, allowed markupis the same as the one Pango uses, but more markup types may be defined in the future).It is also possible to ask libkate to remove this markup if the client prefers to receiveplain text without the markup.

A header field defines the language (if any) used in the stream (this can be overridden in adata packet, but this is not relevant to this point). At the moment, my test code usesISO 639-1 two letter codes, but I originally thought to use RFC 3066 tags. However, matchinga language to a user selection may be simpler for user code if the language encoding is keptsimple. At the moment, I tend to favor allowing both two letter tags (eg, "en") and secondarytags (like "en_EN"), as RFC 3066 tags can be quite complex, but I welcome comments on this.