Of kits, documents and sub-documents

Yves Savourel

unread,

Jun 4, 2010, 3:21:34 PM6/4/10

to okapi...@googlegroups.com

One of the things we need to resolve for M8 is how what kind of events a "kit" like the XLIFF-TKit, IDZ, or a full project inside an XLIFF document should send.

Currently we have this:

One property file -> start-doc, etc.
One DOCX -> start-doc for the DOCX itself, then start-sub-document for each extract internal file
One XLIFF -> start-doc for the XLIFF doc, then one start-sub-doc for each <file>

The problem arise with the new packages: for example the xliff-tkit.

It could send:
- Start-doc upon starting the kit
- Then start-sub-doc for each extract file

The problem: if it reads an XLIFF document or an DOCX we would get start-sub-doc within a sub-doc.

Is that mean we need to send start-group instead of start-doc when extracting ODT and DOCX?

I think we need to take a step back and look at what each file is. To me a DOCX and a properties file are basically input uinit that should correspond to start-doc.

Something like the tkit is a different animal, it's a bunch of input files organized into a single storage. But we should be able to work with those even when they are unpacked. So, to me, a tkit is really a batch for document rather than a document.

Maybe the solution is to send start-batch for the kit-type input and modify the pipeline driver to allow starting from that animal as well as from a set of listed input?

Just thinking aloud...
-ys

Jim Hargrave

unread,

Jun 4, 2010, 3:47:36 PM6/4/10

to okapi...@googlegroups.com

Are we also considering XLIFF and PO as tkits?

Jim

Message has been deleted

Yves Savourel

unread,

Jun 4, 2010, 4:15:31 PM6/4/10

to okapi...@googlegroups.com

> Are we also considering XLIFF and PO as tkits?

Good question :)

I'd say PO is just a normal input, just like TS.

XLIFF is more tricky. It's even more of a problem when you have an XLIFF inside a tkit.

It's really a thin wrapper around one or more <file>: There is actually almost no info in <xliff>, everything is set in <file>.

Maybe we could treat ODT/DOCX like one <file> and use <group> for the different type of source inside.

-ys

Jim Hargrave

unread,

Jun 4, 2010, 4:29:20 PM6/4/10

to okapi...@googlegroups.com

> Maybe we could treat ODT/DOCX like one <file> and use <group> for the different type of source inside.

I like this. The real distinction is if the file being processed is
"standalone" or simply a sub-file of the parent (=can't stand alone).

If we use SUB_DOCUMENT for the standalone files inside tkits, xliff
and GROUP for the stuff inside ODT/DOCX then this makes good
conceptual sense to me.

Jim

Message has been deleted

Jim Hargrave

unread,

Jun 4, 2010, 5:05:30 PM6/4/10

to okapi...@googlegroups.com

I don't like the START_DOCUMENT to START_SUBDOCUMENT transformation
:-) If we changed the event types would we still need to do this
transformation?

On Fri, Jun 4, 2010 at 2:51 PM, Sergei <vasil...@gmail.com> wrote:
> START_DOCUMENT holds a lot of information about the filtered document.
> When it's transformed to START_SUBDOCUMENT like with ODT/DOCX, that
> precious info is lost.
>
> IMHO if we process "sub-documents" with separate filters, we should
> preserve all that filter sets in the events it generates.
>
> So maybe the idea to have layered START_DOCUMENT with all that *sub-
> filter* has written, is not that bad?

Message has been deleted

Yves Savourel

unread,

Jun 4, 2010, 5:13:24 PM6/4/10

to okapi...@googlegroups.com

But the issue is that the documents within ODT/DOCX-type files could be parsed separately too. That's why we have start-doc to start-sub-doc change currently. We would just have start-doc to start-group change instead.

I guess it boils down (at least for DOCX-type file) to whether or not we should treat the inner files as 'document'.

All this does not resolve the case of XLIFF <file> though.

-------------------
Sent: Friday, June 04, 2010 3:06 PM
To: okapi...@googlegroups.com
Subject: Re: [okapi-devel] Re: Of kits, documents and sub-documents

Yves Savourel

unread,

Jun 4, 2010, 5:15:03 PM6/4/10

to okapi...@googlegroups.com

This layer idea is intriguing.
I'm trying to visualize it... I'll have to chew on it for a while.

-----Original Message-----
From: okapi...@googlegroups.com [mailto:okapi...@googlegroups.com] On Behalf Of Sergei
Sent: Friday, June 04, 2010 3:09 PM
To: okapi-devel
Subject: [okapi-devel] Re: Of kits, documents and sub-documents

> If we changed the event types would we still need to do this
> transformation?

Yes, but changing START_SUBDOCUMENT to START_GROUP. With layered
START_DOCUMENT we actually don't have to transform events and loose
info.

Jim Hargrave

unread,

Jun 4, 2010, 5:21:04 PM6/4/10

to okapi...@googlegroups.com

Layers (maybe embeddings?) are worth looking at. I will vote for
anything that gets rid of the transformation hack :-)

J

On Fri, Jun 4, 2010 at 3:09 PM, Sergei <vasil...@gmail.com> wrote:
>> If we changed the event types would we still need to do this
>> transformation?
>

Yves Savourel

unread,

Jun 4, 2010, 5:24:38 PM6/4/10

to okapi...@googlegroups.com

We may still have to deal with transformation. For example when we have to use for example a Javascript filter inside an HTML file. I assume we would go make that a group, but the sub-filter used would initially work like for a specific document.
I guess I'm still trying to grasp how layering would work there.

Yves

unread,

Jun 7, 2010, 11:42:33 PM6/7/10

to okapi-devel

Fredrik brought me up to speed with the part of the discussion missed
about the Document/SubDocument issue.
So if I understand correctly the idea would be to:

- Remove the start/endSubDocument events

- Have nested start/endDocument events instead with an additional
metadata corresponding to the level/layer

- And the startDocument would have a reference to the parent document
(not just its parent event)

Did I got it right?

I guess it does sound interesting: It’s true that the sub-document
level is rather artificial and also doesn’t generalize very well.

From the refactoring viewpoint:

- a Start/EndDocument of level 1 would be like the Start/
EndSubDocument today, so we could probably replace things without too
much code drama.

- it would save us some conversion from startdocument to
startsubdocument or startgroup like we do today.

- the steps would have to be changed a bit: many do things on
StartDocument, so we would have to make sure the level is checked.

However, from the tkit/xliff viewpoint: How does it resolve the issue?
We would be getting the same events now (start/end documents) but with
different layers: for example, on a normal ODF: one startdoc-L0 for
the ODF, one startdoc-L1 for inner files. The same files from the
tkit: one startdoc-L0 for the kit(?), one startdoc-L1 for the start of
the ODF, etc. we would still have a difference between processing from
a normal batch and processing from a tkit, no?

I see Sergei's notion of changing the layer depending on the
container. But how would this work concretely? The container reader
would do the change?

I'm probably missing something...
-ys

Jim Hargrave

unread,

Jun 8, 2010, 1:31:36 PM6/8/10

to okapi...@googlegroups.com

After thinking about this some more...

As much as I love unification and generalization I'm not sure its
appropriate in this case. Maybe we need specialization - we need
disambiguation. What if we added some special events (and resources?)
that map perfectly to (1) tkits as files (2) embedded files inside
ODF/DOCX? We still use START/END document for all documents that need
to be filters and get rid of sub-document.

If we go this route the advantage as I see it is the ability to
process without larger context (Dan's concern yesterday). No need to
keep up with levels or interpret a stack.

Jim

Yves Savourel

unread,

Jun 9, 2010, 7:20:35 AM6/9/10

to okapi...@googlegroups.com

> If we go this route the advantage as I see it is the
> ability to process without larger context (Dan's
> concern yesterday). No need to keep up with levels
> or interpret a stack.

I'm not sure adding new events would help for XLIFF/TKit: for example, an TKit with XLIFF would have the same problem of recurrent StartXYZ events: once for the TKit, once for each XLIFF, no?

I could see a "meta-document" level for tkit/idz/etc. but the problem always goes back to the XLIFF input: some can be "document" other "meta-document".

One thing that seems more clear now is that using <file> for sub-documents (for the inner files of ODF/DOCX seem to be) seems to be not a good idea. It results in several <file> elements in XLIFF for a single input. If we play with joining/splitting large XLIFF documents, we could end up with not all the <file> together and not being able to merge back. Maybe the inner files should be mapped to <group>. The question then is: should that be done at the XLIFF level, or at the filter level (instead of sub-document)?

Maybe our XLIFF should have a one-to-one match between an input (of a "normal" file, not TKit) and a <file>, except (possibly) for XLIFF input that have themselves several <file> elements, but that is a special case.

Maybe the solution is to have two filters for the XLIFF documents: one treating it as meta-document, the other as simple input?

-ys

Jim Hargrave

unread,

Jun 9, 2010, 1:50:55 PM6/9/10

to okapi...@googlegroups.com

I might not be fully appreciating the real problem - but it seems that
if we have accurate events that tell us exactly what we are processing
(tkiit start/end, "meta" document, real file etc..) than we can make
the right choices. No need for two filters only one filter that sends
the correct events.

I do agree about sub-document - it should go. Group might be a better
event - but again is this too general? If I get a group in a step I
would have to study the context pretty carefully to know what kind of
group. Why not save all the step writers this extra work and simply
create a more specific event?

Jim

Dan Higinbotham

unread,

Jun 9, 2010, 1:56:30 PM6/9/10

to okapi...@googlegroups.com

I'd rather have a new event type than overload Group -- the context could be hard to pin down.

NOTICE: This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.

Jim Hargrave

unread,

Jun 10, 2010, 7:10:01 PM6/10/10

to okapi...@googlegroups.com

I agree with Dan about the context. The ultimate solution shouldn't
include any kind of stack or level id that *every* step must
process/decode in order to do simple things. An Event should be
relatively unambiguous.

I think Sergei had an idea about some type of Event converter step.
Maybe for our short term project we could use something like this.
Convert START_FILE to START_BATCH for all tkits we need to process.
Then each tkit is processed as its own batch with each file in the
tkit getting a proper START_FILE/END_FILE.

But I may have mutilated the idea :-) We are closing on the end of
the week and we will need some kind of solution for our short term
project. Just trying to stir the pot again.

Jim

Reply all

Reply to author

Forward