Currently we have this:
One property file -> start-doc, etc.
One DOCX -> start-doc for the DOCX itself, then start-sub-document for each extract internal file
One XLIFF -> start-doc for the XLIFF doc, then one start-sub-doc for each <file>
The problem arise with the new packages: for example the xliff-tkit.
It could send:
- Start-doc upon starting the kit
- Then start-sub-doc for each extract file
The problem: if it reads an XLIFF document or an DOCX we would get start-sub-doc within a sub-doc.
Is that mean we need to send start-group instead of start-doc when extracting ODT and DOCX?
I think we need to take a step back and look at what each file is. To me a DOCX and a properties file are basically input uinit that should correspond to start-doc.
Something like the tkit is a different animal, it's a bunch of input files organized into a single storage. But we should be able to work with those even when they are unpacked. So, to me, a tkit is really a batch for document rather than a document.
Maybe the solution is to send start-batch for the kit-type input and modify the pipeline driver to allow starting from that animal as well as from a set of listed input?
Just thinking aloud...
-ys
Jim
Good question :)
I'd say PO is just a normal input, just like TS.
XLIFF is more tricky. It's even more of a problem when you have an XLIFF inside a tkit.
It's really a thin wrapper around one or more <file>: There is actually almost no info in <xliff>, everything is set in <file>.
Maybe we could treat ODT/DOCX like one <file> and use <group> for the different type of source inside.
-ys
I like this. The real distinction is if the file being processed is
"standalone" or simply a sub-file of the parent (=can't stand alone).
If we use SUB_DOCUMENT for the standalone files inside tkits, xliff
and GROUP for the stuff inside ODT/DOCX then this makes good
conceptual sense to me.
Jim
On Fri, Jun 4, 2010 at 2:51 PM, Sergei <vasil...@gmail.com> wrote:
> START_DOCUMENT holds a lot of information about the filtered document.
> When it's transformed to START_SUBDOCUMENT like with ODT/DOCX, that
> precious info is lost.
>
> IMHO if we process "sub-documents" with separate filters, we should
> preserve all that filter sets in the events it generates.
>
> So maybe the idea to have layered START_DOCUMENT with all that *sub-
> filter* has written, is not that bad?
I guess it boils down (at least for DOCX-type file) to whether or not we should treat the inner files as 'document'.
All this does not resolve the case of XLIFF <file> though.
-------------------
Sent: Friday, June 04, 2010 3:06 PM
To: okapi...@googlegroups.com
Subject: Re: [okapi-devel] Re: Of kits, documents and sub-documents
-----Original Message-----
From: okapi...@googlegroups.com [mailto:okapi...@googlegroups.com] On Behalf Of Sergei
Sent: Friday, June 04, 2010 3:09 PM
To: okapi-devel
Subject: [okapi-devel] Re: Of kits, documents and sub-documents
> If we changed the event types would we still need to do this
> transformation?
Yes, but changing START_SUBDOCUMENT to START_GROUP. With layered
START_DOCUMENT we actually don't have to transform events and loose
info.
J
On Fri, Jun 4, 2010 at 3:09 PM, Sergei <vasil...@gmail.com> wrote:
>> If we changed the event types would we still need to do this
>> transformation?
>
As much as I love unification and generalization I'm not sure its
appropriate in this case. Maybe we need specialization - we need
disambiguation. What if we added some special events (and resources?)
that map perfectly to (1) tkits as files (2) embedded files inside
ODF/DOCX? We still use START/END document for all documents that need
to be filters and get rid of sub-document.
If we go this route the advantage as I see it is the ability to
process without larger context (Dan's concern yesterday). No need to
keep up with levels or interpret a stack.
Jim
I'm not sure adding new events would help for XLIFF/TKit: for example, an TKit with XLIFF would have the same problem of recurrent StartXYZ events: once for the TKit, once for each XLIFF, no?
I could see a "meta-document" level for tkit/idz/etc. but the problem always goes back to the XLIFF input: some can be "document" other "meta-document".
One thing that seems more clear now is that using <file> for sub-documents (for the inner files of ODF/DOCX seem to be) seems to be not a good idea. It results in several <file> elements in XLIFF for a single input. If we play with joining/splitting large XLIFF documents, we could end up with not all the <file> together and not being able to merge back. Maybe the inner files should be mapped to <group>. The question then is: should that be done at the XLIFF level, or at the filter level (instead of sub-document)?
Maybe our XLIFF should have a one-to-one match between an input (of a "normal" file, not TKit) and a <file>, except (possibly) for XLIFF input that have themselves several <file> elements, but that is a special case.
Maybe the solution is to have two filters for the XLIFF documents: one treating it as meta-document, the other as simple input?
-ys
I do agree about sub-document - it should go. Group might be a better
event - but again is this too general? If I get a group in a step I
would have to study the context pretty carefully to know what kind of
group. Why not save all the step writers this extra work and simply
create a more specific event?
Jim
NOTICE: This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.
I think Sergei had an idea about some type of Event converter step.
Maybe for our short term project we could use something like this.
Convert START_FILE to START_BATCH for all tkits we need to process.
Then each tkit is processed as its own batch with each file in the
tkit getting a proper START_FILE/END_FILE.
But I may have mutilated the idea :-) We are closing on the end of
the week and we will need some kind of solution for our short term
project. Just trying to stir the pot again.
Jim