Okapi as a (web) service

Stefan Pries

unread,

Jan 11, 2011, 9:28:47 AM1/11/11

to okapi-devel

One problem is getting more and more frustrating:

The idea to use Rainbow for setting up filter configurations and
pipelines with steps and their configurations has one big
disadvantage: Rainbow's project file is currently not interchangeble
between systems.

The source files, custom filter configurations and maybe even files
referenced in step configurations (for example: SRX files for
segmentation) will most likely not be available on other systems. Even
if you would transfer them along with the Rainbow project file, you
would have to manually adjust the file paths.

Just to recall:
Our main use cases for Okapi are
* the text extraction from source files with an xliff and skeleton
output (eventually with additional segmentation and TM leveraging
done)
* the other way round (translated xliff + skeleton --> translated
files)

In Ontram the users that initiate translation jobs usually don't have
the technical knowledge to set up filter configurations and pipelines.
So we'd like other users to set up the required configurations for
each type of file sets (we call that filter profiles).

Those filter profiles would be stored somewhere and be available in a
drop down menu (or whatever) when jobs are created. Okapi would then
be used to execute the pre-defined pipeline on the files just
provided.

The reasons why we decided to use Rainbow for setting up
configurations were:
* We would rather not re-create an interface to do all the
configuration when there already is an existing one (Rainbow)
* Rainbow is a great, comprehensive tool to play around with and try
out various settings and instantly see the results

Maybe we could extend the Rainbow configuration file or create a new
configuration format that Rainbow could export? I think it would be
great to have this information saved/included:
a) files matching a pattern X (eg. *.htm) will be processed using the
filter configuration Y
b) all filter configurations from a) that were created by the user
c) all pipeline step configuration stuff that's currently an external
reference (SRX files)

What dou you think?
Any comments are greatly appreciated.

Cheers,
Stefan

Yves Savourel

unread,

Jan 11, 2011, 11:53:47 AM1/11/11

to okapi...@googlegroups.com

If I understand correctly the problem you are running into is that the Rainbow project file (.rnb) doesn't hold some of the data needed to perform the process, for example you have external files (like SRX, filter configurations, etc.)

I'm not sure how your overall file structure is set, but maybe the following can help:

In many of our cases we have to move all the files for a project from one place to another. To avoid path problem we work with relative paths.

For example if you have the following:

C:/project/p1
C:/project/p1/files/*.html

The first thing we do is save a Rainbow project in C:/project/p1. The root of the input files and the filter parameters folder are both set to <auto> by default and will use the folder where the project file is as the root.

Adding the *.html files will make then relative to C:/project/p1.

Any custom filter configuration or SRX we put in C:/project/p1 as well.

The ${ProjDir} and ${rootDir} variables can be used in many places to refer to the same root (C:/project/p1) and specify SRX, output paths from there.

Maybe that can help.

> ... create a new configuration format that Rainbow could export?

> I think it would be great to have this information saved/included:
> a) files matching a pattern X (eg. *.htm) will be processed using
> the filter configuration Y b) all filter configurations from a)
> that were created by the user c) all pipeline step configuration
> stuff that's currently an external reference (SRX files)

So: some general purpose file that would hold all the different parts needed to run a process.

- The input files and their associated configuration Id would be easy
- the filter configuration files used would also be easy
- the pipeline steps would be ok, except for any referenced files (like the SRX): because each step is independent an tool like Rainbow cannot know what are its external references. So the step itself would have to provide a way to serialize any external file it references. Not impossible, but tricky.

Maybe understanding better the environment you have may help seeing what would be the best options?

Here is what I get so far: You are creating a Rainbow project locally from various info:
- input files residing locally,
- SRX rules from a server,
- filter configurations from a server,
- step parameters from also a server.

Then you send that project, along with the referenced files to the server and execute?
Or am I missing something?

Cheers,
-ys

Jim

unread,

Jan 11, 2011, 12:35:42 PM1/11/11

to okapi...@googlegroups.com

The key difference is that rainbow is an interactive tool without many batch processing features - the server needs to be a batch processor with minimal configuration. Both approaches have advantages - but maybe this means that using the rainbow config file is not the right way to go - unless it also makes sense to add more batch processing features to rainbow like preconfigured resource files (srx etc) and file pattern rules to trigger proper configuration - you could even extend these rules to define what pipelines are applied to the files

Jim

C:/project/p1
C:/project/p1/files/*.html

Maybe that can help.

[The entire original message is not included]

Stefan Pries

unread,

Jan 12, 2011, 6:42:14 AM1/12/11

to okapi-devel

> Maybe understanding better the environment you have may help seeing what would be the best options?

Here's what we would like to be able to do:

1. Create a Rainbow project with
- Some input files (for testing purposes only) residing locally
- SRX rules also residing locally
- a customized filter configuration that was created locally (eg.
okf_...@copy-of-default.fprm which will be saved in the user's
directory by default)
2. Test that configuration locally
3. Transfer that configuration (pipeline, SRX file, filter
configuration file) to a server
4. Make the server execute that configuration on any given set of
files (not just the files used in 1.)

I guess a batch processing feature is the perfect name for what we
would need. The Rainbow project file seems not to be the right way to
go, as it saves a whole project and not only a configuration or a
process to be executed (it's a PROJECT file after all, not a
configuration file).

For a batch processing feature we would still need all the
configuration to be serializable (including SRX files referenced in
step parameters). Otherwise the configuration could not be created on
one system and be used on another.

That solution could also be useful outside of server environments.
Imagine someone localizing web-based trainings from a single vendor.
Those WBTs may consist of html pages, localizable java scripts and
flash xml files. Instead of setting up a Rainbow project for every
WBT, it would be possible to set up one batch process and use it for
all of the WBTs. (That would work for similar structured WBTs only, of
course)
Those batch configurations could be exchanged between translators and
localization engineers, as well.

Yves Savourel

unread,

Jan 12, 2011, 10:42:24 AM1/12/11

to okapi...@googlegroups.com

> 1. Create a Rainbow project with
> - Some input files (for testing purposes only) residing locally
> - SRX rules also residing locally
> - a customized filter configuration that was created locally (eg.
> okf_...@copy-of-default.fprm which will be saved in the user's
> directory by default)
> 2. Test that configuration locally
> 3. Transfer that configuration (pipeline, SRX file, filter
> configuration file) to a server
> 4. Make the server execute that configuration on any given set of
> files (not just the files used in 1.)

OK, I see the process better now. Thanks.

I think the best way to solve this is like you said: some kind of file that holds everything, or at least every part that belongs to the re-useable configuration.

The most difficult thing to solve will probably be the files references in the step parameters. Outside code cannot really know what are references or how to serialize them.

I'm not sure if it's doable, but it seems we would almost need an additional method that can provide the parameters plus the de-referenced file content. But it would mean mixing file formats, etc. Not sure how that could work.

Another possible direction could be to have some kind of "harvester" routine that explore a pipeline and manages to gather the referenced files. Based on extension, etc. maybe that can be done... although I doubt it. It would be very difficult to see the difference between an input and output paths for example.

Or creating a configuration package based on the Rainbow project and relying on the variables like ${rootDir}... maybe that is more doable.

I'll keep thinking about it. There's got to be a solution.

-yves

Stefan Pries

unread,

Jan 13, 2011, 7:13:22 AM1/13/11

to okapi-devel

> I think the best way to solve this is like you said: some kind of file that holds everything, or at least every part that belongs to the re-useable configuration.
>
> The most difficult thing to solve will probably be the files references in the step parameters. Outside code cannot really know what are references or how to serialize them.
>
> I'm not sure if it's doable, but it seems we would almost need an additional method that can provide the parameters plus the de-referenced file content. But it would mean mixing file formats, etc. Not sure how that could work.

Right, mixing file formats could produce quite a mess.

> Another possible direction could be to have some kind of "harvester" routine that explore a pipeline and manages to gather the referenced files. Based on extension, etc. maybe that can be done... although I doubt it. It would be very difficult to see the difference between an input and output paths for example.

I agree, that would be very difficult to do automatically. Steps could
also have a directory as a parameter. Imagine a leverage step that
grabs all *.tmx files from a specified directory. How would you tell
it it's an input or output directory?

Maybe the step parameters could be extended in some way. What if each
single parameter would have a type like
- Boolean
- String
- File
- URL

So there would be a parameter class with the fields name, type, value.
Files and URLs could be serialized easily. The step parameters could
not be serialized in a properties-like way anymore (key/value). Could
a simple xml structure be an alternative here?

That would also make it easier to mix file formats without producing
one big mess. The Rainbow project file is XML already. The parameters
would be stored as XML in that file, too. Couldn't external references
be stored as CDATA then?

> Or creating a configuration package based on the Rainbow project and relying on the variables like ${rootDir}... maybe that is more doable.

Sounds like a good plan, too. We could work with that solution, I
guess.

Please correct me if I'm wrong. In step parameters, the difficulty is
to distinguish variables from plain text, right? If you'd want to use
a text replacement step to replace "${rootDir}" with some other text,
it would be hard to tell if ${rootDir} should be treated as plain text
or a variable.

Cheers,
Stefan

Yves Savourel

unread,

Jan 13, 2011, 9:16:40 AM1/13/11

to okapi...@googlegroups.com

I'm not sure the extending StepParameter will help because it's for the runtime parameters.
But Stefan, I think you touched on the right solution: we would need an annotation to simply markup the configuration parameters that need to be de-referenced. A harvester can go through the pipeline, get the parameters of each step and save the files in a configuration file (XML with CDATA would do well).
We could even not look at what the file is and save as a binary base-64 if needed. But obviously it may be more handy to save the real format when we can.

-ys

Reply all

Reply to author

Forward