Is it possible to set the targetEncoding in Longhorn (e.g. to UTF-16LE) somehow?

30 views
Skip to first unread message

Axel Becher

unread,
Mar 13, 2026, 10:21:55 AM (9 days ago) Mar 13
to okapi-users
Hello Okapi Team,

we would like to define the output/target encoding in Longhorn just as it is possible in Rainbow. In Rainbow, the setting is saved in the manifest as "targetEncoding" but with Longhorn, there seems to be no way to set these values. We would expect this to be possible via the BCONF or as params in the REST call just like "targets" for the target-locale. 

Are there "hidden" parameters somewhere or what are the reasons the targetFormat is always set to "UTF-8" ? From a user-standpoint it would be nicer to retrieve the file in the encoding it was submitted - as long as the encoding is supported.

Thanks a lot for any infos or advice,
Axel / MittagQI

Marc Mittag

unread,
Mar 13, 2026, 11:03:55 AM (9 days ago) Mar 13
to okapi-users

Dear all,

to add to Axels question: We try to do the following with Okapi Longhorn:

  1. Convert a UTF-16LE txt-file to xliff (xliff should be utf-8 encoded)
  2. translate the xliff
  3. convert the xliff back to UTF-16LE txt-file with the translation

Step 1 and 2 work.

Step 3 does not work. You always get back the txt-file in utf-8.

The reason for this is, that in the manifest-file that Longhorn creates for the UTF-16LE file you have this:

inputEncoding="UTF-16LE" targetEncoding="UTF-8" 

regardless of what we try to send to Longhorn as params in the pipeline or in the fprm-file, which are packaged in the bconf.

There seems to be no way to influence the targetEncoding, that Longhorn writes in the manifest.

If we manually manipulate the manifest and set targetEncoding="UTF-16LE" and then convert the xliff back to txt with Longhron, we get UTF-16LE txt-file as we want it.

Is there anyway to achieve it with Longhorn to set targetEncoding in the manifest to something else than utf-8?

Why does Longhorn not use the encoding as targetEncoding, that it discovered for the source file (where it discovers correctly, that the encodingn is UTF-16LE?

Is that a gap in the current Longhorn implementation? Because with Rainbow you can achieve what you want, if you set the encoding like I did in this screenshot:

Thank you very much in advance for any help!!!

best

Marc

--
You received this message because you are subscribed to the Google Groups "okapi-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to okapi-users...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/okapi-users/319d72f7-db6e-45c9-ba2f-22a38be7833an%40googlegroups.com.

Chase Tingley

unread,
Mar 13, 2026, 8:41:13 PM (9 days ago) Mar 13
to Marc Mittag, okapi-users
My suspicion is that when longhorn initializes the pipeline, it never sets the outputEncoding parameter on the step that ends up generating the package.  However, it will take some time to figure out why that is.  I do suspect this is just oversight.

Marc Mittag

unread,
Mar 14, 2026, 5:06:30 AM (9 days ago) Mar 14
to Chase Tingley, okapi-users

Hi Chase,

thank you very much for your answer!

Since you write "just oversight" it sounds like a bug, right?

What would be the expected way to send the output encoding to Longhorn, if it would work?

Since in Rainbow the output encoding setting seems not to be part of the pipeline or the filter settings. But part of the document properties, as shown in my below screenshot.

So it does not surprise me, that putting it in the pipeline or in the fprm has not effect.

Just asking, what would be the right fix here. We would then ask someone to fix it and do a PR.

My impression is, that since it seems not to be part of the filter settings or pipeline, the right place would be to transfer it as separate REST params in addition to how source and target language shortcuts are passed.

Would that be the right way?

best

Marc

Chase Tingley

unread,
Mar 15, 2026, 2:29:04 PM (7 days ago) Mar 15
to Marc Mittag, okapi-users
Yes, I think it was just a bug -- the team that originally built longhorn probably never needed the feature.  I agree that setting it via REST call when setting up the longhorn project makes sense.

It's currently possible to pass parameters to override individual step parmas when you add a bconf to the project.  I don't think that can be used to solve this problem already (although it would be good to double-check), but maybe that mechanism could be expanded to set global pipeline properties like output encoding.  If that's too clunky, a new call would also be a fine approach.


Marc Mittag

unread,
Mar 15, 2026, 6:21:21 PM (7 days ago) Mar 15
to Chase Tingley, okapi-users

Hi Chase,

thank you very much for your answer!

Would there also be a reasonable way to add it to the pipeline as param somehow?

That would be easier for us to implement in the way we integrate Okapi.

But I guess, that is not in the way Okapi "thinks" in this regard?

Or would it make sense to add it to RawDocumentToFiltersEvent as params (one for input and one for output encoding)?

best

Marc

Chase Tingley

unread,
Mar 15, 2026, 6:54:17 PM (7 days ago) Mar 15
to Marc Mittag, okapi-users
Hi Marc,

When you say "add it to the pipeline", do you mean include it in the .pln data somehow?  I think this is probably possible, although I don't know exactly how you're integrating.


Marc Mittag

unread,
Mar 15, 2026, 7:00:08 PM (7 days ago) Mar 15
to Chase Tingley, okapi-users

Hi Chase,

yes, that is the direction, I'm thinking.

Claude already suggested, that would work ;-) 

What of course it does not.

But this is, what it suggested as part of the .pln file:

<step class="net.sf.okapi.steps.common.FilterEventsToRawDocumentStep"> 

    <param name="encoding">UTF-16LE</param>

    <param name="targetEncoding">UTF-16LE</param> 

</step>

Would something like that make sense? For us it would be easiest, because we would not need to change anything in our code to use that.

If yes, we would (let) develop a bugfix for Okapi (Longhorn and Rainbow) that goes in that direction.
I would suggest, that specifying this in the pipeline for Rainbow would overwrite settings in document properties like here

best

Marc


Chase Tingley

unread,
Mar 16, 2026, 6:57:40 PM (6 days ago) Mar 16
to Marc Mittag, okapi-users
OUTPUT_ENCODING is one of the step parameters, but step parameters come from two places:
- Some are per-step
- Some come from the BatchItemContext, which is created per-input document

OUTPUT_ENCODING (I think) is in the latter camp.

Looking at the longhorn code (ProjectUtils#executeProject()), it ends up calling PipelineWrapper#execute() in base okapi, which ends up here:


This sets the target encoding for a given document pipeline to either the target encoding of that input document (if specified), or else the project target encoding (which defaults to UTF-8).

So it seems like the natural way to support this in longhorn would be to do one or both of these things:
  • When creating a project, allow the project target encoding to be set
  • When adding an input document, allow the target encoding for that document to be set
In either case, there is probably a little bit of doing to make actually implement this, since longhorn tracks its state entirely in the files it writes out on disk.

Marc Mittag

unread,
Mar 17, 2026, 6:35:42 AM (5 days ago) Mar 17
to Chase Tingley, okapi-users

Hi Chase,

thank you for your answer!

The more I know about how it works and should work, the more I have the feeling, the main problem is the following bug:

  • Currently if you pass a utf-16LE txt-file to Longhorn, it recognizes it as utf-16LE and sets the input encoding in the manifest like that. What is fine.
  • But then it sets the output encoding to utf-8. I would assume, default (if nothing specified by the user in some way) should be, that output encoding equals input encoding.

Could you and the Okapi dev team agree to this?

For us at the moment that would fix the problem. Then we would contribute the fix for this.

To be able to set the output encoding to something different then the input encoding is something, where I do not see, that for the foreseeable future we will need it. Therefore I would go with the basic bugfix, which in my eyes anyway would make sense, even if it would be possible to pass the output encoding somehow to Longhorn.

If you agree I think it would not be needed/make sense, that we take part in the dev meeting this afternoon.

best

Marc

Reply all
Reply to author
Forward
0 new messages