Issue #1355: Tikal and Rainbow when used in command line produce escaped characters in the converted XLF (okapiframework/okapi)

7 views
Skip to first unread message

matijakovac

unread,
May 9, 2024, 4:52:52 AM5/9/24
to okapi-...@googlegroups.com
New issue 1355: Tikal and Rainbow when used in command line produce escaped characters in the converted XLF
https://bitbucket.org/okapiframework/okapi/issues/1355/tikal-and-rainbow-when-used-in-command

Matija Kovač:

Hi,
I’m building a Python app around the Okapi Framework in order to be able to implement it in my project.
I set up a script to use Tikal for converting files into XLF first, only to realize that the output of the conversion escapes the “<“ character on some nested elements, but not all.
I thought it might just be the issue with Tikal, so I added Rainbow to my app too, but the output is the same \(in hindsight, of course it is, they’re both using the same OpenXML filter utility\).

So in my test .docx file I have some inline formatting to test the conversion process, like some of the words are bold, italic etc.
As expected, this results in embedded subelements of the <source> element in the translation units. All good so far.

However, it looks like the <run> element is always returned as “`&lt;run1>`", with the left “<“ escaped, while the right is not.
I can of course fix this in python when storing the file, but then it refuses to merge back again.

I also tried with HTML files, where the escaped tags appear in the <sup> element, since there is no <run> element.
`<source xml:lang="en">This text is <bpt id="1">&lt;b></bpt>bold<ept id="1">&lt;/b></ept>.</source>`

This inconsistent handling of embedded subelements in the XML structure, or rather their tag representation is of course leading to issues down the line when it comes to positioning tags through the MT process, as well as other issues.

Is this expected behavior, and if yes - may I know why and how am I expected to deal with this.
If not, how can it be fixed?

Here’s an example of my code, using Rainbow with the TranslationKitCreation:


```python
def rainbow_convert_to_xlf():

if 'file' not in request.files:
return jsonify({'error': 'No file part in the request'}), 400

file = request.files['file']
if file.filename == '':
return jsonify({'error': 'No file selected'}), 400

if not is_extension_supported(file.filename):
return jsonify({'error': 'Unsupported file extension'}), 400

source_lang = request.form.get('source_lang', 'en') # Default to English if not provided
target_lang = request.form.get('target_lang', 'en') # Default to English if not provided

folder_name = str(uuid.uuid4())
upload_folder_path = os.path.join(current_app.config['UPLOAD_FOLDER'], folder_name)
os.makedirs(upload_folder_path, exist_ok=True)
input_file_path = os.path.join(upload_folder_path, file.filename)
file.save(input_file_path)

xlf_filename = file.filename + '.xlf'
output_file_path = os.path.join(upload_folder_path, xlf_filename)

rainbow_command = [
'./rainbow/rainbow.sh',
'-x', 'TranslationKitCreation',
'-sl', source_lang,
'-tl', target_lang,
input_file_path,
'-o', output_file_path
]

try:
result = subprocess.run(rainbow_command, check=True, capture_output=True, text=True)
print("Rainbow Command Output:", result.stdout) # Debug output stdout
print("Rainbow Command Error:", result.stderr) # Debug output stderr
return jsonify({
'message': 'File converted to XLF successfully',
'xlf_file_url': f'http://{request.host}/get/{folder_name}/{xlf_filename}'
}), 200
except subprocess.CalledProcessError as e:
current_app.logger.error(f"Rainbow CLI failed: {e.stderr}")
return jsonify({
'error': 'Failed to convert file',
'message': e.stderr
}), 500
```


Many thanks for building all these amazing tools and for helping me out with this issue.

yves.s...@gmail.com

unread,
May 11, 2024, 8:49:41 AM5/11/24
to matijakovac, okapi-...@googlegroups.com
Hi Matija,

The character '<' being escaped in entries such as
`<source xml:lang="en">This text is <bpt id="1">&lt;b></bpt>bold<ept id="1">&lt;/b></ept>.</source>`
is expected: In the XLIFF document, the "<b>" is part of the XML content rather than an actual tag.

> This inconsistent handling of embedded subelements in the XML structure, or rather their tag representation is of course
> leading to issues down the line when it comes to positioning tags through the MT process, as well as other issues.

It should not be a problem if the MT tool you are using supports XLIFF.
For example, the MT connectors provided with Okapi make sure inline codes are submitted in a way that is compatible with the given MT engine.
That said, the MT engine itself may not be perfect with handling inline codes.

Cheers,
-ys
--
You received this message because you are subscribed to the Google Groups "Okapi Issues" group.
To unsubscribe from this group and stop receiving emails from it, send an email to okapi-issues...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/okapi-issues/1fbd5328-2e58-41c5-8626-30e37dfc9512%40bitbucket.org.

Reply all
Reply to author
Forward
0 new messages