> On May 30, 2020, at 10:33, Emanuel Berg via Users list for the GNU Emacs text editor <
help-gn...@gnu.org> wrote:
>
> Can't we compile a list of what the commercial CATs
> offer? M Helary and Mr Abrahamsen?
x commercial → ○ professional, if you don't mind :)
OmegaT is very much a professional tool and certainly not a "commercial" one.
My 20 years of practice but otherwise not technically so very informed idea is the following:
1) CAT tools extract translatable contents from various file formats into an easy-to-handle format, and put the translated contents back into the original format. That way the translator does not have to worry *too much* about the idiosyncrasies of the original format.
→ File filters are a core part of a CAT tool *but* as was suggested in the thread it is possible to rely on an external filter that will output contents in a standard localization "intermediate" format (current "industry" standards are PO and XLIFF). Such filters provide export and import functions so that the translated files are converted back to the original format.
File filters can also accept rules for not outputting non-translatable text (the current standard is ITS)
The PO format can be handled by po4a (perl), translate-toolkit (python) and the Okapi Framework tools (java).
XLIFF has the Okapi Framework, OpenXLIFF (electron/node) and the translate-toolkit. All are top-notch pro-grade free software and in the case of Okapi and OpenXLIFF have been developed by people who have participated to the standardization process (XLIFF/TMX/SRX/ITS/TBX, etc...)
→ emacs could rely on such external filters and only specialize in one "intermediate" format. The po-mode already does that for PO files.
2) Once the text is extracted, it needs to be segmented. Basic "no" segmentation usually means paragraph based segmentation. Paragraphs are defined differently depending on the original format (1, or 2 line breaks for a text file, a block tag for XML-based formats, etc.).
Fine-grained segmentation is obtained by using a set of native language based regex that includes break rules and no-break rules. A simple example is break after a "period followed by a space" but don't break after "Mr. " for English.
→ File filters usually handle the segmentation part based on user specifications. Once the file is segmented into the intermediate format, it is not structurally trivial to "split" or "merge" segments because the tool needs to remember what will go back into the original file structure.
→ emacs could rely on the external filters to handle the segmentation.
3) The real strength of a CAT tool shows where it helps the translator handle all the resources needed in the translation. Let me list potential resources:
- Legacy translations, called "translation memories" (TM), usually in multilingual "aligned" files where a given segment has equivalents in various languages. Translated PO files are used as TMs, the XML standard is TMX.
- Glossaries, usually in a similar but simpler format, sometimes only TSV, sometimes CSV, the XML-based standard is TBX.
- Internal translations, which are produced by the translator while translating. Each translated segment adding to the project "memory".
- Dictionaries are a more global form of glossaries, usually monolingual, format varies.
- external files, either local documents, or web documents, in various formats, usually monolingual (otherwise they'd be aligned and used as TMs)
→ each resource format needs a way to be parsed, memorized, fetched, recycled efficiently during the translation
4) Usually the process is the following:
- the translator "enters" a segment
- the tool displays "matches" from the resources that relatively closely correspond to the segment contents
- the translator inserts or modifies the matches
- when no matches are produced the translator enters a translation from scratch
- the translator can add glossary items to the project glossary
- the new translation is added to the "internal" memory set
- the translator moves to the next segment
5) The matching is usually some sort of levenstein distance-based algorithm. The "tokens" that are used in the "distance" calculation are usually produced by native language based tokenizers (the Lucene tokenizers are quite popular)
The better the match, the more efficient the tool is at helping the translator recycle resources. The matching process/quality is where tools profoundly differ (OmegaT is generally considered to have excellent quality matches, sometimes better than expensive commercial tools).
Some tools propose "context" matches where the previous and next segments are also taken into account, some tools propose "subsegment" matches where even if a whole segment won't match significant subparts can, etc.
The matching process must sometimes apply to extremely big resources (like many million lines of multilingual TMs in the case of the EU legal corpora) and must thus be able to handle the data quickly regardless of the set size.
6) Goodies that are time savers include:
- history based autocompletion
- glossary/TM/dictionary based autocompletion
- MT services access
- shortcuts that auto insert predefined text chunks
- spell-checking/grammar checking
- QA checks against glossary terms, completeness/length of the translation, integrity of the format structure, numbers used, etc. (QA checks are also available as external processes in some of the solutions mentioned above, or related solutions.)
> I'll read thru this thread tomorrow (today)
> God willing but I don't understand everything, in
> particular examples would nice to get the exact
> meaning of the desired functionality...
Go ahead if you have questions.
> With examples we can also see if Emacs already can do
> it. And if not: Elisp contest :)
:)
> Some features are probably silly, we don't have to
> list or do them, or everything in the CATs, just what
> really makes sense and is useful on an every-day basis.
A lot of the heavy-duty tasks can be handled by external processes.
> When we are done, we put it in the wiki or in a pack.
>
> We can't have that Emacs doesn't have a firm grip on
> this issue. Because translation is a very common task
> with text!
>
> Also, let's compile a list of what Emacs already has
> to this end. It doesn't matter if some of that stuff
> already appears somewhere else, modularity is
> our friend.
:)