PO files

38 views
Skip to first unread message

Alessandro Falappa

unread,
Mar 25, 2021, 6:04:39 AM3/25/21
to Group: okapi-devel
Hello all,
  I worked some more on the PO escaping topic and started by understanding more thoroughly the current code, here are my findings.
  • PO file to events extraction is performed by the POFilter, PO events to file merge is performed by GenericFilterWriter with a GenericSkeletonWriter and a POEncoder.
  • A separate POWriter class is also used to write PO files from events and is used in Rainbowkit PO and Transifex package writers.
  • POFilter uses the code finder and a set of default rules to intercept C printf format specifiers (e.g. %-6.3f), basic Java MessageFormat placeholders (e.g. {0}) and some escape combinations (e.g. \r, \n, \t, \f,\a, \b, \v and \r\n) but the user could disable the code finder or change the set of default rules.
  • Code finder rules do not intercept escaped quotes \" and escaped backslash \\.
  • POFilter does not perform any escaped chars decoding after stitching together the content of msgstr/msgid quoted lines, this probably was done to let the code finder operate on the raw content, thus escaped quotes and backslashes pass trough undecoded.
  • POEncoder does not perform encoding of backslashes, quotes, newlines, tabs and other control characters. This mirrors the POFilter behavior and works for escape combinations intercepted by the default code finder rules.
  • Since escaped backslashes and quotes are not intercepted nor decoded it is easy to break the format inserting an unescaped quotes or backslash character in the target for a language for example.
  • From my runs the POEncoder is almost always called with a TEXT EncoderContext. The encode(String) method is called when there are no codes while encode(char) method is called when there are codes. In the latter case the encoder has no way to detect if a quotes/backslash character was escaped or not as it has no way to "look" at the previous character.
Given the previous findings I am unsure on how to proceed and therefore ask for suggestion/guidance.
I could:
  1. Implement decoding of escaped combinations in the POFilter and conversely encode those combinations in the POEncoder
  2. Add code finder rules to intercept escaped quotes and escaped backslashes
Escaping quotes and backslashes not already escaped in the POEncoder encode(String) method as I did in my previous PR is not enough to get correct behaviour in all cases.

What do you think?

Regards,


Alessandro Falappa
Senior Java Developer

jim

unread,
Mar 25, 2021, 11:54:30 AM3/25/21
to okapi...@googlegroups.com
Thank you for digging into this.

My vote would be for #1. Before applying any codefinder rules the PO filter should have already decoded the content.

Jim 
--
You received this message because you are subscribed to the Google Groups "okapi-devel" group.
To unsubscribe from this group and stop receiving emails from it, send an email to okapi-devel...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/okapi-devel/CAE7HRiLe%2BN1AX%2BA%2BsFbPw4qk_skLjP0hNG0sc3CRhLhE9XGm0A%40mail.gmail.com.

Yves Savourel

unread,
Mar 25, 2021, 12:42:23 PM3/25/21
to okapi...@googlegroups.com

+1.

This would give a clean extracted string even when the CodeFinder is not used.

-ys

Alessandro Falappa

unread,
Mar 26, 2021, 11:11:27 AM3/26/21
to Group: okapi-devel


Alessandro Falappa
Senior Java Developer

jim

unread,
Mar 26, 2021, 11:56:24 AM3/26/21
to okapi...@googlegroups.com
Alex,

I'm getting a PO unit test failure with newllines - probably an OS difference (Linux in my case). I'll take a look at it and push a fix.

testEscapesAmongAlreadyEscaped

Jim

jim

unread,
Mar 26, 2021, 1:48:24 PM3/26/21
to okapi...@googlegroups.com
Alex,

I updated the POWriter to use the new POEncoder. I don't believe the embedded encoder did anything special.

We still get one test failure. But I'm not 100% sure the expected string is correct. However, this may be a bug with strings that have a mixture of escaped and unescaped chars. Can you take a look?

Escape/Unescape will make your head spin :-)

@Test @Ignore("Check expected string to make sure it is correct")
public void testEscapesAmongAlreadyEscaped () {
   String snippet = ""
      + "msgid \"' \\\\ \" \\\\\\\"\r"
      + "msgstr \"' \\\\ \" \\\\\\\"\r\r";
   String expected = ""
      + "msgid \"' \\\\ \\\" \\\\\\\\\"\r"
      + "msgstr \"' \\\\ \\\" \\\\\\\\\"\r\r";
   String result = rewrite(getEvents(snippet, locEN, locFR), locFR);
   assertEquals(header.replace('\n', '\r')+expected, result);
}

Alessandro Falappa

unread,
Mar 29, 2021, 3:51:12 AM3/29/21
to Group: okapi-devel
Hello Jim,
  I haven't touched POWriter and its test class in my PR as I thought it was already correctly handling quotes and their escaping.

I will update my fork and have a look. By the way, I am on Linux too.

Regards,


Alessandro Falappa
Senior Java Developer

Alessandro Falappa

unread,
Mar 29, 2021, 6:05:23 AM3/29/21
to Group: okapi-devel
Hi all,
  looking at this test failure unveiled a "corner case" bug in the POFilter (it currently chops a trailing lone backslash after an escaped backslash, e.g. msgstr "\\\") and also highlights that maybe the filter should be stricter against not escaped quotes and backslashes in the input. I am currently revisiting the logic in the filter to throw an OkapiBadFilterInputException if single quotes or backslashes are found in the input. The filter could also be made a bit more tolerant and accept unescaped quotes and backslashes to let the POEncoder escape them on output, what would be the best behaviour? Are malformed PO files frequent in the wild?

With regard to the POWriter failing unit test replacing its logic with a POEncoder is not a backward compatible change. Previously the POWriter corrected unescaped quotes and backslashes to always produce a correct PO file while the POEncoder always escapes quotes and backslashes relying on the unescaping of the POFilter. In order to let the POWriter manage events coming from non-escaping-aware sources I would revert the use of the POEncoder in POWriter.

Regards,


Alessandro Falappa
Senior Java Developer
Il giorno ven 26 mar 2021 alle ore 18:48 jim <jhargr...@gmail.com> ha scritto:

jim

unread,
Mar 29, 2021, 11:56:04 AM3/29/21
to okapi...@googlegroups.com
>>he filter could also be made a bit more tolerant and accept unescaped quotes and backslashes to let the POEncoder escape them on output,

How do other tools and PO editors handle this issue? IMHO, any PO content not properly escaped should be flagged as malformed (OkapiBadFilterInputException) and fixed. But if other popular tools are lenient - then we have to do the same. Our policy has normally been to be lenient on parsing, but strict on writing.

>>With regard to the POWriter failing unit test replacing its logic with a POEncoder is not a backward compatible change.

I'd like to have a single Encoder if possible to avoid confusion. Maybe we could enhance the logic of the POEncoder. We do have the ability to send in parameters if needed.

Jim

Alessandro Falappa

unread,
Mar 29, 2021, 12:12:50 PM3/29/21
to Group: okapi-devel
Hello Jim,
It is lenient on reading and strict on writing as you indicate. The POWriter test should now pass as well.
Note however that escaping is not performed when writing the msgid parts as those go into the skeleton and don't get passed to the POEncoder (see added unit test UnescapedRewrite).

Let me know what you think.

Regards,


Alessandro Falappa
Senior Java Developer

jim

unread,
Mar 29, 2021, 12:17:52 PM3/29/21
to okapi...@googlegroups.com, Yves, Mihai Nita, Chase Tingley
Thanks!

Do you guys on the CC list have any thoughts on PO format and escaping? Do most tools accept "malformed" or unescaped content? Should the POFilter do the same?

Jim
Reply all
Reply to author
Forward
0 new messages