Bug#1017742: po4a: Please provide marker for double strings during po4a-gettextize

Helge Kreutzmann

unread,

Aug 19, 2022, 3:20:04 PM8/19/22

to

Package: po4a
Version: 0.67-2
Severity: wishlist

Since recently po4a-gettextize adds spaces at the end of strings
during gettextisation, if strings occure multiple times in the master
file (or translation).

In production, multiple indentical strings in the original file are
only stored once in the po file (as they are translated the same).

Therefore, translators need to review those strings carefully and
remove those entries from the po file, which have this final space
added. In this process they need to choose the most appropriate
translation to keep.

Currently, this can be quite difficult, as the string can occur
multiple times and the trailing space(s) are difficult to spot in
large files. Also these trailing spaces are hard to get a good regular
expression for searching.

Therefor I kindly ask you if you could mark those strings in addition
(!) with a suitable translator comment (e.g. "Potential duplicate
string, review and consolidate"). This would users allow to use msggrep(1)
with the option -C to filter them out for review.

Thanks for considering.

-- System Information:
Debian Release: bookworm/sid
APT prefers testing
APT policy: (500, 'testing')
Architecture: amd64 (x86_64)

Kernel taint flags: TAINT_UNSIGNED_MODULE
Locale: LANG=de_DE.UTF-8, LC_CTYPE=de_DE.UTF-8 (charmap=UTF-8) (ignored: LC_ALL set to de_DE.UTF-8), LANGUAGE not set
Shell: /bin/sh linked to /usr/bin/dash
Init: systemd (via /run/systemd/system)

Versions of packages po4a depends on:
ii gettext 0.21-6
ii libpod-parser-perl 1.65-1
ii libsgmls-perl 1.03ii-37
ii libsyntax-keyword-try-perl 0.27-1
ii libyaml-tiny-perl 1.73-1
ii opensp 1.5.2-13+b2
ii perl 5.34.0-5

Versions of packages po4a recommends:
ii liblocale-gettext-perl 1.07-4+b2
ii libterm-readkey-perl 2.38-2
ii libtext-wrapi18n-perl 0.06-9
ii libunicode-linebreak-perl 0.0.20190101-1+b4

po4a suggests no packages.

-- no debconf information

--
Dr. Helge Kreutzmann deb...@helgefjell.de
Dipl.-Phys. http://www.helgefjell.de/debian.php
64bit GNU powered gpg signed mail preferred
Help keep free software "libre": http://www.ffii.de/

signature.asc

Martin Quinson

unread,

Aug 20, 2022, 4:50:03 AM8/20/22

to

----- Le 19 Aoû 22, à 21:13, Helge Kreutzmann deb...@helgefjell.de a écrit :

> Package: po4a
> Version: 0.67-2
> Severity: wishlist
>
> Since recently po4a-gettextize adds spaces at the end of strings
> during gettextisation, if strings occure multiple times in the master
> file (or translation).
>
> In production, multiple indentical strings in the original file are
> only stored once in the po file (as they are translated the same).
>
> Therefore, translators need to review those strings carefully and
> remove those entries from the po file, which have this final space
> added. In this process they need to choose the most appropriate
> translation to keep.
>
> Currently, this can be quite difficult, as the string can occur
> multiple times and the trailing space(s) are difficult to spot in
> large files. Also these trailing spaces are hard to get a good regular
> expression for searching.
>
> Therefor I kindly ask you if you could mark those strings in addition
> (!) with a suitable translator comment (e.g. "Potential duplicate
> string, review and consolidate"). This would users allow to use msggrep(1)
> with the option -C to filter them out for review.
>
> Thanks for considering.

Your bug report makes me realize that the current behavior is perfectly broken. I was hopping for the next msgmerge to merge the alternative translations of the same msgid, but I realize that they will most probably be dropped in the process.

Mmm. Let me think about it, I'll try to reimplement this thing so that the carful review that you describe becomes automatically done during the gettexization.

Many thanks for reporting,
Mt

Helge Kreutzmann

unread,

Aug 20, 2022, 11:30:04 AM8/20/22

to

Hello Martin,

On Sat, Aug 20, 2022 at 10:43:23AM +0200, Martin Quinson wrote:
> ----- Le 19 Aoû 22, à 21:13, Helge Kreutzmann deb...@helgefjell.de a écrit :
>
> > Package: po4a
> > Version: 0.67-2
> > Severity: wishlist
> >
> > Since recently po4a-gettextize adds spaces at the end of strings
> > during gettextisation, if strings occure multiple times in the master
> > file (or translation).
> >
> > In production, multiple indentical strings in the original file are
> > only stored once in the po file (as they are translated the same).
> >
> > Therefore, translators need to review those strings carefully and
> > remove those entries from the po file, which have this final space
> > added. In this process they need to choose the most appropriate
> > translation to keep.
> >
> > Currently, this can be quite difficult, as the string can occur
> > multiple times and the trailing space(s) are difficult to spot in
> > large files. Also these trailing spaces are hard to get a good regular
> > expression for searching.
> >
> > Therefor I kindly ask you if you could mark those strings in addition
> > (!) with a suitable translator comment (e.g. "Potential duplicate
> > string, review and consolidate"). This would users allow to use msggrep(1)
> > with the option -C to filter them out for review.
> >
> > Thanks for considering.
>
> Your bug report makes me realize that the current behavior is perfectly broken. I was hopping for the next msgmerge to merge the alternative translations of the same msgid, but I realize that they will most probably be dropped in the process.

Yes, that is what I noticed as well. (We just went through it at
manpages-l10n).

> Mmm. Let me think about it, I'll try to reimplement this thing so that the carful review that you describe becomes automatically done during the gettexization.

Thanks!

Greetings

Helge

signature.asc

Martin Quinson

unread,

Aug 23, 2022, 7:00:04 AM8/23/22

to

Package: po4a
tag 1017742 fixed-upstream
thanks

Hello,

I just fixed the gettextization process, and I think it's now much better. The
version I introduced in last version was indeed really suboptimal.

Now, if I gettextize these two files (presented side by side):

```
# hello | # HELLO
|
## hello | ## SUBTITLE
|
hello | SAMPLE PARAGRAPH.
```

I get a pot file with a single msgid:

```
#, fuzzy, markdown-text, no-wrap
msgid "hello"
msgstr ""
"#-#-#-#-# file1:1 (type Title #) #-#-#-#-#\n"
"HELLO\n"
"#-#-#-#-# file1:3 (type: Title ##) #-#-#-#-#\n"
"SUBTITLE\n"
"#-#-#-#-# file1:5 (type: Plain text) #-#-#-#-#\n"
"SAMPLE PARAGRAPH."
```

I think it's much much better than the previous thing. Thanks again for
reporting the issue, I'm glad of that new version :)

Bye, Mt.

--
Pour une évaluation indépendante, transparente et rigoureuse !
Je soutiens la Commission d'Évaluation de l'Inria.

signature.asc

Helge Kreutzmann

unread,

Aug 23, 2022, 12:00:04 PM8/23/22

to

Hello Martin,

On Tue, Aug 23, 2022 at 12:46:03PM +0200, Martin Quinson wrote:
> I get a pot file with a single msgid:
>
> ```
> #, fuzzy, markdown-text, no-wrap
> msgid "hello"
> msgstr ""
> "#-#-#-#-# file1:1 (type Title #) #-#-#-#-#\n"
> "HELLO\n"
> "#-#-#-#-# file1:3 (type: Title ##) #-#-#-#-#\n"
> "SUBTITLE\n"
> "#-#-#-#-# file1:5 (type: Plain text) #-#-#-#-#\n"
> "SAMPLE PARAGRAPH."
> ```
>
> I think it's much much better than the previous thing. Thanks again for
> reporting the issue, I'm glad of that new version :)

That is indeed what looks familiar and really a good solution.

Thanks for your hard work.

(And now back to translation the updated documentation ...)

Greetings

Helge

signature.asc