Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

[sharutils] unwieldy msgids, unnecessary reformatting

4 views

Skip to first unread message

Benno Schulenberg

unread,

Jan 9, 2013, 12:18:47 PM1/9/13

to bug-gn...@gnu.org

Hello maintainer(s) of sharutils,

In the recent releases of sharutils, the help-text fragments for
'shar' first disappeared from the POT file, and then in the next
release reappeared as one giant text. This is terribly inconvenient
for translators. Please revert to the subdivision of the text into
overseeable chunks, preferrably exactly as it was, or otherwise
into chunk of roughly three to five lines -- it will also make most
of the old translations become valid again without any work to be
done by the translators.

Please also subdivide the printing of larger texts into several print
statements that each encompass a single paragraph. It is much
easier for translators. For example, the now 36-line help text for
'uuencode' contains near the end three pargraphs of running text,
which reoccur identically a bit further on as a separate text -- if
the text or its paragraphs had been gettextized separately, the
second occurrence wouldn't cost translators any extra work.

(I would provide a patch, but it would be so huge you probably
wouldn't be allowed to accept it without a copyright assignment.)

(Also, there are several typos in some of the new msgids, for example
"create a shell archiv", but that is a minor issue compared to the
reformatting of the strings into enormous texts.)

Regards,

Benno

--
http://www.fastmail.fm - A fast, anti-spam email service.

Benno Schulenberg

unread,

Jan 13, 2013, 4:05:25 PM1/13/13

to Bruce Korb, bug-gn...@gnu.org

(I've reincluded the list in the CC.)

On Thu, Jan 10, 2013, at 20:47, Bruce Korb wrote:
> On 01/09/13 09:18, Benno Schulenberg wrote:
> > In the recent releases of sharutils, the help-text fragments for
> > 'shar' first disappeared from the POT file, and then in the next
> > release reappeared as one giant text.
>

> The disappearance was a mistake. The reappearance was the result
> of not having good guidance on how to do it best. Some years back,
> I solicited some help but didn't get a response.
>
> With short fragments, it is easier to translate, but these short
> fragments get woven into usage text in ways that are disparaged
> by the docs I've read for i18n text.

With "fragment" I mean always a whole line, or a small group of
whole lines. Where is this discouraged in i18n documents?

> Specifically, little bits
> of the usage are emitted with the expectation that a consistent
> amount of horizontal space is used. That works IFF the source
> language is the display language.

When a certain indentation needs to be maintained, this is the
responsibility of the translator. Half of the time I use a slightly
different indentation than the original, and use it consistently.

> My solution for this woven text problem is to build a version
> without a combined usage text, print the help with bit-at-a-time
> text and suck that output into a combined usage string and
> rebuild. In the rebuild, the short strings will never be
> used.

In the source code I see for example:

static char const shar_opt_strs[10449] =" [[[enormous string]]] ";

and afterward a hundred or so definitions like:

#define INPUT_FILE_LIST_DESC (shar_opt_strs+1777)

In my opinion this is madness... If you want to add a space
or a word somewhere, you have to figure out and change fifty
indexes by hand... !

> But that is the cause of the next problem:

>
> > Please also subdivide the printing of larger texts into several print
> > statements that each encompass a single paragraph. It is much
> > easier for translators. For example, the now 36-line help text
>

> A possible solution for this is to fiddle the code so that it
> detects a localized environment and requests a paragraph at a time.
> I'd have to jigger something to split up the message the same
> way for the extraction code. The gettext program extracts the
> text from uuencode-opts.c in a function bogus_function() that
> does not get compiled. The actual string lives in a character
> array uuencode_opt_strs[4248] defined above. I could also put
> it directly into a .pot file and bypass the text extraction.
>
> Suggestions would be very, very welcome. FYI, there are other
> projects that use AutoOpts, most notably NTP and GNUtls.
> But a simple, consistent way of getting it to work for all
> AutoOpts clients would be nice, and probably save i18n effort
> should any more of them get internationalized.

Is it AutoOpts that requires that the help text be provided as a
single huge character array? Or do you need these indexes only
because you decided to use something like:

static tOptDesc optDesc[OPTION_CT] = { [[[bunch of stuff]]] }

It all looks horribly complicated and intertwined to me. Indexes
and lengths are something that the program should figure out,
not ever the programmer.

> http://www.gnu.org/software/autogen/autoopts.html

>
> > 'uuencode' contains near the end three pargraphs of running text,
> > which reoccur identically a bit further on as a separate text -- if
> > the text or its paragraphs had been gettextized separately, the
> > second occurrence wouldn't cost translators any extra work.
>

> Since there is only one source for both texts, getting this to
> work depends upon coming up with a paragraph splitting algorithm
> that would split out an exactly matching paragraph. A desirable
> goal, but might not be easy to do. Please suggest an algorithm
> while I try to puzzle one out, too. (e.g. separate on every
> double newline and every line starting with white space.
> Does that yield something more usable?)

_If the help text needs to be a single huge string, why not "add"
(concatenate) many small gettexttized strings? Something like:

hugestring = _("shar (GNU sharutils) 4.13.3\n")
+ _("Copyright (C) 1994-2013 Free Software Foundation, Inc., all rights reserved.\n")
+ _("This is free software. It is licensed for use, modification and\n")
+ ...

I have no idea how to actually do this, but it should be possible,
and then let the program itself work out what the lengths of all
these substrings are (_if you actually need the indexes).

Benno

> > (I would provide a patch, but it would be so huge you probably
> > wouldn't be allowed to accept it without a copyright assignment.)
>

> Thank you for considering it, but as you have gathered by now,
> any patch to bogus_function() would not be very useful.

>
> > (Also, there are several typos in some of the new msgids, for example
> > "create a shell archiv", but that is a minor issue compared to the
> > reformatting of the strings into enormous texts.)
>

> I have to figure out when I need to delete the last character and
> when not, and then adjust some Scheme code to do it correctly.
>
> Thank you for your well considered response! Regards, Bruce

--
http://www.fastmail.fm - Access all of your messages and folders
wherever you are

Bruce Korb

unread,

Jan 13, 2013, 5:07:15 PM1/13/13

to Benno Schulenberg, bug-gn...@gnu.org

Hi Benno,

On 01/13/13 13:05, Benno Schulenberg wrote:
>
> (I've reincluded the list in the CC.)

I omitted the entire list because I'm expecting a somewhat boring
discussion. I'd be more interested in translator feedback, because
you-all are more directly affected.

> On Thu, Jan 10, 2013, at 20:47, Bruce Korb wrote:
>> With short fragments, it is easier to translate, but these short
>> fragments get woven into usage text in ways that are disparaged
>> by the docs I've read for i18n text.

The "short fragments" are the long option names and the short
(~40 character) description, e.g. here is the real source that
describes the "level-of-compression" option:

> flag = {
> name = level-of-compression;
> value = g;
> arg-type = number;
> arg-name = LEVEL;
> arg-range = '1->9';
> arg-default = 9;
> descrip = 'pass @file{LEVEL} for compression';
> doc = <<- _EODoc_
> Some compression programs allow for a level of compression. The
> default is @code{9}, but this option allows you to specify something
> else. This value is used by @command{gzip}, @command{bzip2} and
> @command{xz}, but not @command{compress}.
> _EODoc_;
> };

the option line in long usage appears as:
-g, --level-of-compression=num pass LEVEL for compression
That string does not appear anywhere in the source.
It gets pulled together and formatted from the "g", "level-of-compression",
"number", "1->9" and "pass @file{LEVEL} for compression" strings.
So in order to make something that is translatable, I create
a program, emit the help, capture that help text and
put it into the final program. Those strings plus the "doc" string
show up in man pages and texi docs.

>> Specifically, little bits
>> of the usage are emitted with the expectation that a consistent
>> amount of horizontal space is used. That works IFF the source
>> language is the display language.
>
> When a certain indentation needs to be maintained, this is the
> responsibility of the translator. Half of the time I use a slightly
> different indentation than the original, and use it consistently.
>
>> My solution for this woven text problem is to build a version
>> without a combined usage text, print the help with bit-at-a-time
>> text and suck that output into a combined usage string and
>> rebuild. In the rebuild, the short strings will never be
>> used.
>
> In the source code I see for example:
>
> static char const shar_opt_strs[10449] =" [[[enormous string]]] ";

That is intermediary source. I obviously do not hand edit a 10K string.

> In my opinion this is madness... If you want to add a space
> or a word somewhere, you have to figure out and change fifty
> indexes by hand... !

At the top of that file, you will see:
> /* -*- buffer-read-only: t -*- vi: set ro:
> *
> * DO NOT EDIT THIS FILE (shar-opts.c)
> *
> * It has been AutoGen-ed January 11, 2013 at 11:39:24 AM by AutoGen 5.17.2pre7

so if you want to add a space, do not do it in that file.

> Is it AutoOpts that requires that the help text be provided as a
> single huge character array?

AutoOpts only requires the strings associated with each option and
the program as a whole. On the theory that gluing all these strings
together would be untranslatable and/or sometimes not yield an
aesthetically pleasing help string, I provided a way of overriding
the computation of the usage text by providing _as an alternative_
the entire usage text as a single string. What I am proposing here
is emitting this long usage a paragraph at a time.

>> Since there is only one source for both texts, getting this to
>> work depends upon coming up with a paragraph splitting algorithm
>> that would split out an exactly matching paragraph. A desirable
>> goal, but might not be easy to do. Please suggest an algorithm
>> while I try to puzzle one out, too. (e.g. separate on every
>> double newline and every line starting with white space.
>> Does that yield something more usable?)
>
> _If the help text needs to be a single huge string, why not "add"
> (concatenate) many small gettexttized strings?

The pieces of the help text are derived from too many sources.
Gluing together little strings is strongly discouraged for
translatable text. Therefore, I am suggesting the splitting up
of the monster string according to a well defined algorithm.
viz. start a new "paragraph" whenever a non-empty line is
preceded by two line breaks or a non-empty line starts with
a few space characters. I *think* that yields something wieldy.
I could also split them one string per line. That is likely
somewhat easier for me, but seems like it would make the
translation task a bit harder. e.g. there would be no guarantee
that every line would be unique and the same line of text might
translate differently in different contexts. I do think splitting
on "paragraphs" would make the translation effort easier, but I
would take whatever suggestion you make.

> hugestring = _("shar (GNU sharutils) 4.13.3\n")
> + _("Copyright (C) 1994-2013 Free Software Foundation, Inc., all rights reserved.\n")
> + _("This is free software. It is licensed for use, modification and\n")
> + ...
>
> I have no idea how to actually do this, but it should be possible,
> and then let the program itself work out what the lengths of all
> these substrings are (_if you actually need the indexes).

The indexes are a relatively unimportant implementation detail.
In order to produce libraries that minimize the number of fixups
required at load time, I produced some functions that assemble
massive text strings and #define-d values that reference that huge
table. I could also make for static global strings that go by the
name used in the #define. That eliminates all the offset stuff,
but then the link/loader has more fixup work to do.

Benno Schulenberg

unread,

Jan 14, 2013, 6:07:50 AM1/14/13

to Bruce Korb, bug-gn...@gnu.org

Hi Bruce,

On Sun, Jan 13, 2013, at 23:07, Bruce Korb wrote:
> I omitted the entire list because I'm expecting a somewhat boring
> discussion. I'd be more interested in translator feedback, because
> you-all are more directly affected.

It may be boring, but one never knows whether not someone else
has some good idea on how to handle this. And it is good to have
the discussion archived somewhere.

> The "short fragments" are the long option names and the short

> (~40 character) description, [...]

>
> the option line in long usage appears as:
> -g, --level-of-compression=num pass LEVEL for compression
> That string does not appear anywhere in the source.
> It gets pulled together and formatted from the "g", "level-of-compression",
> "number", "1->9" and "pass @file{LEVEL} for compression" strings.
> So in order to make something that is translatable, I create
> a program, emit the help, capture that help text and
> put it into the final program.

Alright, now I understand. Indeed it is not good to offer those
fragments for translation, but it would be perfectly okay to offer
each complete option description, for example the above:

"-g, --level-of-compression=num pass LEVEL for compression"

> AutoOpts only requires the strings associated with each option and
> the program as a whole. On the theory that gluing all these strings
> together would be untranslatable and/or sometimes not yield an
> aesthetically pleasing help string, I provided a way of overriding
> the computation of the usage text by providing _as an alternative_
> the entire usage text as a single string. What I am proposing here
> is emitting this long usage a paragraph at a time.

Paragraphs are good, whole sentences are good, a complete option
plus its description (like the above) is good, a full doc string is good,
an entire usage summary is good -- anything that is self-contained
and doesn't break a sentence is good.

> The pieces of the help text are derived from too many sources.
> Gluing together little strings is strongly discouraged for
> translatable text. Therefore, I am suggesting the splitting up
> of the monster string according to a well defined algorithm.
> viz. start a new "paragraph" whenever a non-empty line is
> preceded by two line breaks or a non-empty line starts with
> a few space characters. I *think* that yields something wieldy.

If you split off non-empty lines that start with some whitespace,
you would fragment multiline option descriptions. That is not good.
Some examples of which text fragments I would like to see:

N_("Specifying file encoding methodology:\n")

N_(" -M, --mixed-uuencode decide uuencoding for each file\n")

N_(" -B, --uuencode treat all files as binary\n"
" - an alternate for mixed-uuencode\n")

N_(" -T, --text-files treat all files as text\n"
" - an alternate for mixed-uuencode\n")

N_("Options are specified by doubled hyphens and their name or by a single\n"
"hyphen and the flag character.\n")

N_("If no 'file's are specified, the list of input files is read from a\n"
"standard input. Standard input must not be a terminal.\n")

((By the way, this "- an alternate for mixed-uuencode" phrase is unneeded,
in my opinion. A help text should be concise, and that these options are
alternates is already clear by grouping them under the same heading.))

> I could also split them one string per line. That is likely
> somewhat easier for me, but seems like it would make the
> translation task a bit harder. e.g. there would be no guarantee
> that every line would be unique and the same line of text might
> translate differently in different contexts.

One string per line will be fine, as long as it does not split up sentences.
What would be best: the smallest elements that you can make that are
self-contained.

Regards,

Benno

Bruce Korb

unread,

Jan 14, 2013, 7:40:02 PM1/14/13

to Benno Schulenberg, bug-gn...@gnu.org

On 01/14/13 03:07, Benno Schulenberg wrote:
>
> Hi Bruce,
>
> On Sun, Jan 13, 2013, at 23:07, Bruce Korb wrote:
>> the option line in long usage appears as:
>> -g, --level-of-compression=num pass LEVEL for compression

> Alright, now I understand. Indeed it is not good to offer those
> fragments for translation, but it would be perfectly okay to offer
> each complete option description, for example the above:

> N_(" -B, --uuencode treat all files as binary\n"
> " - an alternate for mixed-uuencode\n")

Exactly my intention, actually. A line starting with more than 8 spaces
would be considered an extension of the previous "hanging indent" line.

> ((By the way, this "- an alternate for mixed-uuencode" phrase is unneeded,
> in my opinion. A help text should be concise, and that these options are
> alternates is already clear by grouping them under the same heading.))

I agree and disagree both, without a strong commitment to either.
Let's pick up on that later, since that is not strictly speaking
a translation issue, but rather a "what makes sense" issue.

One other potentially translatable piece of text is the long option name.
I know that no other programs translate them, but it turns out to be
trivial to do so. So in C locale:

my-prog --first-option

could be used thus (in Spanish, my second language):

my-prog --opcion-primero

my AutoOpts code doesn't care because it just looks up whatever bytes
appear after the double hyphen and before any '=' character. That
does present problems with languages that embed the 0x3D character in
their representations, but certainly all the Roman alphabet based
languages *could* have their long options localized. I've already
implemented the capability, so my question boils down to:

Is it too far over the top? :)

Thank you - Bruce

Benno Schulenberg

unread,

Jan 15, 2013, 6:06:02 AM1/15/13

to Bruce Korb, bug-gn...@gnu.org

Hello Bruce,

On Tue, Jan 15, 2013, at 1:40, Bruce Korb wrote:
> On 01/14/13 03:07, Benno Schulenberg wrote:
> > Indeed it is not good to offer those
> > fragments for translation, but it would be perfectly okay to offer
> > each complete option description, for example the above:
>
> > N_(" -B, --uuencode treat all files as binary\n"
> > " - an alternate for mixed-uuencode\n")
>
> Exactly my intention, actually. A line starting with more than 8 spaces
> would be considered an extension of the previous "hanging indent" line.

Okay, very good.

> One other potentially translatable piece of text is the long option name.
> I know that no other programs translate them, but it turns out to be
> trivial to do so. So in C locale:
>
> my-prog --first-option
>
> could be used thus (in Spanish, my second language):
>
> my-prog --opcion-primero
>
> my AutoOpts code doesn't care because it just looks up whatever bytes
> appear after the double hyphen and before any '=' character. That
> does present problems with languages that embed the 0x3D character in
> their representations, but certainly all the Roman alphabet based
> languages *could* have their long options localized. I've already
> implemented the capability, so my question boils down to:
>
> Is it too far over the top? :)

Yes, that is completely over the top. :)

The only other command-line program I know that translates part
of its interface, is 'parted'. Only its (interactive) commands
are translatable, not its options nor the keyword arguments to
the commands. But it will recognize both the original commands
and the translated ones.

I wouldn't make the program options translatable, it would only
lead to confusion -- for makers when they start getting bug reports
in Polish or Bulgarian, for users when they must first translate
a command found on the net to their localized options names (and
how to figure out which is which when you can't see them side by
side?). It should even be impossible to translate the option names
-- they should be included in the explanatory help strings, but any
changes or typos in them (in those strings) should not affect their
recognition.

(It would be a nice feature if AutoOpts would detect such typos
in option names (in the help text) and emit an "ohoh".)

Regards,

Benno

--
http://www.fastmail.fm - IMAP accessible web-mail

Bruce Korb

unread,

Jan 15, 2013, 1:10:37 PM1/15/13

to Benno Schulenberg, bug-gn...@gnu.org

Hi,

On 01/15/13 03:06, Benno Schulenberg wrote:
>> Is it too far over the top? :)
> Yes, that is completely over the top. :)

It was amusing to do 6-8 years ago...

> (It would be a nice feature if AutoOpts would detect such typos
> in option names (in the help text) and emit an "ohoh".)

Doing that inside the library would be too difficult,
*especially* with the long option usage text being split
up into paragraphs. Instead, a program could be built
that examines each untranslated string for "keywords"
(option names) and check that that keyword is in the
translated text. That would likely be several days of
effort I don't have though. :(

Cheers - Bruce

0 new messages