Printing UTF-8 characters

Farhan Khan

unread,

Feb 1, 2018, 2:00:38 AM2/1/18

to

Hi everyone,

Is there a standard way to render historically non-printable UTF-8
characters that will work across all terminals? I am trying to modify a
standard FreeBSD utility that may occasionally work with characters in
other languages. On some terminals, specifically FreeBSD running in
VirtualBox, I see question-marks rather than the expected character. I
wonder if this is the proper way to display such non-printable characters
or no?

I am not the most versed in encoding standards, so pardon any mistakes I
might have made.

Thanks,
--
Farhan Khan
PGP Fingerprint: B28D 2726 E2BC A97E 3854 5ABE 9A9F 00BC D525 16EE
_______________________________________________
freebsd...@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to "freebsd-hacke...@freebsd.org"

Matthias Apitz

unread,

Feb 1, 2018, 6:10:47 AM2/1/18

to

El día jueves, febrero 01, 2018 a las 01:15:34a. m. -0500, Farhan Khan escribió:

> Hi everyone,
>
> Is there a standard way to render historically non-printable UTF-8
> characters that will work across all terminals? I am trying to modify a
> standard FreeBSD utility that may occasionally work with characters in
> other languages. On some terminals, specifically FreeBSD running in
> VirtualBox, I see question-marks rather than the expected character. I
> wonder if this is the proper way to display such non-printable characters
> or no?

Not sure what you mean with 'historically non-printable UTF-8'. UTF-8 is
an encoding form (one of more) to present Unicode Codepoints in bytes. If
you want to "print" them to paper or PDF there are ways to write them
with Postscript and with the correct font-support to bring them into
human readable form. If you want to "display" these UTF-8 bytes you need
a terminal-software with UTF-8 support, for example from the ports x11/rxvt-unicode
and the fonts for the Codepoint areas you want to display.

Btw: Can you display my signature line correctly? There is an UTF-8 encoded
Codepoint for a mobile telephone :-)

matthias
--
Matthias Apitz, ✉ gu...@unixarea.de, ⌂ http://www.unixarea.de/ 📱 +49-176-38902045
Public GnuPG key: http://www.unixarea.de/key.pub

Farhan Khan

unread,

Feb 1, 2018, 10:47:37 AM2/1/18

to

On Thu, Feb 1, 2018 at 2:28 AM, Matthias Apitz <gu...@unixarea.de> wrote:
>
> El día jueves, febrero 01, 2018 a las 01:15:34a. m. -0500, Farhan Khan escribió:
>
> > Hi everyone,
> >
> > Is there a standard way to render historically non-printable UTF-8
> > characters that will work across all terminals? I am trying to modify a
> > standard FreeBSD utility that may occasionally work with characters in
> > other languages. On some terminals, specifically FreeBSD running in
> > VirtualBox, I see question-marks rather than the expected character. I
> > wonder if this is the proper way to display such non-printable characters
> > or no?
>
> Not sure what you mean with 'historically non-printable UTF-8'. UTF-8 is
> an encoding form (one of more) to present Unicode Codepoints in bytes. If
> you want to "print" them to paper or PDF there are ways to write them
> with Postscript and with the correct font-support to bring them into
> human readable form. If you want to "display" these UTF-8 bytes you need
> a terminal-software with UTF-8 support, for example from the ports x11/rxvt-unicode
> and the fonts for the Codepoint areas you want to display.
>
> Btw: Can you display my signature line correctly? There is an UTF-8 encoded
> Codepoint for a mobile telephone :-)
>
> matthias
> --
> Matthias Apitz, ✉ gu...@unixarea.de, ⌂ http://www.unixarea.de/ 📱 +49-176-38902045
> Public GnuPG key: http://www.unixarea.de/key.pub
>

Sorry, that was a poorly phrased question on my part. Let me try again.
I am trying to make text align in columns in a terminal. My
understanding is that characters above 0x7E are 3 bytes in length. A
modern terminal will render that as either a single question-mark or
the character itself, making terminal column alignment easy. But how
would an older terminal display a 3-byte character? I am worried that
would render as 3 question marks and throw off column alignment. If
so, is there a proper way to perform alignment for both newer and
older terminals?

I am reading this email on Gmail's, so those characters properly
render for me :)

Thanks,

--
Farhan Khan
PGP Fingerprint: B28D 2726 E2BC A97E 3854 5ABE 9A9F 00BC D525 16EE

Conrad Meyer

unread,

Feb 1, 2018, 3:22:20 PM2/1/18

to

You've said a number of things about UTF-8 that appear to be mistaken.
Start here: https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/

Matthias Apitz

unread,

Feb 1, 2018, 3:28:15 PM2/1/18

to

On Thursday, 1 February 2018 21:18:10 CET, Conrad Meyer <c...@freebsd.org>
wrote:

> You've said a number of things about UTF-8 that appear to be mistaken.
> Start here:

> ...

You are top posting, which is messing up things, and you are not very clear
about who said something wrong.

matthias

--
Sent from my Ubuntu phone
http://www.unixarea.de/

Bakul Shah

unread,

Feb 1, 2018, 11:03:11 PM2/1/18

to

On Thu, 01 Feb 2018 10:42:36 -0500 Farhan Khan <kha...@gmail.com> wrote:
> Sorry, that was a poorly phrased question on my part. Let me try again.
> I am trying to make text align in columns in a terminal. My
> understanding is that characters above 0x7E are 3 bytes in length. A
> modern terminal will render that as either a single question-mark or
> the character itself, making terminal column alignment easy. But how
> would an older terminal display a 3-byte character? I am worried that
> would render as 3 question marks and throw off column alignment. If
> so, is there a proper way to perform alignment for both newer and
> older terminals?

UTF-8 can use upto 4 bytes to encode a unicode point,
depending on the script.

For what you want, you can use openoffice like programs that
understand unicode and can do complex text layout. Normal
terminal programs typically use monospace (fixed width) fonts
are simply not capable of what you want. The assumption that
one char means one rectangular cell on the screen is too
deeply woven in them. Particularly for Indic languages this
just doesn't work, You may have N unicode points, each of
which require 3 bytes, all together map to a one single glyph.

Farhan Khan

unread,

Jun 19, 2018, 9:40:49 PM6/19/18

to

On Thu, Feb 1, 2018 at 10:51 PM, Bakul Shah <ba...@bitblocks.com> wrote:
> On Thu, 01 Feb 2018 10:42:36 -0500 Farhan Khan <kha...@gmail.com> wrote:
>> Sorry, that was a poorly phrased question on my part. Let me try again.
>> I am trying to make text align in columns in a terminal. My
>> understanding is that characters above 0x7E are 3 bytes in length. A
>> modern terminal will render that as either a single question-mark or
>> the character itself, making terminal column alignment easy. But how
>> would an older terminal display a 3-byte character? I am worried that
>> would render as 3 question marks and throw off column alignment. If
>> so, is there a proper way to perform alignment for both newer and
>> older terminals?
>
> UTF-8 can use upto 4 bytes to encode a unicode point,
> depending on the script.
>
> For what you want, you can use openoffice like programs that
> understand unicode and can do complex text layout. Normal
> terminal programs typically use monospace (fixed width) fonts
> are simply not capable of what you want. The assumption that
> one char means one rectangular cell on the screen is too
> deeply woven in them. Particularly for Indic languages this
> just doesn't work, You may have N unicode points, each of
> which require 3 bytes, all together map to a one single glyph.

Hi all,

To follow-up from my earlier poorly asked question from a few months
back, how do I determine if the terminal is capable of printing UTF-8
encoded strings and/or unicode in general?
The obvious answer is to check the LANG variable via getenv(3), but
what if you are using "en_US.UTF-8" vs "en_GB.UTF-8"? Should I just
check for the string "UTF-8" in the LANG variable?

My concern is printing characters above 0x7F on terminals/encodings
that are not capable of displaying them, resulting in unusual
behavior.

Thanks,

--
Farhan Khan
PGP Fingerprint: B28D 2726 E2BC A97E 3854 5ABE 9A9F 00BC D525 16EE

Conrad Meyer

unread,

Jun 19, 2018, 10:50:49 PM6/19/18

to

You want LC_CTYPE.

Farhan Khan

unread,

Jun 20, 2018, 12:25:03 AM6/20/18

to

Thanks Conrad!

I looked up exactly how locale(1) worked. Similar to what you suggested,
locale(1) did essentially this:

setlocale(LC_ALL, "");
charset = nl_langinfo(CODESET);

The final product was 'charset'.

Thanks!