Using UTF8 on the "down low"

31 views
Skip to first unread message

Kevin Toppenberg

unread,
Mar 29, 2024, 12:11:45 PMMar 29
to Hardhats
Hey all, this post is a bit of a cross post.  I have been working over on EVERYTHING MUMPS group, but thought I'd post an overview here in case not everyone goes there. 

I wanted to be able to output Unicode characters to my terminal.  YottaDB supports UTF8, but Sam H has recommended AGAINST enabling UTF8, and I'm sure he has good reasons for this.   But just because we don't want the database, strings etc in YottaDB to be in multi-byte mode, doesn't mean that we can't send signals to the terminal to be able to get the desired things drawn on the terminal screen.

I especially wanted to get better line drawing.  I know that traditional VT100+ has a line-drawing mode.  I think it swaps out the character set for chars 128-255 such that they look like lines.  I know that the KIDS package installer uses this.  The problem is that about half the time it works, and half the time all those lines look like "Q"'s etc - - which drives me crazy (why is it not displaying correctly today?  Who knows?).  A real solution will use characters that are unambiguous, not sharing numbers with some other character -- and that means Unicode. 

Most terminals these days understand UTF8 encoding and the Unicode characters being represented in the UTF8 stream.  On my Mac iTerm, it works natively.  In Putty, I went in to settings>Window>Translation and selected UTF8 remote character set.  Then I had to pick a font that contains appropriate drawings for each Unicode character.  The Courier font, for example, is a old font with only 1-255 (I think) images, but Courier New has many additional Unicode characters -- including many line drawing chars (though not all of them).  I finalized on the DejaVu Sans Mono font which has all of them.  

To look at fonts, in Windows, I used the Character Map application. Then select the Advanced view checkbox, ensure Character set is unicode, and then select GroupBy to be Unicode Subrange.  This will pop up a dialog that allows one to pick options like 'Box Drawing' elements, or Math etc. 

Before I started working on this, I was kind of lumping UTF8 and Unicode together.  But I have it a bit clearer in my mind now.  Unicode character $2547 is a line-drawing character named "Down Light and Up Horizontal Heavy".  It takes.  12 bits to represent $2547, whereas traditional ASCII is 7 bits (typically extended to full 8 bits).  So how to send out 12 bits of information in a way that is backwards compatible?  That is where UTF8 comes in.  It splits up the bits over 1-4 bytes, depending on how many actual bits are needed for the Unicode character number. 

So how to write out the Unicode character?  If I try WRITE $CHAR(Number), YottaDB will not output anything if number > 255, unless UTF8 mode is enabled.  And as above, that is not recommended.  But if one uses WRITE *NUM, it will output that byte without any filter etc.  And if one writes out 1-4 bytes that comply with UTF8 encoding, then the terminal will combine those bytes back into a large character number, such as $2547, and then display the corresponding Unicode character.

I wrote code that takes a Unicode Codepoint (character number) and returns a sequence of bytes that can be written out via WRITE *n.

The function is  GETUTF8, which can be seen here, in the TMGMISC routine:  https://github.com/kdtop/TMGLIB/blob/master/TMGMISC.m

I then wrote a function called UTF8WRITE in TMGSTUTL which takes Codepoint and gets bytes and writes them via WRITE *n.  It can be seen here:  https://github.com/kdtop/TMGLIB/blob/master/TMGSTUTL.m

I have next taken all the Codepoints for line drawing characters, which have names like "Down Light and Up Horizontal Heavy" and written code that parses it all into a data structure, so that I will be able to easily find a character that has, for example, a heavy line going DOWN and LEFT, but light going UP and RIGHT.

Getting this going was quite the journey!  I got into a long discussion with Google Bard about how to do this.  I was really helpful, except IT SUCKS AT MATH!!  The UTF8 encoding will take 21 bits, for example, and split it over 4 bytes.  It kept miscounting total bits, getting the appropriate prefix wrong, insisting that 4+4+6 = 15 bits etc etc.   I would point out it's error.  It would apologize and try again, and get it wrong again.  But sometimes just having a starting point it helpful.  This bit diagram at Wikipedia finally helped me understand how to get it working.  https://en.wikipedia.org/wiki/UTF-8  .  Because I had to do bit splitting etc, I had to write several functions for HEX->BIN, BIN->HEX, HEX->DEC, DEC->HEX, DEC->BIN, BIN->DEC.   I think that YottaDB has some utility functions to do some of this, but I couldn't readily find them, so I recreated them. 

I'm not going to post samples of code, or demonstration of all the unicode characters that show up in my terminal.  Yesterday Google thought I was a spammer and kept deleting my posts, and I think posting all that stuff may have been part of the trigger that made Google think I was a spammer yesterday.  But it does work.

I had a lot of help from everyone over in the other EVERYTHING MUMPS message board, so lots of credit to them.

Kevin

tldr; If you want to output Unicode characters to the terminal, but don't want to turn on global UTF8 mode in YottaDB, this shows how to do it. 


Nancy Anthracite

unread,
Mar 29, 2024, 12:49:19 PMMar 29
to Hardhats, Kevin Toppenberg
Thanks Kevin! I will file this away for when I need it or someone else needs
it and I can point them at it!

--
Nancy Anthracite
Reply all
Reply to author
Forward
0 new messages