Hey all, this post is a bit of a cross post. I have been working over on EVERYTHING MUMPS group, but thought I'd post an overview here in case not everyone goes there.
I wanted to be able to output Unicode characters to my terminal. YottaDB supports UTF8, but Sam H has recommended AGAINST enabling UTF8, and I'm sure he has good reasons for this. But just because we don't want the database, strings etc in YottaDB to be in multi-byte mode, doesn't mean that we can't send signals to the terminal to be able to get the desired things drawn on the terminal screen.
I especially wanted to get better line drawing. I know that traditional VT100+ has a line-drawing mode. I think it swaps out the character set for chars 128-255 such that they look like lines. I know that the KIDS package installer uses this. The problem is that about half the time it works, and half the time all those lines look like "Q"'s etc - - which drives me crazy (why is it not displaying correctly today? Who knows?). A real solution will use characters that are unambiguous, not sharing numbers with some other character -- and that means Unicode.
Most terminals these days understand UTF8 encoding and the Unicode characters being represented in the UTF8 stream. On my Mac iTerm, it works natively. In Putty, I went in to settings>Window>Translation and selected UTF8 remote character set. Then I had to pick a font that contains appropriate drawings for each Unicode character. The Courier font, for example, is a old font with only 1-255 (I think) images, but Courier New has many additional Unicode characters -- including many line drawing chars (though not all of them). I finalized on the DejaVu Sans Mono font which has all of them.
To look at fonts, in Windows, I used the Character Map application. Then select the Advanced view checkbox, ensure Character set is unicode, and then select GroupBy to be Unicode Subrange. This will pop up a dialog that allows one to pick options like 'Box Drawing' elements, or Math etc.
Before I started working on this, I was kind of lumping UTF8 and Unicode together. But I have it a bit clearer in my mind now. Unicode character $2547 is a line-drawing character named "Down Light and Up Horizontal Heavy". It takes. 12 bits to represent $2547, whereas traditional ASCII is 7 bits (typically extended to full 8 bits). So how to send out 12 bits of information in a way that is backwards compatible? That is where UTF8 comes in. It splits up the bits over 1-4 bytes, depending on how many actual bits are needed for the Unicode character number.
So how to write out the Unicode character? If I try WRITE $CHAR(Number), YottaDB will not output anything if number > 255, unless UTF8 mode is enabled. And as above, that is not recommended. But if one uses WRITE *NUM, it will output that byte without any filter etc. And if one writes out 1-4 bytes that comply with UTF8 encoding, then the terminal will combine those bytes back into a large character number, such as $2547, and then display the corresponding Unicode character.
I wrote code that takes a Unicode Codepoint (character number) and
returns a sequence of bytes that can be written out via WRITE *n.
I
have next taken all the Codepoints for line drawing characters, which
have names like "Down Light and Up Horizontal Heavy" and written code
that parses it all into a data structure, so that I will be able to
easily find a character that has, for example, a heavy line going DOWN
and LEFT, but light going UP and RIGHT.
Getting
this going was quite the journey! I got into a long discussion with
Google Bard about how to do this. I was really helpful, except IT SUCKS
AT MATH!! The UTF8 encoding will take 21 bits, for example, and split
it over 4 bytes. It kept miscounting total bits, getting the
appropriate prefix wrong, insisting that 4+4+6 = 15 bits etc etc. I
would point out it's error. It would apologize and try again, and get
it wrong again. But sometimes just having a starting point it helpful.
This bit diagram at Wikipedia finally helped me understand how to get
it working.
https://en.wikipedia.org/wiki/UTF-8
. Because I had to do bit splitting etc, I had to write several
functions for HEX->BIN, BIN->HEX, HEX->DEC, DEC->HEX,
DEC->BIN, BIN->DEC. I think that YottaDB has some utility
functions to do some of this, but I couldn't readily find them, so I
recreated them.
I'm not going to post
samples of code, or demonstration of all the unicode characters that
show up in my terminal. Yesterday Google thought I was a spammer and kept deleting my posts, and I think posting all that stuff may have been part of the trigger
that made Google think I was a spammer yesterday. But it does work.
I had a lot of help from everyone over in the other EVERYTHING MUMPS message board, so lots of credit to them.
Kevin
tldr; If you want to output Unicode characters to the terminal, but don't want to turn on global UTF8 mode in YottaDB, this shows how to do it.