Converting UTF-8 to ASCII

1,333 views
Skip to first unread message

Dave Graham @ STORIS

unread,
Apr 30, 2021, 11:23:59 AM4/30/21
to Pick and MultiValue Databases
We have an interface to a vendor that provides their product catalog via JSON. The UniData Json parser makes incredibly quick work of traversing and extracting the required data from these 25-30MB files (under a minute or so) so we are really happy with that.

The issue is that the vendor insists on sending the data in UTF-8. Of course, like any other good U2 system, we're ASCII/Text based so that causes everything to fall on its face.

So, does anyone know of a simple way to convert that UTF-8 file to ASCII/TEXT without:
A) telling the vendor to change it - already asked and they won't
B) receiving the data on another server and converting it there - that won't fly
C) writing a string parser in BASIC - parsing 30MB of data character by character is one thing but we have to do this several times a day to performance is critical.

geneb

unread,
Apr 30, 2021, 11:33:42 AM4/30/21
to Pick and MultiValue Databases
Dave, there's a program called "konwert" that might be what you're looking
for. Here's the manpage for it:
http://manpages.ubuntu.com/manpages/bionic/man1/konwert.1.html

g.

--
Proud owner of F-15C 80-0007
http://www.f15sim.com - The only one of its kind.
http://www.diy-cockpits.org/coll - Go Collimated or Go Home.
Some people collect things for a hobby. Geeks collect hobbies.

ScarletDME - The red hot Data Management Environment
A Multi-Value database for the masses, not the classes.
http://scarlet.deltasoft.com - Get it _today_!

Pete Schellenbach

unread,
Apr 30, 2021, 12:02:53 PM4/30/21
to Pick and MultiValue Databases
Dave -

I know that UniVerse has an international mode (NLS?), which I assume supports UTF-8/Unicode. Does UniData not support this feature?

Is it possible to trick the JSON parser into thinking the incoming data is in a different encoding, like Latin-1, instead of UTF-8? UTF-8 consists of normal 8-bit characters, and magically (by design) the MV delimiters (SM, AM, VM, SVM) are not allowed so there is no conflict with MV data. Latin-1 encoding is pretty much interchangible with binary data so you should be able to store the data in your MV database without issue. The problem will be interpreting any characters > CHAR(127). So accented characters, or special symbols, will appear in your Latin-1 view of the data as junk. And remember, when converting between different encodings, as geneb recommended, there is possible loss of data since UTF-8 encodes a very large character set, and ASCII is limited to 256 characters.

Pete

jorge jjcsf

unread,
Apr 30, 2021, 12:06:41 PM4/30/21
to mvd...@googlegroups.com
what about using iconv lib ?

D3linux databasic 
execute '!/usr/bin/iconv -f UTF-8    -t CP850   ':file-in :' > ' :file-out  capturing result returning error

Change CP850 if you use another set of characters

In D3Windows is  the same 

Jorge Jareño
687817220
jjc...@gmail.com




--
You received this message because you are subscribed to
the "Pick and MultiValue Databases" group.
To post, email to: mvd...@googlegroups.com
To unsubscribe, email to: mvdbms+un...@googlegroups.com
For more options, visit http://groups.google.com/group/mvdbms
---
You received this message because you are subscribed to the Google Groups "Pick and MultiValue Databases" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mvdbms+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/mvdbms/1a4136ee-295e-4769-9fa8-e362061f8a7bn%40googlegroups.com.

Dave Graham @ STORIS

unread,
Apr 30, 2021, 1:25:36 PM4/30/21
to Pick and MultiValue Databases
Thanks for the great suggestions! I see I forgot a few additional restrictions though...
-This has to work on AIX, Windows and Linux at the same time using the same tool (the interface/command line structure can be different but the underlying tool cannot)
-This has to work on UniData, D3, UniVerse and so on solutions are great thought provoking ideas. But if they don't work on UniData they ultimately will not fly.

GeneB - your idea likely the best of breed but requiring an external tool is something we try to shy away from because of support requirements. We would need to install this on 400+ systems and then install it on every new system we sell. Not to mention keeping change logs, PCI compliance updates and so on. That's a really big lift. If it is the only way - we'll do it but the costs get to be really, REALLY high (think thousands of man hours per year) to deal with this type of solution.

Pete - Thanks for your input. Always been a fan of your knowledge (Accu-Plot anyone?).I'm working with Rocket to try to find a way to use their NLS features to do some kind of code page translation within the transmit/receive functions or the UDO parsing itself. We'll see if they can do something magical.

Jorge - That would be the simplest and easiest solution wouldn't it? Regrettably, we are on a UniData platform and that library doesn't exists in our world.

Keep those suggestions coming folks!  

DaveG

Gerd Förthmann

unread,
Apr 30, 2021, 4:09:48 PM4/30/21
to mvd...@googlegroups.com

I guess as long as your data only contains numbers and letters of the English alphabet you don't need to covert UTF-8 to ASCII at all because those are identical 8 Bit characters.
If you haven't then you will have the same problem all "good old" ASCII based systems have since you simply cannot convert all characters. The German ß in extended ASCII for instance is a value mark if I remember correctly.

Well and when it comes to Arabic, Chinese and other languages that work just fine in UTF-8 how are you going to convert those to ASCII?

I don't want to hijack another thread but holding data in ASCII is another one of the reasons Pick has failed to catch on. If you only do business in English speaking countries you may get away with ASCII otherwise forget it!

And don't get me started on the claims of UTF-8 compatible Pick databases. D3 for instance may store the data in UTF-8 format and displays it correctly too but Basic still treats strings as if it's all 8 Bit ASCII so 16 Bit UTF-8 characters count as 2 not 1 as they should. Unless Rocket finally fixed that.
The conversion of the system from ASCII to UTF-8 wasn't a big deal really but then we actually ended up using a subroutine swapping special characters of nearly every European alphabet back and forth to determine the actual length of a string in D3 running in UTF-8 mode.

My advice nevertheless: get rid of ASCII and convert your Pick systems to UTF-8 instead trying to do it the other way round.

Kevin King

unread,
Apr 30, 2021, 4:52:46 PM4/30/21
to mvd...@googlegroups.com
Dave, you're looking to take UTF-8 and convert it to text?  If that's the goal, some characters simply won't translate.  This is where the iconv solution actually makes some sense, but I also understand your hesitancy to add external tools and platform dependencies.

So if you ran into a \u3041, what would you want to do with it?


--

jorge jjcsf

unread,
Apr 30, 2021, 5:27:58 PM4/30/21
to mvd...@googlegroups.com
I apologize
maybe I have not explained myself clearly, I'm not an expert

iconv command is not inside MV database, D3, Unidata, Universe etc
it's a lib of the O.S. where the mvdatabase is installed and all the O.S.  have it

you just have to run it

in D3 I do it with :
execute ! Command
in Universe, Unidata etc it will be very similar I guess

MarZ

unread,
May 2, 2021, 8:16:15 PM5/2/21
to Pick and MultiValue Databases
Hey Dave, how did you go...
My choice would be Jorge's suggestion.
I believe "libiconv" is present on all *nixs.
There is also an implementation for msWin.

Brian Speirs

unread,
May 2, 2021, 9:45:55 PM5/2/21
to Pick and MultiValue Databases
Converting data is all well and good ... if it is possible. As Kevin notes above, some (many) characters that can be present in Unicode simply won't be available in code page. And even if all characters can be translated to ones that are available within one of the available code pages, you need a code page that contains ALL of the characters that are being sent.

The number of characters available in Unicode is around 144,000 (https://en.wikipedia.org/wiki/Unicode), while the number of characters available within a code page is around 219 (256 - 32 control characters - 5 mark characters).

So, that is really the first issue that you need to address. What are the characters being sent, and can you find a (single) code page that contains all of those characters? What happens if the range of characters being sent increases in future?

You may be able to get away with conversion for now, but is that going to be the long-term solution?

Cheers,

Brian

Dave Graham @ STORIS

unread,
May 3, 2021, 8:24:52 AM5/3/21
to Pick and MultiValue Databases
Thanks to everyone who replied. My sincere apologies to Jorge - I had no idea that the iconv library was not a component of D3! Based on the name I made an assumption and was very wrong. I need to go look into this more as it may very well be the perfect solution.

To all of you who have expressed concern that converting UTF-8 to ASCII is a temporary solution at best: I agree. I also agree that this is one of the things that has held the PICK/U2 system back from being what it could have been. But that's water under the bridge now.

My firm would have to do rather a lot of re-coding to convert our database to completely accept UTF-8, I think. But if we want to get another 35 years out of the database, it may be a requirement to do so. Gotta think about that..

The good news is that, as a stopgap solution, converting the data is reasonable. For the time being. The data is in U.S. English with a few special characters. Those special characters are in fields we don't care about so again, for the time being, we're safe.

It's interesting to me that we are given solutions by the environment providers and then we complain about our unwillingness to implement them. We could have made this a non-issue when we implemented Arabic and Chinese but we chose not to.

DaveG

frosty

unread,
May 4, 2021, 7:36:10 PM5/4/21
to Pick and MultiValue Databases
Jorge, this is excellent -- I never knew this iconv command was baked into linux.
Even works in my macOS.
Thank you!
¡Gracias!

jorge jjcsf

unread,
May 5, 2021, 4:00:26 AM5/5/21
to mvd...@googlegroups.com
Thanks frosty #:-))

I'm happy that my small contribution can serve you

the iconv command was included in all linux versions that I have used in the past

for windows:

download the library 

iconv -l  -> Print the list of all character set encodings 

Jorge Jareño
687817220
jjc...@gmail.com




--
You received this message because you are subscribed to
the "Pick and MultiValue Databases" group.
To post, email to: mvd...@googlegroups.com
To unsubscribe, email to: mvdbms+un...@googlegroups.com
For more options, visit http://groups.google.com/group/mvdbms
---
You received this message because you are subscribed to the Google Groups "Pick and MultiValue Databases" group.
To unsubscribe from this group and stop receiving emails from it, send an email to mvdbms+un...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages