libicu ucsdet_detect in 9.2 ?

46 views
Skip to first unread message

Joachim Tuchel

unread,
Nov 8, 2019, 4:11:13 AM11/8/19
to VA Smalltalk
Dear vast team,

In Seth's ESUG talk about VAST 9.2 there was a slide mentioning the new Application EsCodePageUtilities and talking about libicu integration with the VM.

Does this mean we'll get some of the funcionality of libicu wrapped in Smalltalk. I'm especially interested in ucsdet_detect or ucsdet_detectAll. I can't find any reference to these names in the ECAP 9.2 build.

How hard would it be to wrap these on my own, and will these be available to a VAST on both Linux and Windows?

Any hints?

Joachim

Mariano Martinez Peck

unread,
Nov 8, 2019, 8:06:30 AM11/8/19
to VA Smalltalk
On Fri, Nov 8, 2019 at 6:11 AM Joachim Tuchel <jtu...@objektfabrik.de> wrote:
Dear vast team,

In Seth's ESUG talk about VAST 9.2 there was a slide mentioning the new Application EsCodePageUtilities and talking about libicu integration with the VM.


Hi Joachim,

The EsCodePageUtilities is ready for 9.2 and you can find it in ECAP. 
 
Does this mean we'll get some of the funcionality of libicu wrapped in Smalltalk. I'm especially interested in ucsdet_detect or ucsdet_detectAll. I can't find any reference to these names in the ECAP 9.2 build.


Yes, the idea is to wrap ICU from Smalltalk and that's probably one of the first things we will start with in 9.3. But not for 9.2. 
 
How hard would it be to wrap these on my own, and will these be available to a VAST on both Linux and Windows?


If I were you I would just wait for 9.3. We can do an ECAP as soon as we have ICU wrapped. 
 
Any hints?


Not me. Maybe Seth has something to add.

Best,

--
Mariano Martinez Peck
Software Engineer, Instantiations Inc.

Joachim Tuchel

unread,
Nov 8, 2019, 9:19:39 AM11/8/19
to VA Smalltalk
Mariano,

thanks for this info. It's great to see unicade making its way into VAST.

I am not sure I can wait for 9.3, though, so I might have to implement some poor man's UTF-8 detection for at least some letters... The scope of my use case is relatively narrow...


Joachim

Mariano Martinez Peck

unread,
Nov 8, 2019, 1:30:06 PM11/8/19
to VA Smalltalk
On Fri, Nov 8, 2019 at 11:19 AM Joachim Tuchel <jtu...@objektfabrik.de> wrote:
Mariano,

thanks for this info. It's great to see unicade making its way into VAST.

I am not sure I can wait for 9.3, though, so I might have to implement some poor man's UTF-8 detection for at least some letters... The scope of my use case is relatively narrow...


If it is UTF-8 ONLY detection AND the stream has a UTF BOM then it's very easy to implement. You just need to read 3 bytes and check if that looks like a BOM. 

If you don't have  BOM or you need to detect all possible encoders, did you think about just doing a UnixProcess with 'file' or similar Unix command that gives you exactly that?  Sure...this is not cross platform...
 
Cheers, 

Joachim Tuchel

unread,
Nov 9, 2019, 1:31:29 AM11/9/19
to VA Smalltalk
Hi Mariano,


thanks for the tips. If I could rely on a BOM, things were easy, as you say. Unfortunately, I can't. The streams I have to handle com from sources that do not respect any standards (i.e. German Banks) ;-) .

IIUC, 'file' will only look for magic numbers, right? There are no magic numbers in these files. It 's really just text files that may or may not be encoded in UTF-8, the only chance to find out is to find a first occurence of an UTF-8 encoded character, just to guess this might be UTF-8. The fact that I receive these files as uploads from the browser would mean I have to save them to disk to use 'file'. Plus, 'file' is not available on Windows. So I really look forward to 9.3 ;-)
For now, I'll brush up my little Stream reading knowledge and implement some naive search for UTF-encoded German Umlaut sequences in the uploads. Far from perfect, I know. Lots of "but that won't work for X (like french characters)".


Anyways, It's great Instantiations has this area on their radar and we'll get a libicu based solution in 9.3! Thanks for that, keep up the good work!

Joachim
Reply all
Reply to author
Forward
0 new messages