unicode progress

2 views
Skip to first unread message

Steve Bush

unread,
Dec 5, 2010, 7:04:38 PM12/5/10
to Exodus Users
A lot of work has been done in the last month by Alex and myself
mainly on internationalisation issues. The new version 10.12.1 is
available for download from http://code.google.com/p/exodusdb/

A major change is the move to different field mark characters and this
requires you to re-initialise your database or at least do "deletefile
dict.MD"

Field marks etc are now 02F8-02FF instead of F8 to FF in order not to
conflict with some accented latin characters.

* a new "locale" argument on osopen/osread/oswrite supports i/o in
whatever locales are supported by the system
* exodus provides a "utf8" locale that will work everywhere (important
for windows)
* regular expressions (in swap and match) now handle unicode
* multiple database connections supported

Steve Bush

unread,
Dec 6, 2010, 4:53:49 AM12/6/10
to Exodus Users
Perhaps I should explain why I chose unicode code points 02F8-02FF to handle field marks/attribute marks etc.

1. They are two bytes in UTF-8 (same goes for all unicode 0100-07FF)
2. They are real but virtually unused unicode code points that Exodus can safely decide to never support.
http://en.wikibooks.org/wiki/Unicode/Character_reference/0000-0FFF
3. They are visible in most browsers although rarely elsewhere.

Why not use unicode Private Use Area codes?

1. PUA unicode characters result in 3 bytes in UTF-8 and I did not want to resort to illegal UTF-8 characters with FE for AM/FM etc.
2. They are randomly in use for other purposes so are not predictable.

Why not use unicde 05F8-05FF since they are unused/undefined?
1. They might become defined/used in the future by characters that we cannot ignore.

Why not use illegal UTF-8 characters FE for FM/AM etc?
1. In order to be able to transfer exodus data in UTF-8 over external parties which might not accept illegal UTF-8.


--
You received this message because you are subscribed to the Google Groups "Exodus Users" group.
To post to this group, send email to exodus...@googlegroups.com.
To unsubscribe from this group, send email to exodus-users...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/exodus-users?hl=en.


Ashley Chapman

unread,
Dec 6, 2010, 6:02:55 AM12/6/10
to exodus...@googlegroups.com
Interesting ideas.

I've been using the PICK delimiters within the Anji text editor, and that means I have illegal unicode characters.  My original workaround was to allow the user to switch between unicode, and Latin-1 (ISO8859).  This whole scheme allowed the user to either have full visability of UTF-8 or delimiters, but not both at the same time.  Works well for the storing of the data in the database, but not so good when it travels around to browsers etc.

As I rewrite that bit, I'll store meta-data for the database to indicate what delimiters are used natively, and convert to your scheme for display/editing.

Time to come up with a name for this scheme.  Perhaps EDE for Exodus Delimited Encoding?


Taking this further, I often hit problems with data that already contains delimiters that I want to store within an already existing record structure.  I have to convert all of the delimiters so they suit the containing record.  This is normal, and I'm sure you've hit it before.  It's getting more frequent now that I'm processing data from XML sources that often has 5 (or more) levels of nesting.

I've been contemplating an alternative aproach, where there are just three delimiters.  One to indicate the start of a lower-level, one to separate data items, and a third to indicate a return to a higher level.  This should allow an infinite number of levels, and remove the need for delimiter conversion as I described above.  I've not taken this any further yet, but would be interested in hearing your thoughts on it.
--
Ashley Chapman

Steve Bush

unread,
Dec 6, 2010, 6:42:52 AM12/6/10
to exodus...@googlegroups.com
Your three markers idea seems technically workable to me. It means you can freely embed stuff to any number of levels like XML without worries.

It reminds me of lisp ... where open bracket means start a deeper level, space means start another item on the current level and close bracket means up a level. ( 1 2 ( 3a 3b ( 3c1 ( 3c2a 3c2b) 3c3 ) 3d ) 4 5 )

Parsing it might even be easier than a typical extract(x,,f,v,s) and using it as a general format for exodus mv strings would be very possible and interesting but would take rather a lot of time to recode everything at the moment.

Visually debugging such strings might be harder unless you have an indenting viewer of some kind.

Ashley Chapman

unread,
Dec 23, 2010, 3:02:45 AM12/23/10
to exodus...@googlegroups.com
On 6 December 2010 11:42, Steve Bush <neosys.com@gmail.com> wrote:
Your three markers idea seems technically workable to me. It means you can freely embed stuff to any number of levels like XML without worries.

I've just used it in a simple XML parser, and it seems okay for that.  Probably will be exponentially slow with big arrays.  It's an issue that I'll leave for now, and return to when I get the time.  Or a better idea comes along.
 

It reminds me of lisp ... where open bracket means start a deeper level, space means start another item on the current level and close bracket means up a level. ( 1 2 ( 3a 3b ( 3c1 ( 3c2a 3c2b) 3c3 ) 3d ) 4 5 )

LISP = Lets Insert Some Parenthesis :-)  <joke from  1981>

Parsing it might even be easier than a typical extract(x,,f,v,s) and using it as a general format for exodus mv strings would be very possible and interesting but would take rather a lot of time to recode everything at the moment.

Yep, let's leave it for now.  Perhaps an alternative datastructure when some future development calls for it.



Reply all
Reply to author
Forward
0 new messages