Progress and Unicode support

Donovan Hide

unread,

Dec 14, 2011, 9:00:56 AM12/14/11

to superfa...@googlegroups.com

Hi,

have made very good progress with the ExtJS based UI and have attached
a screen shot showing the document browsing page, which completely
runs off the REST-ful JSON API, ie. there are no dynamic web pages on
the server-side.

However, while starting to work with JSON and in browser processing of
results I've started to notice some bugs that occur when dealing with
documents containing none ASCII characters, ie. Unicode. Because of
the variable length of utf-8 characters, some of the document offsets
demonstrate a skidding effect when the previous characters are
non-ASCII. I always knew this would be a feature that needed to be
implemented, but had forgotten that Javascript has native Unicode
strings. For the in-browser hashing that is required for the anonymous
search feature of the Churnalism browser extension to function
correctly, the client side hashes have to exactly match the output of
the server-side hashing. Thus, I need to handle Unicode server-side at
this point.

Currently, just trying to make a sensible decision between which
Unicode library to use:

http://utfcpp.sourceforge.net/

or

http://icu-project.org/apiref/icu4c/classUnicodeString.html

This is definitely a worthwhile thing to do, as it does open up SFM to
global usage :)

Bit unsure whether to push the current version to Github before or
after implementing this feature?

Cheers,
Donny.

Documents.png

Tom Lee

unread,

Dec 14, 2011, 9:39:18 AM12/14/11

to superfa...@googlegroups.com

My not-very-educated opinion is that it's worth pushing any stable release to Github, even with known limitations. Is there any specific worry that's making you hold back, other than worrying that unicode use cases will fail?

Donovan Hide

unread,

Dec 14, 2011, 11:13:18 AM12/14/11

to superfa...@googlegroups.com

Hi Tom,

the other considerations of pushing to the master branch are that the
other tabs in the interface are not formatted using ExtJS and the
browser extension is not yet coded to deal with the JSON search
results. It would be great to get some feedback so I am keen to push,
but other users might be confused by the unfinished presentation of
the other tabs.

If I do push, I can deploy to the demo instance to help you evaluate
the new interface.

In hindsight, I probably should have branched this piece of work, but
that might be a bit complex to fix now...

Cheers,
Donny.

Tom Lee

unread,

Dec 15, 2011, 11:09:31 AM12/15/11

to superfa...@googlegroups.com

I'd say go ahead and push -- maybe just tag the previous commit before doing so. Inconsistencies in the UI don't seem too worrisome to me, and the browser extension isn't yet of importance to anyone but us.

Donovan Hide

unread,

Dec 15, 2011, 11:27:59 AM12/15/11

to superfa...@googlegroups.com

Hi Tom,

managed to fix the Unicode skid bug, so will push this evening. Still
a slight issue that long unicode sections get split because of
erroneous whitespace detection, but the vast majority of fragments are
correct.