Proposal for fixing encoding issues in SemanticSpaceExplorer

23 views
Skip to first unread message

tat...@gmail.com

unread,
Jun 11, 2014, 8:21:42 AM6/11/14
to s-space-re...@googlegroups.com
Hi,

I would like to describe here a proposal for a small fix in SemanticSpaceExplorer (SSE) along with a motivation for it.

Recently I identified that there exists a problem with usage of SemanticSpaceExplorer when you require to run it on Windows and you have some non-standard characters in your Semantic Space, thus you need Unicode.
Here is the initial description of my problem: https://groups.google.com/forum/#!topic/s-space-users/GF71NKIvqdU
In short: When you need to use word containing some special characters as argument of e.g. "get-neighbors" command, the word you input won't match corresponding word in the Semantic Space which is stored using Unicode.

After some experiments and research on the web I learned that it's not possible to input Unicode characters via Windows CommandLine or PowerShell. Please find here relevant threads discussing this problem:

Fortunately, SemanticSpaceExplorer has also an option to input commands from a file using "-f" argument.
However, currently SSE opens this file using following code:
 commandsToExecute = new BufferedReader(new FileReader(
                    options.getStringOption("executeFile")));

That causes Java to use default system file.encoding, which on Windows is not Unicode.
One option to solve that is to set an environment variable JAVA_TOOL_OPTIONS with value "-Dfile.encoding=UTF8". However, that would be a bit confusing for users with non-IT background and my primary concern is to enable non-IT users browse my Semantic Space using SSE.

Therefore, a better option would be to make a small change in the code of SSE and instead of the code above use something similar to that:
BufferedReader in = new BufferedReader(
  new InputStreamReader(
                      new FileInputStream(options.getStringOption("executeFile")), <encoding from additional parameter>));
and also add another SSE argument for defining command file encoding and possibly another for encoding of file with command output (file defined with "-s" option), but I would need to further check that output issue.
I would be happy to implement all of those required changes. Please let me know what do you think about that.

Regards,
Marcin

David Jurgens

unread,
Jun 11, 2014, 4:40:52 PM6/11/14
to s-space-re...@googlegroups.com
Hi Marcin, 

Fortunately, SemanticSpaceExplorer has also an option to input commands from a file using "-f" argument.
However, currently SSE opens this file using following code:
 commandsToExecute = new BufferedReader(new FileReader(
                    options.getStringOption("executeFile")));

That causes Java to use default system file.encoding, which on Windows is not Unicode.
One option to solve that is to set an environment variable JAVA_TOOL_OPTIONS with value "-Dfile.encoding=UTF8". However, that would be a bit confusing for users with non-IT background and my primary concern is to enable non-IT users browse my Semantic Space using SSE.

Out of curiosity, what is the default encoding?

Therefore, a better option would be to make a small change in the code of SSE and instead of the code above use something similar to that:
BufferedReader in = new BufferedReader(
  new InputStreamReader(
                      new FileInputStream(options.getStringOption("executeFile")), <encoding from additional parameter>));
and also add another SSE argument for defining command file encoding and possibly another for encoding of file with command output (file defined with "-s" option), but I would need to further check that output issue.
I would be happy to implement all of those required changes. Please let me know what do you think about that.

A few things come to mind.  First, would it just be easier to have SSE default to utf8 unless otherwise specified?  This would remove the need for a patch.  If someone is using some format other than utf8, they can just use the -Dfile.encoding option.  

Second, instead of adding more arguments to the SSE, would it be possible to just create a .BAT file on Windows (or .sh file on Linux) which automatically launches SSE for your end-users with the -Dfile.encoding=utf8 variable?  In this way, users don't even have to know they're running Java.

The biggest issue I see with adding a new argument to SSE is that we're duplicating the existing JVM parameter functionality.  

Would either of the two options work for you?

  Thanks,
  David

 

Regards,
Marcin

--
You received this message because you are subscribed to the Google Groups "Semantic Space Research - Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to s-space-research...@googlegroups.com.
To post to this group, send email to s-space-re...@googlegroups.com.
Visit this group at http://groups.google.com/group/s-space-research-dev.
For more options, visit https://groups.google.com/d/optout.

tat...@gmail.com

unread,
Jun 12, 2014, 9:29:14 AM6/12/14
to s-space-re...@googlegroups.com
Hi David,

Windows uses regionalized code pages, e.g. 1250, 1252 etc. There is Windows UTF-8 Unicode code page no 65001, yet it's has many flaws commonly reported. It didn't work properly on my machine.

Defaulting SSE to UTF8 would be enough for me. It sounds good. However, why you say it would remove the need for a patch? You would still need to add small changes in the SSE code.

There is a problem with the "-Dfile.encoding=utf8" parameter.
As I understand it's not supported in all Java versions as a parameter for "java" command.
I also tested that on my machine. Didn't work.

Only setting JAVA_TOOL_OPTIONS to "-Dfile.encoding=utf8" can do that job reliably.
Also it seems more convenient for a user to pass a parameter to SSE than change an environment variable. Also changing JAVA_TOOL_OPTIONS could affect other applications, so shouldn't be changed permanently by users.

Defaulting SSE to utf8 sounds best for me. However, I think SSE should also enable user to use different encoding when required. So a parameter would be necessary.
What do you think about that?

Thanks,
Marcin
Hi Marcin, 


Regards,
Marcin
To unsubscribe from this group and stop receiving emails from it, send an email to s-space-research-dev+unsub...@googlegroups.com.

David Jurgens

unread,
Jun 12, 2014, 6:32:33 PM6/12/14
to s-space-re...@googlegroups.com

Defaulting SSE to UTF8 would be enough for me. It sounds good. However, why you say it would remove the need for a patch? You would still need to add small changes in the SSE code.

It seems easier to change the default file encoding for SSE than to introduce new parameters for the command line.
 

There is a problem with the "-Dfile.encoding=utf8" parameter.
As I understand it's not supported in all Java versions as a parameter for "java" command.
I also tested that on my machine. Didn't work.

Only setting JAVA_TOOL_OPTIONS to "-Dfile.encoding=utf8" can do that job reliably.
Also it seems more convenient for a user to pass a parameter to SSE than change an environment variable. Also changing JAVA_TOOL_OPTIONS could affect other applications, so shouldn't be changed permanently by users.

Defaulting SSE to utf8 sounds best for me. However, I think SSE should also enable user to use different encoding when required. So a parameter would be necessary.
What do you think about that?

I'm less excited about adding all this functionality to support an issue with the Windows terminals.  However, I would like SSE to "just work" so it's a potential work-around.  

A quick-and-dirty GUI version of the SSE seems like a better investment of development resources for people encountering the problem, in my opinion.

  Thanks,
  David
 
To unsubscribe from this group and stop receiving emails from it, send an email to s-space-research...@googlegroups.com.

Marcin Tatjewski

unread,
Jun 13, 2014, 7:02:22 PM6/13/14
to s-space-re...@googlegroups.com
For me a good fix for now would be to default the file encoding for SSE to UTF8 as you proposed.
What do you think?

Regards,
Marcin

Thanks,
Hi Marcin, 


Regards,
Marcin
To unsubscribe from this group and stop receiving emails from it, send an email to s-space-research-dev+unsubscrib...@googlegroups.com.
To post to this group, send email to s-space-re...@googlegroups.com.

David Jurgens

unread,
Jun 14, 2014, 8:50:32 PM6/14/14
to s-space-re...@googlegroups.com
Hi Marcin,

  I've just checked in a fix to the fix-encoding branch of the S-Space Package.  This fix automatically detects the input file encoding (rather than fixing it to utf-8) of the command file.  Would you mind checking out this branch and seeing if SSE commands work as expected for you now?

  Thanks,
  David


To unsubscribe from this group and stop receiving emails from it, send an email to s-space-research...@googlegroups.com.

Marcin Tatjewski

unread,
Jun 16, 2014, 5:14:35 AM6/16/14
to s-space-re...@googlegroups.com
Thanks a lot David. I'll check that soon.

Marcin


--
You received this message because you are subscribed to a topic in the Google Groups "Semantic Space Research - Development" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/s-space-research-dev/RzGd7uhWjLE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to s-space-research...@googlegroups.com.

Marcin Tatjewski

unread,
Oct 10, 2014, 11:44:54 AM10/10/14
to s-space-re...@googlegroups.com, marcin.t...@gmail.com
Hi David,

Your fix works correctly. I tested on Windows with input files in UTF-8, previously they crashed. Thanks a lot.

I was wondering how to solve another problem: output redirection to a file, especially useful when one uses "-f" option to get commands from file.
On Windows redirecting output with ">" ends up in a mess with all non-standard latin characters. This is caused by going through the Windows console.
Maybe it would be worth adding an extra option to send commands output directly to a file?

Regards,
Marcin

--
You received this message because you are subscribed to a topic in the Google Groups "Semantic Space Research - Development" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/s-space-research-dev/RzGd7uhWjLE/unsubscribe.
To unsubscribe from this group and all its topics, send an email to s-space-research-dev+unsub...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages