SemanticSpaceExplorer - error reading semantic space after character encoding conversion - "java.io.IOError: java.io.StreamCorruptedException: invalid stream header:"

13 views
Skip to first unread message

Marcin Tatjewski

unread,
Oct 17, 2014, 10:51:27 AM10/17/14
to s-spac...@googlegroups.com
Hello,

I constructed a semantic space in text format. Semantic space was encoded in UTF8. Then I converted the character encoding from UTF8 to Windows 1250 code page.
After this operation I'm no longer able to read the semantic space with use of SemanticSpaceExplorer. I get the "java.io.IOError: java.io.StreamCorruptedException: invalid stream header:"
I understand that the 4 byte stream header which preceeds the number of rows and dimensions in the first row is now corrupted. How can I make my space readable for the SemanticSpaceExplorer in the changed encoding?
Or what's the easiest way to fix that problem changing the source code?

Regards,
Marcin

Marcin Tatjewski

unread,
Oct 17, 2014, 10:55:56 AM10/17/14
to s-spac...@googlegroups.com
Of course I tried deleting those 4 header bytes but this does not help.

Marcin Tatjewski

unread,
Oct 17, 2014, 11:30:04 AM10/17/14
to s-spac...@googlegroups.com
This is the error stack:
java.io.IOError: java.io.StreamCorruptedException: invalid stream header: 61737973
java.io.IOError: java.io.StreamCorruptedException: invalid stream header: 61737973
        at edu.ucla.sspace.util.SerializableUtil.load(SerializableUtil.java:165)
        at edu.ucla.sspace.common.SemanticSpaceIO.loadInternal(SemanticSpaceIO.java:271)
        at edu.ucla.sspace.common.SemanticSpaceIO.load(SemanticSpaceIO.java:225)
        at edu.ucla.sspace.common.SemanticSpaceIO.load(SemanticSpaceIO.java:186)
        at edu.ucla.sspace.tools.SemanticSpaceExplorer.execute(SemanticSpaceExplorer.java:256)
        at edu.ucla.sspace.tools.SemanticSpaceExplorer.execute(SemanticSpaceExplorer.java:204)
        at edu.ucla.sspace.tools.SemanticSpaceExplorer.main(SemanticSpaceExplorer.java:779)
Caused by: java.io.StreamCorruptedException: invalid stream header: 61737973
        at java.io.ObjectInputStream.readStreamHeader(Unknown Source)
        at java.io.ObjectInputStream.<init>(Unknown Source)
        at edu.ucla.sspace.util.SerializableUtil.load(SerializableUtil.java:159)
        ... 6 more

It seems for me that the SemanticSpaceIO.getFormat() method incorrectly decides that my space is in the format SspaceFormat.SERIALIZE instead of SSpaceFormat.TEXT.
What exactly means the format "SERIALIZE" in the context of your Spaces? It's not described in the FileFormats Wiki.

Regards,
Marcin

David Jurgens

unread,
Oct 17, 2014, 11:40:09 AM10/17/14
to s-spac...@googlegroups.com
Hi Marcin,

  The SERIALIZE type is the case where the SemanticSpace object was serialized to bytes using the standard Java serialization.  We added this at some point but must not have updated the wiki.  I suppose the serialization could be generating an initial byte sequence that matches the two bytes used by the internal SemanticSpace types, though this seems unlikely.  Can you load the object using Java serialization and see if that works?  It certainly seems like there is a bug somewhere in our code for performing this type of operation.

  Thanks,
  David


--

---
You received this message because you are subscribed to the Google Groups "S-Space Package Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to s-space-user...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Marcin Tatjewski

unread,
Oct 20, 2014, 6:49:02 AM10/20/14
to s-spac...@googlegroups.com
Hi David,

Thanks for your response. However, I don't fully understand your point.
My semantic space was primarily produced in the text format. It was not serialized, it was plain text. This one was correctly working with SemanticSpaceExplorer. Despite beeing in text format, It had a four byte header before the row and column number in the first number. Was that already an error?
Afterwards I changed the character encoding of this space from UTF-8 to Windows codepage 1250. Now it's no longer loaded properly in SemanticSpaceExplorer as I described above. Problem seems to be caused by the encoding conversion of the byte header, 

The method getFormat() from SemanticSpaceIO identifies it as serialized and parsing crashes as I pointed out above.
As you can see below, getFormat() identifies as serialized all spaces that have first byte different than "s". Is it a desired behaviour?
How can I force SemanticSpaceExplorer to ignore the header?

static SSpaceFormat getFormat(File sspaceFile) throws IOException {
        DataInputStream dis = new DataInputStream(
            new BufferedInputStream(new FileInputStream(sspaceFile)));
        // read the expected header
        char header = dis.readChar();
        if (header != 's') {
            dis.close();
            return SSpaceFormat.SERIALIZE;
        }
        char encodedFormatCode = dis.readChar();
        int formatCode = encodedFormatCode - '0';
        dis.close();
        return (formatCode < 0 || formatCode > SSpaceFormat.values().length)
            ? SSpaceFormat.SERIALIZE
            : SSpaceFormat.values()[formatCode];                
    }

Thanks,
Marcin 

Marcin Tatjewski

unread,
Oct 20, 2014, 7:44:04 AM10/20/14
to s-spac...@googlegroups.com
To avoid this problem I kept the header in UTF-8 and conversed all the rest of the file to Windows code page 1250.
Reply all
Reply to author
Forward
0 new messages