Error trying to run example python code

125 views
Skip to first unread message

Christopher Kedzie

unread,
May 14, 2014, 1:12:18 PM5/14/14
to stream...@googlegroups.com
Hi,

I am trying to run the example python script iterating-over-tokens.py and am getting the following error:
Traceback (most recent call last):
  File "iterating-over-tokens.py", line 16, in <module>
    for si in streamcorpus.Chunk(path=sys.argv[1]):
  File "/home/kedz/projects/envs/trec/local/lib/python2.7/site-packages/streamcorpus-0.3.30-py2.7.egg/streamcorpus/_chunk.py", line 392, in __iter__
    (msg.version, self.message().version))
streamcorpus._chunk.VersionMismatchError: read msg.version = 0 != 1 = message().version):


I was able to build, and test, and install.
I am showing that I am passing 36 tests with 2 skipped. 

I am running on this file taken from a previous post on this google group.
MAINSTREAM_NEWS-393-839f04b6bd4e90a5f284c91c43d58b60-f2ed7aa60c5e2999de9585c982756edd.sc

Thanks!
Chris

John R. Frank

unread,
May 14, 2014, 6:41:04 PM5/14/14
to Christopher Kedzie, stream...@googlegroups.com
> I am trying to run the example python script iterating-over-tokens.py
> and am getting the following error:
>
> Traceback (most recent call last):
>   File "iterating-over-tokens.py", line 16, in <module>
>     for si in streamcorpus.Chunk(path=sys.argv[1]):
>   File "/home/kedz/projects/envs/trec/local/lib/python2.7/site-packages/streamcorpus-0.3.30-py2.7.egg/streamcorpus/_chunk.py", line 392, in __iter__
>     (msg.version, self.message().version))
> streamcorpus._chunk.VersionMismatchError: read msg.version = 0 != 1 = message().version):


Hi Chris,

Your code needs to pass the old version of the thrift message to the Chunk
constructor, like this:

https://github.com/trec-kba/streamcorpus-pipeline/blob/master/streamcorpus_pipeline/_local_storage.py#L63


Here are the three available versions:

https://github.com/trec-kba/streamcorpus-pipeline/blob/master/streamcorpus_pipeline/_local_storage.py#L29


This year's corpus, which we will release next week, is the first to be on
v0_3_0 Last year's corpus is in v0_2_0


Let us know if this works for you. We can improve the example if
you have any more questions.

John

Beate Baier Biribakken

unread,
Jan 27, 2016, 7:08:32 AM1/27/16
to streamcorpus, christophe...@gmail.com
Hi John, I'm trying to familiarize myself with streamcorpus to run a set of sc-files from the TREC KBA corpora through CoreNLP, and I too have some troubles running the example script iterating-over-tokens.py:

streamcorpus/py$ python examples/iterating-over-tokens.py
Traceback (most recent call last):
  File "examples/iterating-over-tokens.py", line 16, in <module>
    for si in streamcorpus.Chunk(path=sys.argv[1]):
IndexError: list index out of range

To my frustration, I receive errors on all the examples I've tried, including:

streamcorpus/py$ streamcorpus_dump --show-all input.sc
Traceback (most recent call last):
  File "/usr/local/bin/streamcorpus_dump", line 9, in <module>
    load_entry_point('streamcorpus==0.3.56', 'console_scripts', 'streamcorpus_dump')()
  File "/usr/local/lib/python2.7/dist-packages/streamcorpus/dump.py", line 775, in main
    _dump(fpath, args)
  File "/usr/local/lib/python2.7/dist-packages/streamcorpus/dump.py", line 142, in _dump
    for num, si in enumerate(Chunk(path=fpath, mode='rb')):
  File "/usr/local/lib/python2.7/dist-packages/streamcorpus/_chunk.py", line 381, in __iter__
    for msg in self.read_msg_impl():
  File "/usr/local/lib/python2.7/dist-packages/streamcorpus/_chunk.py", line 474, in read_msg_impl
    (msg.version, self.message().version))
streamcorpus._chunk.VersionMismatchError: read msg.version = 0 != 1 = message().version):
# input.sc is a decompressed and renamed sc.xz-file from the TREC KBA corpora. 


I've tried python setup.py thrift, which I've assumed was enough to get streamcorpus up and running, but from what I can see from Chris' post, you should also be able to build, test and install. I'm not sure if either of these are included in the setup-script. I'm also not sure whether the links you've provided in your answer above are still valid or outdated (as the files you link to have been updated since the answer was posted).

Do you know what's causing these errors? Are there any additional steps that needs to be taken beyond running the setup-script to make streamcorpus run properly?

- Beate

John R. Frank

unread,
Jan 27, 2016, 7:40:37 AM1/27/16
to Beate Baier Biribakken, streamcorpus, christophe...@gmail.com
Hi Beate,

Yes, that's a frustrating bug. It's actually a defect in the latest
versions of thrift. You have to go back to thrift==0.9.2

With pip you can do this by:

pip uninstall thrift -y && pip install thrift==0.9.2


Here are details:

https://issues.apache.org/jira/browse/THRIFT-3175?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15056830#comment-15056830

John

Beate Baier Biribakken

unread,
Jan 27, 2016, 8:41:09 AM1/27/16
to John R. Frank, streamcorpus
Hi John,  and thank you for your quick response. Although I followed your suggestion, I unfortunately receive the exact same errors as before. I might add I also receive errors when trying to run streamcorpus/java/src/test/ReadThrift.java:

streamcorpus/java/src/test$ javac ReadThrift.java 
ReadThrift.java:3: error: package org.apache.thrift.protocol does not exist
import org.apache.thrift.protocol.TBinaryProtocol;
                                 ^
ReadThrift.java:4: error: package org.apache.thrift.transport does not exist
import org.apache.thrift.transport.TIOStreamTransport;
                                  ^
ReadThrift.java:5: error: package org.apache.thrift.transport does not exist
import org.apache.thrift.transport.TTransport;
                                  ^
ReadThrift.java:6: error: package streamcorpus does not exist
import streamcorpus.StreamItem;
                   ^
ReadThrift.java:21: error: cannot find symbol
            TTransport transport = new TIOStreamTransport(new BufferedInputStream(new FileInputStream("test-data/john-smith-tagged-by-lingpipe-0.sc")));
            ^
  symbol:   class TTransport
  location: class ReadThrift
ReadThrift.java:21: error: cannot find symbol
            TTransport transport = new TIOStreamTransport(new BufferedInputStream(new FileInputStream("test-data/john-smith-tagged-by-lingpipe-0.sc")));
                                       ^
  symbol:   class TIOStreamTransport
  location: class ReadThrift
ReadThrift.java:22: error: cannot find symbol
            TBinaryProtocol protocol = new TBinaryProtocol(transport);
            ^
  symbol:   class TBinaryProtocol
  location: class ReadThrift
ReadThrift.java:22: error: cannot find symbol
            TBinaryProtocol protocol = new TBinaryProtocol(transport);
                                           ^
  symbol:   class TBinaryProtocol
  location: class ReadThrift
ReadThrift.java:26: error: cannot find symbol
                final StreamItem item = new StreamItem();
                      ^
  symbol:   class StreamItem
  location: class ReadThrift
ReadThrift.java:26: error: cannot find symbol
                final StreamItem item = new StreamItem();
                                            ^
  symbol:   class StreamItem
  location: class ReadThrift
10 errors

I even tried copying the .sc file to the same folder as ReadThrift.java. streamcorpus.jar is at streamcorpus/java

John R. Frank

unread,
Jan 27, 2016, 8:47:18 AM1/27/16
to Beate Baier Biribakken, streamcorpus
Hi Beate,

If you send me a link to the chunk file you are trying to run, I can
verify that it runs correctly for me. It's generally easiest to verify
data in python.

If you are not running in a python virtualenv, then you might want to try
using that instead. It's generally best to say --no-site-packages if you
have an old version of virtualenv.

Beate Baier Biribakken

unread,
Jan 27, 2016, 9:05:36 AM1/27/16
to John R. Frank, streamcorpus
Hi John, 

Here are the files:

The .sc file was renamed to input.sc. I would appreciate if you could verify that these files can be run in streamcorpus, if anything just to exclude the input files as a cause to the errors. 
I currently don't run virtualenv, will look into it now.

John R. Frank

unread,
Jan 27, 2016, 9:30:27 AM1/27/16
to Beate Baier Biribakken, streamcorpus
This command works fine:

streamcorpus_dump arxiv-19-6e8e8db427c9cd4497f6695ceb65bd51-08c60f0b7955ac0841be2bf26ceed179.sc.xz --smart --version v0_2_0

Perhaps you weren't using the v0_2_0 version of the thift file?

jrf

Beate Baier Biribakken

unread,
Jan 27, 2016, 10:16:24 AM1/27/16
to John R. Frank, streamcorpus
Hurray, it runs! Thank you so much for your help, John :) I hadn't heard of the --version option. 
Any tweaks I can do to run either example python script in streamcorpus/py/examples or streamcorpus/examples/py
Neither python <path-to-python-file> nor python <path-to-python-file> --version v0_2_0 seem to work, not even in virtualenv.

John R. Frank

unread,
Jan 27, 2016, 11:04:30 AM1/27/16
to Beate Baier Biribakken, streamcorpus

The good and bad of Thrift is that it rigidly specifies what fields mean.
If you want to change a field's meaning, e.g. go from int16 to int32, then
you either have to pick a new slot number possibly with the same or
different attribute name, or make an entirely new message class. We did
the switch from v0_2_0 to v0_3_0 the latter way. The former might have
been better. These version numbers are specific to the KBA data, and not
related to the other versioning issue you were having with thrift==0.9.2

The result is that there are *three* different message classes that you
might have to use when deserializing data. This means compiling your
thrift classes from a *different* thrift file. They are listed here [1].
The python tooling handles this for you to some degree. The --version
flag only exists on streamcorpus_dump [2]. The different classes for the
different versions are pre-built in the python module [3]

If you want to do this in another language, e.g. Java, you'll have to
handle the compiling from the right thrift file in [1].


[1] https://github.com/diffeo/streamcorpus/tree/master/if

[2] https://github.com/diffeo/streamcorpus/blob/master/py/src/streamcorpus/dump.py#L728

[3] https://github.com/diffeo/streamcorpus/blob/master/py/src/streamcorpus/package_globals.py#L51-L55

jrf


--
______________________________
John R. Frank <j...@diffeo.com>
mobile: +1-617-899-2066
http://diffeo.com
Reply all
Reply to author
Forward
0 new messages