KBA corpus: how to process the whole corpus on one machine in AWS

John R. Frank

unread,

Apr 17, 2013, 10:18:18 PM4/17/13

to stream...@googlegroups.com

All users of KBA corpus,

The corpus stripped of non-english texts is 99% complete. I will post
when the last few hundred chunks finish. You can list the dirs like this:

s3cmd ls s3://aws-publicdatasets/trec/kba/kba-streamcorpus-2013-v0_2_0-english-and-unknown-language/

Previously, I posted the suggestion that instead of downloading the whole
corpus, you can process it in EC2. Just to check that this is easy and
cheap, we ran the attached C++ example of deserializing the corpus in this
simple python and shell script pipeline:

for line in sys.stdin:
chunk_count += 1

url = 'http://s3.amazonaws.com/aws-publicdatasets/' + line.strip()

cmd = '(wget -O - %s | gpg --homedir %s --no-permission-warning --trust-model always --output - --decrypt - | xz --decompress | ./streamcorpus-counter) 2>> subprocess_errors.log' % (url, gpg_dir)

child = Popen(cmd, stdout=PIPE, shell=True)

Running this in an EC2 cc2.8xlarge by splitting the list of 2.2M input
paths and feeding into GNU parallel like this:

ls paths/x???? | ./parallel -j 32 --eta "cat {} | python filter-streamitems-cpp.py 1> {}.speed.log 2> {}.errors.log {}" &> parallel.log &

It runs 123MB/sec or 10TB/day of compressed encrypted chunk files.

That size EC2 machine costs $2.10/hr, so you can process the whole corpus
for about $50, assuming your initial filtering technique is fast.

Doing it with the python generated by thrift is more than 100x slower.
I'd expect Java to be fast like the C++.

jrf

streamcorpus-counter.cpp

Dai Zhang

unread,

Apr 20, 2013, 7:33:01 AM4/20/13

to stream...@googlegroups.com

Hi,John~

There occour some errors when I processing the streamcorpus in java with the example---- ReadThrift.java

The system displays that "

org.apache.thrift.transport.TTransportException

at org.apache.thrift.transport.TIOStreamTransport.read(TIOStreamTransport.java:132)

at org.apache.thrift.transport.TTransport.readAll(TTransport.java:84)

at org.apache.thrift.protocol.TBinaryProtocol.readAll(TBinaryProtocol.java:378)

at org.apache.thrift.protocol.TBinaryProtocol.readByte(TBinaryProtocol.java:264)

at org.apache.thrift.protocol.TBinaryProtocol.readFieldBegin(TBinaryProtocol.java:228)

at streamcorpus.StreamItem$StreamItemStandardScheme.read(StreamItem.java:1493)

at streamcorpus.StreamItem$StreamItemStandardScheme.read(StreamItem.java:1)

at streamcorpus.StreamItem.read(StreamItem.java:1326)

at test.ReadThrift.main(ReadThrift.java:27)

"

It seems that there is something wrong with the line (item.read(protocol); ) everytime when it reads the last item of each chunck.

We want to know how to fix that .

Deeply appreciate your kindness and patience.

在 2013年4月18日星期四UTC+8上午10时18分18秒，John R. Frank写道：

John R. Frank

unread,

Apr 23, 2013, 6:34:14 PM4/23/13

to stream...@googlegroups.com, Dai Zhang

Dai Zhang, thank you for prompting us to improve the exception handling in
the Java example for iterating over StreamItems:

https://github.com/trec-kba/streamcorpus/commit/9c764982cff95a206738525078b77237676e3d09

EVERYONE, please note this example that shows the many useful pieces of
metadata in each StreamItem. The example is in python. If someone would
like to contribute a Java example, we'd be appreciate any contributions:

https://github.com/trec-kba/streamcorpus/blob/master/examples/py/iterating-over-tokens.py

John

On Tue, 23 Apr 2013, Dai Zhang wrote:

> Thank you for your response,John.
> I just run the code in streamcorpus/java/src/test/ ReadThrift.java without any modification , using the streamitem chunk john-smith-tagged-by-lingpipe-0.sc in "test-data" .It throws an error when
> reads the last item in the test file .The error occours in line 27.
>
> Many thanks.
>
>
>
> 2013/4/23 John R. Frank <j...@mit.edu>
> Hi Dai,
>
> Sorry that Java thrift is causing this problem. We would be happy to help you. Could you please send us a code that exhibits and a streamitem chunk file that exhibit this error?
>
> If practical for you, it would help us if you could fork this repo and make your example in your own forked repo. Then, we can pull fixes as we help you debug this issue.
>
> https://github.com/trec-kba/streamcorpus/tree/master/java
>
>
> John
>
>
> --
> ___________________________
> John R. Frank <j...@mit.edu>
> mobile: +1-617-899-2066

> --
> Dai Zhang （张岱）
> Tel:18811595161 School of Information and Communication Engineering
> Beijing University of Posts and Telecommunications,
> Beijing 100876,
> P.R.China
>
>

Morteza Shahriari Nia

unread,

May 3, 2013, 5:01:15 PM5/3/13

to stream...@googlegroups.com

John,

I'm trying to replicate the above shell call on EC2 and evaluate the performance versus your benchmark. How can I access filter-streamitems-cpp.py?

Regards,

Morteza Shahriari Nia
http://mshahriarinia.com/

John R. Frank

unread,

May 16, 2013, 9:53:13 PM5/16/13

to Morteza Shahriari Nia, stream...@googlegroups.com

Morteza, sorry for the delay. I just posted the full python script that
we used to generate an inverse mapping of stream_id to chunk path.

https://github.com/trec-kba/streamcorpus/commit/3a69efc1ad3f4187753f6798e65b4e853e4cfbc4

We will post this reverse index file soon.

Let us know if you have further questions on this.

jrf

On Fri, 3 May 2013, Morteza Shahriari Nia wrote:

> John,
>
> I'm trying to replicate the above shell call on EC2 and evaluate the performance versus your benchmark. How can I access filter-streamitems-cpp.py?
>
> Regards,
>
> Morteza Shahriari Nia
> http://mshahriarinia.com/
>
>
> On Wednesday, April 17, 2013 10:18:18 PM UTC-4, John R. Frank wrote:
>
> All users of KBA corpus,
>

> The corpus stripped of non-english texts is 99% complete. ï¿œI will post
> when the last few hundred chunks finish. ï¿œYou can list the dirs like this:

>
> s3cmd ls s3://aws-publicdatasets/trec/kba/kba-streamcorpus-2013-v0_2_0-english-and-unknown-language/
>
>
> Previously, I posted the suggestion that instead of downloading the whole

> corpus, you can process it in EC2. ï¿œJust to check that this is easy and

> cheap, we ran the attached C++ example of deserializing the corpus in this
> simple python and shell script pipeline:
>
> for line in sys.stdin:

> ï¿œ ï¿œ ï¿œchunk_count += 1
>
> ï¿œ ï¿œ ï¿œurl = 'http://s3.amazonaws.com/aws-publicdatasets/' + line.strip()
>
> ï¿œ ï¿œ ï¿œcmd = '(wget -O - %s | gpg --homedir %s --no-permission-warning --trust-model always --output - --decrypt - | xz --decompress | ./streamcorpus-counter) 2>> subprocess_errors.log' %
> (url, gpg_dir)
>
> ï¿œ ï¿œ ï¿œchild = Popen(cmd, stdout=PIPE, shell=True)

>
>
> Running this in an EC2 cc2.8xlarge by splitting the list of 2.2M input
> paths and feeding into GNU parallel like this:
>

> ls paths/x???? | ./parallel -j 32 --eta "cat {} | python filter-streamitems-cpp.py 1> {}.speed.log 2> {}.errors.log ï¿œ{}" &> parallel.log &

Reply all

Reply to author

Forward