decrypting the corpus

John R. Frank

unread,

Apr 12, 2013, 11:59:15 AM4/12/13

to stream...@googlegroups.com

All Users of the KBA corpora:

The naming on the decryption keys that you receive from NIST can be a bit
confusing because GPG/PGP uses email semantics. You are the "recipient"
of the message, so you hold the PRIVATE key and we use the PUBLIC key to
encrypt the data before we send it to you (via S3).

You can see how our code uses GPG to do this decryption:

https://github.com/trec-kba/streamcorpus/blob/master/py/src/streamcorpus/_chunk.py#L289

Let us know if you have any questions.

jrf

Message has been deleted

John R. Frank

unread,

Apr 16, 2013, 7:58:14 AM4/16/13

to stream...@googlegroups.com, hung_ng...@yahoo.com

> I got compiler complain at the following lines, can you advise?
>
> from .ttypes import StreamItem
> from .ttypes_v0_1_0 import StreamItem as StreamItem_v0_1_0

Hung,

Can you post the full traceback?

Did you installed with

pip install streamcorpus

If you are compiling from source, then perhaps you don't have the thrift
compiler installed.

jrf

wim.g...@gmail.com

unread,

Apr 28, 2013, 4:56:13 AM4/28/13

to stream...@googlegroups.com

Hi, John

I am not good at python language and can not understand the code you give above. Please tell me how to use java to decrypt the data one by one automatically. If there are other ways, please let me known.

Thanks a lot.

wim

在 2013年4月16日星期二UTC+8下午7时58分14秒，John R. Frank写道：

John R. Frank

unread,

Apr 28, 2013, 10:27:56 AM4/28/13

to stream...@googlegroups.com

> how to use java to decrypt the data one by one automatically.

The easiest way to decrypt the corpus is to use the "gpg" command-line
utility either via a shell script that you run once before processing the
corpus, or by forking a child subprocess from whatever programming
language you are using.

From a googling for [java shell command]

http://stackoverflow.com/questions/3062305/executing-shell-commands-from-java

http://docs.oracle.com/javase/7/docs/api/java/lang/Process.html

In bash you can do something like

find corpus-directory -type f | parallel -j 20 'gpg -d {} > {.}'

Let us know if this works for you.

John

wim.g...@gmail.com

unread,

Apr 29, 2013, 7:13:53 AM4/29/13

to stream...@googlegroups.com

Hi John

Thank you for your prompt reply. I have tried the shell script you give above. It works very well. But I have never seen the sc file format. I have tried to open this type of file by TXT, but there are stil in it.Can you give me some advice when I use java to process this file format.Thank you very much for your kind consideration

wim

John R. Frank

unread,

Apr 29, 2013, 7:39:17 AM4/29/13

to stream...@googlegroups.com, wim.g...@gmail.com

> I have never seen the sc file format. I have tried to open this type of

> file by TXT, but there are stil in it. Can you give me some advice when

> I use java to process this file format.

We invented the file extension ".sc" to represent "streamcorpus" which is
a Thrift message format defined here:

https://github.com/trec-kba/streamcorpus/blob/master/if/streamcorpus.thrift

Thrift is a serialization format that is very efficient for
reading/writing data from files or over network connections. It was
developed at Facebook and is similar to protocol buffers ("protobuf") from
Google. Thrift is self-delimiting, so after you compile the thrift
classes for a given programming language, you can simply call .read() on a
file or byte stream to get fully instantiated class instances.

These class instances have the corpus data and NLP tagging metadata.

Here is a simple Java example:

https://github.com/trec-kba/streamcorpus/tree/master/java

https://github.com/trec-kba/streamcorpus/blob/master/java/src/test/ReadThrift.java

NOTE: the Java example does *not* yet fully illustrate all of the data in
the corpus. To see all that is available in the KBA 2013 corpus, please
see the example pasted below, which is python (read it as pseudocode if
you prefer):

https://github.com/trec-kba/streamcorpus/blob/master/examples/py/iterating-over-tokens.py

'''
Example of using the python streamcorpus library to iterate over
documents and then iterate over tokens. The output of this example is
a sequence of labels for tokens.

This serves as documentation for both the Thrift message structures in
streamcorpus v0_2_0, and also the data in the KBA 2013 corpus:

https://github.com/trec-kba/streamcorpus/blob/master/if/streamcorpus.thrift

http://s3.amazonaws.com/aws-publicdatasets/trec/kba/index.html

For example, you can run this like this:

python iterating-over-tokens.py ../../test-data/john-smith-tagged-by-lingpipe-0.sc

or, you can try a Chunk from the KBA 2013 streamcorpus:

## download a chunk from the KBA 2013 corpus
wget http://s3.amazonaws.com/aws-publicdatasets/trec/kba/kba-streamcorpus-2013-v0_2_0-english-and-unknown-language/2012-11-07-18/MAINSTREAM_NEWS-393-839f04b6bd4e90a5f284c91c43d58b60-f2ed7aa60c5e2999de9585c982756edd.sc.xz.gpg

## decrypt with key from NIST
gpg MAINSTREAM_NEWS-393-839f04b6bd4e90a5f284c91c43d58b60-f2ed7aa60c5e2999de9585c982756edd.sc.xz.gpg > MAINSTREAM_NEWS-393-839f04b6bd4e90a5f284c91c43d58b60-f2ed7aa60c5e2999de9585c982756edd.sc.xz

## run this program on it
python iterating-over-tokens.py MAINSTREAM_NEWS-349-e471e2725727236219a831a277f72295-d2a1a49a5216d3383bd1e401233028a5.sc.xz

'''
## import that python module that wraps the streamcorpus.thrift
## interfaces. We will only use the Chunk convenience interface,
## which reads StreamItem messages out of flat files.
import streamcorpus

## use sys module to access commandline arguments
import sys

## iterate over StreamItem messages in a flat file
### In other languages, you may need to create a read loop like this:
### https://github.com/trec-kba/streamcorpus/blob/master/java/src/test/ReadThrift.java
for si in streamcorpus.Chunk(path=sys.argv[1]):

## iterate over the sentences map for each tagger.
#
# taggers are identified by tagger_id strings, e.g. 'lingpipe' or 'serif' or 'stanford'
#
for tagger_id in si.body.sentences:
for sentence in si.body.sentences[tagger_id]:

## iterate over the tokens in each sentence
for token in sentence.tokens:

## iterate over the labels map
target_ids = []
for annotator_id in token.labels:
for label in token.labels[annotator_id]:

## print the target_id for each label
target_ids.append( label.target.target_id )

## token has many useful properties:
#
# token.token is the word itself
#
# token.mention_id is a number that distinguishes
# multi-token named entity mentions
#
# token.equiv_id is a number that identifies
# within-doc coref chains of mentions to the same
# entity
#
# token.entity_type is the type of the entity
if token.entity_type is not None:
## we could just display the integer of the
## token.entity_type, however it is easier to read
## if we pull the name out of the EntityType
## enumeration:
entity_type_name = streamcorpus.EntityType._VALUES_TO_NAMES[token.entity_type]
else:
entity_type_name = ' '

print '\t'.join(map(str, [token.mention_id, token.equiv_id, entity_type_name, token.token, target_ids]))

## In some corpora, e.g. the KBA 2013 corpus, the
## target_id is a full URL, such as the URL used by
## the author of the HTML from the text was extracted.

## In most corpora, the StreamItem.body.clean_html
## contains a properly structured version UTF-8 of the
## original HTML in StreamItem.body.raw The
## clean_visible is constructed from clean_html by
## replacing all bytes that are part of HTML tags with
## ' ', so that the byte offsets remain the same.
## This allowed us construct hyperlink_labels for the
## tokens.

## some documents have NER data stored only as raw_tagging output
## and have not be unpacked into thrift structures. Specifically,
## the NER data from KBA 2012 has been preserved in the KBA 2013
## streamcorpus here:
#StreamItem.body.taggings['stanford'].raw_tagging

print

if si.body.clean_visible:
num_bytes = len(si.body.clean_visible)
num_chars = len(si.body.clean_visible.decode('UTF-8'))
print '%d bytes %d characters in StreamItem(%r).body.clean_visible for %s' % (num_bytes, num_chars, si.stream_id, si.abs_url)

print '-------'

John R. Frank

unread,

Jul 1, 2013, 10:51:58 PM7/1/13

to stream...@googlegroups.com

> It has been said that the streamitems in KBA 2013 corpus come from 10
> data sources. I am not clear of the meaning of some sources, such as
> linking, CLASSIFIED, and so on.

The amounts of content in each "source" are listed here:

http://s3.amazonaws.com/aws-publicdatasets/trec/kba/index.html

Here's a short description of what they mean:

social -- aggregation of blogs, forums, and discussion web sites

MAINSTREAM_NEWS, news -- news wires and news web sites

linking -- links passed through bitly's link shortener

WEBLOG -- blogs

FORUM -- discussion web sites

CLASSIFIED -- advertisements, a separate section in news papers

REVIEW -- commentary / critique of products and services

MEMETRACKER (spinn3r) -- see memetracker.org

jrf

wim.g...@gmail.com

unread,

Jul 2, 2013, 9:12:00 AM7/2/13

to stream...@googlegroups.com

Hi, John

The KBA corpus used three types of taggers to label data. I have spent some time throughout the day looking at this data, but I still can not determine whether this pre-prcessing contains word stemming. If not, I have to do this by myself.

By the way, I need some statistical information about the token's document frequency(namely inversed document frequency). Do youhave any information about that? If so, would you mind sending it to me?

Thank you very much for your kind consideration and I am looking forward you answers.

Best,

wim

wim.g...@gmail.com

unread,

Jul 3, 2013, 9:52:06 PM7/3/13

to stream...@googlegroups.com

Hi, John

I have found that the data extracted from the HTML Web pages usually included strong noises. That is, except the body text, the data contained much information which is irrelevant or rarely needed.

By the way, I have used the following code to get the StreanItem information

StreamItem.body.getClean_visible( )

I wonder if this is the right way to get the StreanItem information. Thank you very much for your kind consideration!

wim

wim.g...@gmail.com

unread,

Jul 3, 2013, 9:52:42 PM7/3/13

to stream...@googlegroups.com

John R. Frank

unread,

Jul 3, 2013, 10:08:32 PM7/3/13

to stream...@googlegroups.com

> I have found that the data extracted from the HTML Web pages usually
> included strong noises. That is, except the body text, the
> data contained much information which is irrelevant or rarely needed.

Want to send an example showing what you mean?

clean_visible contains all of the text from clean_html that is not inside
of "<" and ">" tags. If you want to remove chrome and navigation links,
then you can run your own article extractor.

> By the way, I have used the following code to get the StreanItem information
> StreamItem.body.getClean_visible( )
>
> I wonder if this is the right way to get the StreanItem information.
> Thank you very much for your kind consideration!

I haven't seen a method on ContentItem called getClean_visible. Perhaps
that is something constructed by thrift in some languages. What language
are you using?

Usually, clean_visible is simple a class instance property of type binary
or bytes or bytebuffer, and it contains a utf8-encoded string.

jrf

wim.g...@gmail.com

unread,

Jul 3, 2013, 10:55:41 PM7/3/13

to stream...@googlegroups.com

Hi , John

Thank you for your prompt reply! I have attached an example. The example is an XML file, and its Field '<Content>' is got from the method getClean_visible.
The language I am using is Java, and it does have the method called getClean_visible.

Best,
wim

1329915960-a3f0d531cb6e55868a7fc5e82f6a07de-news.xml

John R. Frank

unread,

Jul 4, 2013, 5:58:33 AM7/4/13

to stream...@googlegroups.com

> I have attached an example. The example is an XML file, and its Field
> '<Content>' is got from the method getClean_visible.

What problem do you see in it?

That content looks pretty correct to me --- its a simple news page with an
article and some chrome around it.

If you want to remove the chrome, you might look at using boilerpipe. We
used boilerpipe last year to generate "cleansed", and decided not to use
it this year because it sometimes drops important content. You can use
clean_visible to pick a small fraction of the total corpus to process with
more computational expensive tools, like article extractors.

> The language I am using is Java, and it does have the method called
> getClean_visible.

Interesting.

jrf

wim.g...@gmail.com

unread,

Jul 4, 2013, 11:12:54 AM7/4/13

to stream...@googlegroups.com

Hi, John

I have tried to use boilerpipe to remove the chrome, but with no success. The 'artice extractor' in boilerpipe requires that the input should be HTML, but you know clean_visible is not HTML format, instead a simple string. As a result, the chrome has not been removed from the <Content> in the example. The code I have used is attached.
Would you ming telling me how do you use boilerpipe to remove the chrome?

Thank you very much for your kind consideration!

wim

testBoilerpipe.java

John R. Frank

unread,

Jul 4, 2013, 11:20:53 AM7/4/13

to stream...@googlegroups.com

> I have tried to use boilerpipe to remove the chrome, but with no
> success. The 'artice extractor' in boilerpipe requires that the input
> should be HTML, but you know clean_visible is not HTML format, instead a
> simple string. As a result, the chrome has not been removed from the
> <Content> in the example. The code I have used is attached.

Try using clean_html

clean_visible is generated from clean_html by replacing all tags with
spaces, so the byte offsets are the same.

jrf

Reply all

Reply to author

Forward