how to implement CPL and text corpus

Mehjabin Rahman

unread,

May 12, 2017, 9:37:49 AM5/12/17

to NELL: Never-Ending Language Learner

Hi,

I am reading your publication on Coupled Semi-Supervised Learning for

Information Extraction. There I am facing problem that how you implement CPL and which tools are you using?And how can I get the text corpus. In my MSc thesis I need this.

Thank you.

Bryan Kisiel

unread,

May 12, 2017, 12:27:01 PM5/12/17

to NELL: Never-Ending Language Learner

Hi Mehjabin,

At the time of that paper, we called that component "CBL" and its
operation is described in detail in
https://rtw.ml.cmu.edu/papers/cbl-sslnlp09.pdf. We have tweaked that
algorithm over time, but haven't published anything so detailed about the
modern variant. The closest thing, if you're interested, is a brief
synoposis in the "Implementation" section of
https://rtw.ml.cmu.edu/papers/carlson-aaai10.pdf, if you are interested.
The program itself is written from scratch in Java, and doesn't use any
tools beyond some standard utilities like log4j.

The text corpus, ClueWeb09, is available from
https://lemurproject.org/clueweb09/. We are not allowed to redistribute
it ourselves. However, the CPL component itself operates off of
cooccurrence statistics derived from the text. A copy of these statitics
that would have been in use at the time of the WSDM10 can be found at
http://rtw.ml.cmu.edu/wk/all-pairs-2010-02-11-gz/ -- there's a brief
README describing the format.

The series of Hadoop jobs that produce these are also written in scratch
from Java. Here again there are some 3rd party packages in use, such as
for extracting text from HTML, but I'd say the only significant one is
CoreNLP, which is used primarily for sentence detection and POS-tagging of
tokens so that we can detect noun phrases and filter out noisy patterns.

If you need more details, please feel free to ask.

bki...@cs.cmu.edu

> --
> You received this message because you are subscribed to the Google Groups "NELL: Never-Ending Language Learner" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to cmunell+u...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>

Mehjabin Rahman

unread,

Apr 8, 2018, 3:08:01 AM4/8/18

to NELL: Never-Ending Language Learner

Hello, can you please help me by providing the source code of cpl algorithm.I am facing problem to implement this.

Thank you

Mehjabin Rahman

unread,

Jun 6, 2018, 2:34:51 PM6/6/18

to NELL: Never-Ending Language Learner

Hello, so far you help me a lot. I want to ask a clarify answer about the following thing,

suppose, my pattern consist arg1,arg2 players. Then how to match this pattern with the corpus and extract new seed instance.

Thanks

bkisiel

unread,

Jun 11, 2018, 4:08:31 PM6/11/18

to NELL: Never-Ending Language Learner

Hi Mehjabin,

What we did is implement a preprocessing step that extracts all (arg1, context, arg2) tuples from the corpus. For instance, we might find a sentence like, "Toyota is releasing a new Camry for 2018." In this case, we might extract a tuple ("Toyota", "is releasing a new", "Camry"). Then we collect all such tuples into one very large matrix where each row is is an (arg1, arg2) pair, the column is the context, and the value in that cell of the matrix is the number of times we found that tuple in the corpus.

So if we had a parttern like ("Toyota", "Camry") then we would look in that row and find all of the contexts we had extracted. One of them would be "is releasing a new" with probably a large value since that would have been mentioned in many automotive news articles over many years. Probably also "has released a new", "will release a new", and also things like "is issuing a recal for" or "is redesigning the", and so-on.

The trick of course is how to find all such tuples given a sentence. We've used various approaches with various laungaes and source corpora, but what we've been using for the public NELL run based on ClueWeb starts by POS-tagging all of the sentences. Then we have basically set of hand-tuned hueristics to identify things that could be an arg1 or arg2 (e.g. one or more nouns in a row, allowing a determiner before the first noun, maybe adjective followed by a noun, etc.). We consider any pair of arguments sufficiently close to each other within a sentence to be a candidate pair. Then the POS tags of the intervening words are filtered according to another set of hueristics (e.g. argument, verb, argument would be a simple obvious match, but we allow for many more complicated patterns, and also apply some filters as well to reduce noise). Any candidate pair then becomes an extracted tuple by taking the sequence of words between the two arguments to be the context.

Obviously, this could be extended such that the context could include words that are not between the two arugments, or we could use a dependence parse rather than a sequence of POS tags, or all sorts of other techniques. But that basic approach is what we use for the main English run of NELL that uses ClueWeb as its source corpus.

Does that answer your question?

bki...@cs.cmu.edu

Mehjabin Rahman

unread,

Jul 11, 2018, 7:39:45 AM7/11/18

to NELL: Never-Ending Language Learner

Hello,

Thanks for your help. I have a simple question and that is to execute CPL I have to first POS tag my corpus right?

Thank you

bkisiel

unread,

Jul 18, 2018, 10:38:21 AM7/18/18

to NELL: Never-Ending Language Learner

Hi Mehjabin,

If you want to do it they way that we do it for NELL, then yes, you would need to POS-tag the corpus first.

But really you can use any method you want to extract all (NP, context) and (NP, context, NP) tuples from each sentence or document. We tried using dependency parses rather than POS-tagged sentences once, for instance, but it wasn't clear that the results were better. In any case, you'd need a way to identify which sequences of words count as an NP, so some sort of word-detection and noun-phrase-detection would be needed. Possibly you could do that with a big list of noun phrases to match against, for instance.

bki...@cs.cmu.edu

Mehjabin Rahman

unread,

Aug 7, 2018, 10:44:32 AM8/7/18

to NELL: Never-Ending Language Learner

Hi, bkisiel,

So far you helped me a lot. I am really thankful to you. Would you please explain elaborately how you use "hand-tuned heuristics to identify things that could be an arg1 or arg2 (e.g. one or more nouns in a row, allowing a determiner before the first noun, maybe adjective followed by a noun, etc.)"

Thanks

bkisiel

unread,

Aug 23, 2018, 12:56:11 PM8/23/18

to NELL: Never-Ending Language Learner

Hi Mehjabin,

The system in use for English is kind of complicated and was the product of a number of people who were on the project before I joined who did a lot of hand tweaking to try to produce low noise output. I might have a summary explanation sitting in my email outbox from years ago, but either way I'd have to do some digging through email or source code to try to capture the net effect of the pipeline.

While exploring the use of CPL with other sufficiently English-like languages (e.g. Spanish, Portuguese), we took a simplified approach of defining a list of POS tag regexps. This results in somewhat noisier output, but the CPL (and NELL) of today is less sensitive to noisy input than 10 years ago, and starting out will a more recall-heavy all-pairs extraction turns out to be useful. Then, if there are very common sorts of noise, special hand-tweaked filters can be added as needed.

So, for instance, two of the regexps for the Portuguese category matrix are:

L,/V /CN
L,/V /CN /PREP /CN
The first one means "take a sequence of nouns to the left (L) of a verb (V) followed by a common noun (CN)" and then that sequence of nouns would be the arg1 and the verb and common noun would be the pattern for a single (arg1, pattern) extraction. The next one looks for a sequence of nouns followed by a common noun, then a prepositional phrase, then another common noun. I believe we have it set to accept a run of up to 5 nouns in a row to count as an arg1 or arg2, but I'd have to double-check that.

There are about a hundred of these regexps all told, and I forget where they came from, but it turns out you can get most of the good patterns in an English-like language with about a half-dozen of the most common constructions, and then you can go back by hand and look at the output from the POS tagger to find sequences of POS tags that are commonly missed to expand your set of regexps, or if you find one of your regexps is too noisy then you can remove it and replace it with a set of more specific regexps. It's one of those things that just takes some time for somebody to sit down and fiddle with until the output looks pretty good.

Regexps for relations are similar. Here are two very general ones, again for Portuguese. (This tagset comes from the LX-Parser package btw):

/V /CJ
/V /DA /CN

First one logs for a run of noun phrases to be arg1, then a verb followed by a conjunction, and then another run of noun phrases to be arg2. The second one does the same thing, but this time the pattern should be a verb followed by a definite article followed by a common noun.

I'm not all that much of an NLP guy myself, and I'm drawing a blank on suggesting where to look, but there are studies out there that identify the most common constructions in various languages that might be helpful to use as a starting point if you don't want to just look at a bunch of POS-tagged sentences and try to eyeyball out a list of the 10 or 20 most common things you see, and then look at the output on another batch of sentences to see what was commonly missed and what picked up a lot of noise.

bki...@cs.cmu.edu

Reply all

Reply to author

Forward