Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
finding names in Common Crawl
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  1 message - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Lisa Green  
View profile  
 More options Aug 20 2012, 6:07 pm
From: Lisa Green <l...@commoncrawl.org>
Date: Mon, 20 Aug 2012 15:07:58 -0700 (PDT)
Local: Mon, Aug 20 2012 6:07 pm
Subject: finding names in Common Crawl

Has everyone seen this piece by Mat
Kelcey? http://matpalm.com/blog/2012/08/18/finding_names_in_common_crawl/

finding names in common crawl<http://matpalm.com/blog/2012/08/18/finding_names_in_common_crawl>August
18, 2012 at 08:00 PM | categories: common-crawl<http://matpalm.com/blog/tag/common-crawl>
, quick-hack <http://matpalm.com/blog/tag/quick-hack>, nlp<http://matpalm.com/blog/tag/nlp>
, nltk <http://matpalm.com/blog/tag/nltk>, noun-phrases<http://matpalm.com/blog/tag/noun-phrases>
 | 1 Comment and 2 Reactions<http://matpalm.com/blog/2012/08/18/finding_names_in_common_crawl#disq...>

the central offering from common crawl <http://commoncrawl.org/> is the raw
bytes they've downloaded and, though this is useful for some people, a lot
of us just want the visible text of web pages. luckily they've done this
extraction as a part of post processing the crawl and it's freely available
too!
getting the data

the first thing we need to do is determine which segments of the crawl are
valid and ready for use (the crawl is always ongoing)

$ s3cmd get s3://aws-publicdatasets/common-crawl/parse-output/valid_segments.txt
$ head -n3 valid_segments.txt
1341690147253
1341690148298
1341690149519

given these segment ids we can lookup the related textData objects.

if you just want one grab it's name using something like ...

$ s3cmd ls s3://aws-publicdatasets/common-crawl/parse-output/segment/1341690147253/ 2>/dev/null \
 | grep textData | head -n1 | awk '{print $4}'
s3://aws-publicdatasets/common-crawl/parse-output/segment/1341690147253/tex tData-00000

but if you want the lot you can get the listing with ...

$ cat valid_segments.txt \
 | xargs -I{} s3cmd ls s3://aws-publicdatasets/common-crawl/parse-output/segment/{}/ \
 | grep textData | awk '{print $4}' > all_valid_segments.tsv

( note: this listing is roughly 200,000 textData files and takes awhile to
fetch )

each textData file is a hadoop sequence files, the key being the crawled
url and the value being the extracted visible text.

to have a quick look at one you can get hadoop to dump the sequence file
contents with ...

$ hadoop fs -text textData-00000 | less
http://webprofessionals.org/intel-to-acquire-mcafee-moving-into-onlin...       Web Professionals
Professional association for web designers, developers, marketers, analysts and other web professionals.
Home
...
The company’s share price has fallen about 20 percent in the last five years, closing on Wednesday at $19.59 a share.
Intel, however, has been bulking up its software arsenal. Last year, it bought Wind River for $884 million, giving it a software maker with a presence in the consumer electronics and wireless markets.
With McAfee, Intel will take hold of a company that sells antivirus software to consumers and businesses and a suite of more sophisticated security products and services aimed at corporations.

( note: the visible text is broken into *one line* per block element from
the original html. as such the value in the key/value pairs includes
carriage returns and, for something like less, gets outputted as being
seperate lines )
extracting noun phrases

now that we have some text, what can we do with it? one thing is to look
for noun phrases and the quickest simplest way is to use something like the
python natural language toolkit <http://nltk.org/>. it's certainly not the
fastest to run but for most people would be the quickest to get going.

extract_noun_phrases.py<https://github.com/matpalm/common-crawl-quick-hacks/blob/master/findi...> is
an example of doing sentence then word tokenisation, pos tagging and
finally noun chunk phrase extraction.

given the text ...

Last year, Microsoft bought Wind River for $884 million. This makes it the largest software maker with a presence in North Kanada.

it extract noun phrases ...

Microsoft
Wind River
North Kanada

to run this at larger scale we can wrap it in a simple streaming job

hadoop jar $HADOOP_HOME/hadoop-streaming.jar \
 -input textDataFiles \
 -output counts \
 -mapper extract_noun_phrases.py \
 -reducer aggregate \
 -file extract_noun_phrases.py

run it across a small 50mb sample of textData files the top noun phrases
extracted ...
*rank**phrase**freq*110094Posted29597November39553February48929Copyright5
8726September68709January78434April88307August97963October107963December

this is not terribly interesting and the main thing that's going on here is
that this is just being extracted from the boiler plate of the pages. one
tough problem when dealing with visible text on a web page is that it might
be visible but that doesn't mean it's interesting to the actual content of
the page. here we see 'posted' and 'copyright', we're just extracting the
chrome of the page.

check out the full list of values with freq >= 20 here<https://github.com/matpalm/common-crawl-quick-hacks/blob/master/findi...> there
are some more interesting ones a bit later
notes

so it's fun to look at noun phrases but i've actually brushed over some key
details here

   - not filtering on english text first generates a *lot* of "noise". "G
   úûv ÝT M", "U ŠDú T" and "Y CKdñˆô" are not terribly interesting english
   noun phrases.
   - running this at scale you'd probably want to change from streaming and
   start using an in process java library like the stanford parser<http://nlp.stanford.edu/software/lex-parser.shtml>
   - when it comes to actually doing named entity recognition it's a bit
   more complex. there's a wavii blog post<http://blog.wavii.com/2012/08/16/bush-is-back/>
    from manish <https://twitter.com/mkbubba> that talks a bit more about
   it.


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »