Commoncrawl mapreduce jobs using PHP how-to

203 views
Skip to first unread message

fightsw...@gmail.com

unread,
Apr 7, 2013, 6:25:19 AM4/7/13
to common...@googlegroups.com
Hi,
for those not familiar with java, python etc. I wrote this short how-to article about using hadoop streaming and php scripts.
Maybe it will help somebody ;)
http://www.fightswithbytes.com/2013/04/05/sample-wordcount-streaming-job-using-php-on-commoncrawl-dataset/

Miso

Dave Lester

unread,
Apr 14, 2013, 1:24:44 PM4/14/13
to common...@googlegroups.com
Awesome, I'm sure this example will help a number of folks. I went ahead and linked to this on the helpful guides section of the Common Crawl wiki: https://commoncrawl.atlassian.net/wiki/display/CRWL/Helpful+Guides+and+Links

Cheers,
Dave



--
You received this message because you are subscribed to the Google Groups "Common Crawl" group.
To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
To post to this group, send email to common...@googlegroups.com.
Visit this group at http://groups.google.com/group/common-crawl?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Lisa Green

unread,
Apr 14, 2013, 1:56:19 PM4/14/13
to common...@googlegroups.com
Miso

Thanks for sharing this! We are collecting tutorials for an improved "Get Started" page on the Common Crawl website and would love to include your example.

Thank you,
Lisa

Usman Shahid

unread,
Jun 6, 2016, 11:58:30 PM6/6/16
to Common Crawl
Does anyone have a copy of this? It appears that the site has been compromised. Python nor Java are my areas of expertise (and I'm sure thats true for many) having some sort of library for this in PHP would be incredibly helpful.

Greg Lindahl

unread,
Jun 7, 2016, 12:11:02 AM6/7/16
to common...@googlegroups.com
Here's an archived copy of the page:

https://web.archive.org/web/20160430094419/http://www.fightswithbytes.com/2013/04/05/sample-wordcount-streaming-job-using-php-on-commoncrawl-dataset/

Did you look around the CommonCrawl website to see if there's a
tutorial there for PHP?

-- greg
> --
> You received this message because you are subscribed to the Google Groups "Common Crawl" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to common-crawl...@googlegroups.com.
> To post to this group, send email to common...@googlegroups.com.
> Visit this group at https://groups.google.com/group/common-crawl.
> For more options, visit https://groups.google.com/d/optout.

Sebastian Nagel

unread,
Jun 7, 2016, 10:56:24 AM6/7/16
to common...@googlegroups.com
Hi,

the tutorial from fightswithbytes is referenced here:
http://forums.phpfreaks.com/topic/284999-php-and-common-crawl-on-aws-amazon-web-services/
Maybe there are useful comments.

We are in course of updating the examples and tutorials on the Common Crawl website.
Unfortunately, I'm not aware of newer examples written in PHP.

Sebastian

Paulius Rimavičius

unread,
Jun 27, 2016, 9:43:17 AM6/27/16
to Common Crawl
Hi,

I just wrote an article on my experience using Common Crawl with PHP on Amazon AWS.

In the article I explain the basics how to start using Common Crawl with PHP.

Paulius

James Chmielinski

unread,
Jul 11, 2016, 9:16:04 PM7/11/16
to Common Crawl
Hi Everyone,


Does anyone know how to use Common Crawl in order to extract LinkedIn profiles for recruiting?  I have a search engine to find talent and I wanted to see if this was a good solution....?

Please contact me at Ja...@Veruca.io or reply to this thread.  I have a budget for this project if anyone is interested in helping me out.

Thanks!

James Chmielinski

Ivan Habernal

unread,
Jul 12, 2016, 2:27:10 AM7/12/16
to Common Crawl
Hi James,

 
Does anyone know how to use Common Crawl in order to extract LinkedIn profiles for recruiting?  I have a search engine to find talent and I wanted to see if this was a good solution....?

James Chmielinski

unread,
Jul 12, 2016, 2:30:48 AM7/12/16
to common...@googlegroups.com
Hi Ivan,

Help me.  How can I learn more about doing this?  I'm not that technical.

Are you free for a Skype?

James
--
You received this message because you are subscribed to a topic in the Google Groups "Common Crawl" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/common-crawl/FH6JLALO-NM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to common-crawl...@googlegroups.com.

Ivan Habernal

unread,
Jul 12, 2016, 2:49:25 AM7/12/16
to Common Crawl
Hi James,
 
Help me.  How can I learn more about doing this?  I'm not that technical.
Are you free for a Skype?

No, I'm sorry.

I think your topic is related to scraping a particular website rather than a "shallow" crawl of CommonCrawl. Maybe you might find someone in their respective forums...

All the best,

Ivan

Sebastian Nagel

unread,
Jul 12, 2016, 2:58:02 AM7/12/16
to Common Crawl
Hi James, hi Ivan,

the problem is simple: LinkedIn does not allow any bots except for explicitly whitelisted ones
to crawl most of their sites:
   User-agent: *
   Disallow: /

See https://www.linkedin.com/robots.txt and the notice there about how to get
whitelisted. Common Crawl by now has no agreement with LinkedIn to be able
to crawl their content.

Best,
Sebastian

James Chmielinski

unread,
Jul 13, 2016, 3:10:09 PM7/13/16
to common...@googlegroups.com
What other sources via common crawl are available to aggregate talent profiles and insights about those candidates?

There has to be this data available, right?  Any ideas?  

I still have budget for developing this project.  

It would help me build this: http://veruca.io

Anyone interested?

James

Live Long + Prospect,  


James Chmielinski, CEO + Cofounder

  Facebook Twitter Google Plus Linkedin

READ OUR BLOG: 25 Daily Tasks of the Modern Recruiter




--
You received this message because you are subscribed to a topic in the Google Groups "Common Crawl" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/common-crawl/FH6JLALO-NM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to common-crawl...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages