Create Map/Reduce jobs

40 views
Skip to first unread message

David Gkogkritsiani

unread,
Jun 4, 2013, 10:33:20 AM6/4/13
to chenn...@googlegroups.com

Hi Folks,
I want to create 2 jobs in mapreduce(I have already made the first one,but it doesn't work for that I want to do), ie to a job I want to export all titles (<title>....</ title>) of urls that have 3 or more "a" .  I have stored contents of the URLs in separate files, and each file is stored locally on my HDD in text format.
And at another job I want to export, like eg, looking for through pages a word (eg car) and display the corresponding URLs that contain that word.
I append you my MapReduce code that I started to do.

In essence,
for first functionality, I want to extract and display each file all titles (<title>....</ title>) of urls that have 3 or more "a" .  with file name.
For second functionality, I want to display word e.g (car) the word that exists in each urls content with file name 

thanks in advance!

Ashwanth Kumar

unread,
Jun 4, 2013, 9:51:40 PM6/4/13
to chenn...@googlegroups.com
Does the second job depend on the output of the first job? 

For 1, you can achieve the solution in a Map-only job, you don't need Map-Reduce job. First do a count of "<a>" and if it exceeds the required threshold value emit the <title> from the page. Also, given that each file is in an individual file (which is totally not recommended, read this blog post on small files problem on HDFS) you will eventually have # of mappers equal to # of files on HDFS or on each map() you will get the entire file contents in one shoot (since the split happens on the block size or on individual file). 

PS: Read this for why you should not use Regex for HTML Parsing. Use libraries like JSoup for HTML Parsing, it makes your life much easier. 




--
You received this message because you are subscribed to the Google Groups "Hadoop Users Group (HUG) Chennai" group.
To unsubscribe from this group and stop receiving emails from it, send an email to chennaihug+...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 



--

Ashwanth Kumar / ashwanthkumar.in

Reply all
Reply to author
Forward
0 new messages