Architectural setup for Hadoop processed data to be served to front end

31 views
Skip to first unread message

kumar vaibhav

unread,
Oct 7, 2014, 10:07:44 AM10/7/14
to chenn...@googlegroups.com

So this question is regarding serving Hadoop processed data to the front end of some web app. I need some recommendations on setting up such architecture.

Details - I would be doing a lot of text analysis - sentiment analysis, feature extraction, theme extraction, etc. The frequency with which data will be inputted into HDFS will be once a week or more often but definitely not daily. I don't think an approach of firing a MR job on the click of a button and then getting processed data to be shown on the web app will be appropriate. What should be the middle layer such that at any point of time only the most recent processed is shown to the user? HBase (or some other NoSql)?

Please note that I am new to Hadoop, NoSql (but that should not be an issue) :)

Should I be looking into Lambda architecture or should I just use Spark and forget about any middle layer or should I be using Storm? It would be awesome if the community could direct me to a simple working example or something similar to what I am asking above.

Thanks & Regards, 

Vaibhav

Subrata Biswas

unread,
Oct 7, 2014, 1:12:45 PM10/7/14
to chenn...@googlegroups.com
What I feel, below approach may work for you as I do not know the volume of output data from your mapreduce job.
First have you MapReduce file to be in .jar if written in Java not required in case of PIG or Hive script.
Create a shell script-
Which will launch the MapReduce job
Next line of your script will move the output file into Linux file system from HDFS using hadoop get command, make sure you do not forget to merge all output file into single file while moving into Linux File Sytem. And have a local MySQl or some metadata form where you are keeping your result files. that could be listed in webApp UI while browsing might be with job name ir date or something alice name. On clicking it will pick up display the data in your webUI(you need to write your business logic how to display in WEB UI.

Second approach, if you are going to process small amount of data or stream data on real time, while passing the data into HDFS pass it through STORM, make sure you have proper business in Storm bolt to validate data for what you are trying to process. And if any breach, you want to alert on WeB UI or on Mobile, which is like server side push rather traditional WEBAPP response on request. use WebSocketServlet, using you can achieve.

Not sure if it works for you until I have proper info on your POC or project.

Regards
Subrata

--
You received this message because you are subscribed to the Google Groups "Hadoop Users Group (HUG) Chennai" group.
To unsubscribe from this group and stop receiving emails from it, send an email to chennaihug+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

kumar vaibhav

unread,
Oct 8, 2014, 2:08:15 AM10/8/14
to chenn...@googlegroups.com
Thanks for your reply Subrata. So, the amount of data processed will be a lot - like in gigabytes every week. The amount of output data, IMO, can be put in some RDBMS or NoSql easily. No issue with that. Passing incoming data straight to Storm doesn't like a good idea. I have been thinking on the same as you suggested. Will keep posted about the decisions made and why they were made. 

Thanks again for your reply. Looking for more responses.

Subrata Biswas

unread,
Oct 9, 2014, 1:31:24 AM10/9/14
to chenn...@googlegroups.com
Tanks, Sure please do not forget the approach you are going to follow.
I used both approach inn my past some project.

Regards
Subrata
Reply all
Reply to author
Forward
0 new messages