Thank you. Great help.
> If you want the results to come out instantly Map/Reduce is not the proper
> choice. Map/Reduce is designed for batch processing. It can do small
> batches, but the overhead of launching the map/redcue jobs can be very high
> compared to the amount of processing you are doing. I personally would
> look into using either Storm, S4, or some other realtime stream processing
> framework. From what you have said it sounds like you probably want to use
> Storm, as it can be used to guarantee that each event is processed once and
> only once. You can also store your results into HDFS if you want, perhaps
> through HBASE, if you need to do further processing on the data.
> --Bobby Evans
> On 5/22/12 5:02 AM, "Zhiwei Lin" <zhiwei...@gmail.com> wrote:
> Hi Robert,
> Thank you.
> How quickly do you have to get the result out once the new data is added?
> If possible, I hope to get the result instantly.
> How far back in time do you have to look for BBBB from the occurrence of
> bbbb?
> The time slot is not constant. It depends on the "last" occurrence of BBBB
> in front of bbbb. So, I need to look up the history to get the last BBBB
> in this case.
> Do you have to do this for all combinations of values or is it just a small
> subset of values?
> I think this depends on the time of last occurrence of BBBB in the history.
> If BBBB rarely occurred, then the early stage data has to be taken into
> account.
> Definitely, I think HDFS is a good place to store the data I have (the size
> of daily log is above 1GB). But I am not sure if Map/Reduce can help to
> handle the stated problem.
> Zhiwei
> On 21 May 2012 22:07, Robert Evans <ev...@yahoo-inc.com> wrote:
> > Zhiwei,
> > How quickly do you have to get the result out once the new data is added?
> > How far back in time do you have to look for BBBB from the occurrence of
> > bbbb? Do you have to do this for all combinations of values or is it
> just
> > a small subset of values?
> > --Bobby Evans
> > On 5/21/12 3:01 PM, "Zhiwei Lin" <zhiwei...@gmail.com> wrote:
> > I have large volume of stream log data. Each data record contains a time
> > stamp, which is very important to the analysis.
> > For example, I have data format like this:
> > (1) 20:30:21 01/April/2012 AAAAA.............
> > (2) 20:30:51 01/April/2012 BBBB.............
> > (3) 21:30:21 01/April/2012 bbbb.............
> > Moreover, new data comes every few minutes.
> > I have to calculate the probability of the occurrence "bbbb" given the
> > occurrence of "BBBB" (where BBBB occurs earlier than bbbb). So, it is
> > really time-dependant.
> > I wonder if Hadoop is the right platform for this job? Is there any
> > package available for this kind of work?
> > Thank you.
> > Zhiwei
> --
> Best wishes.
> Zhiwei