Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
Stream data processing
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  5 messages - Collapse all  -  Translate all to Translated (View all originals)
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
Zhiwei Lin  
View profile  
 More options May 21 2012, 4:01 pm
From: Zhiwei Lin <zhiwei...@gmail.com>
Date: Mon, 21 May 2012 21:01:33 +0100
Local: Mon, May 21 2012 4:01 pm
Subject: Stream data processing

I have large volume of stream log data. Each data record contains a time
stamp, which is very important to the analysis.
For example, I have data format like this:
(1) 20:30:21 01/April/2012    AAAAA.............
(2) 20:30:51 01/April/2012    BBBB.............
(3) 21:30:21 01/April/2012    bbbb.............

Moreover, new data comes every few minutes.
I have to calculate the probability of the occurrence "bbbb" given the
occurrence of "BBBB" (where BBBB occurs earlier than bbbb). So, it is
really time-dependant.

I wonder if Hadoop  is the right platform for this job? Is there any
package available for this kind of work?

Thank you.

Zhiwei


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Robert Evans  
View profile  
 More options May 21 2012, 5:07 pm
From: Robert Evans <ev...@yahoo-inc.com>
Date: Mon, 21 May 2012 16:07:40 -0500
Local: Mon, May 21 2012 5:07 pm
Subject: Re: Stream data processing

Zhiwei,

How quickly do you have to get the result out once the new data is added?  How far back in time do you have to look for BBBB from the occurrence of bbbb?  Do you have to do this for all combinations of values or is it just a small subset of values?

--Bobby Evans

On 5/21/12 3:01 PM, "Zhiwei Lin" <zhiwei...@gmail.com> wrote:

I have large volume of stream log data. Each data record contains a time
stamp, which is very important to the analysis.
For example, I have data format like this:
(1) 20:30:21 01/April/2012    AAAAA.............
(2) 20:30:51 01/April/2012    BBBB.............
(3) 21:30:21 01/April/2012    bbbb.............

Moreover, new data comes every few minutes.
I have to calculate the probability of the occurrence "bbbb" given the
occurrence of "BBBB" (where BBBB occurs earlier than bbbb). So, it is
really time-dependant.

I wonder if Hadoop  is the right platform for this job? Is there any
package available for this kind of work?

Thank you.

Zhiwei


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Zhiwei Lin  
View profile  
 More options May 22 2012, 6:02 am
From: Zhiwei Lin <zhiwei...@gmail.com>
Date: Tue, 22 May 2012 11:02:40 +0100
Local: Tues, May 22 2012 6:02 am
Subject: Re: Stream data processing

Hi Robert,
Thank you.
How quickly do you have to get the result out once the new data is added?
If possible, I hope to get the result instantly.

How far back in time do you have to look for BBBB from the occurrence of
bbbb?
The time slot is not constant. It depends on the "last" occurrence of BBBB
in front of bbbb.  So, I need to look up the history to get the last BBBB
in this case.

Do you have to do this for all combinations of values or is it just a small
subset of values?
I think this depends on the time of last occurrence of BBBB in the history.
If BBBB rarely occurred, then the early stage data has to be taken into
account.

Definitely, I think HDFS is a good place to store the data I have (the size
of daily log is above 1GB). But I am not sure if Map/Reduce can help to
handle the stated problem.

Zhiwei

On 21 May 2012 22:07, Robert Evans <ev...@yahoo-inc.com> wrote:

--

Best wishes.

Zhiwei


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Robert Evans  
View profile  
 More options May 22 2012, 9:52 am
From: Robert Evans <ev...@yahoo-inc.com>
Date: Tue, 22 May 2012 08:52:18 -0500
Local: Tues, May 22 2012 9:52 am
Subject: Re: Stream data processing

If you want the results to come out instantly Map/Reduce is not the proper choice.  Map/Reduce is designed for batch processing.  It can do small batches, but the overhead of launching the map/redcue jobs can be very high compared to the amount of processing you are doing.  I personally would look into using either Storm, S4, or some other realtime stream processing framework.  From what you have said it sounds like you probably want to use Storm, as it can be used to guarantee that each event is processed once and only once.  You can also store your results into HDFS if you want, perhaps through HBASE, if you need to do further processing on the data.

--Bobby Evans

On 5/22/12 5:02 AM, "Zhiwei Lin" <zhiwei...@gmail.com> wrote:

Hi Robert,
Thank you.
How quickly do you have to get the result out once the new data is added?
If possible, I hope to get the result instantly.

How far back in time do you have to look for BBBB from the occurrence of
bbbb?
The time slot is not constant. It depends on the "last" occurrence of BBBB
in front of bbbb.  So, I need to look up the history to get the last BBBB
in this case.

Do you have to do this for all combinations of values or is it just a small
subset of values?
I think this depends on the time of last occurrence of BBBB in the history.
If BBBB rarely occurred, then the early stage data has to be taken into
account.

Definitely, I think HDFS is a good place to store the data I have (the size
of daily log is above 1GB). But I am not sure if Map/Reduce can help to
handle the stated problem.

Zhiwei

On 21 May 2012 22:07, Robert Evans <ev...@yahoo-inc.com> wrote:

--

Best wishes.

Zhiwei


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Zhiwei Lin  
View profile  
 More options May 22 2012, 9:58 am
From: Zhiwei Lin <zhiwei...@gmail.com>
Date: Tue, 22 May 2012 14:58:44 +0100
Local: Tues, May 22 2012 9:58 am
Subject: Re: Stream data processing

Hi Bobby,

Thank you. Great help.

Zhiwei

On 22 May 2012 14:52, Robert Evans <ev...@yahoo-inc.com> wrote:

--

Best wishes.

Zhiwei


 
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
End of messages
« Back to Discussions « Newer topic     Older topic »