Auto-failover chain fails / Acks / Event ordering

36 views
Skip to first unread message

Yun Huang Yong

unread,
Jul 1, 2010, 8:15:57 AM7/1/10
to flume...@cloudera.org
hi all,

I'm trying to get a relatively simple setup going on 2 physical hosts:
baklava: node + master
ramen: node

Both machines running Ubuntu 10.04 (lucid) & CDH3.

Logical node config:
agent1 : text("/home/yun/flume-src") | autoE2EChain;
collector1 : collectorSource | text("/tmp/flume-test-output");

Master status:
http://www.mooh.org/public/flume/20100701-flumemaster.html


Why does the autoE2EChain fail with a "no collectors" message?


If I use an agentSink("baklava.mooh.org") with agent1 then I'm able to
transport events. However if I stop & restart the node process on
either machine it seems all previous events are replayed, even those
already sent previously.

Could someone explain the ack behaviour?


Also, in my initial goofing around with a console source where I typed a
message every few minutes, I notice that in the replay behaviour above
the events arrive in groups, and out of order. Is this intentional? I
would've expected that events from a single source should arrive in
source submission order.

Lastly... how do I get rid of the extraneous entries for "collector",
"baklava.mooh.org" and "ramen.mooh.org" in the Node status?

Thanks in advance :)
yun

Jonathan Hsieh

unread,
Jul 1, 2010, 9:41:14 AM7/1/10
to y...@nomitor.com, flume...@cloudera.org
Hi Yun, 

I've put answers inline.  Let us know if you have more questions, or if other things are unclear!

- Jon

On Thu, Jul 1, 2010 at 5:15 AM, Yun Huang Yong <y...@nomitor.com> wrote:
hi all,

I'm trying to get a relatively simple setup going on 2 physical hosts:
 baklava: node + master
 ramen: node

Both machines running Ubuntu 10.04 (lucid) & CDH3.

Logical node config:
 agent1 : text("/home/yun/flume-src") | autoE2EChain;
 collector1 : collectorSource | text("/tmp/flume-test-output");

Master status:
 http://www.mooh.org/public/flume/20100701-flumemaster.html


Why does the autoE2EChain fail with a "no collectors" message?


The auto*Chains will automatically figure out port/host name combos to nodes using 'autoCollectorSource's.  To get the behavior you want you should use autoCollectorSource instead of collectorSource.  Also, since you are using E2E reliability mode, you should use a collectorSink (it takes cares of the acks after events are delivered to it).  So collector1 should be set to:

collector1: autoCollectorSource | collectorSink("file:///tmp/flume-test-output","fileprefix") ;

Over time, there should be several files in the collector machine's /tmp/flume-test-output dir with names fileprefix-xxxx where xxxx has datetime-like information.
 

If I use an agentSink("baklava.mooh.org") with agent1 then I'm able to
transport events.  However if I stop & restart the node process on
either machine it seems all previous events are replayed, even those
already sent previously.

Could someone explain the ack behaviour?

You must use a collectorSink in when using E2E sink.  Here's why:

E2E sinks add 'Ack messages' to the streams of events.  Ack messages are checked, processed,  and removed by the collectorSink.   The collector sink is responsible for figuring out and eventually telling the agents that the messages have been successfully received.  You are receiving replayed data because the sink you have chosen, text, isn't aware of the ack messages, and thus cannot tell the E2E agent that it has successfully received data.  The E2E agent is  paranoid about losing data and will replay the data until it gets confirmation of receipt!
 


Also, in my initial goofing around with a console source where I typed a
message every few minutes, I notice that in the replay behaviour above
the events arrive in groups, and out of order.  Is this intentional?  I
would've expected that events from a single source should arrive in
source submission order.

Arriving in groups is on purpose.  Out of order is ok behavior abd can be mitigated.  When you use the collectorSink, we give you the ability to bucket messages that come within a particular time period regardless of the order they are eventually sent to the collector.

Here's an example:

collector1: autoCollectorSource | collectorSink("file://tmp/foo/%Y-%m-%d/%H/%M", "foodata");

This would write: 
data from 7/1/2010 6:33 am to /tmp/foo/2010-07-01/06/33/foodata-xxx and 
data from 7/1/2010 6:34 am to /tmp/foo/2010-07-01/06/34/foodata-xxx.

The %Y corresponds to year, %m to month,  etc. You can see more details here: http://archive.cloudera.com/cdh/3/flume/UserGuide.html#_output_bucketing
 
Lastly... how do I get rid of the extraneous entries for "collector",
"baklava.mooh.org" and "ramen.mooh.org" in the Node status?


Right now the extraneous node statuses will live on in DECOMMISSIONED state. We want to make sure you know what has happened to these nodes and my first thought is that we don't want them to automatically disappear.  Maybe we could give you  the ability  make them "go away" (which would also let the system know that you have seen the effect).  Does that seem like a reasonable idea?  (if so, go to http://issues.cloudera.com and add a file a bug/feature request!)
 
Thanks in advance :)
yun



--
// Jonathan Hsieh (shay)
// j...@cloudera.com

Jonathan Hsieh

unread,
Jul 1, 2010, 11:35:40 AM7/1/10
to y...@nomitor.com, flume...@cloudera.org
Yun,

Response below.

On Thu, Jul 1, 2010 at 7:18 AM, Yun Huang Yong <y...@nomitor.com> wrote:
Thanks for the explanations, everything now works as expected.


On 1/07/2010 11:41 PM, Jonathan Hsieh wrote:
> Right now the extraneous node statuses will live on in DECOMMISSIONED
> state. We want to make sure you know what has happened to these nodes
> and my first thought is that we don't want them to automatically
> disappear.  Maybe we could give you  the ability  make them "go away"
> (which would also let the system know that you have seen the effect).
>  Does that seem like a reasonable idea?  (if so, go to
> http://issues.cloudera.com and add a file a bug/feature request!)

Ah ok, that makes sense, will file a feature request.

Is this also the source of these messages:

2010-07-02 00:13:22,187 INFO com.cloudera.flume.agent.LivenessManager:
Logical Node 'ramen.mooh.org' not configured on master

They're appearing every 5 seconds which is cluttering the log somewhat. :)

Thanks!
yun

That message every 5 secs  is a vestige of a previous and less general design.  The DECOMMISSIONED state is relatively new in flume, and it probably makes sense to make that message go away, or back off to be less frequent for user decommissioned nodes, and nodes in general.

Thanks,
Jon.
Reply all
Reply to author
Forward
0 new messages