running Flume on windows

841 views
Skip to first unread message

Jorge

unread,
Aug 4, 2010, 1:39:51 PM8/4/10
to Flume Users
Hello!

I would like to use Flume to store logs from my .NET server. The
easiest way seems to be
to log to a file, and then tail this file with the Flume agent.

Because I am a performance hound, I would like to skip the disk
activity, and send the message directly to the
Flume agent. Would it be possible to send the log via a named pipe? I
suppose this would involve some coding, which I would be happy to do.
Here is an example of how to do this:

http://v01ver-howto.blogspot.com/2010/04/howto-use-named-pipes-to-communicate.html


Also, regarding running the startup scripts on windows, there is a
port of bash shell for windows, so this should work,
with a couple of changes to the file paths.

Cheers,
Aaron



Jonathan Hsieh

unread,
Aug 4, 2010, 8:55:05 PM8/4/10
to Jorge, Flume Users
Aaron,

The example you posted looks like it takes data from java land and then feeds it to windows/.net land (and sends an back).  It however looks like it can support going the other way (.net->java) as well using the windows file naming convention "\\.\pipe\testpipe".  

My first thought is that, if your application is already writing to the named pipe, you can probably just hook up a flume 'file' source or 'tail' source using the named pipes name as its argument. There also may be some issues with '\n' vs '\r\n' as well. This of course needs to be tested.

A while ago, we did some testing running flume using cygwin's as our "unix-environment" on windows.  Some vestiges remain in our startup scripts.  These have not been tested for a long time however.  I think it might be a bit much to expect putting cygwin on all production machines if it isn't there already.   Ideally, we'd do something more windows-native.

We've started a JIRA here: https://issues.cloudera.org/browse/FLUME-107  Its a good place to put more requirements and follow proress once we have a design.  The mailing list is probably a good place to hash out an initial design.

Jon.
--
// Jonathan Hsieh (shay)
// j...@cloudera.com

Aaron Boxer

unread,
Aug 4, 2010, 11:13:20 PM8/4/10
to Jonathan Hsieh, Flume Users
Thanks, Jonathan. Yes, the example is going the wrong way, but it
looks pretty straightforward to reverse it. Currently, my app writes
log messages as xml text blobs to a SQL Server database.
I am not using pipes at the moment, but thought this might be the
easiest way of communicating
between a .NET app and the Flume Java agent, without going to disk.

I understand there is a C# thrift client. This sounds like a good
solution. Would this client talk to a local Flume agent, or to a
remote Flume node?

Also, I need the highest level of reliability. At this level, would
the Flume agent be writing to disk before sending the message to the
remote node? Also, if the Flume agent is "tailing" a file, would it
skip the write to disk, because the message is already persisted in
the file?

Regarding cygwin, no, I would rather avoid installing that.
But, there is a bash shell for windows; shouldn't that work as a
lightweight solution?

So, to recap, I need the highest level of reliability, and I want to
avoid unnecessary writes to disk.

Thanks again.

Jonathan Hsieh

unread,
Aug 5, 2010, 10:56:07 PM8/5/10
to Aaron Boxer, Flume Users
Aaron,

Answers inline.

Jon.

On Wed, Aug 4, 2010 at 8:13 PM, Aaron Boxer <box...@gmail.com> wrote:
Thanks, Jonathan.  Yes, the example is going the wrong way, but it
looks pretty straightforward to reverse it. Currently, my app writes
log messages as xml text blobs to a SQL Server database.
I am not using pipes at the moment, but thought this might be the
easiest way of communicating
between a .NET app and the Flume Java agent, without going to disk.


Sounds reasonable.
 
I understand there is a C# thrift client. This sounds like a good
solution. Would this client talk to a local Flume agent, or to a
remote Flume node?

It could do either depending on how reliable you need the data to be and how whether you are willing to have load on the log generating machine.  If you are really concerned about reliability, it would make sense to have the agent local and use its write-ahead-log (WAL).  If you are concerned about load, I'd go best effort to a remote flume node.  Something in between would use the disk failover mode to a local agent.  It would only write to disk if there was a detected problem one-hop downstream.
 
Also, I need the highest level of reliability. At this level, would
the Flume agent be writing to disk before sending the message to the
remote node? Also, if the Flume agent is "tailing" a file, would it
skip the write to disk, because the message is already persisted in
the file?


For the highest reliability mode, yes, it would be writing to local disk.

I think if you were tailing the named pipe, there would be no initial write to disk.  If you used the most reliable mode it would always write to disk, if you used the mid-level it would write to disk on detected failures, and in the best effort mode, it would never write to disk.
 
Regarding cygwin, no, I would rather avoid installing that.
But, there is a bash shell for windows; shouldn't that work as a
lightweight solution?


 Can you give me a pointer to the one you are thinking about?  Another option would be to do something more windows native (something  like powerscript or .bat scripts?)

Aaron Boxer

unread,
Aug 6, 2010, 11:24:25 AM8/6/10
to Jonathan Hsieh, Flume Users
Thanks, Jonathan.

I think I would prefer having a local flume agent with WAL, and use
named pipe to communicate with it from .NET server.

Can you please walk me through a configuration of Flume that would be
fail safe: i.e. could recover from failure of
any node?

For starters, I would have two hosts with local flume agents, writing
to a flume master. Would I need a second master?
How would it work?


Regarding windows bash,
UnixUtils is a windows distribution of common unix commands, including
sh. (http://unxutils.sourceforge.net/)
One just has to proceed with caution when putting them on the path,
because there are similar windows only
command, like find.

So, quick and dirty solution would be to run sh.exe, and change some
of the paths in the startup script.
Better one would be to re-write in Perl.

Thanks again!

Aaron

Jonathan Hsieh

unread,
Aug 8, 2010, 2:12:34 PM8/8/10
to Aaron Boxer, Flume Users
Aaron,

The high level plans sounds reasonable.  More below.

On Fri, Aug 6, 2010 at 8:24 AM, Aaron Boxer <box...@gmail.com> wrote:
Thanks, Jonathan.

I think I would prefer having a local flume agent with WAL, and use
named pipe to communicate with it from .NET server.

Can you please walk me through a configuration of Flume that would be
fail safe: i.e. could recover from failure of
any node?

For starters, I would have two hosts with local flume agents, writing
to a flume master. Would I need a second master?
How would it work?
 

let's assume that the physical machines are named node1, node2, and collector.  (a flume master is a different from a flume collector -- masters generally aren't in the data path).

node1: source |  agentE2ESink("collector");
node2: source | agentE2ESink("collector");
collector: collectorSource | collectorSink("hdfs://nn/path/to/", "fileprefix-");

If the collector or name node goes down, the nodes have a copy in its local WAL, and will eventually recover.  If node1 goes down and then comes back up, it will recover and reply data in its WAL.  If node 2 goes down and never comes back up, there may some messages lost (but since node2 is down its not producing anything new).
 
You could have two collectors.  In this situation, if a collector goes down, the nodes can failover to the other collector.  The 0.9.0 version has some alpha code that automatically manages this.  

node1: source | autoE2EChain;
node2: source | autoE2EChain;
collector1: autoCollectorSource | collectorSink(...);
collector2: autoCollectorSource | collectorSink(...);

the auto chains will be randomly assigned to autoCollectorSources.  Let's say node1 gets assigned to collector1 and node2 gets assigned to collector2.  Then let's say collector1 goes down.  node2 will failover to  collector2.  Lets say both collector1 and collector2 go down.  node1 and node2 have WALs and store the data.  If collector1 comes back up, node1 and node2 will eventually retry and start delivering to collector1.   


Regarding windows bash,
UnixUtils is a windows distribution of common unix commands, including
sh. (http://unxutils.sourceforge.net/)
One just has to proceed with caution when putting them on the path,
because there are similar windows only
command, like find.

So, quick and dirty solution would be to run sh.exe, and change some
of the paths in the startup script.
Better one would be to re-write in Perl.

Hm.. perl (or Cloudera prefers python) might be a reasonable idea. It seems to be a fairly heavy weight solution though.  I'd lean toward something that is standard on all the windows boxen, or something really small.

Aaron Boxer

unread,
Aug 9, 2010, 11:09:53 PM8/9/10
to Jonathan Hsieh, Flume Users
Thanks, Jonathan. It's helpful to see how the pieces would fit together.

For the startup script, Perl or Python might be overkill, but it would
save you from
maintaining two different scripts, which might be worth the effort.

Cheers,
Aaron

Reply all
Reply to author
Forward
0 new messages