Hello !
I'm doing some benchs with flume in order to define our futur log shipping system to a Hadoop cluster.
Actually my setup seems to be working fine when i send some logs through this platform.
Lab setup :
* A machine acts as a client. Indeed we would like to send its syslog file to the hadoop cluster.
We have a
rsyslog daemon which is configured to send to a local flume instance as follows :
if $programname startswith 'Mytag' then @@localhost:20514
It's
localhost tcp connection so it should be reliable...
* The local flume listens with syslog as source and sends logs to a pool of flume servers using avro protocole
agent.sources=source1
agent.channels=channel1
agent.sinks=host1 host2 host3 host4
agent.sinkgroups = g1
agent.sinkgroups.g1.sinks = host1 host2 host3 host4
agent.sinkgroups.g1.processor.type =load_balance
agent.sinkgroups.g1.processor.selector = round_robin
agent.sinkgroups.g1.processor.backoff =true
agent.sources.source1.type = syslogtcp
agent.sources.source1.port = 20514
agent.sources.source1.host = 127.0.0.1
agent.sources.source1.channels = channel1
agent.channels.channel1.type=memory
agent.channels.channel1.capacity = 3500000
agent.channels.channel1.transactioncapacity = 3500000
agent.sinks.host1.channel = channel1
agent.sinks.host1.type = avro
agent.sinks.host1.hostname = 10.29.2.199
agent.sinks.host1.port = 44445
agent.sinks.host2.channel = channel1
agent.sinks.host2.type = avro
agent.sinks.host2.hostname = 10.29.2.200
agent.sinks.host2.port = 44445
agent.sinks.host3.channel = channel1
agent.sinks.host3.type = avro
agent.sinks.host3.hostname = 10.29.2.201
agent.sinks.host3.port = 44445
agent.sinks.host4.channel = channel1
agent.sinks.host4.type = avro
agent.sinks.host4.hostname = 10.29.2.202
agent.sinks.host4.port = 44445
And the target flume servers are configured to write to HDFS
agent1.channels.memory-channel.type = file
agent1.channels.memory-channel.checkpointDir = /tmp/flume_checkpoint
agent1.channels.memory-channel.checkpointInterval = 1000
agent1.channels.memory-channel.transactionCapacity = 150000
agent1.sources=source1
agent1.sources.source1.bind=0.0.0.0
agent1.sources.source1.channels=memory-channel
agent1.sources.source1.port=44445
agent1.sources.source1.type=avro
agent1.sinks.hdfs-sink.channel = memory-channel
agent1.sinks.hdfs-sink.type = hdfs
agent1.sinks.hdfs-sink.hdfs.path = /data/in
agent1.sinks.hdfs-sink.hdfs.fileType = DataStream
a1.sinks.hdfsSink.hdfs.rollSize = 128000000
a1.sinks.hdfsSink.hdfs.rollInterval = 0
a1.sinks.hdfsSink.hdfs.rollCount = 0
a1.sinks.hdfsSink.hdfs.idleTimeOut = 300
agent1.channels = memory-channel
agent1.sources = source1
agent1.sinks = hdfs-sink
Unfortunately when i send
100 000 lines exactly, some of the
data is lost.
wc -l loglines.log
10000 loglines.log
for i in {0..9}; do
cat loglines.log| while read line; do
logger -p local7.info -t Mytag "$line"; sleep 0.01;
done;
done
On a node of the HDFS server i always
loose about 20-30%I don't have any java exception on either in flumes or hdfs logs.
I'm a beginner in hadoop softwares and flume, maybe i'm missing something...
Could you please help me to find out the cause of this issue ?
Regards,
Smana