Rsyslog to HDFS through Flume servers: Data lost :(

21 views

Skip to first unread message

Smain Kahlouch

unread,

Feb 5, 2015, 10:07:56 AM2/5/15

to flume...@googlegroups.com

Hello !

I'm doing some benchs with flume in order to define our futur log shipping system to a Hadoop cluster.
Actually my setup seems to be working fine when i send some logs through this platform.

Lab setup :

* A machine acts as a client. Indeed we would like to send its syslog file to the hadoop cluster.
We have a rsyslog daemon which is configured to send to a local flume instance as follows :

if $programname startswith 'Mytag' then @@localhost:20514

It's localhost tcp connection so it should be reliable...

* The local flume listens with syslog as source and sends logs to a pool of flume servers using avro protocole

agent.sources=source1
agent.channels=channel1
agent.sinks=host1 host2 host3 host4
 
agent.sinkgroups = g1
agent.sinkgroups.g1.sinks = host1 host2 host3 host4
agent.sinkgroups.g1.processor.type =load_balance 
agent.sinkgroups.g1.processor.selector = round_robin
agent.sinkgroups.g1.processor.backoff =true 

agent.sources.source1.type = syslogtcp     
agent.sources.source1.port = 20514
agent.sources.source1.host = 127.0.0.1
agent.sources.source1.channels = channel1

agent.channels.channel1.type=memory
agent.channels.channel1.capacity = 3500000
agent.channels.channel1.transactioncapacity = 3500000
 
agent.sinks.host1.channel = channel1 
agent.sinks.host1.type = avro 
agent.sinks.host1.hostname = 10.29.2.199
agent.sinks.host1.port = 44445
agent.sinks.host2.channel = channel1 
agent.sinks.host2.type = avro
agent.sinks.host2.hostname = 10.29.2.200
agent.sinks.host2.port = 44445
agent.sinks.host3.channel = channel1 
agent.sinks.host3.type = avro
agent.sinks.host3.hostname = 10.29.2.201
agent.sinks.host3.port = 44445
agent.sinks.host4.channel = channel1 
agent.sinks.host4.type = avro
agent.sinks.host4.hostname = 10.29.2.202
agent.sinks.host4.port = 44445

And the target flume servers are configured to write to HDFS

agent1.channels.memory-channel.type = file
agent1.channels.memory-channel.checkpointDir = /tmp/flume_checkpoint
agent1.channels.memory-channel.checkpointInterval = 1000
agent1.channels.memory-channel.transactionCapacity = 150000

agent1.sources=source1
agent1.sources.source1.bind=0.0.0.0
agent1.sources.source1.channels=memory-channel
agent1.sources.source1.port=44445
agent1.sources.source1.type=avro

agent1.sinks.hdfs-sink.channel = memory-channel
agent1.sinks.hdfs-sink.type = hdfs
agent1.sinks.hdfs-sink.hdfs.path = /data/in
agent1.sinks.hdfs-sink.hdfs.fileType = DataStream
a1.sinks.hdfsSink.hdfs.rollSize = 128000000
a1.sinks.hdfsSink.hdfs.rollInterval = 0
a1.sinks.hdfsSink.hdfs.rollCount = 0
a1.sinks.hdfsSink.hdfs.idleTimeOut = 300

agent1.channels = memory-channel
agent1.sources = source1
agent1.sinks = hdfs-sink

Unfortunately when i send 100 000 lines exactly, some of the data is lost.

wc -l loglines.log 
10000 loglines.log

for i in {0..9}; do 
    cat loglines.log| while read line; do 
        logger -p local7.info -t Mytag "$line"; sleep 0.01; 
        done; 
done

On a node of the HDFS server i always loose about 20-30%

I don't have any java exception on either in flumes or hdfs logs.
I'm a beginner in hadoop softwares and flume, maybe i'm missing something...
Could you please help me to find out the cause of this issue ?

Regards,
Smana

Smain Kahlouch

unread,

Feb 10, 2015, 11:25:38 AM2/10/15

to flume...@googlegroups.com

Ok, i don't know why but it seems to be caused by the hdfs sink.
I've change my configuration in order to write directly into the local filesystem and i have no data loss.

agent1.channels=file-channel
agent1.channels.file-channel.checkpointDir=/tmp/flume_checkpoint
agent1.channels.file-channel.checkpointInterval=1000
agent1.channels.file-channel.transactionCapacity=150000
agent1.channels.file-channel.type=file
agent1.sinks=file-sink
agent1.sinks.file-sink.channel=file-channel
agent1.sinks.file-sink.sink.directory=/var/log/flume/local
agent1.sinks.file-sink.type=file_roll
agent1.sources=source1
agent1.sources.source1.channels=file-channel
agent1.sources.source1.host=127.0.0.1
agent1.sources.source1.port=20515
agent1.sources.source1.type=syslogtcp