Rsyslog to HDFS through Flume servers: Data lost :(

21 views
Skip to first unread message

Smain Kahlouch

unread,
Feb 5, 2015, 10:07:56 AM2/5/15
to flume...@googlegroups.com
Hello !

I'm doing some benchs with flume in order to define our futur log shipping system to a Hadoop cluster.
Actually my setup seems to be working fine when i send some logs through this platform.

Lab setup :

* A machine acts as a client. Indeed we would like to send its syslog file to the hadoop cluster.
We have a rsyslog daemon which is configured to send to a local flume instance as follows :

if $programname startswith 'Mytag' then @@localhost:20514

It's localhost tcp connection so it should be reliable...

* The local flume listens with syslog as source and sends logs to a pool of flume servers using avro protocole

agent.sources=source1
agent
.channels=channel1
agent
.sinks=host1 host2 host3 host4
 
agent
.sinkgroups = g1
agent
.sinkgroups.g1.sinks = host1 host2 host3 host4
agent
.sinkgroups.g1.processor.type =load_balance
agent
.sinkgroups.g1.processor.selector = round_robin
agent
.sinkgroups.g1.processor.backoff =true

agent
.sources.source1.type = syslogtcp    
agent
.sources.source1.port = 20514
agent
.sources.source1.host = 127.0.0.1
agent
.sources.source1.channels = channel1

agent
.channels.channel1.type=memory
agent
.channels.channel1.capacity = 3500000
agent
.channels.channel1.transactioncapacity = 3500000
 
agent
.sinks.host1.channel = channel1
agent
.sinks.host1.type = avro
agent
.sinks.host1.hostname = 10.29.2.199
agent
.sinks.host1.port = 44445
agent
.sinks.host2.channel = channel1
agent
.sinks.host2.type = avro
agent
.sinks.host2.hostname = 10.29.2.200
agent
.sinks.host2.port = 44445
agent
.sinks.host3.channel = channel1
agent
.sinks.host3.type = avro
agent
.sinks.host3.hostname = 10.29.2.201
agent
.sinks.host3.port = 44445
agent
.sinks.host4.channel = channel1
agent
.sinks.host4.type = avro
agent
.sinks.host4.hostname = 10.29.2.202
agent
.sinks.host4.port = 44445

And the target flume servers are configured to write to HDFS

agent1.channels.memory-channel.type = file
agent1
.channels.memory-channel.checkpointDir = /tmp/flume_checkpoint
agent1
.channels.memory-channel.checkpointInterval = 1000
agent1
.channels.memory-channel.transactionCapacity = 150000

agent1
.sources=source1
agent1
.sources.source1.bind=0.0.0.0
agent1
.sources.source1.channels=memory-channel
agent1
.sources.source1.port=44445
agent1
.sources.source1.type=avro

agent1
.sinks.hdfs-sink.channel = memory-channel
agent1
.sinks.hdfs-sink.type = hdfs
agent1
.sinks.hdfs-sink.hdfs.path = /data/in
agent1
.sinks.hdfs-sink.hdfs.fileType = DataStream
a1
.sinks.hdfsSink.hdfs.rollSize = 128000000
a1
.sinks.hdfsSink.hdfs.rollInterval = 0
a1
.sinks.hdfsSink.hdfs.rollCount = 0
a1
.sinks.hdfsSink.hdfs.idleTimeOut = 300

agent1
.channels = memory-channel
agent1
.sources = source1
agent1
.sinks = hdfs-sink

Unfortunately when i send 100 000 lines exactly, some of the data is lost.

wc -l loglines.log
10000 loglines.log

for i in {0..9}; do
    cat loglines
.log| while read line; do
        logger
-p local7.info -t Mytag "$line"; sleep 0.01;
       
done;
done

On a node of the HDFS server i always loose about 20-30%


I don't have any java exception on either in flumes or hdfs logs.
I'm a beginner in hadoop softwares and flume, maybe i'm missing something...
Could you please help me to find out the cause of this issue ?

Regards,
Smana


Smain Kahlouch

unread,
Feb 10, 2015, 11:25:38 AM2/10/15
to flume...@googlegroups.com
Ok, i don't know why but it seems to be caused by the hdfs sink.
I've change my configuration in order to write directly into the local filesystem and i have no data loss.

agent1.channels=file-channel
agent1
.channels.file-channel.checkpointDir=/tmp/flume_checkpoint
agent1
.channels.file-channel.checkpointInterval=1000
agent1
.channels.file-channel.transactionCapacity=150000
agent1
.channels.file-channel.type=file
agent1
.sinks=file-sink
agent1
.sinks.file-sink.channel=file-channel
agent1
.sinks.file-sink.sink.directory=/var/log/flume/local
agent1
.sinks.file-sink.type=file_roll
agent1
.sources=source1
agent1
.sources.source1.channels=file-channel
agent1
.sources.source1.host=127.0.0.1
agent1
.sources.source1.port=20515
agent1
.sources.source1.type=syslogtcp

Could you please help me to configure the hdfs sink properly ?

Regards,
Smana
Reply all
Reply to author
Forward
0 new messages