Hello guys,
I'm trying to run Secor on my production environment, but I'm facing some issues.
Please see below my current configuration.
secor.kafka.topic_filter=^(oneTopic|topicTwo|thirdTopic|topic4)$
secor.consumer.threads=1
zookeeper.session.timeout.ms=30000
zookeeper.sync.time.ms=200
kafka.consumer.timeout.ms=10000
kafka.rebalance.max.retries=10
kafka.rebalance.backoff.ms=8000
kafka.socket.receive.buffer.bytes=
kafka.fetch.message.max.bytes=
secor.max.file.size.bytes=200000000
secor.max.file.age.seconds=600
I'm running 3 Secor processes, each one on a separate machine (m1.xlarge). Each of my topic has 3 partitions. After the initial rebalance, each process owns a particular partition:
- Machine 1: oneTopic/0, topicTwo/0, thirdTopic/0, topic4/0
- Machine 2: oneTopic/1, ...
- Machine 3: oneTopic/2, ..., topic4/2
After some hours, Kafka lag (logSize - offset) is close to ZERO, but I lost many events. How do I know that?
I run the following command and then count the number of lines:
./kafka-console-consumer.sh --zookeeper localhost:2181 --topic oneTopic --from-beginning | grep '{"date":"2015-01-15' > /tmp/output
I compare this output with the output from a Spark code that count the number of records on my S3 bucket (I'm pretty sure this code is OK).
These values *sometimes* differ a lot from each other (7~25%). When I reset the topic offset and run again, sometimes it works.
Does any of you face similar problems?
Did you perform these kind of checks?
Thanks in advance for your help.
Regards,
-- Flávio Barata