Secor losing events

571 views
Skip to first unread message

bar...@chaordicsystems.com

unread,
Jan 16, 2015, 9:08:56 AM1/16/15
to secor...@googlegroups.com
Hello guys,

I'm trying to run Secor on my production environment, but I'm facing some issues.
Please see below my current configuration.

secor.kafka.topic_filter=^(oneTopic|topicTwo|thirdTopic|topic4)$
secor.consumer.threads=1
zookeeper.session.timeout.ms=30000
zookeeper.sync.time.ms=200
kafka.consumer.timeout.ms=10000
kafka.rebalance.max.retries=10
kafka.rebalance.backoff.ms=8000
kafka.socket.receive.buffer.bytes=
kafka.fetch.message.max.bytes=
secor.max.file.size.bytes=200000000
secor.max.file.age.seconds=600

I'm running 3 Secor processes, each one on a separate machine (m1.xlarge). Each of my topic has 3 partitions. After the initial rebalance, each process owns a particular partition:
- Machine 1: oneTopic/0, topicTwo/0, thirdTopic/0, topic4/0
- Machine 2: oneTopic/1, ...
- Machine 3: oneTopic/2, ..., topic4/2

After some hours, Kafka lag (logSize - offset) is close to ZERO, but I lost many events. How do I know that?

I run the following command and then count the number of lines:
./kafka-console-consumer.sh --zookeeper localhost:2181 --topic oneTopic --from-beginning | grep '{"date":"2015-01-15' > /tmp/output

I compare this output with the output from a Spark code that count the number of records on my S3 bucket (I'm pretty sure this code is OK).
These values *sometimes* differ a lot from each other (7~25%). When I reset the topic offset and run again, sometimes it works.

Does any of you face similar problems?
Did you perform these kind of checks?

Thanks in advance for your help.
Regards,
-- Flávio Barata

Pawel Garbacki

unread,
Jan 16, 2015, 12:05:25 PM1/16/15
to bar...@chaordicsystems.com, secor...@googlegroups.com
Did you try running LogFileVerifier on the s3 logs? It will tell you if there are gaps in the offset ranges (indicating data loss).

Note that Secor comes with end-to-end tests verifying that all messages go through. As a general design advice, in production environment you may want to set up an audit topic where you post fake messages along the real ones (or track posted message counts in something like tsdb) on the producer end, and compare the output.

Thanks,
-pawel


--
You received this message because you are subscribed to the Google Groups "secor-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to secor-users...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Flávio Santos

unread,
Jan 19, 2015, 7:43:03 AM1/19/15
to Pawel Garbacki, secor...@googlegroups.com
Hi Pawel,

Thanks for your reply. Yes, I tried LogFileVerifier:

barata@instance01:/mnt/secor$ java -ea -Dlog4j.configuration=log4j.prod.properties -Dconfig=secor.prod.backup.properties -cp "secor-0.1-SNAPSHOT.jar:lib/*" com.pinterest.secor.main.LogFileVerifierMain -t oneTopic -q
2015-01-19 12:30:50,514 [main] (com.pinterest.secor.main.LogFileVerifierMain:99) ERROR Log file verifier failed
java.lang.RuntimeException: Message count of 293737 in topic oneTopic partition 0 does not agree with adjacent offsets 9744343 and 10058777
at com.pinterest.secor.tools.LogFileVerifier.verifyCounts(LogFileVerifier.java:143)
at com.pinterest.secor.main.LogFileVerifierMain.main(LogFileVerifierMain.java:93)

Am I doing anything wrong? Just to let you know, instance01 is not the owner of partition 0. I don't know if LogFileVerifier can be executed from any of my instances. I think it is supposed to work anyway.
The message parser configured in the secor.prod.backup.properties file is:
secor.message.parser.class=com.pinterest.secor.parser.OffsetMessageParser

I look forward to hearing back from you.
Thanks in advance!

Regards,
-- Flávio Barata

Pawel Garbacki

unread,
Jan 21, 2015, 1:59:03 AM1/21/15
to Flávio Santos, secor...@googlegroups.com
LogFileVerifier reads data from s3 and it does not talk to kafka so it doesn't really matter where you invoke it from. I honestly don't know what's going on here. Does this happen consistently across many offset ranges (i.e., what if you set -start_offset to different offset values recorded in output s3 file names)? Is there anything interesting in Secor logs?

Flávio Santos

unread,
Jan 21, 2015, 7:50:39 AM1/21/15
to Pawel Garbacki, secor...@googlegroups.com
Hi Pawel,

That's really wierd. I'm currently running Secor for a single topic on 3 different instances (distinct machines) without problems. I'm doing some stress tests by killing and restarting some instances. No problem so far.
I don't know if there is something related to consuming multiple topics at once (my filter regex is "^(oneTopic|topicTwo|thirdTopic|topic4)$" ) or if the problem comes when Secor is overloaded.

I have some errors in my logs, but I believe they are somehow normal. The first is a "Connection refused" to Zookeeper. The second is:
2015-01-21 03:05:44,637 [ConsumerFetcherThread-CONSUMER-GROUP_ec2-XX-XX-XX-XX.compute-1.amazonaws.com_12810_13-0-1011623593] (kafka.utils.Logging$class:83) WARN  Reconnect due to socket error: null
2015-01-21 03:05:44,681 [ConsumerFetcherThread-CONSUMER-GROUP_ec2-XX-XX-XX-XX.compute-1.amazonaws.com_12810_13-0-1011623593] (kafka.utils.Logging$class:68) INFO  [ConsumerFetcherThread-CONSUMER-GROUP_ec2-XX-XX-XX-XX.compute-1.amazonaws.com_12810_13-0-1011623593], Stopped 
2015-01-21 03:05:44,686 [CONSUMER-GROUP_ec2-XX-XX-XX-XX.compute-1.amazonaws.com_12810_13_watcher_executor] (kafka.utils.Logging$class:68) INFO  [ConsumerFetcherThread-CONSUMER-GROUP_ec2-XX-XX-XX-XX.compute-1.amazonaws.com_12810_13-0-1011623593], Shutdown completed
2015-01-21 03:05:44,687 [CONSUMER-GROUP_ec2-XX-XX-XX-XX.compute-1.amazonaws.com_12810_13_watcher_executor] (kafka.utils.Logging$class:68) INFO  [ConsumerFetcherManager-1421697880780] All connections stopped
2015-01-21 03:05:44,687 [CONSUMER-GROUP_ec2-XX-XX-XX-XX.compute-1.amazonaws.com_12810_13_watcher_executor] (kafka.utils.Logging$class:68) INFO  [CONSUMER-GROUP_ec2-XX-XX-XX-XX.compute-1.amazonaws.com_12810_13], Cleared all relevant queues for this fetcher
2015-01-21 03:05:44,687 [CONSUMER-GROUP_ec2-XX-XX-XX-XX.compute-1.amazonaws.com_12810_13_watcher_executor] (kafka.utils.Logging$class:68) INFO  [CONSUMER-GROUP_ec2-XX-XX-XX-XX.compute-1.amazonaws.com_12810_13], Cleared the data chunks in all the consumer message iterators
2015-01-21 03:05:44,687 [CONSUMER-GROUP_ec2-XX-XX-XX-XX.compute-1.amazonaws.com_12810_13_watcher_executor] (kafka.utils.Logging$class:68) INFO  [CONSUMER-GROUP_ec2-XX-XX-XX-XX.compute-1.amazonaws.com_12810_13], Committing all offsets after clearing the fetcher queues
2015-01-21 03:05:44,687 [CONSUMER-GROUP_ec2-XX-XX-XX-XX.compute-1.amazonaws.com_12810_13_watcher_executor] (kafka.utils.Logging$class:68) INFO  [CONSUMER-GROUP_ec2-XX-XX-XX-XX.compute-1.amazonaws.com_12810_13], Releasing partition ownership

Thanks for your consideration.

Regards,

--
Flávio R. Santos

Pawel Garbacki

unread,
Jan 23, 2015, 11:53:31 AM1/23/15
to Flávio Santos, secor...@googlegroups.com
Hmm, we do run secor with multi-topic settings but have never seen this behavior. If you have time, you could modify the end-to-end test to post to multiple topics and try to capture this scenario in a reproducible case.

bar...@chaordicsystems.com

unread,
Feb 4, 2015, 8:28:11 AM2/4/15
to secor...@googlegroups.com, bar...@chaordicsystems.com
Hello Pawel and others,

I spent some time investigating the problem and I have some extra info.

I'm currently running Secor with a single topic (6000 messages per minute) using 3 m1.xlarge instances and everything is going fine since last week.
However, when I run another completely independent Secor instance (also m1.xlarge) to consume a topic (24k mpm), I start losing more than 30% of my events.

I came to the conclusion that the problem is indeed related to Secor overload. I'm using the default OffsetMessageParser and the log (level=INFO) doesn't show errors (grep'ed -i exception and error). Moreover, files are not being accumulated in Secor local path, hence they're being uploaded to S3 and deleted locally.

Could you please share the number of messages per minute (mpm) you run in your production environment?
I suspect that there is a try/catch hiding errors. I really don't know. Do you have any hint before I start debugging Secor code?

Thank you very much in advance.

Regards,
-- Flávio Barata

Pawel Garbacki

unread,
Feb 4, 2015, 2:22:28 PM2/4/15
to Flávio Santos, secor...@googlegroups.com
Hi Flavio,

thanks for looking into this. We have several topics with ~100K mpm. Each of them is configured with 60 partitions and they are handled by 6 secor boxes.

HTH,
-pawel
Reply all
Reply to author
Forward
0 new messages