Crash when writing to HDFS with a large number of Match configs

ke...@bellycard.com

unread,

Apr 18, 2013, 5:22:38 PM4/18/13

to flu...@googlegroups.com

Hey everyone,

I'm currently using fluentd to pass our user events data into HDFS. We are running into a buffer overflow crash - https://gist.github.com/kevinreedy/833137fa94d8fca5595f.

Here is what our setup looks like. (Bad ascii art representation is below)

- our rails based API writes to fluentd running on the same machine

- fluentd on the API machines buffer into RAM and forward data to log collector machines running fluentd

- the log collector machines buffer onto disk and then write to HDFS using the WebHDFS plugin, with httpfs

Here is our fluentd config on API Machines: https://gist.github.com/kevinreedy/cfe2072a60a920c49c70

Notes:

- We are using type copy, as we eventually plan to forward to two sets of collectors - one for HDFS and another for S3.

- For troubleshooting purposes, we are only writing to collector01.bellycard.com. We will have failover back in place once we fix the problems.

The collector fluentd config is more complex, as we have a large number of different events coming through:

- The main config file is just "include config.d/*". In that directory, we have one file with a source definition, and about 300 files, each with a single match definition

- source definition: https://gist.github.com/kevinreedy/f1316ed787a4c5884edf

- example match definition: https://gist.github.com/kevinreedy/12b7e553e32f6b40a73b

When fluentd runs on our collector machine, it crashes with "*** buffer overflow detected ***: /usr/lib/fluent/ruby/bin/ruby terminated" and "[info]: process finished code=134". The full log output and crash is at https://gist.github.com/kevinreedy/833137fa94d8fca5595f. If we limit the number of match definitions, we are able to run fine, but I don't know exactly where the breaking point is.

Other notes:

- fluentd is installed via treasure data's apt repository

- td-agent version is 1.1.12-1

- hdfs gems installed are webhdfs-0.5.1 and fluent-plugin-webhdfs-0.1.2

- ulimit -n 32768 is in our init scripts

I have a workaround in mind, but I would much rather do all of the sorting in fluentd before we get to HDFS.

- have fluentd drop all logs into a single folder on HDFS

- write a map/reduce job that runs every 5 or 10 minutes to sort the fluent data by event name and copy it into its corresponding dirrectory

My other crazy idea is to fork the webhdfs plugin to make a single connection to HDFS but write to multiple files.

Does anyone have any ideas on the crash or where to go from here?

Thanks everyone!

Kevin

Here's that bad ascii art diagram of our setup, if it's easier to visualize

+-------------+

| API Machine | (12 machines) (AWS Linux)

|-------------|

| Rails |

| | |

| FluentD |

+------+------+

|

| (FluentD Protocol)

|

+------v------------+

| Collector Machine | (Ubuntu 12.04)

|-------------------|

| FluentD |

+------+------------+

|

| (httpfs Protocol)

|

+------v-----------+

| Hadoop Name Node | (Ubuntu 12.04)

|------------------|

| HDFS |

+------------------+

Satoshi Tagomori

unread,

Apr 19, 2013, 3:06:46 AM4/19/13

to flu...@googlegroups.com

Hi Kevin,

I'm worrying about httpfs server threads number.

If your httpfs server (tomcat) is from CDH4, tomcat works with tomcat's default configurations,

and its threads are few. Many parallel requests may fail with httpfs server.

Buffer overflow of ruby is very curious. It might be from ruby's bug, but we may solve that problem

with configuration change.

What you can do to try to avoid current trouble, i think, are:

* check httpfs server logs and status

Can you operate hdfs over httpfs server when fluentd is in trouble?

If you cannot, httpfs server may be in trouble. Check logs of httpfs server.

* use webhdfs protocol

In our environment, httpfs server was bottleneck of traffic. Many many many I/O failures were occured.

For heavy traffic, webhdfs protocol is better than httpfs.

* use fluent-plugin-forest (to reduce number of <match> directives in configuration)

> If we limit the number of match definitions, we are able to run fine,

> but I don't know exactly where the breaking point is.

How many <match> directives in your configuration?

Handreds of <match> directives may not be tested in case i knew.

To reduce Fluentd's match overhead, fluent-plugin-forest is available.

fluent-plugin-forest uses its internal match mechanism instead of fluentd's.

https://github.com/tagomoris/fluent-plugin-forest

type forest

subtype webhdfs

# webhdfs settings

</template>

</match>

type forest

subtype webhdfs

# ....

</template>

</match>

> My other crazy idea is to fork the webhdfs plugin to make a single connection to HDFS but write to multiple files.

I'm developer of fluent-plugin-webhdfs.

I hope patches If you need such that feature seriously, rather than forks.

tagomoris

2013年4月19日金曜日 6時22分38秒 UTC+9 ke...@bellycard.com:

Kevin Reedy

unread,

Apr 21, 2013, 2:04:01 PM4/21/13

to flu...@googlegroups.com

Hey Satoshi,

Thanks for the quick response!

I looked through httpfs logs, and the only thing in them at the time of the fluentd crashes is "2013-04-18 21:32:38,695 INFO httpfsaudit: [/] filter [-]"

The reason we are having fluentd connect to httpfs instead of webhdfs is we have two hdfs name nodes with failover controllers. I don't see any documented way for fluentd to connect to both name nodes in this case. Please point me in the right direction if I've missed something there. Our traffic into HDFS is less than 5GB a day, but it will of course grow over time.

I also just brought up a second fluentd collector (identical configuration and hardware). It was able to start and connect to HDFS with no crash, but it also has no traffic coming at it. I'm going to point some traffic at it shortly to see what happens.

Thanks for pointing me to fluent-plugin-forest. I'm going to put that in place today as well. I'll send an update later today with results of the new fluentd collector and/or using the forest plugin.

Thanks!

	Kevin Reedy - Infrastructure Architect ke...@bellycard.com \| C: 708.421.9526 www.bellycard.com \| @kevinreedy

--
You received this message because you are subscribed to the Google Groups "Fluentd Google Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fluentd+u...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

ke...@bellycard.com

unread,

Apr 22, 2013, 12:21:56 PM4/22/13

to flu...@googlegroups.com

It looks like the problem had nothing to do with httpfs or the large match configs. I believe there was some corrupt data sitting in the disk buffer. Once the buffer was cleared, everything has been stable. Thanks again for pointing me at the forest plugin, it made our configuration much much much simpler.

Thanks!

Kevin

To unsubscribe from this group and stop receiving emails from it, send an email to fluentd+unsubscribe@googlegroups.com.

Reply all

Reply to author

Forward