MDEV-14047, MyRocks and open_files_limit

295 views
Skip to first unread message

Sergey Petrunia

unread,
Nov 4, 2017, 11:07:16 AM11/4/17
to MyRocks - RocksDB storage engine for MySQL
Hello,

This is based on https://jira.mariadb.org/browse/MDEV-14047 .

The main observations of that MDEV are:

1. MyRocks requires file descriptors to operate. The more data one has, the
more files will need to be open (This is expected I guess).

2. If MyRocks runs out open files, a "too many open files" error will happen
inside RocksDB, which will cause mysqld to crash.

3. With default settings on a modern OS, mysqld will limit itself to 4096 open
file descriptors, which will cause it to crash once the data size reaches about
250G.

4. In order work with larger datasets, one must adjust open_files_limit in
my.cnf.

4.1 ... and NOT rocksdb_max_open_files, which is -1, "unlimited" by default.
the value of rocksdb_max_open_files is clipped to the OS limit in
RocksDB::Open(), however that does not help to avoid the crash.
I am not sure if
- it doesn't help at all, or
- it does not help, because there other uses of file descriptors inside mysqld, so
RocksDB code gets "too many open files" error before it has opened
rocksdb_max_open_files files.


Questions:

Q1: I assume #2 and #3 are expected behavior and not considered bugs by
either MyRocks or RocksDB teams?

Q2: On one hand, crashing the server creates a bad user experience.
Is it acceptable to rely that those who want to load >250G should have
the tuning guide and have set open_files_limit?

Q3: Question for 4.1: is it possible to avoid the crash by adjusting
rocksdb_max_open_files to take into account file descriptor use by other parts
of the server (substract @@max_connections?)

Or, a better way would be for RocksDB to handle "too many open files" error by
closing some of the files it has open and retrying? I am not sure how hard it
is to implement this logic.

BR
Sergei
--
Sergei Petrunia, Software Developer
MariaDB Corporation | Skype: sergefp | Blog: http://s.petrunia.net/blog


MARK CALLAGHAN

unread,
Nov 4, 2017, 11:27:28 AM11/4/17
to Sergey Petrunia, MyRocks - RocksDB storage engine for MySQL
This should be a github issue.

ulimit -a for me shows 30,000 on Ubuntu Server and 1,000 on Ubuntu Desktop -- both Ubuntu 16.04. I thought low values for this were a remnant of the old days of Unix when you had to recompile SunOS to change such things. That too low value of 1,000 is going to be a problem for things in mysqld other than RocksDB. So we are not alone in causing pain.

I wish mysqld didn't crash on this error.

Do we log an error message in RocksDB::Open when the value is clipped?

--
You received this message because you are subscribed to the Google Groups "MyRocks - RocksDB storage engine for MySQL" group.
To unsubscribe from this group and stop receiving emails from it, send an email to myrocks-dev+unsubscribe@googlegroups.com.
To post to this group, send email to myroc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/myrocks-dev/20171104150712.GE17659%40pslp2.
For more options, visit https://groups.google.com/d/optout.



--
Mark Callaghan
mdca...@gmail.com

Sergey Petrunia

unread,
Nov 5, 2017, 7:14:21 AM11/5/17
to MARK CALLAGHAN, MyRocks - RocksDB storage engine for MySQL
Hi Mark,
On Sat, Nov 04, 2017 at 08:27:27AM -0700, MARK CALLAGHAN wrote:
> This should be a github issue.
>
> ulimit -a for me shows 30,000 on Ubuntu Server and 1,000 on Ubuntu Desktop
> -- both Ubuntu 16.04. I thought low values for this were a remnant of the
> old days of Unix when you had to recompile SunOS to change such things.
> That too low value of 1,000 is going to be a problem for things in mysqld
> other than RocksDB. So we are not alone in causing pain.


The limits seem to be set by mysqld, not the OS. On my Ubuntu Desktop, the
default environment has

ulimit -a ## Soft limit:
open files (-n) 1024

ulimit -a -H ## Hard limit:
open files (-n) 65536

when I start mysqld from it and examine /proc/$PID/limits, I get:

FB/MySQL-5.6:
Limit Soft Limit Hard Limit Units
Max open files 5000 5000 files

MariaDB 10.2:
Limit Soft Limit Hard Limit Units
Max open files 4162 4162 files


== What kind of limit rocksdb_max_open_files is? ==

I've ran another experiment. I've put

rocksdb-max-open-files=60

into my.cnf. I've started the server and loaded 12G data which reside in 154
SST files.

watching /proc/PID/fd/* and looking at the number of open files in
$datadir/.rocksdb, I see that *most of the time*, the number of open RocksDB
files is below the limit.

But I saw that 71 file was open at one point:
https://gist.github.com/spetrunia/17b98bfccb6f0c0721b6423f900aa1f4
(probably happened during compaction? unfortunately I dont have SHOW ENGINE
ROCKSDB STATUS from that moment)

So

* it seems, MyRocks could prevent itself from crashing by adjusing
rocksdb_max_open_files to be something like

current_OS_limit - max_connections - some_grace_amount


* There is a bit of uncertaninty about some_grace_amount because RocksDB
sometimes opens more than rocksdb_max_open_files files.
(I saw 70 of 60... is it 10 files more? or 15% more?)


* TokuDB complains if one starts it while not having much disk space. Perhaps
MyRocks could also complain into stderr if it started with small OS limit of
open files?

Sergey Petrunia

unread,
Nov 5, 2017, 7:18:17 AM11/5/17
to MARK CALLAGHAN, MyRocks - RocksDB storage engine for MySQL
On Sat, Nov 04, 2017 at 08:27:27AM -0700, MARK CALLAGHAN wrote:
> This should be a github issue.
>
> ulimit -a for me shows 30,000 on Ubuntu Server and 1,000 on Ubuntu Desktop
> -- both Ubuntu 16.04. I thought low values for this were a remnant of the
> old days of Unix when you had to recompile SunOS to change such things.
> That too low value of 1,000 is going to be a problem for things in mysqld
> other than RocksDB. So we are not alone in causing pain.
>
> I wish mysqld didn't crash on this error.
>
> Do we log an error message in RocksDB::Open when the value is clipped?

No. The details are here:

https://jira.mariadb.org/browse/MDEV-14047?focusedCommentId=102337&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-102337

RocksDB code has this logic that "if db_options.max_open_files is greater
than the current OS soft limit, reduce it"

but nothing is printed into the stderr

and the value of @@rocksdb_max_open_files system variable does not reflect the
clipping.

MARK CALLAGHAN

unread,
Nov 5, 2017, 12:33:13 PM11/5/17
to Sergey Petrunia, MyRocks - RocksDB storage engine for MySQL
I misread the output on my Ubuntu desktop install. The hard limit is what you state above. Please file an issue for MyRocks I think we improve this.

--
You received this message because you are subscribed to the Google Groups "MyRocks - RocksDB storage engine for MySQL" group.
To unsubscribe from this group and stop receiving emails from it, send an email to myrocks-dev+unsubscribe@googlegroups.com.
To post to this group, send email to myroc...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Mark Callaghan
mdca...@gmail.com

George O. Lorch III

unread,
Nov 6, 2017, 1:15:44 PM11/6/17
to myroc...@googlegroups.com

We encountered the same issue. In fact, it was Peter Zaitzevs first MYR bug report. There are some other similar issues with MyRocks and realistic/valid defaults and min/max for some sys_vars. This issue is a difficult one because there is a fundamental problem with mysql server code. On startup it establishes what the open_files_limit is, but then provides no 'reservation' scheme for the various parts of the server code, plugins, and storage engines to use to share in their use of this value. As a result, each plugin/engine has to guess about what it can take from the open_files_limit and can easily over-take the limit.

For reference, from the rocksdb wiki :

max_open_files -- RocksDB keeps all file descriptors in a table cache. If number of file descriptors exceeds max_open_files, some files are evicted from table cache and their file descriptors closed. This means that every read must go through the table cache to lookup the file needed. Set max_open_files to -1 to always keep all files open, which avoids expensive table cache calls.

Our initial fix was pretty simple, never allow rocksdb_open_files_limit to exceed the open_files_limit. That was a little short-sighted as we now know that rocksdb can still exceed this limit when workload demands that files are held open for real work and not just held open within the cache. This also doesn't account for files needed by the server (.frm, binlog, InnoDB, TokuDB). So we are now considering limiting the rocksdb_max_open_files to be some 50% of the open_files_limit if the user specified value (or -1 default) exceeds the open_files_limit and allowing any explicit user set value to be anywhere between 0 and open_files_limit to allow educated tuning.

This is of course a simple band-aid to a much bigger issue within mysql code, but, it should allow us to provide a product that is expected to work out-of-the box without the need to know they have to go and tune everything to their environment in order to work an avoid a crash. To quote Peter Zaitsev - "My main point in this case - I install Percona Server with RocksDB in the default install and I run trivial workload (sysbench) and it crashes. This is not acceptable user behavior. We should not expect users to understand details such as file descriptor limits.". I agree with this 100%.

-- 
George O. Lorch III
Senior Software Engineer, Percona
US/Arizona (GMT -7)
skype: george.ormond.lorch.iii

MARK CALLAGHAN

unread,
Nov 6, 2017, 1:48:39 PM11/6/17
to George O. Lorch III, MyRocks - RocksDB storage engine for MySQL
I have been hurt by this too and lost days of performance tests when MyRocks crashed. So I want us to avoid the crash but I don't think we can expect users to never understand the impact max_open_files. There are more than a few my.cnf options they have to understand when they care about performance. Hopefully MyRocks doesn't increase that list too much, although we have work to do in that regard.

Do we have an open issue for this?

--
You received this message because you are subscribed to the Google Groups "MyRocks - RocksDB storage engine for MySQL" group.
To unsubscribe from this group and stop receiving emails from it, send an email to myrocks-dev+unsubscribe@googlegroups.com.
To post to this group, send email to myroc...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Mark Callaghan
mdca...@gmail.com

George O. Lorch III

unread,
Nov 6, 2017, 2:13:04 PM11/6/17
to myroc...@googlegroups.com

I/we have an umbrella task in MYR for sensible defaults and limits with the intention of the resulting changes being sent in via pull request to fb-mysql. This is included in that set. It does make sense for a github issue for this, so I created one here https://github.com/facebook/mysql-5.6/issues/758. We should probably move any more spit-balling on the technical merits and possible fixes over to this issue.

Tuning for performance and out-of-the-box operation are two different tasks that do intersect. Out-of-the-box settings are expected to always just work, even at the cost of leaving some performance on the table if it prevents "likely to hit" known problems, just like this. This is where a good tuning guide comes in that covers the "Top <n> options to tune MyRocks to your hardware and workload". All tuning must take into account things that exist 'outside' MyRocks, such as if users are experimenting with multiple engines, replication, etc... For example: we wouldn't want someone to try to tune MyRocks in a vacuum on a machine where they they have tuned InnoDB to take 80% of available memory. We are working on such a guide to include with our documentation as we understand and learn more about what is important.

To unsubscribe from this group and stop receiving emails from it, send an email to myrocks-dev...@googlegroups.com.

To post to this group, send email to myroc...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

-- 
George O. Lorch III
Senior Software Engineer, Percona
US/Arizona (GMT -7)
skype: george.ormond.lorch.iii
Reply all
Reply to author
Forward
0 new messages