Rocksdb_no_file_closes in MariaDB MyRocks

192 views
Skip to first unread message

Jonas Krauss

unread,
Nov 19, 2018, 1:58:31 PM11/19/18
to MyRocks - RocksDB storage engine for MySQL
Hi all,

I am trying to put a MariaDB MyRocks database into production and right now I am facing problems regarding the memory usage of MyRocks. I haven't been able to identify one specific problem but the following links contained some useful information:

https://github.com/facebook/rocksdb/issues/3216

https://github.com/facebook/rocksdb/wiki/Memory-usage-in-RocksDB

The current behavior of the mysqld process is that when putting the database into production (an OLTPish process) on a machine with 64G RAM and a block cache configured with 16G it starts to slowly accumulate more and more memory until swap is used. This is obviously not suitable for production.

Currently I am wondering about a particular statistic, the number of open and closed files. Particularly I am wondering why Rocksdb_no_file_closes is staying a zero at all time while Rocksdb_no_file_opens slowly increases over time. Does anybody know why that is? Could it be responsible for the steady memory increase?

I experimented with the parameter rocksdb_max_open_files, it was set to -1, I tried setting it to 1536, 512 and 128. The lower the value the faster Rocksdb_no_file_opens increases (which seems right). However, at no setting do I see any change in
Rocksdb_no_file_closes.

Running MariaDB 10.2.18.

Here are the settings regarding rocksdb in my.cnf:

# MyRocks
plugin-load-add                    = ha_rocksdb.so
default-storage-engine                = rocksdb
default-tmp-storage-engine            = MyISAM
transaction-isolation                = READ-COMMITTED
rocksdb_unsafe_for_binlog            = 1 # enables statement based replication
rocksdb_datadir                    = /var/local/mysql/rocksdb
rocksdb_wal_dir                    = /var/mysql_logs/rocksdb
rocksdb_tmpdir                    = /var/mysql_logs/rocksdb

rocksdb_flush_log_at_trx_commit            = 0
rocksdb_use_direct_io_for_flush_and_compaction    = 0
rocksdb_use_direct_reads            = 0

rocksdb_max_open_files                = 512
rocksdb_max_background_jobs            = 8
rocksdb_max_total_wal_size            = 4G
rocksdb_block_size                = 64K
rocksdb_block_cache_size            = 4G
rocksdb_table_cache_numshardbits        = 6
rocksdb_new_table_reader_for_compaction_inputs    = 1
rocksdb_compaction_readahead_size        = 4M
#rocksdb_db_write_buffer_size            = 0 # max write buffer across all column families, zero = disabled

# rate limiter
rocksdb_bytes_per_sync                = 4M
rocksdb_wal_bytes_per_sync            = 4M
rocksdb_rate_limiter_bytes_per_sec        = 80M # MB/s. Increase if you're running on higher spec machines

# triggering compaction if there are many sequential deletes
rocksdb_compaction_sequential_deletes_count_sd    = 1
rocksdb_compaction_sequential_deletes        = 199999
rocksdb_compaction_sequential_deletes_window    = 200000

# read free replication
#rocksdb_rpl_lookup_rows=0

rocksdb_default_cf_options=write_buffer_size=128m;max_write_buffer_number=4;target_file_size_base=256m;max_bytes_for_level_base=2560m;target_file_size_multiplier=2;level0_file_num_compaction_trigger=4;level0_slowdown_writes_trigger=10;level0_stop_writes_trigger=15;compression=kSnappyCompression;bottommost_compression=kZlibCompression;compression_opts=-14:1:0;block_based_table_factory={cache_index_and_filter_blocks=1;filter_policy=bloomfilter:10:false;whole_key_filtering=1};level_compaction_dynamic_level_bytes=true;optimize_filters_for_hits=true;compaction_pri=kMinOverlappingRatio


Yoshinori Matsunobu

unread,
Nov 19, 2018, 2:06:00 PM11/19/18
to Jonas Krauss, MyRocks - RocksDB storage engine for MySQL
How many SST files do you have on your instance? SST files are stored under rocksdb_datadir.
Could you try rocksdb_max_open_files=-1 and see if leak reproduces? We generally recommend keeping the number of SST files under control (less than 65536) and
setting rocksdb_max_open_files=-1.

- Yoshinori

--
You received this message because you are subscribed to the Google Groups "MyRocks - RocksDB storage engine for MySQL" group.
To unsubscribe from this group and stop receiving emails from it, send an email to myrocks-dev...@googlegroups.com.
To post to this group, send email to myroc...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/myrocks-dev/c8794134-9940-4ef6-881b-b1f30ee958e0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jonas Krauss

unread,
Nov 19, 2018, 4:18:26 PM11/19/18
to MyRocks - RocksDB storage engine for MySQL
We have 1742 SST files, with a total of 345G in five levels, should be okay?

As I wrote, setting rocksdb_max_open_files to -1 unfortunately also saw the steady increase in memory usage, I actually left this at the default (-1) when we recognized the observed behavior.

Do you have some info why rocksdb_no_file_closes is staying zero?
 

Yoshinori Matsunobu

unread,
Nov 19, 2018, 4:40:46 PM11/19/18
to Jonas Krauss, MyRocks - RocksDB storage engine for MySQL
We use over 30k SST files with rocksdb_max_open_files=-1 (though not MariaDB) and don't have any memory leak issue.
- How much memory (RSZ) did mysqld use up to? Did it cause OOM eventually?
- What memory allocator did you use? We recommend jemalloc.
- Is it possible for you to test with Percona Server and see memory leak reproduces?


- Yoshinori


Sergey Petrunia

unread,
Nov 19, 2018, 8:04:35 PM11/19/18
to Jonas Krauss, MyRocks - RocksDB storage engine for MySQL
Hi Jonas,
On Mon, Nov 19, 2018 at 01:18:26PM -0800, Jonas Krauss wrote:
> We have 1742 SST files, with a total of 345G in five levels, should be okay?
>
> As I wrote, setting rocksdb_max_open_files to -1 unfortunately also saw the
> steady increase in memory usage, I actually left this at the default (-1)
> when we recognized the observed behavior.
>
> Do you have some info why rocksdb_no_file_closes is staying zero?
>
I'm looking at the source code

* Rocksdb_no_file_closes is counted inside RocksDB. MariaDB only passes the
counter value to the user

* In RocksDB, the only mentions of NO_FILE_CLOSES are:

include/rocksdb/statistics.h|144| NO_FILE_CLOSES,
include/rocksdb/statistics.h|387| {NO_FILE_CLOSES, "rocksdb.no.file.closes"},
java/rocksjni/portal.h|3204| case rocksdb::Tickers::NO_FILE_CLOSES:
java/rocksjni/portal.h|3408| return rocksdb::Tickers::NO_FILE_CLOSES;
java/src/main/java/org/rocksdb/TickerType.java|262| NO_FILE_CLOSES((byte) 0x2F),

I don't see a single place where the ticker is incremented.

Let's take some other counter, NUMBER_DB_SEEK_FOUND. The search does find calls
to increment its value:
db/db_iter.cc|1302| RecordTick(statistics_, NUMBER_DB_SEEK_FOUND);
db/db_iter.cc|1353| RecordTick(statistics_, NUMBER_DB_SEEK_FOUND);
db/db_iter.cc|1397| RecordTick(statistics_, NUMBER_DB_SEEK_FOUND);
db/db_iter.cc|1443| RecordTick(statistics_, NUMBER_DB_SEEK_FOUND);
include/rocksdb/statistics.h|138| NUMBER_DB_SEEK_FOUND,
include/rocksdb/statistics.h|383| {NUMBER_DB_SEEK_FOUND, "rocksdb.number.db.seek.found"},
...

I've also checked
* the version of RocksDB that is used by facebook/mysql-5.6
* the latest revision of RocksDB on github.

It's all the same - I don't see any calls to increment NO_FILE_CLOSES. This is
probably a bug in RocksDB. But I guess it's not related to the high memory
usage issue you're observing.

>
> On Monday, November 19, 2018 at 8:06:00 PM UTC+1, Yoshinori Matsunobu wrote:
> >
> > How many SST files do you have on your instance? SST files are stored
> > under rocksdb_datadir.
> > Could you try rocksdb_max_open_files=-1 and see if leak reproduces? We
> > generally recommend keeping the number of SST files under control (less
> > than 65536) and
> > setting rocksdb_max_open_files=-1.
> >
> > - Yoshinori
> >
> > On Nov 19, 2018, at 10:58 AM, Jonas Krauss <jkra...@gmail.com
> > <javascript:>> wrote:
> >
> > Hi all,
> >
> > I am trying to put a MariaDB MyRocks database into production and right
> > now I am facing problems regarding the memory usage of MyRocks. I haven't
> > been able to identify one specific problem but the following links
> > contained some useful information:
> >
> > https://github.com/facebook/rocksdb/issues/3216
> >
> > https://github.com/facebook/rocksdb/wiki/Memory-usage-in-RocksDB
> >
> > The current behavior of the mysqld process is that when putting the
> > database into production (an OLTPish process) on a machine with 64G RAM and
> > a block cache configured with 16G it starts to slowly accumulate more and
> > more memory until swap is used. This is obviously not suitable for
> > production.
> >
> > Currently I am wondering about a particular statistic, the number of open
> > and closed files. *Particularly I am wondering why Rocksdb_no_file_closes
> > is staying a zero at all time while Rocksdb_no_file_opens slowly increases
> > over time.* Does anybody know why that is? Could it be responsible for
> > email to myrocks-dev...@googlegroups.com <javascript:>.
> > To post to this group, send email to myroc...@googlegroups.com
> > <javascript:>.
> > <https://urldefense.proofpoint.com/v2/url?u=https-3A__groups.google.com_d_msgid_myrocks-2Ddev_c8794134-2D9940-2D4ef6-2D881b-2Db1f30ee958e0-2540googlegroups.com-3Futm-5Fmedium-3Demail-26utm-5Fsource-3Dfooter&d=DwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=h7I09O8mW25YOy83cdKu0g&m=EZkjwXf4rfovSZwy8HenfjiomLiMlBuhnILAQdPNUM0&s=emLPW4nnXKxp7IEFbJZH71-_IpO1ihCg5QSKyW_eQu8&e=>
> > .
> > For more options, visit https://groups.google.com/d/optout
> > <https://urldefense.proofpoint.com/v2/url?u=https-3A__groups.google.com_d_optout&d=DwMFaQ&c=5VD0RTtNlTh3ycd41b3MUw&r=h7I09O8mW25YOy83cdKu0g&m=EZkjwXf4rfovSZwy8HenfjiomLiMlBuhnILAQdPNUM0&s=n324xy6E9rtPuYS_L3nT7A3HoXq85Hkodv5hmVTF7Ak&e=>
> > .
> >
> >
> >
>
> --
> You received this message because you are subscribed to the Google Groups "MyRocks - RocksDB storage engine for MySQL" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to myrocks-dev...@googlegroups.com.
> To post to this group, send email to myroc...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/myrocks-dev/ea9936cc-a936-41b9-aa8f-2c4bc5b53183%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.


--
BR
Sergei
--
Sergei Petrunia, Software Developer
MariaDB Corporation | Skype: sergefp | Blog: http://s.petrunia.net/blog


Jonas Krauss

unread,
Nov 20, 2018, 3:37:55 AM11/20/18
to MyRocks - RocksDB storage engine for MySQL
Sergei, thanks for this information, this is very helpful to rule out this as a cause.

Yoshinori:
- How much memory (RSZ) did mysqld use up to? Did it cause OOM eventually?
Virtual memory exceeded 80G and swap was used up to ~50% of the available 32G before I shut down the server. So OOM did not actually occur, but I did not want to risk data loss by waiting further. Swappiness is set to 1, not sure if this has an influence. Reserved memory was in the 60G range, unfortunately I do not have the exact number I think it was higher than the available ~62G. I will run another test today and observe closely. Will post findings here.

- What memory allocator did you use? We recommend jemalloc.
I have not looked into that so far but I am willing to give it a try. I do not have experience in switching the malloc but I will probably figure it out. Will test this and let you know.

- Is it possible for you to test with Percona Server and see memory leak reproduces?
I do not think this is possible as we use GTID-based replication which is incompatible as I recall. I could try setting replication to traditional mode and try, however I would only do this as a last resort and try other options first.


- Yoshinori

MARK CALLAGHAN

unread,
Nov 20, 2018, 11:09:53 AM11/20/18
to jkra...@gmail.com, myroc...@googlegroups.com
Thank you for a nice bug report.

I filed a bug for the problem Sergey reported with NO_FILE_CLOSES - https://github.com/facebook/rocksdb/issues/4700

At the risk of being pedantic I am not sure this is a leak, but it might be much more memory used than expected. And there are potentially two problems:
1) memory from keeping all files open (rocksdb_max_open_files=-1) which can be fixed by setting that to a value larger than 0.
2) extra memory lost from malloc fragmentation. The workaround for this is to use jemalloc or tcmalloc because glibc malloc usually means that RSS is 2X larger than with jemalloc/tcmalloc for MyRocks because RocksDB is an allocator fragmentation stress test. See http://smalldatum.blogspot.com/2018/04/myrocks-malloc-and-fragmentation-strong.html.
Allocator can be set in my.cnf - https://dev.mysql.com/doc/refman/8.0/en/mysqld-safe.html#option_mysqld_safe_malloc-lib

I prefer you confirm whether 1 and 2 above resolve the problem before considering this to be a memory leak.


For more options, visit https://groups.google.com/d/optout.


--
Mark Callaghan
mdca...@gmail.com

MARK CALLAGHAN

unread,
Nov 20, 2018, 11:11:10 AM11/20/18
to jkra...@gmail.com, myroc...@googlegroups.com
Forgot to mention that lsof can be used to confirm that rocksdb_max_open_files works as expected.
http://manpages.ubuntu.com/manpages/xenial/man8/lsof.8.html
--
Mark Callaghan
mdca...@gmail.com

Jonas Krauss

unread,
Nov 20, 2018, 11:26:31 AM11/20/18
to mdca...@gmail.com, myroc...@googlegroups.com
Hi all,

thanks again for the very helpful comments.

We are currently running the server in production with rocksdb_max_open_files=512, rocksdb_block_cache_size=4G and jemalloc. Memory usage stays around 8G-9G which is fine considering we still have other storage engines like innodb block their own parts of the memory.

My feeling is that the problem was caused by using the default memory allocator which comes with Debian. I will restart the server later or tomorrow with the original settings of rocksdb_max_open_files=-1 and rocksdb_block_cache_size=16G and continue using jemalloc. Will let you know the results here, too, but I am confident that this solves the issue.

MARK CALLAGHAN

unread,
Nov 20, 2018, 11:35:51 AM11/20/18
to jkra...@gmail.com, myroc...@googlegroups.com
That is good news. One thing I forgot to ask is whether you were concerned about VSZ or RSS when looking at mysqld memory usage. With jemalloc VSZ can be much larger than RSS, but RSS is the metric to watch. It took me some time to adjust to that and from an older post:
* glibc malloc has high VSZ and low RSS
* jemalloc has high VSZ and low RSS
* tcmalloc has low VSZ and low RSS

See http://smalldatum.blogspot.com/2015/10/myrocks-versus-allocators-glibc.html

I assume this is a result of how tcmalloc and jemalloc return memory to the OS.


For more options, visit https://groups.google.com/d/optout.


--
Mark Callaghan
mdca...@gmail.com

Jonas Krauss

unread,
Nov 21, 2018, 7:36:50 AM11/21/18
to MyRocks - RocksDB storage engine for MySQL
Mark, I was concerned about the RSS, this was growing beyond the available RAM. I noticed that the VSZ was actually exceeding available RAM by far already.

We have the server running with 16G block cache and max open files -1 since 9am this morning (Europe/Berlin tz), RSS stays at around 21G :) So I would say it is almost certain that the default memory allocator was the reason for the steady increase in reserved memory and this issue can be considered as solved by switching to jemalloc.

Thanks again for your help @all!
Reply all
Reply to author
Forward
0 new messages