IO consideration in OpenDremel

32 views
Skip to first unread message

Camuel Gilyadov

unread,
Mar 18, 2011, 3:33:02 PM3/18/11
to opend...@googlegroups.com
OpenDremel aims on processing data of few gigabytes per second, assuming the processing is light (simple data filtering).

Naive file IO in java will under-perform by factor of ~100 due to redundant buffering and data copying as well as objects life cycle management overhead (creating thema nd then garbage collecting them, when you aim processing billions of objects per second that overhead just adds up). So a few well-known tricks will be necessary: mmap to achieve zero-copy IO and working with data directly through mapped byte-arrays and word and dword and qword arrays to avoid object life-cycle management overhead. Before we going to reinvent the bicycle :) let's invest some time studying what other apache (or other) projects already did on the issue.... From my brfief checks: Hadoop/HBase/HDFS file io strategy was so far been inconclusive, I need to find more time for that... David, do you have a clue on how HDFS gets data off disk into RAM? Cassandra research revealed a strategy of using array mapping (my guess) and using JNA to call for mlockall and probably madvise. David is on top of it now... if it is found to be a nice and independent piece of code we will just move the whole module verbatim into Dremel.

So what is the tricks in simple words. The trick is:

1. To leverage 64b address space and access files just as RAM regions.
2. To leverage huge RAM modern servers have and OS's caching mechanism to use all free memory to cache files transparently (or mostly transparently).
3. To hint OS (with madvise sys call) what read-a-head strategy is best. Using JNA for example... (JNA is humane JNI)
4. To instruct OS to not to evict cached JVM from RAM. With going crazy and memory-mapping every file, the OS may evict fragment from JVM and then "bad things" will happen.. This is obviously achieved by mlockall sys call also called with JNA
5.Care must be taken not to loose portability with such tricks, once again cassandra code could be looked up for that. 
6. The goal is multi-gigabyte scan rate...

Haven't looked yet into other projects but any reference will be most welcomed. 

Some relevant threads:








Constantine Peresypkin

unread,
Mar 18, 2011, 7:54:32 PM3/18/11
to dremel
Quick look on cassandra code reaveals not so many tricks we can use.

1. lock and unlock of memory regions, I suppose it's used only for the
code itself.
2. posix_fadvice is used to notify the OS about access pattern in
advance.
3. mmap (which should be used anyway, and AFAIK Java has all the
facilities for that)

I think that locking JVM is not that important, cassandra uses that
because they map whole files, they do need random access.
The nature of dremel dictates "sequential access only" therefore we
can achieve it by mmaping only local column blocks (data + levels) and
not the whole column.
Maybe there is no need to mmap "level" blocks/files at all.
I understand that we will need to scan local files much bigger than
local RAM, therefore the amount of caching we put there is really
irrelevant.
> http://www.mail-archive.com/common-...@hadoop.apache.org/msg02456.htmlhttps://issues.apache.org/jira/browse/CASSANDRA-1214

Camuel Gilyadov

unread,
Mar 18, 2011, 10:18:07 PM3/18/11
to opend...@googlegroups.com, Constantine Peresypkin
Well we see a few possible caching strategies:

1. When we have 100% dedicated physical server for Tableton component, there is no point for conserving RAM and mapping whole files is a winning strategy out of expectation that may be some tablets will get cached and reused. If none will, nothing bad happened by overusing mmap.

2. Constantine you are right regarding the fact that the whole typical OpenDremel deployment will scan huge amounts of data, but it not necessarily mean that the amount of data that needs to be scanned per tenant doesn't fit RAM of single server. It all depends on how many servers are there in the cluster. I think in majority of typical deployments it will fit in RAM and by large margin. We aim for interactive querying and subsecond response ( or just a few seconds at worst). Even with multi GB/sec scan times the scanned data per tenant per query is less then 10GB in worst case. Modern servers have few tens of GB nowadays. So caching some data in all available free memory could be highly beneficial strategy with repetitive queries. Many tablets will end up being served from RAM on subsequent queries.

3. When Dremel is sharing its server with other software and it is obvious that heavy cache-thrashing will occur (active dataset is much larger then free RAM), going crazy with mmap may be counter productive and managing smaller cache internally may be a much better idea. In this case all files will be mapped into same window in memory. Each time only some fragment of required file will be mapped into the window. Cache will be managed explicitly in the pre-allocated memory only.

Which strategy we should choose?

I think it is safe to assume that server is 100% dedicated to Tableton and that data to be scanned per query per node per tenant fits entirely and easily into RAM of single server and that most queries are repetitive. So taking all these assumptions and locking JVM memory and going fully crazy with mmaping of all data files until we run out of virtual address space is the way to go.,,

What you think?

David, any inputs here?

Constantine Peresypkin

unread,
Mar 19, 2011, 4:05:41 AM3/19/11
to dremel
I think we can work with configurable "decision support max ram
allowance" parameter or something.
Setting it to various values may give some insight to what will happen
on real nodes.

And after speaking with David yesterday, I think we can also try and
implement I/O inside the compiler jobs (joblets) themselves.
This way we can avoid calling methods like "get next slice" for each
and every slice - lots of overhead.

Constantine

San Nguyen Hong

unread,
Mar 19, 2011, 4:19:59 AM3/19/11
to opend...@googlegroups.com
Hi all,

Actually, my new version of compiler work directly with tablet I/O of
David without using TabletIterator. Please check in latest code of
compiler (run CompilerImpl for test).

San.

2011/3/19 Constantine Peresypkin <pconst...@gmail.com>:

David Gruzman

unread,
Mar 19, 2011, 4:23:31 AM3/19/11
to opend...@googlegroups.com
Hi,
I see the picture as follows:
a) We need high speed sequential access to the tablet's data. I have done a  benchmark for reading 100 MB of data from the file cached by the OS. Regular FileInputStream gave 1GB sec, while Cassandra's  MappedFileDataInput gave 2 GB/sec on my i7 920.
b) As pointed above - our access is always sequential. I think read ahead is default operating system behaviour, but to test real performance we need to utilize some high performance drive or RAID, capable of providing 2 GB/sec.
c) Caching strategy I think it is less critical since tablet is expected to be much smaller then the RAM.  Tablet is planned to be 128 MB.  We can expect that the dedicated Tableton server will scan one tablet per core at a time. I think that 128 MB of RAM per core dedicated to the tablet caching is very reasonable requirement.

David

Constantine Peresypkin

unread,
Mar 19, 2011, 4:52:54 AM3/19/11
to dremel
-- Actually, my new version of compiler work directly with tablet I/O

Yep, sorry, I've seen it, but didn't look into the code.
Anyway the main advantage of getting tableton code inside the compiled
janino job/query should come from not calling any iteration methods
save for someBuffer[i] or something similar.
I doubt it can be achieved right now...


On Mar 19, 10:19 am, San Nguyen Hong <hong...@gmail.com> wrote:
> Hi all,
>
> Actually, my new version of compiler work directly with tablet I/O of
> David without using TabletIterator. Please check in latest code of
> compiler (run CompilerImpl for test).
>
> San.
>
> 2011/3/19 Constantine Peresypkin <pconstant...@gmail.com>:

Constantine Peresypkin

unread,
Mar 19, 2011, 4:56:05 AM3/19/11
to dremel
Did you read with various read buffer sizes?

Constantine Peresypkin

unread,
Mar 19, 2011, 5:02:04 AM3/19/11
to dremel
And one more point: cassandra is reading and writing with native
endianness.
I have a feeling that if you use not native endianness performance
will be the same as FileInputStream
Native endianness means that data created on little-endian machine
cannot be read on big-endian = no portability.

On Mar 19, 10:56 am, Constantine Peresypkin <pconstant...@gmail.com>
wrote:

Camuel Gilyadov

unread,
Mar 19, 2011, 7:29:28 AM3/19/11
to opend...@googlegroups.com
I think that for performance reasons QP (i liked "joblet" to denote
single-tablet QP) must be a piece of java code that doesn't call
anything. Everything is inlined... Hey, the code is generated anyway,
why cary about modularity.... computer doesn't carry about modularity.
Of course "new" is not alowed either.

Also if we develop this idea further we will see that all the code in
opendremel even run outside janino is query-compile-time i.e. the code
runs once per query. And all code that is query-run-time i.e. the code
running per-tablet rather than per-query runs entirely inside janino
executor.

So.... Tableton and dataset API should actually return some
closure-like code pieces rather than actually so any real work on the
caller thread.

The only work done in query-run-time may be mmaping tablet files into RAM.

What you think?

--
Sent from my mobile device

Camuel Gilyadov

unread,
Mar 19, 2011, 7:41:01 AM3/19/11
to opend...@googlegroups.com
Regarding endianess - the portability should mean the ability to convert endianess to match tableton server once - as batch job and not on every query scan.

Camuel Gilyadov

unread,
Mar 19, 2011, 7:51:02 AM3/19/11
to opend...@googlegroups.com
Regarding configurable buffer sizes, David, Constantine, et. al.

As far as I know mmap just uses all free memory of OS as buffer-cache for mmaping files. There is no way to control it in any sensible manner. Am I right? So by configuration we can only control the address-space not the allocated memory itself. Moreover, memory OS uses for mmap is reflected as "free" so it is not allocated to any process or system in normal sense of the word.

So the question becomes do we want to limit address-space used for mmaping and make it configurable? Otherwise we can just mmap all files... 64b address space will allow this.

My suggestion is just do the same thing Cassandra folks are doing.... even better to reuse their code together with the configuration. 

David, regarding 128MB buffer-per-core strategy I cannot see how it can be efficient... If the system has 32GB of RAM and 8 cores so why not to use all this extra 31GB RAM to cache tablets. As I already wrote on repetitive queries the gain will be substantial because all the required tablets will be found already in memory and even for a few tenants.

All this of course doesn't apply for Metaxa we are already discussing future full-blown Dremel architecture. 

Constantine Peresypkin

unread,
Mar 19, 2011, 8:54:38 AM3/19/11
to dremel


On Mar 19, 1:51 pm, Camuel Gilyadov <cam...@gmail.com> wrote:
> Regarding configurable buffer sizes, David, Constantine, et. al.
>
> As far as I know mmap just uses all free memory of OS as buffer-cache for
> mmaping files. There is no way to control it in any sensible manner. Am I
> right?

Java direct ByteBuffers are limited to 2Gb, on any platform AFAIK.
Therefore if you want to map more you need to wrap them in superclass
that accepts long offsets.
Anyway, just as a hint: I could not allocate even 1024Mb byte buffer
on 4Gb 32bit machine anyway (just tested).


> So by configuration we can only control the address-space not the
> allocated memory itself. Moreover, memory OS uses for mmap is reflected as
> "free" so it is not allocated to any process or system in normal sense of
> the word.

Yep, this is also a good way to avoid GC collecting your tablet data.

> So the question becomes do we want to limit address-space used for mmaping
> and make it configurable? Otherwise we can just mmap all files... 64b
> address space will allow this.
>
> My suggestion is just do the same thing Cassandra folks are doing.... even
> better to reuse their code together with the configuration.

This is also interesting.
I've tested Cassandra right now and for sufficiently large dataset
(512Mb only) it behaves exactly the same as just plain
MappedByteBuffer (which is not a wonder because it has exactly that
buffer inside)
And for read pattern: 4 bytes size + n bytes data it gets me around
180Mb/sec on cached file, which is ok, expected figure.

And this is why I want to see how David got his 2Gb/sec


> David, regarding 128MB buffer-per-core strategy I cannot see how it can be
> efficient... If the system has 32GB of RAM and 8 cores so why not to use all
> this extra 31GB RAM to cache tablets. As I already wrote on repetitive
> queries the gain will be substantial because all the required tablets will
> be found already in memory and even for a few tenants.
>
> All this of course doesn't apply for Metaxa we are already discussing future
> full-blown Dremel architecture.
>

Constantine Peresypkin

unread,
Mar 19, 2011, 9:14:27 AM3/19/11
to dremel
Ok, I do see now.
If the block size for 1 read is big enough it goes over 1Gb/sec
It gets to max speed at 8-16kb blocks.
Which means that the compiled query should absolutely read from mmaped
files by itself, otherwise we will need to manage intermittent caches/
buffers of 8kb or more.


On Mar 19, 2:54 pm, Constantine Peresypkin <pconstant...@gmail.com>
wrote:

Camuel Gilyadov

unread,
Mar 19, 2011, 9:17:03 AM3/19/11
to opend...@googlegroups.com
On Sat, Mar 19, 2011 at 2:54 PM, Constantine Peresypkin <pconst...@gmail.com> wrote:


On Mar 19, 1:51 pm, Camuel Gilyadov <cam...@gmail.com> wrote:
> Regarding configurable buffer sizes, David, Constantine, et. al.
>
> As far as I know mmap just uses all free memory of OS as buffer-cache for
> mmaping files. There is no way to control it in any sensible manner. Am I
> right?

Java direct ByteBuffers are limited to 2Gb, on any platform AFAIK.
Therefore if you want to map more you need to wrap them in superclass
that accepts long offsets.
Anyway, just as a hint: I could not allocate even 1024Mb byte buffer
on 4Gb 32bit machine anyway (just tested).

That's not a problem, since tablets are 64MB or 128MB or 256MB, always smaller than 2GB. And no file contains data from multiple tablets. So all the files to be mapped are significantly less than 2GB. So you just keep a list of mapped files....

my suggestion is just to mmap all available files and create a list of this pre-mmaped files and not waste time anymore mid-query on any mapping operations. Simple and elegant and efficient and automatically leverages all installed RAM and in configuration-free fashion, you cannot ask for more... :). The only problem is that it competes with others for free RAM, and can thrash caches of JVM and other processes. Regarding JVM it is solved problem with mlockall. Regarding other processes I don't know.... may be there some way to control it by linux-only devops, like limiting process and giving quote to the process.

Constantine Peresypkin

unread,
Mar 19, 2011, 10:27:10 AM3/19/11
to opend...@googlegroups.com
We will have another problem when allocating too much mmaped files: the page mapping table in kernel will go wild.
If the files are 128Mb in size, we will need 1024 mmaps to allocate 128Gb, and multiply that by file for each level...
I suppose mapping the files with 2Gb segments is a way to go on 64bit, we can also just use uint64 for offset and map everything a a one big array.
I think these guys use that approach: http://www.terracotta.org/bigmemory

Camuel Gilyadov

unread,
Mar 19, 2011, 11:10:06 AM3/19/11
to opend...@googlegroups.com
If you talk about TLB it doesn't matter how many files. With TLB every 4KB occupies an entry, tough you can configure OS page size I think we won't go that far for sure. MMaped file occupies of course another entry in some kernel structure, may be a structure holding file-descriptors and the one controllable with "ulimit -n".

Regarding TLB I think it will do OK with terabytes of mapped files contents since it is multi-level with hardware support inside modern processors and etc.. let's not worry about that. 

Oleg, can you provide some inputs if there are any problem having say 10TB of memory-mapped file contents? Will TLB be ok? Is there a chance of exploding any kernel-structures?

Constantine Peresypkin

unread,
Mar 19, 2011, 11:53:27 AM3/19/11
to dremel
I'm talking about memory protection techniques, I haven't checked yet,
but I remember something vague about context switch when you have many
mappings.
If the the next process has no such mappings, all the mapping tables
will be evicted and reinstalled on next switch to your application.
But I don't quite remember which architectures use what strategy, it
could be that ia64 is free of such behavior.


On Mar 19, 5:10 pm, Camuel Gilyadov <cam...@gmail.com> wrote:
> If you talk about TLB it doesn't matter how many files. With TLB every 4KB
> occupies an entry, tough you can configure OS page size I think we won't go
> that far for sure. MMaped file occupies of course another entry in some
> kernel structure, may be a structure holding file-descriptors and the one
> controllable with "ulimit -n".
>
> Regarding TLB I think it will do OK with terabytes of mapped files contents
> since it is multi-level with hardware support inside modern processors and
> etc.. let's not worry about that.
>
> Oleg, can you provide some inputs if there are any problem having say 10TB
> of memory-mapped file contents? Will TLB be ok? Is there a chance of
> exploding any kernel-structures?
>
> On Sat, Mar 19, 2011 at 4:27 PM, Constantine Peresypkin <
>
> ...
>
> read more »

Constantine Peresypkin

unread,
Mar 19, 2011, 12:50:47 PM3/19/11
to dremel
Reply all
Reply to author
Forward
0 new messages