How does Alluxio react when out of memory?(Alluxio/HDFS/Spark)

178 views
Skip to first unread message

Alexandros Sympetheros

unread,
Mar 29, 2016, 1:17:46 PM3/29/16
to Alluxio Users
Hello,
I am new to Alluxio framework and would like some feedback if possible, I've gone through the documentation but haven't created a clear picture yet.

Using: Alluxio 1.0.0 + HDFS 2.7.2 + Spark 1.6.1

I'm having trouble understanding the how Alluxio works with HDFS, when intermediate files are larger than the available Alluxio Ram and blocks are removed.

Suppose I am running a local installation of HDFS + Alluxio(1GB on the worker) + Spark.

If I load a file of 1.5 GB into Alluxio from the HDFS(using the loadMetadata call) and do some computation on the file(in my case I converted a tpc-h tbl table into Parquet format) and try to store the result on alluxio, with a spark-submit process, it crashes, since it's missing blocks, which I gather is because it cannot fit everything in memory and removes some older blocks.

If I do the same as above and store the result in HDFS instead,  it works, so Alluxio reads 1.5 GB into 1GB Ram in blocks, and eventually writes the result to HDFS.

For the above, is there some way to write the result to Alluxio? Some way for the older blocks to be temporarily written to HDFS and then reloaded into the ram?


Thanks very much for your help! I have gone through the site documentation so any external links explaining this would be greatly appreciated.

I will continue running tests to see how Alluxio behaves in practice, for example when doing a join on large tables that don't both fit in memory, and likewise the result doesn't fit.

Two notes:

I have read about WriteType (CACHE_THROUGH /MUST_CACHE..) but more interested in what happens when the intermediate results don't fit in memory.

I have read about tiered storage but the same problem can just be scaled to larger input data to require saving to HDFS again when no space available in any tier.

Yupeng Fu

unread,
Mar 30, 2016, 2:30:03 AM3/30/16
to Alexandros Sympetheros, Alluxio Users
Hi Alexandros,

At the moment Alluxio needs the file to fully fit in the memory. This is because Alluxio has to hold the lock of all the blocks until the file is fully written to HDFS, for the sake of correctness. 

However, one work around you can do is to split the file into logical parts by yourselves. And in your case is to split your Spark end result into multiple partitions. In fact, by doing this it may also accelerate your job if they can be done on multiple machines in parallel.

We'll also explore the feature in the future that optionally stores a file in parts, one block per part.

Hope this helps.

Cheers,

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to alluxio-user...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages