Hello,
I am new to Alluxio framework and would like some feedback if
possible, I've gone through the documentation but haven't created a
clear picture yet.
Using: Alluxio 1.0.0 + HDFS 2.7.2 + Spark 1.6.1
I'm having trouble understanding the how
Alluxio works with HDFS, when intermediate files are larger than the
available Alluxio Ram and blocks are removed.
Suppose I am running a local installation of HDFS + Alluxio(1GB on the worker) + Spark.
If
I load a file of 1.5 GB into Alluxio from the HDFS(using the loadMetadata call) and do some computation on the
file(in my case I converted a tpc-h tbl table into Parquet format) and
try to store the result on alluxio, with a spark-submit process, it
crashes, since it's missing blocks, which I gather is because it cannot
fit everything in memory and removes some older blocks.
If I do
the same as above and store the result in HDFS instead, it works, so Alluxio reads
1.5 GB into 1GB Ram in blocks, and eventually writes the result to
HDFS.
For the above, is there some way to write the result to
Alluxio? Some way for the older blocks to be temporarily written to HDFS
and then reloaded into the ram?
Thanks very much for your help! I have gone through the site documentation so any external links explaining this would be greatly appreciated.
I
will continue running tests to see how Alluxio behaves in practice, for
example when doing a join on large tables that don't both fit in
memory, and likewise the result doesn't fit.
Two notes:
I
have read about WriteType (CACHE_THROUGH /MUST_CACHE..) but more
interested in what happens when the intermediate results don't fit in
memory.
I
have read about tiered storage but the same problem can just be scaled
to larger input data to require saving to HDFS again when no space
available in any tier.