I'm trying to run terasort using spark (master) with alluxio and HDFS as the UFS. It works fine up until about 100GB dataset. At that point terasort stops processing during stage 4 of the sort with these type of messages:
WARN TaskSetManager: Lost task 86.1 in stage 4.0 (TID 771, 10.1.20.201): java.io.IOException: Failed to cache: Unable to request space from worker
at alluxio.client.file.FileOutStream.handleCacheWriteException(FileOutStream.java:337)
at alluxio.client.file.FileOutStream.write(FileOutStream.java:293)
ERROR logger.type (BlockWorkerClientServiceHandler.java:requestSpace) - Failed to request 1048500 bytes for block: 136650424320
alluxio.exception.BlockDoesNotExistException: TempBlockMeta not found for blockId 136,650,424,320
at alluxio.worker.block.BlockMetadataManager.getTempBlockMeta(BlockMetadataManager.java:264)
at alluxio.worker.block.TieredBlockStore.requestSpaceInternal(TieredBlockStore.java:573)
at alluxio.worker.block.TieredBlockStore.requestSpace(TieredBlockStore.java:250)
at alluxio.worker.block.BlockWorker.requestSpace(BlockWorker.java:557)
2016-06-02 11:56:22,493 ERROR logger.type (FileSystemMaster.java:loadMetadataIfNotExistAndJournal) - Failed to load metadata for path: /SparkBench/Terasort/Output/_temporary/0/task_201606021150_0004_r_000085
These errors seem to indicate that alluxio doesn't have space, except for the Failed to load metadata part. However, I set the size the ALLUXIO_WORKER_MEMORY_SIZE to 400GB (512GB physical memory) and only 100GB is being used ( from teragen). So, I'm not sure why Alluxio would say it can't find a space.
Questions:
*Could I be missing an important configuration parameter in Alluxio, since it seems alluxio should be able to request that amount of space given there is still 300+GB of alluxio ramdisk available when the failures start occuring?
* Do you have any suggestions for how to start debugging/understanding why this problem is happening.