Debugging Alluxio fails to request bytes for block (during Spark Terasort)?

220 views
Skip to first unread message

Tim B

unread,
Jun 2, 2016, 3:03:52 PM6/2/16
to Alluxio Users
Hi,

I'm trying to run terasort using spark (master) with alluxio and HDFS as the UFS.  It works fine up until about 100GB dataset. At that point terasort stops processing during stage 4 of the sort with these type of messages:

WARN TaskSetManager: Lost task 86.1 in stage 4.0 (TID 771, 10.1.20.201): java.io.IOException: Failed to cache: Unable to request space from worker
at alluxio.client.file.FileOutStream.handleCacheWriteException(FileOutStream.java:337)
at alluxio.client.file.FileOutStream.write(FileOutStream.java:293)

 ERROR logger.type (BlockWorkerClientServiceHandler.java:requestSpace) - Failed to request 1048500 bytes for block: 136650424320
alluxio.exception.BlockDoesNotExistException: TempBlockMeta not found for blockId 136,650,424,320
at alluxio.worker.block.BlockMetadataManager.getTempBlockMeta(BlockMetadataManager.java:264)
at alluxio.worker.block.TieredBlockStore.requestSpaceInternal(TieredBlockStore.java:573)
at alluxio.worker.block.TieredBlockStore.requestSpace(TieredBlockStore.java:250)
at alluxio.worker.block.BlockWorker.requestSpace(BlockWorker.java:557)

2016-06-02 11:56:22,493 ERROR logger.type (FileSystemMaster.java:loadMetadataIfNotExistAndJournal) - Failed to load metadata for path: /SparkBench/Terasort/Output/_temporary/0/task_201606021150_0004_r_000085

These errors seem to indicate that alluxio doesn't have space, except for the Failed to load metadata part. However, I set the size the ALLUXIO_WORKER_MEMORY_SIZE to 400GB (512GB physical memory) and only 100GB is being used ( from teragen). So, I'm not sure why Alluxio would say it can't find a space.

Questions:
*Could I be missing an important configuration parameter in Alluxio, since it seems alluxio should be able to request that amount of space given there is still 300+GB of alluxio ramdisk available when the failures start occuring?

* Do you have any suggestions for how to start debugging/understanding why this problem is happening.


- Which version of Alluxio are you using?
I'm using master (pulled 6/1)
- Are you running with tiered storage what is your configuration?
I'm only using memory
- What is your OS version?
ubuntu 14.04: 
- What is your JAVA version?
java 8 oracle

Thanks,
Tim

Gene Pang

unread,
Jun 2, 2016, 4:18:30 PM6/2/16
to Alluxio Users
Hi Tim,

How many machines are in your cluster? Are the spark tasks co-located with the Alluxio workers? Also, while the jobs are running, if you take a look at the master web UI (on port 19999), what does the status page for the workers look like?

Thanks,
Gene

Tim B

unread,
Jun 2, 2016, 6:20:49 PM6/2/16
to Alluxio Users

On Thursday, June 2, 2016 at 1:18:30 PM UTC-7, Gene Pang wrote:
Hi Tim,

How many machines are in your cluster?

Just one node, with 512GB of physical memory, with ALLUXIO_WORKER_MEMORY_SIZE set to 400g
 
Are the spark tasks co-located with the Alluxio workers?

Yes, Spark master/worker and alluxio master/worker are co-located.
 
Also, while the jobs are running, if you take a look at the master web UI (on port 19999), what does the status page for the workers look like?

 
104GB of the worker was used and was 74% free. I didn't see anything obvious in the web ui.

Since I can reproduce this pretty consistently, I'm going to try digging into the code and see if I can understand why this happens.

Thanks,
Tim

Gene Pang

unread,
Jun 3, 2016, 11:34:51 AM6/3/16
to Alluxio Users
Hi Tim,

Do you know how much memory on the machine is being used at the time the exceptions occur? How much memory is free from the commands "free -m" or "vmstat -s -S M"?

Thanks,
Gene

Tim B

unread,
Jun 8, 2016, 2:49:17 PM6/8/16
to Alluxio Users
Hi Gene,

Sorry for the late reply. Yes, roughly 30%/~150GB of the memory is free when the exceptions occur.

I'm not sure if it's relevant, but Spark is using a lot of  temp.dir space. At the time of the exception it has used 80G. This seems odd to me (still trying to get up to speed on Spark too) because the memory parameters I've set for spark are:

 spark.driver.memory              300g
 spark.executor.memory           300g 


I'm still trying to understand why we are hitting this issue, but groking the code is taking a bit of time for me.

Thanks,
Tim

Bin Fan

unread,
Jun 8, 2016, 3:59:49 PM6/8/16
to Tim B, Alluxio Users
Hi Tim and Gene,

I think the error message "Failed to load metadata for path: /SparkBench/Terasort/Output/_temporary/0/task_201606021150_0004_r_000085" is unrelated to insufficient space (it looks scary but it is typically a message telling people the file is not originally existing in under file system---HDFS in your case).

The real exception message---"TempBlockMeta not found for blockId"---is probably due to a timeout setting (which is set a bit low in Alluxio 1.0). 

Time, could you try out Alluxio 1.1(http://www.alluxio.org/download/) which was just released days ago? I believe we have fixed a bunch of issues like this.

- Bin




--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to alluxio-user...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Tim B

unread,
Jun 13, 2016, 12:37:14 AM6/13/16
to Alluxio Users, bis...@gmail.com
Thanks!

I tried adjusting different timeout values. If I increase either of these parameters

or

then my terasort app will finish without any errors (I increased them by 10x). I'm sure I there are still inefficiencies in my stack that I need to fix, but I just wanted to say thanks for the help.

Tim

Gene Pang

unread,
Jun 14, 2016, 10:53:04 AM6/14/16
to Alluxio Users, bis...@gmail.com
Thanks for providing your solution!

Also, for which version did you change those configuration values?

Thanks,
Gene

Tim B

unread,
Jun 14, 2016, 12:34:29 PM6/14/16
to Alluxio Users, bis...@gmail.com
This was on master after 1.1.0 (pulled Jun 8th in the afternoon PST).

Bin Fan

unread,
Jun 14, 2016, 1:40:45 PM6/14/16
to Tim B, Alluxio Users
Glad terasort works for you now

- Bin

--
Reply all
Reply to author
Forward
0 new messages