Dockerizing Tachyon - Cannot complete a file without all the blocks committed

374 views
Skip to first unread message

ralph.s...@gmail.com

unread,
Apr 1, 2016, 5:37:28 AM4/1/16
to Alluxio Users
Hi, 

I've looked into dockerizing Alluxio and have come across an issue with writing a file into it.

Say I create a simple text file and copy it into Alluxio, i.e.

$ echo "Hello world" > a
$ ./alluxio fs copyFromLocal a /a

This will result in an exception:
alluxio.exception.BlockInfoException: Cannot complete a file without all the blocks committed

when using the docker container.

Actually the file gets created but it has 0 bytes in size, and in the FS browser the file's contents reads: "The requested file is not complete yet."

The same works if I run Alluxio bare metal, though.

I'm suspecting this is some networking issue, as I was partially able to reproduce this behavior on bare-metal by adding firewall rules to block port 29998 etc, but I didn't get the exception. Any ideas as to what might be the cause of this?


Regards,
Ralph

ralph.s...@gmail.com

unread,
Apr 1, 2016, 5:45:50 AM4/1/16
to Alluxio Users
Obviously I didn't take great care when posting this, so title should have read Dockerizing Alluxio and also I'm using version 1.0.1. Sorry :)

Gene Pang

unread,
Apr 1, 2016, 9:46:55 AM4/1/16
to Alluxio Users
Hi Ralph,

Thanks for this investigation!

Are you running just a single container, with all the Alluxio components? Also, could you take a look at the worker logs and the master logs?

Thanks,
Gene

Ralph Schlosser

unread,
Apr 1, 2016, 10:00:28 AM4/1/16
to Alluxio Users
Hey Gene, 

yeah it's a single container, basic configuration (essentially what you get after bootstrap-conf).

What confuses me is that, apparently I can create the file, just not write data to it.

Can you elaborate on what the ports do:

19999 - master web ui
19998 - ?
29999 - data port?
29998 - ?
30000 - worker web ui

master.log:
---
2016-04-01 09:29:27,367 ERROR logger.type (MetricsConfig.java:loadConfigFile) - Error loading metrics configuration file.
2016-04-01 09:29:27,374 INFO  logger.type (AbstractMaster.java:start) - BlockMaster: Starting leader master.
2016-04-01 09:29:27,375 INFO  logger.type (JournalWriter.java:completeAllLogs) - Marking all logs as complete.
2016-04-01 09:29:27,379 INFO  logger.type (AbstractMaster.java:start) - BlockMaster: journal checkpoint does not exist, nothing to process.
2016-04-01 09:29:27,379 INFO  logger.type (JournalWriter.java:getCheckpointOutputStream) - Creating tmp checkpoint file: /alluxio-1.0.0/journal/BlockMaster/checkpoint.data.tmp
2016-04-01 09:29:27,379 INFO  logger.type (JournalWriter.java:getCheckpointOutputStream) - Latest journal sequence number: 0 Next journal sequence number: 1
2016-04-01 09:29:27,422 INFO  logger.type (JournalWriter.java:close) - Successfully created tmp checkpoint file: /alluxio-1.0.0/journal/BlockMaster/checkpoint.data.tmp
2016-04-01 09:29:27,422 INFO  logger.type (JournalWriter.java:close) - Renamed checkpoint file /alluxio-1.0.0/journal/BlockMaster/checkpoint.data.tmp to /alluxio-1.0.0/journal/BlockMaster/checkpoint.data
2016-04-01 09:29:27,422 INFO  logger.type (JournalWriter.java:deleteCompletedLogs) - Deleting all completed log files...
2016-04-01 09:29:27,422 INFO  logger.type (JournalWriter.java:deleteCompletedLogs) - Finished deleting all completed log files.
2016-04-01 09:29:27,430 INFO  logger.type (JournalWriter.java:openCurrentLog) - Opened current log file: /alluxio-1.0.0/journal/BlockMaster/log.out
2016-04-01 09:29:27,434 INFO  logger.type (MountTable.java:add) - Mounting /alluxio-1.0.0/underFSStorage at /
2016-04-01 09:29:27,435 INFO  logger.type (AbstractMaster.java:start) - FileSystemMaster: Starting leader master.
2016-04-01 09:29:27,435 INFO  logger.type (JournalWriter.java:completeAllLogs) - Marking all logs as complete.
2016-04-01 09:29:27,435 INFO  logger.type (AbstractMaster.java:start) - FileSystemMaster: journal checkpoint does not exist, nothing to process.
2016-04-01 09:29:27,435 INFO  logger.type (JournalWriter.java:getCheckpointOutputStream) - Creating tmp checkpoint file: /alluxio-1.0.0/journal/FileSystemMaster/checkpoint.data.tmp
2016-04-01 09:29:27,436 INFO  logger.type (JournalWriter.java:getCheckpointOutputStream) - Latest journal sequence number: 0 Next journal sequence number: 1
2016-04-01 09:29:27,443 INFO  logger.type (JournalWriter.java:close) - Successfully created tmp checkpoint file: /alluxio-1.0.0/journal/FileSystemMaster/checkpoint.data.tmp
2016-04-01 09:29:27,443 INFO  logger.type (JournalWriter.java:close) - Renamed checkpoint file /alluxio-1.0.0/journal/FileSystemMaster/checkpoint.data.tmp to /alluxio-1.0.0/journal/FileSystemMaster/checkpoint.data
2016-04-01 09:29:27,443 INFO  logger.type (JournalWriter.java:deleteCompletedLogs) - Deleting all completed log files...
2016-04-01 09:29:27,443 INFO  logger.type (JournalWriter.java:deleteCompletedLogs) - Finished deleting all completed log files.
2016-04-01 09:29:27,578 INFO  server.Server (Server.java:doStart) - jetty-7.x.y-SNAPSHOT
2016-04-01 09:29:27,590 INFO  handler.ContextHandler (ContextHandler.java:startContext) - started o.e.j.s.ServletContextHandler{/metrics/json,null}
2016-04-01 09:29:27,647 INFO  handler.ContextHandler (ContextHandler.java:startContext) - started o.e.j.w.WebAppContext{/,file:/alluxio-1.0.0/core/server/src/main/webapp/},/alluxio-1.0.0/core/server/src/main/webapp
2016-04-01 09:29:27,666 INFO  server.AbstractConnector (AbstractConnector.java:doStart) - Started SelectChann...@0.0.0.0:19999
2016-04-01 09:29:27,667 INFO  logger.type (UIWebServer.java:startWebServer) - Alluxio Master Web service started @ 0.0.0.0/0.0.0.0:19999
2016-04-01 09:29:27,668 INFO  logger.type (AlluxioMaster.java:startServing) - Alluxio Master version 1.0.0 started @ localhost/127.0.0.1:19998 
2016-04-01 09:29:29,751 INFO  logger.type (BlockMaster.java:getWorkerId) - getWorkerId(): WorkerNetAddress: WorkerNetAddress{host=516a731fe2cc, rpcPort=29998, dataPort=29999, webPort=30000} id: 1
2016-04-01 09:29:29,765 INFO  logger.type (BlockMaster.java:workerRegister) - registerWorker(): MasterWorkerInfo( ID: 1, mWorkerAddress: WorkerNetAddress{host=516a731fe2cc, rpcPort=29998, dataPort=29999, webPort=30000}, TOTAL_BYTES: 1073741824, mUsedBytes: 0, mAvailableBytes: 1073741824, mLastUpdatedTimeMs: 1459502969764, mBlocks: [ ] )
2016-04-01 09:30:27,927 INFO  logger.type (JournalWriter.java:openCurrentLog) - Opened current log file: /alluxio-1.0.0/journal/FileSystemMaster/log.out
2016-04-01 09:33:20,643 INFO  logger.type (AbstractClient.java:connect) - Alluxio client (version 1.0.0) is trying to connect with FileSystemMasterClient master @ localhost/127.0.0.1:19998
2016-04-01 09:33:20,647 INFO  logger.type (AbstractClient.java:connect) - Client registered with FileSystemMasterClient master @ localhost/127.0.0.1:19998
---

worker.log:
---
2016-04-01 09:29:29,448 INFO  logger.type (FileUtils.java:createStorageDirPath) - Folder /mnt/ramdisk/alluxioworker was created!
2016-04-01 09:29:29,473 ERROR logger.type (MetricsConfig.java:loadConfigFile) - Error loading metrics configuration file.
2016-04-01 09:29:29,622 INFO  server.Server (Server.java:doStart) - jetty-7.x.y-SNAPSHOT
2016-04-01 09:29:29,634 INFO  handler.ContextHandler (ContextHandler.java:startContext) - started o.e.j.s.ServletContextHandler{/metrics/json,null}
2016-04-01 09:29:29,689 INFO  handler.ContextHandler (ContextHandler.java:startContext) - started o.e.j.w.WebAppContext{/,file:/alluxio-1.0.0/core/server/src/main/webapp/},/alluxio-1.0.0/core/server/src/main/webapp
2016-04-01 09:29:29,713 INFO  server.AbstractConnector (AbstractConnector.java:doStart) - Started SelectChann...@0.0.0.0:30000
2016-04-01 09:29:29,714 INFO  logger.type (UIWebServer.java:startWebServer) - Alluxio Worker Web service started @ 0.0.0.0/0.0.0.0:30000
2016-04-01 09:29:29,715 INFO  logger.type (AbstractClient.java:connect) - Alluxio client (version 1.0.0) is trying to connect with BlockMasterWorker master @ localhost/127.0.0.1:19998
2016-04-01 09:29:29,720 INFO  logger.type (AbstractClient.java:connect) - Client registered with BlockMasterWorker master @ localhost/127.0.0.1:19998
2016-04-01 09:29:29,770 INFO  logger.type (AlluxioWorker.java:start) - Started worker with id 1
2016-04-01 09:29:29,770 INFO  logger.type (AlluxioWorker.java:start) - Alluxio Worker version 1.0.0 started @ 516a731fe2cc/172.17.0.2:29998
2016-04-01 09:29:30,770 INFO  logger.type (AbstractClient.java:connect) - Alluxio client (version 1.0.0) is trying to connect with FileSystemMasterWorker master @ localhost/127.0.0.1:19998
2016-04-01 09:29:30,771 INFO  logger.type (AbstractClient.java:connect) - Client registered with FileSystemMasterWorker master @ localhost/127.0.0.1:19998
2016-04-01 09:29:30,771 INFO  logger.type (AbstractClient.java:connect) - Alluxio client (version 1.0.0) is trying to connect with FileSystemMasterWorker master @ localhost/127.0.0.1:19998
2016-04-01 09:29:30,772 INFO  logger.type (AbstractClient.java:connect) - Client registered with FileSystemMasterWorker master @ localhost/127.0.0.1:19998
---

--
You received this message because you are subscribed to a topic in the Google Groups "Alluxio Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/alluxio-users/lxb_44xLkGI/unsubscribe.
To unsubscribe from this group and all its topics, send an email to alluxio-user...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jiří Šimša

unread,
Apr 1, 2016, 10:17:36 AM4/1/16
to Ralph Schlosser, Alluxio Users
Hi Ralph,

19998 is master RPC port
29998 is worker RPC port

(You can find details here).

--
You received this message because you are subscribed to the Google Groups "Alluxio Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to alluxio-user...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Jiří Šimša

Ralph Schlosser

unread,
Apr 1, 2016, 10:35:24 AM4/1/16
to Alluxio Users
Hi Jiří,

thanks for this, should have looked at the docs first!

In my case, both master and worker run within the confines of the same container, so I guess it's not actually necessary to expose 29998 to the outside, because they are able to connect internally (inside the conainer).

Anyway, that's what it looks like in the logs.

What other cause could trigger the exception I'm seeing?


---
if (!fileInode.isPersisted() && blockInfoList.size() != blockIdList.size()) {
throw new BlockInfoException("Cannot complete a file without all the blocks committed");
}
---

so I'm wondering what would cause the file node to not be persisted, or the block info size not be updated correctly...

Bizarrely, if I log into the running container, e.g.

---
sudo docker exec -ti <container> /bin/bash
---

and then run ./alluxio fs copyFromLocal a /a from inside the container it works!


Regards,
Ralph

Jiří Šimša

unread,
Apr 1, 2016, 10:58:08 AM4/1/16
to Ralph Schlosser, Alluxio Users
Hi Ralph,

I agree with you that it is surprising to see that the same command succeeds when you run it from within the container and fails otherwise. I am not deeply familiar with Docker, so I am not sure what is the difference between these two modes of running a command.

The process of creating a file in Alluxio works something like this:

- the alluxio client tells the alluxio master it would like to create a file, the initial file metadata is created, but the file is marked as incomplete
- the alluxio client then proceeds to write the data to alluxio workers (and possibly to the backend storage for certain write types such as CACHE_THROUGH)
- the alluxio client then tells alluxio master to complete the file

The fact that the completion fails suggests that the file data was not written successfully.

My best guess for the behavior you are seeing is that the client (i.e. your copyFromLocal command) cannot access the worker when not executed from inside the container. Is there user.log in the logs directory. If so, what is its content? Any client errors should be surface there. You could also add additional log statements into FileOutStream.java if needed.

Hope this helps.

Best,

Ralph Schlosser

unread,
Apr 1, 2016, 11:17:32 AM4/1/16
to Alluxio Users
Hi Jiří,

Thanks for your explanation that makes it a lot clearer. Actually I didn't realize there was a client log too! :-)

I think you are right about the client not being able to access the worker:
---
2016-04-01 10:30:28,387 ERROR logger.type (AlluxioShell.java:run) - Error running copyFromLocal a /a
java.io.IOException: alluxio.exception.BlockInfoException: Cannot complete a file without all the blocks committed
        at alluxio.client.file.FileOutStream.close(FileOutStream.java:173)
        at com.google.common.io.Closer.close(Closer.java:206)
        at alluxio.shell.command.CopyFromLocalCommand.copyPath(CopyFromLocalCommand.java:170)
        at alluxio.shell.command.CopyFromLocalCommand.copyFromLocal(CopyFromLocalCommand.java:135)
        at alluxio.shell.command.CopyFromLocalCommand.run(CopyFromLocalCommand.java:78)
        at alluxio.shell.AlluxioShell.run(AlluxioShell.java:182)
        at alluxio.shell.AlluxioShell.main(AlluxioShell.java:66)
Caused by: alluxio.exception.BlockInfoException: Cannot complete a file without all the blocks committed
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
        at alluxio.exception.AlluxioException.from(AlluxioException.java:72)
        at alluxio.AbstractClient.retryRPC(AbstractClient.java:324)
        at alluxio.client.file.FileSystemMasterClient.completeFile(FileSystemMasterClient.java:130)
        at alluxio.client.file.FileOutStream.close(FileOutStream.java:171)
        ... 6 more
---

Further digging in the code shows that retryRpc() above calls this:

---
mClient.completeFile(path.getPath(), options.toThrift());
---
alluxioClient --RPC--->alluxioWorker 

Jiří Šimša

unread,
Apr 1, 2016, 12:04:19 PM4/1/16
to Ralph Schlosser, Alluxio Users
Hi Ralph,

As you correctly pointed out, the exception you are seeing is the result of the completeFile RPC failing. However, this is a master RPC.

This exception is thrown because not all the blocks were committed. Are you using the default Alluxio settings (in particular, alluxio.user.file.writetype.default=MUST_CACHE)? You can check the setting by going to the Alluxio master web UI (running on port 19999) and navigating to the system configuration tab.

Is there anything else in the user.log? Could you add a log message into LocalBlockOutStream#close() and RemoteBlockOutStream#close() to check which one (if any) is called. This method is called when a block written by a client is committed.

Best,

Agam Jain

unread,
May 4, 2016, 5:55:16 AM5/4/16
to Alluxio Users, ralph.s...@gmail.com
Hello,
i am facing a very similar problem and have not been able to fix it for a while now.
we are using hortonworks 2.2.4.2 - spark 1.2. and alluxio 1.0.1

we have a spark job which has a saveAsTextFile("alluxio://host:port/path"). there is some output from the job being written to some workers (as you can see in the attached screenshot) but it is crashes giving this output : http://pastebin.com/ZZPbr8a9

interesting thing is, the spark job when run in local mode runs file, but when run in yarn-cluster it fails.
2016-04-01 09:29:27,666 INFO  server.AbstractConnector (AbstractConnector.java:doStart) - Started SelectChannelConnector@0.0.0.0:19999
2016-04-01 09:29:27,667 INFO  logger.type (UIWebServer.java:startWebServer) - Alluxio Master Web service started @ 0.0.0.0/0.0.0.0:19999
2016-04-01 09:29:27,668 INFO  logger.type (AlluxioMaster.java:startServing) - Alluxio Master version 1.0.0 started @ localhost/127.0.0.1:19998 
2016-04-01 09:29:29,751 INFO  logger.type (BlockMaster.java:getWorkerId) - getWorkerId(): WorkerNetAddress: WorkerNetAddress{host=516a731fe2cc, rpcPort=29998, dataPort=29999, webPort=30000} id: 1
2016-04-01 09:29:29,765 INFO  logger.type (BlockMaster.java:workerRegister) - registerWorker(): MasterWorkerInfo( ID: 1, mWorkerAddress: WorkerNetAddress{host=516a731fe2cc, rpcPort=29998, dataPort=29999, webPort=30000}, TOTAL_BYTES: 1073741824, mUsedBytes: 0, mAvailableBytes: 1073741824, mLastUpdatedTimeMs: 1459502969764, mBlocks: [ ] )
2016-04-01 09:30:27,927 INFO  logger.type (JournalWriter.java:openCurrentLog) - Opened current log file: /alluxio-1.0.0/journal/FileSystemMaster/log.out
2016-04-01 09:33:20,643 INFO  logger.type (AbstractClient.java:connect) - Alluxio client (version 1.0.0) is trying to connect with FileSystemMasterClient master @ localhost/127.0.0.1:19998
2016-04-01 09:33:20,647 INFO  logger.type (AbstractClient.java:connect) - Client registered with FileSystemMasterClient master @ localhost/127.0.0.1:19998
---

worker.log:
---
2016-04-01 09:29:29,448 INFO  logger.type (FileUtils.java:createStorageDirPath) - Folder /mnt/ramdisk/alluxioworker was created!
2016-04-01 09:29:29,473 ERROR logger.type (MetricsConfig.java:loadConfigFile) - Error loading metrics configuration file.
2016-04-01 09:29:29,622 INFO  server.Server (Server.java:doStart) - jetty-7.x.y-SNAPSHOT
2016-04-01 09:29:29,634 INFO  handler.ContextHandler (ContextHandler.java:startContext) - started o.e.j.s.ServletContextHandler{/metrics/json,null}
2016-04-01 09:29:29,689 INFO  handler.ContextHandler (ContextHandler.java:startContext) - started o.e.j.w.WebAppContext{/,file:/alluxio-1.0.0/core/server/src/main/webapp/},/alluxio-1.0.0/core/server/src/main/webapp
2016-04-01 09:29:29,713 INFO  server.AbstractConnector (AbstractConnector.java:doStart) - Started SelectChannelConnector@0.0.0.0:30000
Screenshot from 2016-05-04 15:18:25.png

Ralph Schlosser

unread,
May 4, 2016, 7:27:46 AM5/4/16
to Alluxio Users
Hi, 

so you're running Alluxio via YARN?

Unfortunately I haven't been able to investigate my docker issues any further & put in the log statements Jiří suggested, perhaps give that a try?

Just in case someone wants to take a closer look and get to the bottom of this, I'm sharing my docker files:


There is a chance that I'll get back to this later this month.


Regards,
Ralph

2016-04-01 09:29:27,666 INFO  server.AbstractConnector (AbstractConnector.java:doStart) - Started SelectChann...@0.0.0.0:19999
2016-04-01 09:29:27,667 INFO  logger.type (UIWebServer.java:startWebServer) - Alluxio Master Web service started @ 0.0.0.0/0.0.0.0:19999
2016-04-01 09:29:27,668 INFO  logger.type (AlluxioMaster.java:startServing) - Alluxio Master version 1.0.0 started @ localhost/127.0.0.1:19998 
2016-04-01 09:29:29,751 INFO  logger.type (BlockMaster.java:getWorkerId) - getWorkerId(): WorkerNetAddress: WorkerNetAddress{host=516a731fe2cc, rpcPort=29998, dataPort=29999, webPort=30000} id: 1
2016-04-01 09:29:29,765 INFO  logger.type (BlockMaster.java:workerRegister) - registerWorker(): MasterWorkerInfo( ID: 1, mWorkerAddress: WorkerNetAddress{host=516a731fe2cc, rpcPort=29998, dataPort=29999, webPort=30000}, TOTAL_BYTES: 1073741824, mUsedBytes: 0, mAvailableBytes: 1073741824, mLastUpdatedTimeMs: 1459502969764, mBlocks: [ ] )
2016-04-01 09:30:27,927 INFO  logger.type (JournalWriter.java:openCurrentLog) - Opened current log file: /alluxio-1.0.0/journal/FileSystemMaster/log.out
2016-04-01 09:33:20,643 INFO  logger.type (AbstractClient.java:connect) - Alluxio client (version 1.0.0) is trying to connect with FileSystemMasterClient master @ localhost/127.0.0.1:19998
2016-04-01 09:33:20,647 INFO  logger.type (AbstractClient.java:connect) - Client registered with FileSystemMasterClient master @ localhost/127.0.0.1:19998
---

worker.log:
---
2016-04-01 09:29:29,448 INFO  logger.type (FileUtils.java:createStorageDirPath) - Folder /mnt/ramdisk/alluxioworker was created!
2016-04-01 09:29:29,473 ERROR logger.type (MetricsConfig.java:loadConfigFile) - Error loading metrics configuration file.
2016-04-01 09:29:29,622 INFO  server.Server (Server.java:doStart) - jetty-7.x.y-SNAPSHOT
2016-04-01 09:29:29,634 INFO  handler.ContextHandler (ContextHandler.java:startContext) - started o.e.j.s.ServletContextHandler{/metrics/json,null}
2016-04-01 09:29:29,689 INFO  handler.ContextHandler (ContextHandler.java:startContext) - started o.e.j.w.WebAppContext{/,file:/alluxio-1.0.0/core/server/src/main/webapp/},/alluxio-1.0.0/core/server/src/main/webapp
2016-04-01 09:29:29,713 INFO  server.AbstractConnector (AbstractConnector.java:doStart) - Started SelectChann...@0.0.0.0:30000
---

William Callaghan

unread,
Aug 31, 2016, 9:53:17 AM8/31/16
to Alluxio Users
Ralph,

I also ran into networking issues such as yours when first experimenting with dockerizing Alluxio. I found the following fixed the issue:
docker run -dit --net=host --cap-add=SYS_ADMIN --security-opt apparmor:unconfined <DOCKER_IMAGE> /bin/bash

Here is a link describing some of the capabilities that Docker removes by default (hence we add back SYS_ADMIN): https://opensource.com/business/14/9/security-for-docker

Jiří Šimša

unread,
Nov 18, 2016, 12:59:26 PM11/18/16
to Alluxio Users
William, thank you for sharing your fix.

Ralph, did William's fix work for you?

Best,
Reply all
Reply to author
Forward
0 new messages