Dockerizing Anduril components

68 views
Skip to first unread message

Christian Frech

unread,
Mar 30, 2015, 5:15:47 AM3/30/15
to andur...@googlegroups.com
I am toying around with Anduril and Docker and was wondering if the Anduril workflow engine should provide some additional functionality to allow running components inside containers.

Here is the idea:

  • When instantiating components, there is an additional annotation named "@docker" that is assigned the name of a docker image, inside which the component should execute (e.g. "anduril/HTSeqBam2Count"). This could maybe default to "anduril/<component-name>" or "anduril/<bundle-name>".
  • If the @docker annotation is present, the Anduril engine prefixes the execution command with a user-defined docker string followed by the value of @docker (i.e. "<docker_prefix> <image_name> <command>").
  • The docker prefix can be provided either globally via the 'anduril run' command line parameter, or made host-specific via hosts.conf.
  • Docker prefixes can be used in both local and remote execution mode. In remote execution mode, the docker prefix is sandwiched between the SSH call and the execution command, e.g. "ssh myserver docker run anduril/HTSeqBam2Count <command>". (This actually works, I have ran my workflows with it).
  • Docker images for components will be provided by component developers and made publicly available via Docker Hub. With Docker Hub, docker images get installed "automagically" on execution hosts if not present.
  • Not every component needs to have its own docker image. A component can run in "classic mode" (without using docker at all), inside a generic docker image provided by Anduril (e.g. bundle-specific images), or inside their own component-specific image provided by the component developer. Note that the granularity of the association between component and image is completely up to the component and workflow developers.
This solution will provide several advantages over the status quo:
  • Ease of installation. No need for the user to install any third-party software on any of their execution hosts (except docker of course). Docker images are automatically downloaded and fired-up upon execution of the workflow. Deploying an existing pipeline on a user's system will therefore become very easy.
  • Guaranteed execution. Because a component will be tested with a specific version of a docker image, it's proper execution on the user's system can be guaranteed by the developer.
  • Version control. Components can no longer fail because the wrong version of a third-party software is installed. Also, version conflicts can no longer occur. For example, if component A needs python2 and component B needs python3, they can still run side-by-side. Thus, components become completely independent of each other, at least with respect to their execution environment.
  • Increased reproducibility. Because, with the above solution, a workflow now defines not only the execution steps but also its execution environment (via the @docker tags), it is guaranteed that re-execution of a workflow somewhere else or later in time yields exactly the same results.
  • Control over resource allocation. Via docker prefixes a sysadmin can precisely control how much resources (CPU, memory) are granted to Anduril on a execution host. Sysadmins could therefore be less reluctant in providing computing resources to Anduril deployments.
  • Slim base system. Installation of Anduril itself can be kept very slim, because software required to execute workflows (e.g. Latex, R packages) comes only with components that actually need them.
  • Easier cloud deployments. Because of the above, elastic cloud deployments of Anduril workflows will presumably become easier. But I admit that I have not completely thought this through yet.
To be clear, dockerized execution can already be accomplished with the current functionality of Anduril, for example by providing docker prefixes via 'prefix' execution mode or via customized RemoteExecute strings (I actually tried both and it works). However, the main functionality currently missing in Anduril (at least I could not find it) is component-specific command prefixes. Without them, there is no elegant way to specify which component should run inside which execution environment (i.e. which docker image).

Marko suggested that one could think of 'hijacking' the @host annotation feature for this, i.e. we point @host to a virtual Docker computing host that defines its custom RemoteExecute command to fire-up docker. The main problem with this approach is that Anduril gets confused with the total available computing slots, because it is unaware that multiple virtual Docker hosts share the same physical host. Thus, physical hosts might get overallocated with jobs by the Anduril engine.

I am fairly new to Anduril, so it is quite possible that the solution proposed above is problematic, or that I am overlooking the obvious and all of this is already possible within Anduril's current feature set.

Anyways, I am open to suggestions and I would be happy to see a lively discussion on this topic.

Ville Rantanen

unread,
Mar 30, 2015, 5:27:30 AM3/30/15
to andur...@googlegroups.com
Hi,  

thank you for the feedback.

We are now in late polishing up of the upcoming Anduril 2.0 version, which is already runnable, but there are still some points we want to clear up and do better.

In Anduril2, there is an annotation named  "userDefined"  which could be used for this purpose, joined with a prefix script.
In addition, all of the annotations are written in the _command file. That means that every component knows which kind of annotations it was given.
For example, I have started to apply the "CPU"  annotation in components that can internally multithread.

We do not expect the Anduril1 branch to receive these functionalities, as we try to make Anduril2 branch the working default.

Your ideas are exciting, and I will bring them to the other developers.

Ville Rantanen

unread,
Mar 30, 2015, 9:31:23 AM3/30/15
to andur...@googlegroups.com
Just to continue a bit,  in Anduril2, you can have a prefix script that reads all the annotations. Here's an example how to do it:

#!/bin/bash

# read _command file:
for (( i=1; i<=$#; i++ ))
do  if [[ "${!i}" == */_command ]]
   
then export M_CPU=$( grep ^metadata.cpu= "${!i}" | sed s,^metadata.cpu=,, )
         
export M_MEMORY=$( grep ^metadata.memory= "${!i}" | sed s,^metadata.memory=,, )
         
export M_USERDEFINED=$( grep ^metadata.userDefined= "${!i}" | sed s,^metadata.userDefined=,, )
   
fi
done

# establish defaults, if needed:
[[ -z "$M_CPU" ]] && export M_CPU=0
[[ -z "$M_MEMORY" ]] && export M_MEMORY=0

# Do something with the values:
# ...

# Run the component: ( R-components get their code through stdin. hence the 'cat -' )
cat
- | "$@"



Christian Frech

unread,
Mar 31, 2015, 5:03:53 PM3/31/15
to andur...@googlegroups.com
That looks useful. To make it more generic, maybe Anduril could allow to provide any name for an annotation (i.e. not only CPU, MEMORY, USER_DEFINED), which is then accessible via the command file? The drawback would be that typos in annotation names could no longer be detected by the parser. Maybe an even better solution would be somewhere defining allowed annotation names that can then be used in command calls.

Ville Rantanen

unread,
Apr 1, 2015, 1:45:01 AM4/1/15
to andur...@googlegroups.com
We'll see about that.    Note, that with userDefined, you can of course pass a parameter list  "par1=value1, par2=value2" etc..    the bash functions in Anduril have a function that parses the list  (stringtomap)  

Also, with the previous example, a prefix script can read the inputs, outputs, and parameters of the component instance!   if you are developing a new component that uses docker, you can of course pass the image name as a parameter. 

The greatest difference between using parameters and annotations, is that changing an annotation will not trigger a re-execution. Changing a parameter will.



Ville Rantanen

unread,
Apr 2, 2015, 2:49:58 AM4/2/15
to andur...@googlegroups.com

I remind you, the userDefined is only present in Anduril 2.x branch, which is still in its Beta. Anduril 2.x  uses Scala as the language to build the network, thus the syntax to use annotations had to be changed.

in an Anduril scala-script you would use the annotation like this:

import anduril.tools._

object myDockerizedPipeline {
    val seed
= Randomizer(columns=1, rows=5, distribution="normal", mean=0)
    seed
._userDefined="docker=anduril/some_folder,var2=somevalue"
}


The prefix script looks like this:  (NOTE, I'm not actually running docker, just getting the variable $docker to be used..)
#!/bin/bash
. $ANDURIL_HOME/lang/bash/generic.sh
for (( i=1; i<=$#; i++ ))
do  if [[ "${!i}" == */_command ]]

   
then export M_USERDEFINED=$( grep ^metadata.userDefined= "${!i}" | sed s,^metadata.userDefined=,, )
   
fi
done
stringtomap
"$M_USERDEFINED"
echo
Docker: $docker
echo var2
: $var2
cat
- |  "$@"



The output looks like this:
Warning: unconnected outports at: seed
Executing seed (anduril.tools.Randomizer) (run.6534.script:12)
[STDOUT seed] Docker: anduril/some_folder
[STDOUT seed] var2: somevalue
Done. No errors occurred.




In Anduril 1.x  branch, there is no such annotation.  BUT, if you can infer the docker image name based on the instance name, or component name, you can make it work. 
Those are written in the _command file.


Ville Rantanen

unread,
Apr 23, 2015, 2:44:17 PM4/23/15
to andur...@googlegroups.com
BTW, thanks for bringing up the topic of Docker.io on the table.  we looked at it earlier already, but no real measures were taken. 

I've experimented a little with docker, and i can see the potential in using the prebuilt tools in docker images.  There are some issues concerning our own analysis environment, however..   mostly related to running things as root, with nfs mounts.. etc..
Some of them i've settled by adding a layer that installs LDAP authentication in the docker image, so the processes can be run as exactly the same user who initiated the docker. 

I have some tools that help using docker with anduril already brewing in my mind, but that's waiting for testing and more use cases. 

Christian Frech

unread,
Apr 25, 2015, 11:13:06 AM4/25/15
to andur...@googlegroups.com
No problem, great to hear that you are experimenting with docker as well. I had similar issues with authentication which I solved by setting identical UID and GID of the user inside the docker as the one who started docker. But your LDAP solution is probably cleaner, so maybe you can post it here.

BTW, I found an (interim) solution for component-specific docker images in Anduril 1.2, which works at least for my own components. All my components define a 'execPrefix' paramter that, if set, will be placed in front of all bash commands that require the specialized environment. This works, but the prefix script you posted is probably the way to go in Anduril 2+.

Christian Frech

unread,
Apr 29, 2015, 3:45:15 AM4/29/15
to andur...@googlegroups.com
There is an interesting related discussion going on over at reddit about interchangable bioinformatics software containers:
http://www.reddit.com/r/bioinformatics/comments/344z5y/bioboxes_a_standard_for_creating_interchangable/
Reply all
Reply to author
Forward
0 new messages