Integrating Pydoop and Seal

68 views
Skip to first unread message

Luca Pireddu

unread,
Mar 13, 2012, 11:44:19 AM3/13/12
to cloudb...@googlegroups.com
Hello list!

I've just sent a pull request to Brad in the hopes in integrating the
installation of Seal and its dependency Pydoop into CloudBioLinux.

Seal (http://biodoop-seal.sf.net) is a suite of programs to process
high-throughput sequencing data that run on the Hadoop MapReduce
framework. Pydoop (http://pydoop.sf.net) is a Python API that allows
one to write Python programs that run directly on the Hadoop
framework.

Roman Valls Guimera and I spent some hours working at a recent
hackathon, organized under the SeqAhead EU COST action for software
for NGS, to integrate this software into CloudBioLinux. The result of
that effort is a new "seal flavour" that installs the aforementioned
software property on Scientific Linux (an rpm-based distribution). We
haven't yet put any effort into providing support for deb-based
distributions.

I wanted to discuss an issue we encountered. Although Scientific
Linux (SL) is an rpm-based distribution and uses yum, the names of the
packages differ from the ones used in RedHat distributions. This
issue caused some problems for us as the packages that CBL tried to
install on SL to satisfy Seal's dependencies did not exist. We didn't
find a good solution for the problem and ended up specifying the
required packages using the SL names. Thus, the installation
procedure will most likely fail on any RedHat distro.

Is there a better solution to this problem? From what I saw, the
problem may really lie in the level of abstraction provided by CBL at
the moment which stops at the level of the package manager, while
maybe it should be a little higher providing an abstraction at the
distribution level. Any opinions on this? What might be the best way
to get this new flavour working on all supported distributions?

Thanks for the input!

Luca

Brad Chapman

unread,
Mar 13, 2012, 9:37:24 PM3/13/12
to Luca Pireddu, cloudb...@googlegroups.com

Luca;
Thanks much for this contribution. I'll integrate it
tomorrow. Having Seal in addition to the updated Pydoop installation on
the main distribution will be great.

As you've noticed, the RPM side of CloudBioLinux lags. The
focus has been on Ubuntu/Debian where there are more pre-built
packages. All of the RPM side has been on CentOS, hence the ugly .bashrc
manipulation to get more up to date g++.

So, this could definitely use additional abstraction. If you're a
Scientific Linux user and want to tackle this we'd be very happy to
accept patches to make it cleaner.

Thanks again,
Brad

> --
> You received this message because you are subscribed to the Google Groups "cloudbiolinux" group.
> To post to this group, send email to cloudb...@googlegroups.com.
> To unsubscribe from this group, send email to cloudbiolinu...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/cloudbiolinux?hl=en.
>

Roman Valls

unread,
Mar 16, 2012, 4:29:21 AM3/16/12
to cloudb...@googlegroups.com
Thanks Luca for the writeup !

Brad, talking about abstractions on this, would it make sense to have
some sort of "package name resolver" mechanism ? For example:

python-devel: python-dev, python-devel

And have some code iterating and failing (or succeeding) to install
those package names ?

Then, we could get rid of packages-yum.yml and packages-deb.yml, which
could be potentially unmantainable in the future when different
distributions name packages differently (i.e, Scientific Linux vs
Centos, or even Debian vs Ubuntu vs Linux Mint to say a few major distros).

Otherwise we will end up with an structure similar to:

packages-yum.yaml
packages-yum-scientific-linux.yaml
...

Which could be a bit of a mess to update, imho :-/ What makes more sense
to you ?

Thanks !
Roman

Brad Chapman

unread,
Mar 16, 2012, 7:26:25 AM3/16/12
to Roman Valls, cloudb...@googlegroups.com

Roman;

> Brad, talking about abstractions on this, would it make sense to have
> some sort of "package name resolver" mechanism ? For example:
>
> python-devel: python-dev, python-devel
>
> And have some code iterating and failing (or succeeding) to install
> those package names ?

That's how the debian port works: packages-debian.yaml specifies renames
from programs specified in packages.yaml.

We used this approach when the package repositories are similar, and the
separate configuration approach when they are more divergent. There's a
tradeoff either way: extra YAML files are simpler to update for a single
distribution, but then you end up with a ton of them if you support
multiple distributions.

If someone wants to regularly maintain another image outside of Ubuntu
we'd be happy to go with whatever works best for them. The goal isn't
to build a meta-package-manager and support every distribution out
there, but to find a way to practically provide up-to-date images
with bioinformatics software. So whatever approach supports this the
best is cool with me.

Brad

Roman Valls Guimera

unread,
Mar 17, 2012, 6:48:36 AM3/17/12
to Brad Chapman, cloudb...@googlegroups.com
+1 on your reasoning Brad, I just had a look how you merged in Luca's pull request, good stuff :)

Enis Afgan

unread,
Mar 18, 2012, 6:35:45 PM3/18/12
to cloudb...@googlegroups.com, Brad Chapman
Hi guys,
I wanted to take this opportunity to see if anyone has any opinions on a topic somewhat related to the package name resolution - failed package/library installations.
Currently, if a part of the configuration fails, the whole script tends to fail. What about maintaining a run-time failed component list/file that has the same structure as the original files? Then, the script could continue with the configuration and after it's all complete, it should be faster to rerun just the part that's failed. There may be dependencies issues that arise when an installation continues after a failed build/install but those packages/libraries could be added to the failed list and attempted again later.

Overal, the configuration process continues pretty nicely without failures, but occasionally, I'd have a timeout or a package cannot be found (eg, sun java) and the whole things fails requiring a 2 hour restart. So this seemed like a possible way to minimize those.

At this point I just wanted to take an opportunity to bring this somewhat related issue and see what others think whether this is worth the effort and what the best approach for doing this would be?

Thanks,
Enis

Brad Chapman

unread,
Mar 19, 2012, 7:34:02 AM3/19/12
to Enis Afgan, cloudb...@googlegroups.com

Enis;

> I wanted to take this opportunity to see if anyone has any opinions on a
> topic somewhat related to the package name resolution - failed
> package/library installations.
> Currently, if a part of the configuration fails, the whole script tends to
> fail.

The fail-fast idea is by design, but I agree that restarts following
failures are too slow. A lot of the "quick-checks" before reinstalling
can drag on. Thanks for bringing up this issue.

> What about maintaining a run-time failed component list/file that has
> the same structure as the original files? Then, the script could continue
> with the configuration and after it's all complete, it should be faster to
> rerun just the part that's failed.

The approach I take to restarting after failures is to manually start at
the section where it left off. The fabfile has a number of different
target points to build individual sections. The downside is that this
requires some knowledge of the build process and where you were at.

What do you think about automating that process by having a checkpoint
file that lists completed sections? Then the process could read this
file and start at the appropriate section. Re-running part of any
individual section won't be especially slow.

Brad

Enis Afgan

unread,
Mar 19, 2012, 7:07:16 PM3/19/12
to Brad Chapman, cloudb...@googlegroups.com
Hi Brad,

> What about maintaining a run-time failed component list/file that has
> the same structure as the original files? Then, the script could continue
> with the configuration and after it's all complete, it should be faster to
> rerun just the part that's failed.

The approach I take to restarting after failures is to manually start at
the section where it left off. The fabfile has a number of different
target points to build individual sections. The downside is that this
requires some knowledge of the build process and where you were at.

I've been doing the same but the requires human attention and knowledge - the process can die 5 minutes after I left the computer and although there another hour of work, none of it will get done until I restart the process...

What do you think about automating that process by having a checkpoint
file that lists completed sections? Then the process could read this
file and start at the appropriate section. Re-running part of any
individual section won't be especially slow.

I feel there would still be the same issue of having to manually restart the process vs. it just continuing and doing what it can. Then, later one could come back and see what's failed and/or restart.
I guess the process I suggested of keeping up with what's failed would require also keeping up with that stage of the build process (ie, package, library, or custom) so that may require some special handling in the code (or creating temp files parallel to the current config files) so it may be a bit more work than I'm hoping for... 

Enis


 

Brad

Brad Chapman

unread,
Mar 19, 2012, 8:39:05 PM3/19/12
to Enis Afgan, cloudb...@googlegroups.com

Enis;

> I feel there would still be the same issue of having to manually restart
> the process vs. it just continuing and doing what it can.

That's a good point. It definitely requires some manual attention in the
case of errors.

> Then, later one
> could come back and see what's failed and/or restart.
> I guess the process I suggested of keeping up with what's failed would
> require also keeping up with that stage of the build process (ie, package,
> library, or custom) so that may require some special handling in the code
> (or creating temp files parallel to the current config files) so it may be
> a bit more work than I'm hoping for...

It would be a bit of work to cleanly fail and report like that. I've
tried this path in the past with projects and not been happy with it
since the code tends to degenerate into a ton of error handling for
special cases.

If there are some high level cases where we can catch and retry which
would help with the process I'd be happy to add those. For real
failures the best solution is likely automated test runs so we can
detect and fix problems early.

Let me know if any of this sounds useful. Happy to accept patches on
this whichever way you decide to go,
Brad

Enis Afgan

unread,
Mar 19, 2012, 11:47:23 PM3/19/12
to Brad Chapman, cloudb...@googlegroups.com
OK. I suggest we just table this for now because it's looking more complex than hoped for and, currently, I don't have the time to work on it at that level...

agbiotec

unread,
Apr 19, 2012, 1:26:58 PM4/19/12
to cloudb...@googlegroups.com
Hi guys,

   I had a question in regards to distros / environment where the deployment takes place, and though
a bit different from what Luca presented below, I thought to post here so I wouldn't start a new thread.

   Is there a part of the env. dictionary in the whole cloudbiolinux fab framework, where the arch of 
the VM where the deployment takes place is noted ? I am currently writing a deployment script that
will work under the install_custom directive and it needs to pull in a couple of binaries (unfortunately
they are not available via apt) which are pre-compiled for 32-bit or 64-bit arch.

  One more quick one (I guess I am lazy to google it): is there a way to make apt pull older versions 
of packages ? For example would it work by just putting "clustalw-0.1.1" (circa 2008) in the .yaml package
listing files so that the apt-get commands in the fabric scripts work without modifications by just passing
the name of the older package to the apt-get command ?


 cheers,

Ntino

Brad Chapman

unread,
Apr 19, 2012, 9:00:00 PM4/19/12
to agbiotec, cloudb...@googlegroups.com

Ntino;

> Is there a part of the env. dictionary in the whole cloudbiolinux fab
> framework, where the arch of
> the VM where the deployment takes place is noted ? I am currently writing a
> deployment script that
> will work under the install_custom directive and it needs to pull in a
> couple of binaries (unfortunately
> they are not available via apt) which are pre-compiled for 32-bit or 64-bit
> arch.

Good idea. We previously had a hack for this in the packages but I
swapped it into a global env variable you can use:

https://github.com/chapmanb/cloudbiolinux/commit/109907c155640f3cffe31ce3dfc13ab7c60a1ac3

Hope this works for what you need.

> One more quick one (I guess I am lazy to google it): is there a way to
> make apt pull older versions
> of packages ? For example would it work by just putting "clustalw-0.1.1"
> (circa 2008) in the .yaml package
> listing files so that the apt-get commands in the fabric scripts work
> without modifications by just passing
> the name of the older package to the apt-get command ?

The fabric scripts pass the name directly to apt-get install, but I
don't think specifying specific package versions works like that with
apt. I'm not an expert on this, but I think the way to force it is to
manually download the deb and do a 'dpkg -i' on it. Sorry, I'm not much
of an apt expert. Does anyone else have any tricks for this?

Brad

Tim Booth

unread,
Apr 20, 2012, 4:56:39 AM4/20/12
to cloudb...@googlegroups.com, agbiotec
Hi,

> > One more quick one (I guess I am lazy to google it): is there a way to
> > make apt pull older versions
> > of packages ? For example would it work by just putting "clustalw-0.1.1"
> > (circa 2008) in the .yaml package
> > listing files so that the apt-get commands in the fabric scripts work
> > without modifications by just passing
> > the name of the older package to the apt-get command ?
>
> The fabric scripts pass the name directly to apt-get install, but I
> don't think specifying specific package versions works like that with
> apt. I'm not an expert on this, but I think the way to force it is to
> manually download the deb and do a 'dpkg -i' on it. Sorry, I'm not much
> of an apt expert. Does anyone else have any tricks for this?

The syntax is simple and is virtually what you guessed, just with an "="
sign:

sudo apt-get install clustalw=0.1.1

But of course this will only work if the relevant package version is
somewhere that is accessible to APT, and if the old .deb installs
without conflicts on the newer system. The default Ubuntu repositories
will only ever have the latest version of the package. The Bio-Linux
repo sometimes has older versions but in general we clean them out.

How were you planning to make the old packages available to the system?

Cheers,

TIM

--
Tim Booth <tbo...@ceh.ac.uk>
NERC Environmental Bioinformatics Centre

Centre for Ecology and Hydrology
Maclean Bldg, Benson Lane
Crowmarsh Gifford
Wallingford, England
OX10 8BB

http://nebc.nerc.ac.uk
+44 1491 69 2705

--
This message (and any attachments) is for the recipient only. NERC
is subject to the Freedom of Information Act 2000 and the contents
of this email and any reply you make may be disclosed by NERC unless
it is exempt from release under the Act. Any material supplied to
NERC may be stored in an electronic records management system.

Reply all
Reply to author
Forward
0 new messages