Answering your questions in a slightly different order:
On 08:21 Tue 13 Nov, Bart van Deenen wrote:
> Another thing; disco has over a hundred open bugs on github, many of them
> are months (or years) old. Is closing these a manpower issue, or has the
> project spread into many independent clones, where everyone fixes just
> their own bugs?
Yes, this is a manpower issue. Otoh, our team in Nokia is hiring.
Disco hackers are most welcome! There are certainly clones, but
afaict, folks frequently contribute fixes and improvements back.
Also, the 0.5 model was actually motivated by trying to address ways
of fixing some of the existing bugs.
> Looking through the current documentation, i notice both disco.worker and
> disco.worker.classic. I've also seen some mention in this list of the
> future version 0.5 having an incompatible api change. We use disco in our
> company (Spilgames.com) for aggregating events from large amounts of web
> clients, but I find that there is not a lot of momentum in using disco. To
> me that seems to be mostly because of the limited number of tutorials and
> generally not so much overview documentation (you end up at the python
> class documentation pretty quickly).
>
> We will have to develop skills in this company to really use map-reduce
> frameworks anyway, and we're not a Java shop (PHP, Python and Erlang
> mostly), so there is not a lot of enthusiasm for Hadoop. I like the
> compact code-base of disco, and the ease with which you can get something
> going. I also have confidence in its performance.
>
> We are absolutely willing to contribute quite some effort into writing
> tutorials and other documentation, and we will make those available to the
> project, but I'd like to know a bit about the api roadmap.
> Is the 'classic' pattern on the way to obsolescence? Should we really
> focus on the 'new' mechanism.
More tutorials and documentation would certainly help, especially of
the cookbook kind. Any doc contributions will be very gratefully
merged!
There will be some changes in 0.5, but the goal is to support as much
of the current API as possible. 0.5 allows a more flexible 'pipeline'
approach to computation [1]; this flexibility means that the current
'classic' map-reduce API can essentially be supported. There might
need to be slightly different ways of doing the same thing, so minor
code changes _might_ be needed, but you can still do map-reduce style
processing. But the goal is definitely to minimize any code changes
required.
The Erlang support for the pipeline model for 0.5 is basically done
[2] and lightly tested, but needs serious pounding in a large cluster.
The Python user library needs work, both to natively exploit the new
model, and to support as much of the current 'classic' API on top of
the new model. But a lot of documentation for the 'classic' API
should still ideally hold for its port to the new model.
In addition, the Web UI obviously also needs work to show appropriate
and useful job information.
This means that 0.5 is still a ways away; in the meantime, the
discoproject github master branch will still point to the stable 0.4
line.
Note that the changes are mainly targeting the compute portion of
Disco; there are no major changes in the roadmap for the DDFS storage
layer. The main things to be done for DDFS are known [3] and well
specified, it is primarily an issue of time and manpower to get it
done.
If you already want to play around with the new pipeline model, and
don't mind using OCaml, you can already do so [4].
[1]
https://github.com/pmundkur/disco/blob/devel/scheduler/master/include/pipeline.hrl
[2]
https://github.com/pmundkur/disco/commits/devel/scheduler
[3]
https://github.com/discoproject/disco/wiki/DDFS-Evolution
[4]
https://github.com/pmundkur/odisco/tree/devel/pipeline
--
prashanth