Is SGE still the best scheduler for Kaldi and maybe other ML tools?

415 views
Skip to first unread message

Josef G. Bauer

unread,
Jun 29, 2019, 1:36:12 AM6/29/19
to kaldi...@googlegroups.com
Hello,

I've been using Kaldi now for about 6 years with a "company internal" scheduler on some non-Ubuntu distribution. It mostly went fine.

Now I want / need to switch to Ubuntu for which the "company internal" scheduler is not available.

My natural choice is gridengine from Ubuntu packages (SGE).

But there are opinions againt this:

* SGE is not actively developped, probably there will be no featutres added in the future

* there might be more "modern" shedulers that have additional features and are more "future prooven"


Questions:

* Is my assumption correct that upcoming security issues in SGE will be taken care of?

* What about other ML tools --- is SGE a suboptimal choice for them?


Looking forward to hearing your opinion.


Greetings

Josef


PS: Kaldi rules!

Daniel Povey

unread,
Jun 29, 2019, 2:01:56 PM6/29/19
to kaldi-help
SGE is definitely not actively developed.  Of the schedulers that have that type of interface (based on submitting jobs), probably the best maintained is slurm.  Its architecture is ugly though.  There is nothing in that space that I would wholeheartedly recommend.

There are all kinds of fancy new things that aim to be a modern replacement for those traditional schedulers, but none of them are easily compatible with Kaldi as the way they work is very different than what Kaldi expects.  Things like apache spark.  Kaldi would have to be reworked a bit to work with that, I think.

Most ML tools do not interact directly with any schedulers other than (in some cases) MPI, which of course is not a scheduler but a parallel processing library that you would normally use in conjunction with a scheduler.

In most cases, for most tools, your best bet would just be to reserve one machine with several GPUs.  It will rarely be worth it, except for big organizations, to go beyond that.  And for scheduling you'll have to figure it out yourself; most machine learning tools don't have any concept of interacting with a scheduler.

Dan


--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/CAP-zZaiq5wvq%3D4pFviO2kC7cA17fRnYxeiP16Wf6wYAXFkhZLw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Vishal Kumar

unread,
Jul 1, 2019, 1:42:34 AM7/1/19
to kaldi-help
Hi Dan,
I am also facing similar issues while setting up grid engine.
I am facing issues since 2 weeks and it is still not taking up jobs. The jobs are queued and waiting.There is no help available and also there is no accurate tutorial up to the point.It would be of great help if you could point any tutorial or something. Also please shift Kaldi on any other scheduler.
Thanks

Daniel Povey

unread,
Jul 1, 2019, 4:46:30 PM7/1/19
to kaldi-help
Kaldi is actually independent of the scheduler, it will work with any scheduler with a similar design, like slurm, pbs, etc.  You just have to use the appropriate wrapper script e.g. slurm.pl, pbs.pl, by changing cmd.sh.


--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.

Josef G. Bauer

unread,
Jul 4, 2019, 12:05:26 PM7/4/19
to kaldi...@googlegroups.com, Daniel Povey
Hello Dan,

regardless of the quality of the post from Vishal, which I have my own
opinion about, I'd like to add the following link to this thread:

https://kaldi-asr.org/doc/queue.html

"Kaldi is designed to work best with software such as Sun
GridEngine or other software that works on a similar
principle;"


And ... thanks a lot for your thoughts on the choice of scheduler!


Greetings

Josef
<mailto:kaldi-help+...@googlegroups.com>.
To post to this group, send email to kaldi...@googlegroups.com
<mailto:kaldi...@googlegroups.com>.
<https://groups.google.com/d/msgid/kaldi-help/e1e7a7cd-e5ee-4850-91b9-0e92cdd2b8bb%40googlegroups.com?utm_medium=email&utm_source=footer>.
For more options, visit https://groups.google.com/d/optout.

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google
Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to kaldi-help+...@googlegroups.com
<mailto:kaldi-help+...@googlegroups.com>.
To post to this group, send email to kaldi...@googlegroups.com
<mailto:kaldi...@googlegroups.com>.
To view this discussion on the web visit
https://groups.google.com/d/msgid/kaldi-help/CAEWAuyQWJsuVw0sHy1RM%2Byfno%2B%3D4mgCrHb-7DCUej8OUmbYonQ%40mail.gmail.com

<https://groups.google.com/d/msgid/kaldi-help/CAEWAuyQWJsuVw0sHy1RM%2Byfno%2B%3D4mgCrHb-7DCUej8OUmbYonQ%40mail.gmail.com?utm_medium=email&utm_source=footer>.

Daniel Povey

unread,
Jul 4, 2019, 12:15:15 PM7/4/19
to Josef G. Bauer, kaldi-help
If you are having trouble setting up GridEngine it may be easier to just run Kaldi on a big server and reduce --nj to the number of virtual cores for most jobs.

GridEngine is definitely not easy to set up, but the newer things in this space are quite problematic too.  They make all kinds of promises but in practice it doesn't tend to be as easy as they say; and they tend to require the entire software setup to be built in their framework, which is a lot of work.  At one point I even considered taking over maintenance of the Sun GridEngine project, but ran into objections from the guy who was allegedly maintaining it (but, in practice, wasn't).  Anyway it would have been too much work, so probably a good thing.

Dan


Daniel Povey

unread,
Jul 7, 2019, 8:15:01 PM7/7/19
to Josef G. Bauer, kaldi-help
BTW, if  an SGE job is waiting, you may be able to find out by doing
qstat -j <job-id>

Mortaza Doulaty

unread,
Jul 9, 2019, 5:55:37 AM7/9/19
to kaldi-help
Another option (in case you don't have an on-premise SGE cluster) would be using CycleCloud. It offers 1-click deployment of shared file systems (NFS, BeeGFS, GlusterFS) and SGE clusters (as well as slurm) in the cloud. It's provider agnostic and works on Azure and AWS and possibly others.
It works perfectly fine with Kaldi using SGE (but you may need to customise a bit beyond that 1-click setup, but it's quite easy and the learning curve is not that steep, specially if you have used/maintained SGE clusters and shared FSs before).
The only over head is the main node that runs the CycleCloud software (you can use cheap VM for that), it provides a web interface as well as a CLI interface. Once you setup your file system and cluster you don't really need to use that interface.

You can define multiple queues, each having certain type of VMs - you can have single/multi GPU queues, you name it!

So almost all of the scenarios that you can use Kaldi, you can easily have them in CycleCloud. It's also elastic, meaning when you submit a job, it adds nodes as you defined and once the computation is over, it deallocates the VMs, back to zero, no extra cloud fees. I have been using this setup for a while now and quite happy with it (still not as convenient as having an on-premise large SGE cluster, but the elasticity is really nice as you can easily get thousand of cores if needed) .
You can also use spot instances (cheaper and interruptible VMs that some cloud providers offer) and with a bit of SGE magic you can automatically reschedule jobs that are interrupted on spot instances, so you can decrease your cloud spendings a lot without worrying about resubmitting the failed jobs.
     send an email to kaldi...@googlegroups.com
     <mailto:kaldi-help+unsub...@googlegroups.com>.

     To post to this group, send email to kaldi...@googlegroups.com
     <mailto:kaldi...@googlegroups.com>.
     To view this discussion on the web visit

https://groups.google.com/d/msgid/kaldi-help/e1e7a7cd-e5ee-4850-91b9-0e92cdd2b8bb%40googlegroups.com

<https://groups.google.com/d/msgid/kaldi-help/e1e7a7cd-e5ee-4850-91b9-0e92cdd2b8bb%40googlegroups.com?utm_medium=email&utm_source=footer>.
     For more options, visit https://groups.google.com/d/optout.

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google
Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send

Jan Trmal

unread,
Jul 9, 2019, 7:00:09 AM7/9/19
to kaldi...@googlegroups.com
Interesting, I didn't know about this. 
Y.


     To post to this group, send email to kaldi...@googlegroups.com
     <mailto:kaldi...@googlegroups.com>.
     To view this discussion on the web visit

https://groups.google.com/d/msgid/kaldi-help/e1e7a7cd-e5ee-4850-91b9-0e92cdd2b8bb%40googlegroups.com

<https://groups.google.com/d/msgid/kaldi-help/e1e7a7cd-e5ee-4850-91b9-0e92cdd2b8bb%40googlegroups.com?utm_medium=email&utm_source=footer>.
     For more options, visit https://groups.google.com/d/optout.

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google
Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to kaldi...@googlegroups.com

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/da05dbf4-32c6-45b6-9714-48b2c4f9b90e%40googlegroups.com.

Josef G. Bauer

unread,
Jul 13, 2019, 4:19:36 AM7/13/19
to dpo...@gmail.com, kaldi-help
Hello Dan,

thanks, but I believe that you misunderstood me a bit. I don't have
problems setting up SGE.

I'm currently only trying to figure out, what would be the best
scheduler for a redesigned cluster. I totally trust your opinion.

I haven't really look into the Kaldi tutorial on installing SGE but I
expect that it will be --- as always --- quite straightforward given
these instructions.

Given the HW of the cluster that is to be reconfigured (relatively large
number of nodes with relatively small number of cores each) I think that
a scheduler is required. I very much believe that the future of CPU
based stuff is using basically using single nodes with >= 128 cores. My
expectation is that this is more efficient avoiding the overhead of a
scheduler. But even then a scheduler would be handy to reserve a node
using an interactive session. Basically same thing for GPU based stuffed.


> At one point I even considered taking over
> maintenance of the Sun GridEngine project ...

How comes that I kind of "knew" this?
use the appropriate wrapper script e.g. slurm.pl <http://slurm.pl>,
pbs.pl <http://pbs.pl>, by changing
cmd.sh.


On Mon, Jul 1, 2019 at 1:42 AM Vishal Kumar
<techi...@istarindia.com <mailto:techi...@istarindia.com>
<mailto:techi...@istarindia.com
<mailto:kaldi-help%2Bunsu...@googlegroups.com>
<mailto:kaldi-help+...@googlegroups.com
<mailto:kaldi-help%2Bunsu...@googlegroups.com>>.
To post to this group, send email to
kaldi...@googlegroups.com <mailto:kaldi...@googlegroups.com>
<mailto:kaldi...@googlegroups.com
<mailto:kaldi...@googlegroups.com>>.
To view this discussion on the web visit


https://groups.google.com/d/msgid/kaldi-help/e1e7a7cd-e5ee-4850-91b9-0e92cdd2b8bb%40googlegroups.com


<https://groups.google.com/d/msgid/kaldi-help/e1e7a7cd-e5ee-4850-91b9-0e92cdd2b8bb%40googlegroups.com?utm_medium=email&utm_source=footer>.
For more options, visit https://groups.google.com/d/optout.

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google
Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from it, send
an email to kaldi-help+...@googlegroups.com
<mailto:kaldi-help%2Bunsu...@googlegroups.com>
<mailto:kaldi-help+...@googlegroups.com
<mailto:kaldi-help%2Bunsu...@googlegroups.com>>.
To post to this group, send email to kaldi...@googlegroups.com
<mailto:kaldi...@googlegroups.com>
<mailto:kaldi...@googlegroups.com

Josef G. Bauer

unread,
Jul 13, 2019, 8:55:20 AM7/13/19
to dpo...@gmail.com, kaldi-help
Hello Dan,

thanks, but I believe you did misunderstand me a bit.

I don't have trouble configuring SGE. I'm currently trying to decide on
a scheduler for a cluster that will reconfigured. Hereby I totally trust
your opinion.

I haven't really looked at the Kaldi "tutorial" on SGE. I expect SGE
configuration based on that information to be straight forward, as with
Kaldi in general.

The cluster to be reconfigured has "many" nodes with relatively "few"
CPU cores per node. So I think some kind of scheduler is necessary. I
believe the future should be nodes with e.g. >= 128 CPU cores that can
completely host a training. I expect this can be more efficient avoiding
the overhead of scheduler. But even in this approach a scheduler is
handy for "reserving" a node using a interactive session. Basically same
thing for GPU based training.


> At one point I even considered taking over maintenance of the Sun
> GridEngine project ...

Actually I suspected this. ;)


Thx and best regards

Josef


-------- Original Message --------
From: Daniel Povey
Sent: Monday, Jul 8, 2019 2:14 AM CEST
To: Josef G. Bauer
use the appropriate wrapper script e.g. slurm.pl
<http://slurm.pl>, pbs.pl <http://pbs.pl>, by changing
cmd.sh.


On Mon, Jul 1, 2019 at 1:42 AM Vishal Kumar
<techi...@istarindia.com <mailto:techi...@istarindia.com>
<mailto:techi...@istarindia.com
<mailto:kaldi-help%2Bunsu...@googlegroups.com>
<mailto:kaldi-help+...@googlegroups.com
<mailto:kaldi-help%2Bunsu...@googlegroups.com>>.
To post to this group, send email to
kaldi...@googlegroups.com <mailto:kaldi...@googlegroups.com>
<mailto:kaldi...@googlegroups.com
<mailto:kaldi...@googlegroups.com>>.
To view this discussion on the web visit


https://groups.google.com/d/msgid/kaldi-help/e1e7a7cd-e5ee-4850-91b9-0e92cdd2b8bb%40googlegroups.com


<https://groups.google.com/d/msgid/kaldi-help/e1e7a7cd-e5ee-4850-91b9-0e92cdd2b8bb%40googlegroups.com?utm_medium=email&utm_source=footer>.
For more options, visit https://groups.google.com/d/optout.

--
Go to http://kaldi-asr.org/forums.html find out how to join
---
You received this message because you are subscribed to the Google
Groups "kaldi-help" group.
To unsubscribe from this group and stop receiving emails from
it, send
an email to kaldi-help+...@googlegroups.com
<mailto:kaldi-help%2Bunsu...@googlegroups.com>
<mailto:kaldi-help+...@googlegroups.com
<mailto:kaldi-help%2Bunsu...@googlegroups.com>>.
To post to this group, send email to kaldi...@googlegroups.com
<mailto:kaldi...@googlegroups.com>
<mailto:kaldi...@googlegroups.com

Jan Trmal

unread,
Jul 13, 2019, 9:28:35 AM7/13/19
to kaldi-help, Dan Povey
In my experience, the best one is the one your people (or you) know how to configure properly and the users know how to use, especially if you are not setting it up for one specific workload (for example kaldi-only or tensorflow-only or a cfd-software-only).
y.

To unsubscribe from this group and stop receiving emails from it, send an email to kaldi-help+...@googlegroups.com.
To post to this group, send email to kaldi...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/kaldi-help/c2f02f3f-485d-d8fe-2812-bfeea7992ccd%40gmail.com.

Daniel Povey

unread,
Jul 13, 2019, 12:42:15 PM7/13/19
to Jan Trmal, kaldi-help
If I were starting fresh, I would probably go with slurm myself.  Its architecture is ugly but I think it's the most popular and best supported of the "grid-engine-like" clustering systems.  


Reply all
Reply to author
Forward
0 new messages