I have trained a nnet3 model with about 30M parameters, which backbone is conv*2-[tdnn*3-gru]*3. I do force alignment on a machine with 132CPU cores. The RTF is 0.2 when setting nj 1, however the RTF/CPU is 0.2 when nj is set to 10.
Additionally, it takes about the same to to align 5 hours data regardless of setting nj to 10 or 20. And when I set nj to 120 to algin 3000 hours data, it total took 40 hours, which results in 1.6 RTF/CPU. The expected RTF/CPU is 0.2