To view this discussion on the web visit https://groups.google.com/d/msgid/sympy/CADKKbthPPTo8pqBsekBrbO_ngYqdTvpLNJv1dmCjBPEsGw4qEA%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sympy/CAP7f1Aj1t2z4%2Bm1ASsBitY4JzAewnzTCR0o310iyzimQRfTW2w%40mail.gmail.com.
tl;dr it depends on the DAG, but improved ILP is is likely possible (if difficult) and there could be room for multi-core parallelism as well.As I understand it, we're talking about a long computation applied to short input vectors. If the computation can be applied to many input vectors at once, independent of each other, then all levels of parallelism (multiple instructions, multiple cores, multiple sockets, multiple nodes) can be used. This is data-parallelism, which is great! However, it doesn't sound like this is the case.It sounds like you're thinking of building a DAG of these CSEs and trying to use task-parallelism over independent parts of it (automatically using sympy or theano or what have you). The tension here is going to be between locality and parallelism: how much compute hardware can you spread your data across without losing the nice cache performance that your small input vectors gain you. I'd bet that going off-socket is way too wide. Modern multi-core architectures have core-local L2 and L1 caches, so if your input data fits nicely into L2 and your DAG isn't really local, you probably won't get anything out of multiple-cores. Your last stand is single-core parallelism (instruction-level parallelism), which sympy et al may or may not be well equipped to influence.To start, I'd recommend that you take a look at your DAGs and try to figure out how large the independent chunks are. Then, estimate the amount of instruction level parallelism when you run in 'serial' (which you can do with flop-counting). If your demonstrated ILP is less than your independent chunk size, then at least improved ILP should be possible. Automatically splitting up these DAGs and expressing them in a low-level enough way to affect ILP is a considerable task, though.To see if multi-core parallelism is worth it, you need to estimate how many extra L3 loads you'd incur by spreading your data of multiple L2s. I don't have great advice for that, maybe someone else here does. The good news is that if your problem has this level of locality, then you can probably get away with emitting C code with pthreads or even openmp. Just bear in mind the thread creation/annihilation overhead (standing thread-pools are your friend) and pin them to cores.Good luck,Max
--
You received this message because you are subscribed to the Google Groups "sympy" group.
To unsubscribe from this group and stop receiving emails from it, send an email to sympy+un...@googlegroups.com.
To post to this group, send email to sy...@googlegroups.com.
Visit this group at http://groups.google.com/group/sympy.
To view this discussion on the web visit https://groups.google.com/d/msgid/sympy/CAJ8oX-Hc2y9C7FO07kkeraDAv7NNRGPkMJ2DvjgF2Oq7PzeS6g%40mail.gmail.com.
If you think about what the DAG would look like, your 'stacks' are like horizontal layers in the graph. The width of each layer (length of each stack) gives an upper bound on the speedup, but it doesn't tell the whole story: you need a way to deal with data locality.For example, let's look at stack #3. You have 8 independent expressions, so it would seem like you should be able to use 8 pieces of computational hardware (let's call it core). However, z_6, z_11, and z_19 all depend on z_5. Therefore, either z_6, z_11, and z_19 need to be computed local to z_5, or z_5 needs to be copied somewhere else. The copying is much more expensive than the computing (50-100 cycles [1]), so if you only have 3 things that depend on z_5, you're going to want to just compute them all on the same core as z_5.The complicated thing is that z_5 and z_10 both share a dependency, z_4, so they should be computed locally. Now, we have to compute everything that depends on z_5 or z_10 on the same core. If we don't break locality anywhere, we won't have any available parallelism. This is the tension: copies are expensive but without them we can't expose any parallelism and will be stuck with one core. This is why we really need to build a DAG, not just stacks, and then try to break it into chunks with the fewest edges between them. The number of chunks is the amount of parallelism and the number of edges are the number of copies.Fortunately, even if the DAGs are strongly connected and you're stuck with one core there is still ILP. In a nutshell: each core can actually do a couple operations at the same time. The core uses a single cache, so the data is local and doesn't require copies. The compiler is supposed to figure out ILP for you, but you might be able to help it out using all the extra information sympy/theano knows about your computation.Max
To view this discussion on the web visit https://groups.google.com/d/msgid/sympy/CAP7f1Ajyc6RSDmTecMk4P9GF8%2BjF6qYQ32rNj8wZPtFh-G5zfA%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sympy/CAJ8oX-GLamse8tVYw2WecjgQWbRxDUSCAKsRYmbC4qKBwWY%3D-w%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sympy/CAP7f1AgODZ7c4dwL7HF6pXo%2BLpDFSFG3GhjAwmJSLjQcM-NQUg%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sympy/CADKKbthqjD8sgODmDNyFQjEXN2uTYpZv-juQ5vDn9LGm3qTerg%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sympy/CADDwiVC2Zd4tdJRoAxK7iBin9PWc%2BEihSvf-6X5DTW9m9ph7XA%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sympy/CAP7f1AiZccNH3DdtCvqi7FaeZx5ARXJMAzkPFJ2nci9ChvEXpg%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sympy/CAGZqWdr6ye5LMvVfvMjDYDterj5oVEA_qObyLQx96Wq9c8SbiQ%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sympy/CADDwiVDiOmVeOCkUVwjNuwhdkSr4-7cr4eV4FRuKd5ZiAc01Bw%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sympy/CAP7f1AgCiXFkf42i%2BWc%3DEYNyNkWMGCVkX0RSfJ%2Bjtfqm4Kzxhw%40mail.gmail.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/sympy/CADDwiVDEYALyeEyLy%3DP%2BRC4HGe4TBkH1ejuMbOCXvXVm%2Be2cnw%40mail.gmail.com.