but i'm wondering how to write C code for such a complicated matrix operation. if there is a possiblity that python code could be written inside the source module then i feel the above code could easily parallelized.
please give your opinion on this and if possible some hints so as to parallelize the above code in pycuda.
thanks in advance.