The matrix operations do seem like good candidates for use with accelerator-backed Tensors, although if they're very small and not batched they may still come out ahead with SIMD operations.
If your serial computations can't be parallelized, that might be best handled on the CPU. In that case, we've generally found it to be best to pull down the contents of a Tensor once into a local array (or even an unsafe buffer) and then iterate over that. Repeatedly pulling slices of a Tensor for local calculations can incur a lot of overhead, so it's best to transfer the data once and work on it after that.
Ultimately, you may want to try a few different approaches and profile them to determine where your real bottlenecks are. A lot will come down to the specific calculations you're performing.