Hi Matthew,
Sorry for the late reply, I got a bit swamped recently.
I think the problem is in line 4, in the definition of (*^). You use `the` inside an array computation to get the value out of a scalar array, but that implies nested data parallelism because the array depends on the index given it by generate! Accelerate does not handle nested parallelism at all (although I'm not sure why the "possibly nested" warning does not appear… hmm…)
If I have my maths right (always dubious), this might do what you require:
(*^) :: (Elt e, IsNum e) => Acc (Array DIM1 e) -> Acc (Array DIM2 e) -> Acc (Array DIM1 e)
let (Z :. _ :. cols) = unlift (shape mat) :: Z :. Exp Int :. Exp Int
vec' = A.replicate (lift (Z :. All :. cols)) vec
fold1 (+) $ transpose $ A.zipWith (*) vec' mat
The trick is that operations like zipWith and fold1 in accelerate are all multi-dimensional. We can use replicate to duplicate out the elements of the vector, do the element-wise multiply on two matrices, and fold1 to reduce along the innermost dimension, which is why we need to transpose (alternatively, transpose the matrix and replicate in the opposite direction).
Currently the replication does actually duplicate the data, which might be difficult if your arrays are large, but I'm working on ways to avoid that.
Hope that helps!
Cheers,
-Trev