Firstly I apologize if this is a very basic question, but I am very new to TPU/GPU, so asking this. I would like to achieve this:
- Host to device: copy to SD matrix, SR vector, SW vector.
- On device: calculate vector PD = SD x SW, then calculate scalar X maximum element of PD vector
- Device to Host: copy scalar X from previous step
- On device: calculate scalar Y = SR x SW.
- Device to Host: copy scalar Y from previous step
In the above
- matrix SD is of size r.w
- vector SR is of size r
- vector SW is of size w
- r can be of size 2 to 10 million
- w can be of size 100 to 10,000.
- Input Vector SW is made of floating point numbers, between 0.0 and 1.0.
- Matrix SD, Vector SR / PD and Scalar X / Y output are all made of positive or negative floating point numbers
I would prefer seeing this with both TPU and GPU and compare performance. It would be good this could be tried out in Colab.