Cupy is slower than numpy at emulating dot product with one hot vector

33 views
Skip to first unread message

Mahesh Abnave (Mahesha999)

unread,
Apr 5, 2023, 6:32:10 PM4/5/23
to CuPy User Group

I was trying to implement neural network from scratch. However it is not possible to run numpy on GPU. So I tried using cupy and benchmarking performance improvement over numpy.

I have following code in numpy:

import numpy as np emb = 300 # embedding size m = 2048 # minibatch size V = 50000 # vocabulary size # generate 100 random one hot vectors J = np.random.choice(emb, m) X = np.zeros((m, emb)) for i, j in enumerate(J): X[i, j] = 1 W0 = np.random.uniform(-0.8, 0.8, (V, emb)).astype("float32") # actual computation start_time = time.time() for epoch in range(5): for mb in range(314): # number of minibatches per epoch h = np.zeros((1), dtype='float32') for xi in X.T: w0i = np.argmax(xi) if not h.any(): h = W0[w0i] else: h = np.vstack((h, W0[w0i]))  print("%s seconds" % (time.time() - start_time))

The actual computation part takes 16.55 seconds.
I converted above code into cupy equivalent by simply replacing numpy with cupy

import cupy as cp # generate 100 random one hot vectors J = cp.random.choice(emb, m) X = cp.zeros((m, emb)) for i, j in enumerate(J): X[i, j] = 1 W0 = cp.random.uniform(-0.8, 0.8, (V, emb)).astype("float32") # V⨯e # actual computation start_time = time.time() for epoch in range(5): for mb in range(314): # minibatches h = cp.zeros((1), dtype='float32') for xi in X.T: # m times w0i = cp.argmax(xi) if not h.any(): h = W0[w0i] # (1xe) else: h = cp.vstack((h, W0[w0i])) print("%s seconds" % (time.time() - start_time))


The actual computation part took 2 min 32 seconds (total 152.15 seconds) to execute. I was expecting it to take far lesser time than numpy. But it did not. What I am missing here?

PS:

  1. You can access the corresponding colab notebook here.

  2. cupy comes pre-installed in GPU runtime of google colab.

  3. What I am trying to do with that computation?Actually, I am trying reduce matrix dot product computation time. Above every column of X is one hot vector. So multiply it by any matrix (W0 above) has effect of selecting row of a matrix corresponding to position where one hot vector has 1. So, I am trying to iterate through each vector of X, find position of 1 in the vector and fetch corresponding row from W0 and stack them for all m vectors in X. Here is the example for one vector [ref]:

r/learnmachinelearning - Cupy is slower than numpy at emulating dot product with one hot vector

4. The linked colab notebook shows np.dot takes 1518.75 seconds while numpy emulation of dot product takes 16.55 seconds. So, numpy emulation is indeed faster than np.dot, but I was expecting cupy on GPU to be even faster, but this is not the case. It takes 152.15 seconds.

5. The linked colab notebook also shows how the emulation produces same result as np.dot with smaller matrices.

Reply all
Reply to author
Forward
0 new messages