I am using Numba to improve the speed of the below loop. without Numba it takes 135 sec to execute and with Numba it takes 0.30 sec :) which is very fast.
In the below loop I comparing the array with a threshold of 0.85. If the condition turns out to be True I am inserting the data into the List which will be returned by the function.
The data which is getting inserted into the List looks like this.
``` ['Source ID', 'Source TEXT', 'Similar ID', Similar TEXT, 'Score'] ```
```
idd = df['ID'].to_numpy()
txt = df['TEXT'].to_numpy()
Column = 'TEXT'
df = preprocessing(dataresult, Column) # removing special characters of 'TEXT' column
message_embeddings = model_url(np.array(df['DescriptionNew'])) #passing df to universal sentence encoder model to create sentence embedding.
cos_sim = cosine_similarity(message_embeddings) #len(cos_sim) > 8000
# Below function finds duplicates amoung rows.
@numba.jit(nopython=True)
def similarity(nid, txxt, cos_sim, threshold):
numba_list = List()
for i in range(cos_sim.shape[0]):
for index in range(i, cos_sim.shape[1]):
if (cos_sim[i][index] > threshold) & (i!=index):
numba_list.append([nid[i], nid[index], cos_sim[i][index]]) # either this works
# numba_list.append([txxt[i], txxt[index]]) # or either this works
# numba_list.append([nid[i], txxt[i], nid[index], txxt[index], cos_sim[i][index]]) # I want this to work.
return numba_list
print(similarity(idd, txt, cos_sim, 0.85))
```
In the above code during appending List either columns with numbers get appended or either Text. I want all the columns with both numbers and text to get inserted into the ```numba_list```.
I am getting below Error
```
1 frames
/usr/local/lib/python3.7/dist-packages/numba/core/dispatcher.py in error_rewrite(e, issue_type)
359 raise e
360 else:
--> 361 raise e.with_traceback(None)
362
363 argtypes = []
TypingError: Failed in nopython mode pipeline (step: nopython frontend)
Poison type used in arguments; got Poison<LiteralList((int64, [unichr x 12], int64, [unichr x 12], float32))>
During: resolving callee type: BoundFunction((<class 'numba.core.types.containers.ListType'>, 'append') for ListType[undefined])
During: typing of call at <ipython-input-179-6ee851edb6b1> (14)
File "<ipython-input-179-6ee851edb6b1>", line 14:
def zero(nid, txxt, cos_sim, threshold):
<source elided>
# print(i+1)
numba_list.append([nid[i], txxt[i], nid[index], txxt[index], cos_sim[i][index]])
^
```