Hello! I'm trying to replace an existing matlab code with julia and I'm having trouble matching the performance of the original code. The matlab code is here:
https://github.com/jotok/InventorDisambiguator/blob/julia/Disambig.mThe program clusters inventors from a database of patent applications. The input data is a sparse boolean matrix (named XX in the script), where each row defines an inventor and each column defines a feature. For example, the jth column might correspond to a feature "first name is John". If there is a 1 in the XX[i, j], this means that inventor i's first name is John. Given an inventor i, we find similar inventors by identifying rows in the matrix that agree with XX[i, :] on a given column and then applying element-wise boolean operations to the rows. In the code, for a given value of `index`, C_lastname holds the unique column in XX corresponding to a "last name" feature such that XX[index, :] equals 1. C_firstname holds the unique column in XX corresponding to a "first name" feature such that XX[index, :] equals 1. And so on. The following code snippet finds all rows in the matrix that agree with XX[index, :] on full name and one of patent assignee name, inventory city, or patent class:
lump_index_2 = step & ((C_assignee | C_city | C_class))
The `step` variable is an indicator that's used to prevent the same inventors from being considered multiple times. My attempt at a literal translation of this code to julia is here:
https://github.com/jotok/InventorDisambiguator/blob/julia/disambig.jlThe matrix X is of type SparseMatrixCSC{Int64, Int64}. Boolean operations aren't supported for sparse matrices in julia, so I fake it with integer arithmetic. The line that corresponds to the matlab code above is
lump_index_2 = find(step .* (C_name .* (C_assignee + C_city + C_class)))
The reason I grouped it this way is that initially `step` will be a "sparse" vector of all 1's, and I thought it might help to do the truly sparse arithmetic first.
I've been testing this code on a Windows 2008 Server. The test data contains 45,763 inventors and 274,578 possible features (in other words, XX is an 45,763 x 274,58 sparse matrix). The matlab program consistently takes about 70 seconds to run on this data. The julia version shows a lot of variation: it's taken as little as 60 seconds and as much as 10 minutes. However, most runs take around 3.5 to 4 minutes. I pasted one output from the sampling profiler here [1]. If I'm reading this correctly, it looks like the program is spending most of its time performing element-wise multiplication of the indicator vectors I described above.
I would be grateful for any suggestions that would bring the performance of the julia program in line with the matlab version. I've heard that the last time the matlab code was run on the full data set it took a couple days, so a slow-down of 3-4x is a signficant burden. I did attempt to write a more idiomatic julia version using Dicts and Sets, but it's slower than the version that uses sparse matrix operations:
https://github.com/jotok/InventorDisambiguator/blob/julia/disambig2.jlThank you!
Josh
[1]
https://gist.github.com/jotok/6b469a1dc0ff9529caf5