NumPy N-Array Intersection on GPU

Ben Scully

未讀,

2018年3月30日上午11:11:232018/3/30

收件者：Numba Public Discussion - Public

A bottleneck in my code is finding the index intersection of N arrays; millions of times. A simple NumPy calculation with np.intersect1d but running millions of times takes a toll.

An example:

arr1 = [0,1,2,3,4]
arr2 = [0,3,4]
arr3 = [3,4]
intersection is [3,4]

I would like to leverage GPU threads but struggling on the implementation...

Here is the python code.


import functools, datetime
import numpy as np


def run():
    """
    Create fake-data variable `grouped_data` which is a list of 100k entries. 
    Each element has 3 numpy arrays that are UNIQUE AND SORTED.
    
    Goal: iterate through `grouped_data` to find intersecting values per element.
    Ie, length of output equals length of input, `grouped_data`.
    In each element, these common values will be used to slice another numpy 
    array which is not included here.
    """
    grouped_data = create_data()                            # 9% of runtime
    overlap = loop_through_intersections(grouped_data)      # 91% of runtime


def create_data():
    """ Return `grouped_data`, list of 100k entries. Each element has 3 numpy arrays 
    kern profiler shows this function takes ~ 9% of runtime """
    array = np.array(range(2000))
    
    grouped_data = []
    for i in range(100000):
        ar1 = array[::np.random.randint(1,9)]
        ar2 = array[::np.random.randint(1,9)]
        ar3 = array[::np.random.randint(1,9)]
        
        grouped_data.append( [ar1, ar2, ar3] )

    return grouped_data


def loop_through_intersections(grouped_data):
    """  for each element in grouped_data (3 numpy arrays), find the intersecting values 
    kern profiler shows this function takes ~ 91% of runtime 
    """
    overlap = []
    for f in grouped_data:
        overlap.append( functools.reduce(intersect1d, f) )
    return overlap


def intersect1d(ar1, ar2):
    """
    Find the intersection of two arrays.
    Return the sorted, unique values that are in both of the input arrays.
    
    Taken from NumPy.   https://github.com/numpy/numpy/blob/v1.14.0/numpy/lib/arraysetops.py#L297-L338
    """
    aux = np.concatenate((ar1, ar2))
    aux.sort()
    return aux[:-1][aux[1:] == aux[:-1]]



####################################################
#            Runtime takes ~6s 
####################################################

st = datetime.datetime.now()
run();  print datetime.datetime.now() - st

How can I use Numba?

Chris Uchytil

未讀,

2018年3月30日中午12:20:072018/3/30

收件者：numba...@continuum.io

Are you always comparing 3 lists together or are you comparing M sets of N lists?

--
You received this message because you are subscribed to the Google Groups "Numba Public Discussion - Public" group.
To unsubscribe from this group and stop receiving emails from it, send an email to numba-users+unsubscribe@continuum.io.
To post to this group, send email to numba...@continuum.io.
To view this discussion on the web visit https://groups.google.com/a/continuum.io/d/msgid/numba-users/77eb9c82-b65f-4152-8ac2-bc85611ee694%40continuum.io.
For more options, visit https://groups.google.com/a/continuum.io/d/optout.

Ben Scully

未讀,

2018年3月30日下午2:13:242018/3/30

收件者：Numba Public Discussion - Public

M sets of N lists. This example just shows N=3.

I converted my code to only handle numpy arrays and ints, ie remove python objects.

Now @jit works but sending to the gpu has no benefit. Actually slows everything considerably.

Here is the latest code:

import datetime
import numpy as np
from numba import jit
from numba import cuda


def run():
    """
    var `matrix` now stores all index sets in groups of threes (but is N outside of this example).
    var `bools` stores overlapping indexes. 
    """
    rows = 2000
    iterations = 10000
    
    matrix = create_data(rows, iterations)
    bools = np.zeros((rows, iterations), dtype=bool, order='F')
    
    iterate_over_matrix(matrix, iterations, rows, bools)
    


def create_data(rows, iterations):


    """ Return `grouped_data`, list of 100k entries. Each element has 3 numpy arrays """


    
    grouped_data = [None]*(iterations*3-1)
    array = np.array(range(rows))
    
    grouped_data = []
    for i in range(iterations):
        for j in range(3):
            index = np.zeros(rows, dtype=bool)
            index[ array[::np.random.randint(1,9)] ] = True        
            grouped_data.append( index )
    
    matrix = np.array( np.array(grouped_data).T, order='F')
    return matrix



@jit
#@jit(nopython=True, nogil=True, cache=True)
def iterate_over_matrix(matrix, iterations, rows, bools):
    """
    Since N sets = 3, take three columns at a time from matrix.

    Params:
        :matrix, (bool np.array) matrix of CCS indexes.
                rows = timeseries rows.  cols=# exhaustive * 3.
        :iterations, (int) number of iterations to run through.
        :rows, (int) length of data  
        :bools, (bool np.array) empty matrix as placeholder
    """
    for i in range(iterations):
        arr = matrix[:, i*3:(i*3+3)]
        check_intersection(bools[:, i], arr[:, 0], arr[:, 1], arr[:, 2], rows)
        


@jit
#@jit(target='gpu')
#@cuda.jit
#@jit(nopython=True, nogil=True, cache=True)
def check_intersection(index, ar1, ar2, ar3, rows):
    """
    Iterate through array indexes.  If all are true, set to True.
    """
    
    for i in range(rows):
        if ar1[i] == ar2[i] == ar3[i] == True:
            index[i] = True




####################################################
#            Runtime


####################################################


st = datetime.datetime.now()
run();  print datetime.datetime.now() - st

What can I do to leverage my GPU via Numba?

To unsubscribe from this group and stop receiving emails from it, send an email to numba-users...@continuum.io.

Chris

未讀,

2018年3月30日下午5:12:262018/3/30

收件者：Numba Public Discussion - Public

If the number of input arrays is varying you can't really do the intersection all in one go unless you create an intersection function for all possible N. If you want to use the GPU I would go with CUDA (which is available within Numba), although if you are unfamiliar with how to code using CUDA this should still work for a GPU target with JIT. In regards to speedup/slowdown, the memory has to be transferred to the GPU via the PCI bus so if the arrays are small the transfer time will greatly exceed computation time and you won't see any sort of benefit from using the GPU.

As for implementations, what I would do personally is create an intersection kernel that takes in 2 arrays and generates their intersection. Pair up the N arrays so you you run the kernel N//2 times. These kernel calls can run asynchronously. Each thread in the kernel will be assigned to a value in array 1 and will search for a matching value in array 2. The size of the output array that holds the intersections isn't known ahead of time so you will need to generate an array the size of the smaller of the two arrays and store some sort of counter that goes along with the output array that indicates how many values in the array are important. The counter is needed as, unless all of array 2 is in array 1, the output array will contain some amount of garbage values. This will generate N//2 output arrays. Synchronize the GPU after this so you know that each of the N//2 arrays are filled with their respective intersection. Iterate this process until you have only 2 arrays left which will create the final intersection. This treats the intersection as somewhat of a reduction problem.

1) You start with N arrays

2) run kernel first time on N arrays to generate N//2 intersections

3) cuda.synchronize()

4) run kernel second time on N//2 arrays to generate N//4 intersections

5) cuda.synchronize()

...

continue until you are left with a single output array.

Rob Ennis

未讀,

2018年4月1日下午4:00:592018/4/1

收件者：Numba Public Discussion - Public

Hi

Whilst not targeting GPU and therefore potentially off topic, the following may offer some kind of speedup (x3 on my machine), but mileage may vary etc. It's nothing revolutionary but thought it might be of interest.

I'd be interested to hear if you get a solution up and running using cuda; it would be great if you could update this thread if so.

Cheers

Rob

import functools
import logging
import time

import numpy as np
from numba import njit


def timeit(comment):
    def timed(fn):
        def wrapper(*args, **kw):
            start = time.perf_counter()
            result = fn(*args, **kw)
            end = time.perf_counter()
            logging.info('{0}: {1:.2f}s'.format(comment, end - start))
            return result
        return wrapper
    return timed


@timeit('run')
def run(grouped_data):
    overlap = loop_through_intersections(grouped_data)
    return overlap


@timeit('run_version2')
def run_version2(grouped_data):
    overlap = []

    for array_collection in grouped_data:
        overlap.append(fast_intersect(array_collection))

    return overlap


@timeit('create_data')
def create_data(rng):
    array = np.array(range(2000))
    grouped_data = []

    for i in range(100000):
        ar1 = np.ascontiguousarray(array[::rng.randint(1, 9)])
        ar2 = np.ascontiguousarray(array[::rng.randint(1, 9)])
        ar3 = np.ascontiguousarray(array[::rng.randint(1, 9)])
        grouped_data.append(tuple((ar1, ar2, ar3)))

    return grouped_data


def loop_through_intersections(grouped_data):
    overlap = []

for f in grouped_data:
        overlap.append(functools.reduce(intersect1d, f))
    return overlap


def intersect1d(ar1, ar2

):
    aux = np.concatenate((ar1, ar2))
    aux.sort()

return aux[:-1][aux[1:] == aux[:-1]]

@njit
def fast_intersect(array_collection):

    for idx in range(len(array_collection)):
        arr_set = set(array_collection[idx])
        if idx == 0:
            res = arr_set
        else:
            res = res.intersection(arr_set)

    return np.array(list(res))


if __name__ == '__main__':

    # set logging level to INFO
    logging.basicConfig(level=logging.INFO)

    # generate one lot of grouped_data
    rng = np.random.RandomState(0)
    grouped_data = create_data(rng)

    # generate output via original approach
    output = run(grouped_data)

    # jit compile
    _ = fast_intersect(grouped_data[0])

    # generate output via second approach
    output_v2 = run_version2(grouped_data)

    # some check and balances
    assert len(output) == len(output_v2)

    for idx in range(len(output)):
        s0 = set(output[idx])
        s1 = set(output_v2[idx])
        assert s0 == s1

Ben Scully

未讀,

2018年4月3日中午12:01:322018/4/3

收件者：Numba Public Discussion - Public

Chris & Rob, thank you guys for analyzing my question.

I really appreciate your time.

Currently i'm in the process of learning CUDA and creating a custom kernel to solve this problem.

There's a lot to learn so my solution aka response will take a bit of time. Maybe a week.

A few things i'm considering:

- With Numba, move my matrix to the GPU once then have many threads access it.

- Instead of intersecting the arrays, I will multiple the columns. Since the (updated code) matrix values are 0/1, the result will be the same.

Ben Scully

未讀,

2018年4月4日下午5:40:032018/4/4

收件者：Numba Public Discussion - Public

Below is a GPU kernel solution. A good learning experience.

GPU code is up to 5x faster than Numba no python. Not as great as I wanted... I can still optimize block and grid sizes but will leave for now.

import numpy as np
import datetime
from numba import njit

from pycuda import driver, compiler, gpuarray, tools
import pycuda.autoinit 



def compare(rows, iterations):
    """
    Run CPU & GPU Version.  Compare output.
    Creates binary matrix called a_cpu which represents a dataset.
    The goal is to take 3 columns at a time and if all are 1, pass 1
    to the output matrix.
    """
    
    np.random.seed(42)
    a_cpu = np.random.randint(0,2, (rows, iterations*3)).astype(np.float32)
    
    
    st = datetime.datetime.now()
    cpu = np.zeros((rows, iterations), dtype=int)
    iterate_over_matrix(a_cpu, iterations, rows, cpu)
    print '\n\t CPU runtime: ', datetime.datetime.now() - st
    
    
    st = datetime.datetime.now()
    gpu = cuda_attempt(rows, iterations, a_cpu)
    print '\n\t GPU runtime: ', datetime.datetime.now() - st
    
    print "cpu.sum(): {:,}".format(cpu.sum())
    print "gpu.sum(): {:,}".format(int(gpu.sum()))
    


def get_kernel_code(iterations):
    
    kernel_code = """
    __global__ void MatrixMulKernel(int ROWS, float *A, float *C)
    {
        const int wC = %(C_SIZE)s;
        
        const int blockId = blockIdx.y * gridDim.x + blockIdx.x;
        const int thread = blockId * blockDim.x + threadIdx.x;
        
        
        if ( thread < (ROWS * wC) ) {
            float Aele = A[3*thread] * A[3*thread +1] * A[3*thread +2];
            C[thread] = Aele;
        }
    }
    """
    
    kernel_code = kernel_code % { 
        'A_SIZE': 3*iterations,
        'C_SIZE': iterations,
        }
    
    return kernel_code


def cuda_attempt(rows, iterations, a_cpu):
    """
    Create data, use gpuarray, get pycuda result. 
    """
    
    a_gpu = gpuarray.to_gpu(a_cpu) 
    c_gpu = gpuarray.empty((rows, iterations), np.float32)
    
    
    kernel_code = get_kernel_code(iterations)
    mod = compiler.SourceModule(kernel_code)
    matrixmul = mod.get_function("MatrixMulKernel")
    
    
    # 2D Grid of 1D Blocks
    needed_threads = rows * iterations
    threads = 1024
    number_blocks = needed_threads // threads + 1
    number_blocks = int(np.sqrt(number_blocks)) + 1
    
    assert (number_blocks <= 65535), "number of blocks exceeds allowed limit in 1 dimension"
    
    
    grid = (number_blocks, number_blocks)
    block = (threads, 1, 1)
    
    matrixmul(
        np.int32(rows), a_gpu, c_gpu, 
        grid = grid,
        block = block,
        )
    
    return c_gpu.get()



#===============================================================================
#                    CPU CALCULATTIONS
#===============================================================================

@njit


def iterate_over_matrix(matrix, iterations, rows, bools):


    for i in range(iterations):
        arr = matrix[:, i*3:(i*3+3)]
        check_intersection(bools[:, i], arr[:, 0], arr[:, 1], arr[:, 2], rows)

@njit


def check_intersection(index, ar1, ar2, ar3, rows):


    for i in range(rows):
        if ar1[i] == ar2[i] == ar3[i] == True:
            index[i] = True

#===============================================================================
#                    RUN
#===============================================================================

rows=5
iterations=2


rows=2000
iterations=100000


compare(rows, iterations)

Ehsan Totoni

未讀,

2018年4月6日晚上7:04:152018/4/6

收件者：numba...@continuum.io

Have you tried Numba's automatic parallelism on CPU?

-Ehsan

--
You received this message because you are subscribed to the Google Groups "Numba Public Discussion - Public" group.

To unsubscribe from this group and stop receiving emails from it, send an email to numba-users+unsubscribe@continuum.io.

To post to this group, send email to numba...@continuum.io.

To view this discussion on the web visit https://groups.google.com/a/continuum.io/d/msgid/numba-users/1953ec61-bd73-4816-9cb6-9eba41c2d3c6%40continuum.io.

回覆所有人

回覆作者

轉寄