multidimensional ranking algorithm

30 views

Skip to first unread message

Arne Bøckmann

unread,

Mar 4, 2024, 5:03:22 PMMar 4

to pystatsmodels

As a hobby project, I've implemented a python version of Bentley's multidimensional divide and conquer ranking algorithm; it can be used as an alternative to _ecdf_mv for combinations of length and dimensionality where it outperforms the 'seq' variant; e.g. for 500,000 2D points, it will be around an order of magnitude faster. Feel free to use it any way you please, if you find it useful:

import numpy as np

def _rank2(points, mask=None):
    N = points.shape[0]
    N2 = N//2
    if N == 1:
        return 0
    else:
        idx = np.argpartition(points[:,0], N2)
        idxA_ = idx[:N2]
        idxA = np.zeros(N, dtype=bool)
        idxA[idxA_] = True
        if mask is not None:
            NAm = np.sum(idxA & mask)
            points_reduced = np.vstack((points[idxA & mask], points[~idxA & ~mask]))
        else:
            NAm = np.sum(idxA)
            points_reduced = np.vstack((points[idxA], points[~idxA]))
        count_points = np.zeros(points_reduced.shape[0], dtype=bool)
        count_points[:NAm] = True
        idxY = np.argsort(points_reduced[:,1])
        idxYr = np.zeros_like(idxY)
        idxYr[idxY] = np.arange(idxY.shape[0]) # inverse of idxY
        count_points = count_points[idxY]
        numA = np.cumsum(count_points)[idxYr]
        rank = np.zeros(N, dtype=int)
        if mask is not None:
            rank[idxA] = _rank2(points[idxA], mask[idxA])
            rank[~idxA] = _rank2(points[~idxA], mask[~idxA])
            rank[~idxA & ~mask] += numA[NAm:]
        else:
            rank[idxA] = _rank2(points[idxA])
            rank[~idxA] = _rank2(points[~idxA])
            rank[~idxA] += numA[NAm:]
        return rank

def rankn(points, mask=None):
    N = points.shape[0]
    N2 = N//2
    if mask is None:
        mask = np.ones(N, dtype=bool)
        first_call = True
    else:
        first_call = False
    if N == 1:
        return 0
    if points.shape[1] == 2:
        if first_call:
            return _rank2(points)
        else:
            return _rank2(points, mask)
    idx = np.argpartition(points[:,0], N2)
    idxA_ = idx[:N2]
    idxA = np.zeros(N, dtype=bool)
    idxA[idxA_] = True
    rank = np.zeros(N, dtype=int)
    rank[idxA] = rankn(points[idxA], mask[idxA])
    rank[~idxA] = rankn(points[~idxA], mask[~idxA]) + rankn(points[:,1:], idxA*mask)[~idxA]
    return rank

points = np.random.random((500000, 2))
ranks = rankn(points)

josef...@gmail.com

unread,

Mar 4, 2024, 5:08:23 PMMar 4

to pystat...@googlegroups.com

Hi Arne,

Thanks for this, I usually don't spend a large amount of time implementing higher performance algorithms.

It looks familiar, but can you provide some context, e.g. how it is used.

AFAIR, I used something like this for the multivariate distributions, but don't remember any details.

Josef

--
You received this message because you are subscribed to the Google Groups "pystatsmodels" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pystatsmodel...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/pystatsmodels/3a7381c6-4725-4735-a329-fd9ca63321c9n%40googlegroups.com.

Reply all

Reply to author

Forward

0 new messages