[SciPy-User] Finding closest point in array - inverse of KDTree

classic Classic list List threaded Threaded
7 messages Options
Reply | Threaded
Open this post in threaded view
|

[SciPy-User] Finding closest point in array - inverse of KDTree

Antonio Polino
Hello all, 


I have a very large ndarray A, and a sorted list of points k (a small list, about 30 points).

For every element of A, I want to determine the closest element in the list of points k, together with the index. So something like:

    >>> A = np.asarray([3, 4, 5, 6])
    >>> k = np.asarray([4.1, 3])
    >>> values, indices
    [3, 4.1, 4.1, 4.1], [1, 0, 0, 0]

Now, the problem is that A is very very large. So I can't do something inefficient like adding one dimension to A, take the abs difference to k, and then take the minimum of each column. 

For now I have been using np.searchsorted, as shown in the second answer here: https://stackoverflow.com/questions/2566412/find-nearest-value-in-numpy-array but even this is too slow. 

I thought of using scipy.spatial.KDTree:

    >>> d = scipy.spatial.KDTree(k)
    >>> d.query(A)

This turns out to be much slower than the searchsorted solution.

On the other hand, the array A is always the same, only k changes. So it would be beneficial to use some auxiliary structure (like a "inverse KDTree") on A, and then query the results on the small array k. 

Is there something like that?

_______________________________________________
SciPy-User mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: Finding closest point in array - inverse of KDTree

Daπid
On 11 October 2017 at 19:01, Antonio Polino <[hidden email]> wrote:
On the other hand, the array A is always the same, only k changes. So it would be beneficial to use some auxiliary structure (like a "inverse KDTree") on A, and then query the results on the small array k. 

You can sort A and use searchsorted to find the positions of the midpoints of k. Then, you assign the values to the group.

In your example (adding one more and sorting k):

    >>> A = np.asarray([3, 4, 5, 6])
    >>> k = np.asarray([3, 4.1, 5.2])
    >>> midpoints = np.array([ 3.55,  4.65])   #  k[:-1] + np.diff(k) / 2
    >>> np.searchsorted(A, midpoints)
    array([1, 2])

So, this means that A[:1] belong to k[0], A[1:2] belong to k[2] and A[2:] belong to k[3]


/David.

PS: in your later email I see your name as Ant.

_______________________________________________
SciPy-User mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: Finding closest point in array - inverse of KDTree

Xavier Barthelemy-2
Hi Antonio,

I think yo are looking for the scipy.spatial.distance_matrix function

I used it inside a wave crest tracking algorithm  (on probably a much smaller matrix than your), but unfortunately, there is no magical method to reduce significantly the computation time.


Have a go at it

Xavier

2017-10-12 7:13 GMT+11:00 Daπid <[hidden email]>:
On 11 October 2017 at 19:01, Antonio Polino <[hidden email]> wrote:
On the other hand, the array A is always the same, only k changes. So it would be beneficial to use some auxiliary structure (like a "inverse KDTree") on A, and then query the results on the small array k. 

You can sort A and use searchsorted to find the positions of the midpoints of k. Then, you assign the values to the group.

In your example (adding one more and sorting k):

    >>> A = np.asarray([3, 4, 5, 6])
    >>> k = np.asarray([3, 4.1, 5.2])
    >>> midpoints = np.array([ 3.55,  4.65])   #  k[:-1] + np.diff(k) / 2
    >>> np.searchsorted(A, midpoints)
    array([1, 2])

So, this means that A[:1] belong to k[0], A[1:2] belong to k[2] and A[2:] belong to k[3]


/David.

PS: in your later email I see your name as Ant.

_______________________________________________
SciPy-User mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/scipy-user




--
 « Quand le gouvernement viole les droits du peuple, l'insurrection est, pour le peuple et pour chaque portion du peuple, le plus sacré des droits et le plus indispensable des devoirs »

Déclaration des droits de l'homme et du citoyen, article 35, 1793

_______________________________________________
SciPy-User mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: Finding closest point in array - inverse of KDTree

Robert Kern-2
In reply to this post by Antonio Polino
On Wed, Oct 11, 2017 at 10:01 AM, Ant <[hidden email]> wrote:

>
> Hello all,
>
> I have the same question I posted on stack overflow: https://stackoverflow.com/questions/46693557/finding-closest-point-in-array-inverse-of-kdtree
>
> I have a very large ndarray A, and a sorted list of points k (a small list, about 30 points).
>
> For every element of A, I want to determine the closest element in the list of points k, together with the index. So something like:
>
>     >>> A = np.asarray([3, 4, 5, 6])
>     >>> k = np.asarray([4.1, 3])
>     >>> values, indices
>     [3, 4.1, 4.1, 4.1], [1, 0, 0, 0]
>
> Now, the problem is that A is very very large. So I can't do something inefficient like adding one dimension to A, take the abs difference to k, and then take the minimum of each column.
>
> For now I have been using np.searchsorted, as shown in the second answer here: https://stackoverflow.com/questions/2566412/find-nearest-value-in-numpy-array but even this is too slow.
>
> I thought of using scipy.spatial.KDTree:
>
>     >>> d = scipy.spatial.KDTree(k)
>     >>> d.query(A)
>
> This turns out to be much slower than the searchsorted solution.
>
> On the other hand, the array A is always the same, only k changes. So it would be beneficial to use some auxiliary structure (like a "inverse KDTree") on A, and then query the results on the small array k.
>
> Is there something like that?

The KDTree and BallTree implementations in scikit-learn have implementations for querying with other trees. Unfortunately, these implementations are hidden behind an interface that builds the query tree on demand and then throws it away. You'd have to subclass in Cython and expose the `dualtree` implementations as a Python-exposed method.


_______________________________________________
SciPy-User mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: Finding closest point in array - inverse of KDTree

Antonio Polino
@Daπid
Thank you! This indeed is twice as fast as doing it normally (not counting the fixed time of sorting A, of course). 
I would still like to speed it up, getting another 2x  speedup. This is my code, please tell me if you have any suggestions!

*Preprocessing part* #it can be slow, it is not repeated
indices_sort = np.argsort(A)
sortedA = A[indices_sort]
inv_indices_sort = np.argsort(indices_sort)

*Repeated part

midpoints = k[:-1] + np.diff(k)/2
idx_aux = np.searchsorted(sortedA, midpoints)
idx = []
count = 0
final_indices = np.zeros(sortedA.shape, dtype=int)
old_obj = None
for obj in idx_aux:
if obj != old_obj:
idx.append((obj, count))
old_obj = obj
count += 1
old_idx = 0
for idx_A, idx_k in idx:
final_indices[old_idx:idx_A] = idx_k
old_idx = idx_A



final_indices[old_idx:] = len(k)-1

indicesClosest = final_indices[inv_idx_sort]



Thank you!

@Xavier

Thank you for your answer, but won’t spatial distance be too slow? If it computes the distances from every point of A to every point of k it is computing a lot of unnecessary things.

@Robert

Thank you for the link! Do you think that it will offer signifiant advantages over the search sorted solution? I ask because I have never written anything in Cython (I wouldn’t know where to start, to be fair) so I am a little reluctant to start messing with scipy internal code :)







On 12 Oct 2017, 01:26 +0200, Robert Kern <[hidden email]>, wrote:
On Wed, Oct 11, 2017 at 10:01 AM, Ant <[hidden email]> wrote:
>
> Hello all,
>
> I have the same question I posted on stack overflow: https://stackoverflow.com/questions/46693557/finding-closest-point-in-array-inverse-of-kdtree
>
> I have a very large ndarray A, and a sorted list of points k (a small list, about 30 points).
>
> For every element of A, I want to determine the closest element in the list of points k, together with the index. So something like:
>
>     >>> A = np.asarray([3, 4, 5, 6])
>     >>> k = np.asarray([4.1, 3])
>     >>> values, indices
>     [3, 4.1, 4.1, 4.1], [1, 0, 0, 0]
>
> Now, the problem is that A is very very large. So I can't do something inefficient like adding one dimension to A, take the abs difference to k, and then take the minimum of each column.
>
> For now I have been using np.searchsorted, as shown in the second answer here: https://stackoverflow.com/questions/2566412/find-nearest-value-in-numpy-array but even this is too slow.
>
> I thought of using scipy.spatial.KDTree:
>
>     >>> d = scipy.spatial.KDTree(k)
>     >>> d.query(A)
>
> This turns out to be much slower than the searchsorted solution.
>
> On the other hand, the array A is always the same, only k changes. So it would be beneficial to use some auxiliary structure (like a "inverse KDTree") on A, and then query the results on the small array k.
>
> Is there something like that?

The KDTree and BallTree implementations in scikit-learn have implementations for querying with other trees. Unfortunately, these implementations are hidden behind an interface that builds the query tree on demand and then throws it away. You'd have to subclass in Cython and expose the `dualtree` implementations as a Python-exposed method.

_______________________________________________
SciPy-User mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/scipy-user

_______________________________________________
SciPy-User mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: Finding closest point in array - inverse of KDTree

Antonio Polino
In reply to this post by Daπid
Sorry, I included your reply into Robert. I am a little confused by this system :D

On 11 Oct 2017, 22:15 +0200, Daπid <[hidden email]>, wrote:
On 11 October 2017 at 19:01, Antonio Polino <[hidden email]> wrote:
On the other hand, the array A is always the same, only k changes. So it would be beneficial to use some auxiliary structure (like a "inverse KDTree") on A, and then query the results on the small array k. 

You can sort A and use searchsorted to find the positions of the midpoints of k. Then, you assign the values to the group.

In your example (adding one more and sorting k):

    >>> A = np.asarray([3, 4, 5, 6])
    >>> k = np.asarray([3, 4.1, 5.2])
    >>> midpoints = np.array([ 3.55,  4.65])   #  k[:-1] + np.diff(k) / 2
    >>> np.searchsorted(A, midpoints)
    array([1, 2])

So, this means that A[:1] belong to k[0], A[1:2] belong to k[2] and A[2:] belong to k[3]


/David.

PS: in your later email I see your name as Ant.
_______________________________________________
SciPy-User mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/scipy-user

_______________________________________________
SciPy-User mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: Finding closest point in array - inverse of KDTree

Daπid
In reply to this post by Antonio Polino


On 12 October 2017 at 15:18, Ant <[hidden email]> wrote:
@Daπid
Thank you! This indeed is twice as fast as doing it normally (not counting the fixed time of sorting A, of course). 
I would still like to speed it up, getting another 2x  speedup. This is my code, please tell me if you have any suggestions!


As always, when optimising you must profile. For A of size 3000000 and k of size 30, this is what I get:


89% of the time is being spent on this line (and it gets worse as you increase the size of A):


indicesClosest = final_indices[inv_idx_sort]


I don't know from the top of my head of a faster way of doing this, so, can you somehow adapt your problem to use your sorted indexes of A instead? This you can very easily rewrite unrolled in Cython, I think you can scrape a bit of time there.


_______________________________________________
SciPy-User mailing list
[hidden email]
https://mail.python.org/mailman/listinfo/scipy-user