2-pass k-means: first do a random sample of sqrt(N) ?

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|

2-pass k-means: first do a random sample of sqrt(N) ?

denis-bz-gg
Folks,
  yet another k-means initializer:
- first do a random sample of say sqrt(N) of the points
- run full k-means from those centres.

How well this works depends of course on how well a sqrt(N) sample
approximates the whole, as well as on N, dim, k, maxiter, delta,
phase of the moon ...
The sqrt(N) is heuristic.
(Theoretically one could recurse for very large N.)
One could also tighten delta etc. for the first pass.

Anyway, on Gaussian clusters I get

   N 10000  dim 2  k 20  ninit 10  iter 5  delta 0.01  nwarp 0
    init="sample"
        time: 5.4 sec  vq distances: av 0.0821  quartiles 0.058 0.082
0.11
    init="k-means++"
        time: 53.4 sec  vq distances: av 0.0803  quartiles 0.057 0.081
0.1

    N 10000  dim 10  k 20  ninit 10  iter 5  delta 0.01  nwarp 0
        time: 8.3 sec  vq distances: av 0.718  quartiles 0.63 0.72 0.8
        time: 220.8 sec  vq distances: av 0.715  quartiles 0.63 0.72
0.8

This is with the scikits.learn k-means;
the one in scipy 0.8 uses absolute not relative error
and may be buggy too, long thread last July.

Comments please, links to *real* test data please ?

cheers
  -- denis
_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: 2-pass k-means: first do a random sample of sqrt(N) ?

Gael Varoquaux
On Mon, Jan 17, 2011 at 07:00:57AM -0800, denis wrote:
> Folks,
>   yet another k-means initializer:
> - first do a random sample of say sqrt(N) of the points
> - run full k-means from those centres.

> How well this works depends of course on how well a sqrt(N) sample
> approximates the whole, as well as on N, dim, k, maxiter, delta,
> phase of the moon ...
> The sqrt(N) is heuristic.
> (Theoretically one could recurse for very large N.)
> One could also tighten delta etc. for the first pass.

Sounds reasonnable to me. This is a bit in the direction of implementing
k-means as an online algorithm, which we are discussing.

> Anyway, on Gaussian clusters I get

>    N 10000  dim 2  k 20  ninit 10  iter 5  delta 0.01  nwarp 0
>     init="sample"
>         time: 5.4 sec  vq distances: av 0.0821  quartiles 0.058 0.082
> 0.11
>     init="k-means++"
>         time: 53.4 sec  vq distances: av 0.0803  quartiles 0.057 0.081
> 0.1

Yes, I have found that 'k-means++' wasn't terribly useful on big data.

>     N 10000  dim 10  k 20  ninit 10  iter 5  delta 0.01  nwarp 0
>         time: 8.3 sec  vq distances: av 0.718  quartiles 0.63 0.72 0.8
>         time: 220.8 sec  vq distances: av 0.715  quartiles 0.63 0.72
> 0.8

I am not sure I understand these numbers.

> This is with the scikits.learn k-means;

Why don't you discuss this on the scikits-learn mailing list:
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
Also, we would need to time on different usecases, to see how much we
gain (I couldn't understand your timings above, sorry).

There are quite a few very knowledgeable people and they might have an
opinion on whether it is best to wait for the online version of the
kmeans to be available (I'd say: give it 6 months) or if it is worth
hacking a specific init.

Thanks for your input,

Gaƫl
_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user