Folks,
yet another k-means initializer: - first do a random sample of say sqrt(N) of the points - run full k-means from those centres. How well this works depends of course on how well a sqrt(N) sample approximates the whole, as well as on N, dim, k, maxiter, delta, phase of the moon ... The sqrt(N) is heuristic. (Theoretically one could recurse for very large N.) One could also tighten delta etc. for the first pass. Anyway, on Gaussian clusters I get N 10000 dim 2 k 20 ninit 10 iter 5 delta 0.01 nwarp 0 init="sample" time: 5.4 sec vq distances: av 0.0821 quartiles 0.058 0.082 0.11 init="k-means++" time: 53.4 sec vq distances: av 0.0803 quartiles 0.057 0.081 0.1 N 10000 dim 10 k 20 ninit 10 iter 5 delta 0.01 nwarp 0 time: 8.3 sec vq distances: av 0.718 quartiles 0.63 0.72 0.8 time: 220.8 sec vq distances: av 0.715 quartiles 0.63 0.72 0.8 This is with the scikits.learn k-means; the one in scipy 0.8 uses absolute not relative error and may be buggy too, long thread last July. Comments please, links to *real* test data please ? cheers -- denis _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
On Mon, Jan 17, 2011 at 07:00:57AM -0800, denis wrote:
> Folks, > yet another k-means initializer: > - first do a random sample of say sqrt(N) of the points > - run full k-means from those centres. > How well this works depends of course on how well a sqrt(N) sample > approximates the whole, as well as on N, dim, k, maxiter, delta, > phase of the moon ... > The sqrt(N) is heuristic. > (Theoretically one could recurse for very large N.) > One could also tighten delta etc. for the first pass. Sounds reasonnable to me. This is a bit in the direction of implementing k-means as an online algorithm, which we are discussing. > Anyway, on Gaussian clusters I get > N 10000 dim 2 k 20 ninit 10 iter 5 delta 0.01 nwarp 0 > init="sample" > time: 5.4 sec vq distances: av 0.0821 quartiles 0.058 0.082 > 0.11 > init="k-means++" > time: 53.4 sec vq distances: av 0.0803 quartiles 0.057 0.081 > 0.1 Yes, I have found that 'k-means++' wasn't terribly useful on big data. > N 10000 dim 10 k 20 ninit 10 iter 5 delta 0.01 nwarp 0 > time: 8.3 sec vq distances: av 0.718 quartiles 0.63 0.72 0.8 > time: 220.8 sec vq distances: av 0.715 quartiles 0.63 0.72 > 0.8 I am not sure I understand these numbers. > This is with the scikits.learn k-means; Why don't you discuss this on the scikits-learn mailing list: https://lists.sourceforge.net/lists/listinfo/scikit-learn-general Also, we would need to time on different usecases, to see how much we gain (I couldn't understand your timings above, sorry). There are quite a few very knowledgeable people and they might have an opinion on whether it is best to wait for the online version of the kmeans to be available (I'd say: give it 6 months) or if it is worth hacking a specific init. Thanks for your input, GaĆ«l _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
Free forum by Nabble | Edit this page |