Hi,
I'm trying to use scipy.cluster.vq.kmeans2() but I'm getting inconsistent output. With a simple test input that should have 3 clusters, I'm getting good results most of the time but every so often the output creates the wrong clustering. If anyone could point to what I'm doing wrong I'd appreciate it! Code and sample output below. Thanks! James Code: import sys import scipy from scipy.cluster.vq import * print sys.version vals = scipy.array((0.0,0.1,0.5,0.6,1.0,1.1)) print vals white_vals = whiten(vals) print white_vals.shape, white_vals # try it several times to see if we get similar answers count = 0 while count < 5: res, idx = kmeans2(white_vals, 3) # changing iter doesn't seem to matter print res, idx count += 1 Output: 2.7.3 (default, Apr 10 2012, 23:31:26) [MSC v.1500 32 bit (Intel)] [ 0. 0.1 0.5 0.6 1. 1.1] (6,) [ 0. 0.24313227 1.21566135 1.45879362 2.4313227 2.67445496] [ 0.12156613 2.55288883 1.33722748] [0 0 2 2 1 1] [ 0.12156613 2.55288883 1.33722748] [0 0 2 2 1 1] [ 1.33722748 2.55288883 0.12156613] [2 2 0 0 1 1] [ 2.18819043 0.48626454 -0.97292963] [1 1 1 0 0 0] <-- unexpected result [ 0.12156613 2.55288883 1.33722748] [0 0 2 2 1 1] C:\PYTHON27\lib\site-packages\scipy\cluster\vq.py:588: UserWarning: One of the clusters is empty. Re-run kmean with a different initialization. warnings.warn("One of the clusters is empty. " _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
On Sun, Aug 05, 2012 at 04:09:06PM -0700, James Abel wrote:
> I'm trying to use scipy.cluster.vq.kmeans2() but I'm getting inconsistent > output. With a simple test input that should have 3 clusters, I'm getting > good results most of the time but every so often the output creates the > wrong clustering. K Means is a non-convex problem: it is dependent on the (random) initialization. In addition, it is not garantied to find the 'true' clusters, because quite often it is not possible from the data. You are not doing anything wrong, you are just asking for something that is not possible. HTH, Gael _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
In reply to this post by James Abel-2
Thanks Gael. BTW, I modified my code to loop until it gets the same clustering twice in a row. This yields more consistent results. I don’t know if this is a general solution but it worked for my simple test case. Code below. James import sys import scipy import warnings from scipy.cluster.vq import * print sys.version vals = scipy.array((0.0,0.1,0.5,0.6,1.0,1.1)) print vals white_vals = whiten(vals) print white_vals.shape, white_vals # Check for same clustering def clustering_test(a,b): # have to create copies, then sort so we don't modify the original ea = a.copy() eb = b.copy() ea.sort() eb.sort() r = (ea == eb).all() print a,b,ea,eb,r return r # try it until we get the same clustering twice in a row found = False prior_idx = None while not found: with warnings.catch_warnings(): warnings.simplefilter("ignore") # suppress the warning message (happens if it doesn't find the right number of clusters) res, idx = kmeans2(white_vals, 3) # changing iter doesn't seem to matter #print res, idx if prior_idx is not None: eq = clustering_test(idx, prior_idx) #print eq.all() if eq: found = True prior_idx = idx print "result", res, idx _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
Hi James, Usually, we run the optimisation several times and take the solution with the smallest inertia. The technic you use don't ensure you to keep the best solution. There's a full implementation in scikit-learn with several runs. You can have a look at the code to see how it works. Cheers, On 8 Aug 2012 20:53, "James Abel" <[hidden email]> wrote:
_______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
In reply to this post by James Abel-2
Thanks – sklearn works great! I got exactly what I expected each time I ran it! James import sys import numpy from sklearn.cluster import * print sys.version vals = numpy.array([[0.0],[0.1],[0.5],[0.6],[1.0],[1.1]]) print vals k_means_ex = KMeans(k=3) x = k_means_ex.fit_predict(vals) print x print k_means_ex.cluster_centers_ print k_means_ex.score(vals) 2.7.3 (default, Apr 10 2012, 23:31:26) [MSC v.1500 32 bit (Intel)] [[ 0. ] [ 0.1] [ 0.5] [ 0.6] [ 1. ] [ 1.1]] [1 1 0 0 2 2] [[ 0.55] [ 0.05] [ 1.05]] -0.015 _______________________________________________ SciPy-User mailing list [hidden email] http://mail.scipy.org/mailman/listinfo/scipy-user |
Free forum by Nabble | Edit this page |