kmeans2 question/issue

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

kmeans2 question/issue

James Abel-2
Hi,
I'm trying to use scipy.cluster.vq.kmeans2() but I'm getting inconsistent
output.  With a simple test input that should have 3 clusters, I'm getting
good results most of the time but every so often the output creates the
wrong clustering.  If anyone could point to what I'm doing wrong I'd
appreciate it!
Code and sample output below.
Thanks!
James

Code:

import sys
import scipy
from scipy.cluster.vq import *

print sys.version
vals = scipy.array((0.0,0.1,0.5,0.6,1.0,1.1))
print vals
white_vals = whiten(vals)
print white_vals.shape, white_vals

# try it several times to see if we get similar answers
count = 0
while count < 5:
    res, idx = kmeans2(white_vals, 3) # changing iter doesn't seem to matter
    print res, idx
    count += 1

Output:

2.7.3 (default, Apr 10 2012, 23:31:26) [MSC v.1500 32 bit (Intel)]
[ 0.   0.1  0.5  0.6  1.   1.1]
(6,) [ 0.          0.24313227  1.21566135  1.45879362  2.4313227
2.67445496]
[ 0.12156613  2.55288883  1.33722748] [0 0 2 2 1 1]
[ 0.12156613  2.55288883  1.33722748] [0 0 2 2 1 1]
[ 1.33722748  2.55288883  0.12156613] [2 2 0 0 1 1]
[ 2.18819043  0.48626454 -0.97292963] [1 1 1 0 0 0] <-- unexpected result
[ 0.12156613  2.55288883  1.33722748] [0 0 2 2 1 1]
C:\PYTHON27\lib\site-packages\scipy\cluster\vq.py:588: UserWarning: One of
the clusters is empty. Re-run kmean with a different initialization.
  warnings.warn("One of the clusters is empty. "

_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: kmeans2 question/issue

Gael Varoquaux
On Sun, Aug 05, 2012 at 04:09:06PM -0700, James Abel wrote:
> I'm trying to use scipy.cluster.vq.kmeans2() but I'm getting inconsistent
> output.  With a simple test input that should have 3 clusters, I'm getting
> good results most of the time but every so often the output creates the
> wrong clustering.

K Means is a non-convex problem: it is dependent on the (random)
initialization. In addition, it is not garantied to find the 'true'
clusters, because quite often it is not possible from the data.

You are not doing anything wrong, you are just asking for something that
is not possible.

HTH,

Gael
_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: kmeans2 question/issue

James Abel-2
In reply to this post by James Abel-2

Thanks Gael.

 

BTW, I modified my code to loop until it gets the same clustering twice in a row.  This yields more consistent results.  I don’t know if this is a general solution but it worked for my simple test case.  Code below.

 

James

 

import sys

import scipy

import warnings

from scipy.cluster.vq import *

 

print sys.version

vals = scipy.array((0.0,0.1,0.5,0.6,1.0,1.1))

print vals

white_vals = whiten(vals)

print white_vals.shape, white_vals

 

# Check for same clustering

def clustering_test(a,b):

    # have to create copies, then sort so we don't modify the original

    ea = a.copy()

    eb = b.copy()

    ea.sort()

    eb.sort()

    r = (ea == eb).all()

    print a,b,ea,eb,r

    return r

 

# try it until we get the same clustering twice in a row

found = False

prior_idx = None

while not found:

    with warnings.catch_warnings():

        warnings.simplefilter("ignore") # suppress the warning message (happens if it doesn't find the right number of clusters)

        res, idx = kmeans2(white_vals, 3) # changing iter doesn't seem to matter

    #print res, idx

    if prior_idx is not None:

        eq = clustering_test(idx, prior_idx)

        #print eq.all()

        if eq:

            found = True

    prior_idx = idx

print "result", res, idx


_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: kmeans2 question/issue

Nelle Varoquaux

Hi James,

Usually, we run the optimisation several times and take the solution with the smallest inertia. The technic you use don't ensure you to keep the best solution.

There's a full implementation in scikit-learn with several runs. You can have a look at the code to see how it works.

Cheers,
N

On 8 Aug 2012 20:53, "James Abel" <[hidden email]> wrote:

Thanks Gael.

 

BTW, I modified my code to loop until it gets the same clustering twice in a row.  This yields more consistent results.  I don’t know if this is a general solution but it worked for my simple test case.  Code below.

 

James

 

import sys

import scipy

import warnings

from scipy.cluster.vq import *

 

print sys.version

vals = scipy.array((0.0,0.1,0.5,0.6,1.0,1.1))

print vals

white_vals = whiten(vals)

print white_vals.shape, white_vals

 

# Check for same clustering

def clustering_test(a,b):

    # have to create copies, then sort so we don't modify the original

    ea = a.copy()

    eb = b.copy()

    ea.sort()

    eb.sort()

    r = (ea == eb).all()

    print a,b,ea,eb,r

    return r

 

# try it until we get the same clustering twice in a row

found = False

prior_idx = None

while not found:

    with warnings.catch_warnings():

        warnings.simplefilter("ignore") # suppress the warning message (happens if it doesn't find the right number of clusters)

        res, idx = kmeans2(white_vals, 3) # changing iter doesn't seem to matter

    #print res, idx

    if prior_idx is not None:

        eq = clustering_test(idx, prior_idx)

        #print eq.all()

        if eq:

            found = True

    prior_idx = idx

print "result", res, idx


_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user


_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user
Reply | Threaded
Open this post in threaded view
|

Re: kmeans2 question/issue

James Abel-2
In reply to this post by James Abel-2

Thanks – sklearn works great!  I got exactly what I expected each time I ran it!

James

 

import sys

import numpy

from sklearn.cluster import *

 

print sys.version

vals = numpy.array([[0.0],[0.1],[0.5],[0.6],[1.0],[1.1]])

print vals

k_means_ex = KMeans(k=3)

x = k_means_ex.fit_predict(vals)

print x

print k_means_ex.cluster_centers_

print k_means_ex.score(vals)

 

 

2.7.3 (default, Apr 10 2012, 23:31:26) [MSC v.1500 32 bit (Intel)]

[[ 0. ]

[ 0.1]

[ 0.5]

[ 0.6]

[ 1. ]

[ 1.1]]

[1 1 0 0 2 2]

[[ 0.55]

[ 0.05]

[ 1.05]]

-0.015


_______________________________________________
SciPy-User mailing list
[hidden email]
http://mail.scipy.org/mailman/listinfo/scipy-user