Simple implement K means by using NumPy

2 min readJun 15, 2020

K means is a classical algorithm. I’d like to have a quick implementation of the K means algorithm in case that you are supposed to implement it within 10 minutes such as in an interview.

The hard part is related to NumPy broadcast.

A quick summary of the broadcast rule will be from [1]:

When operating on two arrays, NumPy compares their shapes element-wise. It starts with the trailing dimensions, and works its way forward. Two dimensions are compatible when
they are equal, or
one of them is 1

Some examples can be shown here:[1]

A      (2d array):  5 x 4
B      (1d array):      1
Result (2d array):  5 x 4

A      (2d array):  5 x 4
B      (1d array):      4
Result (2d array):  5 x 4

A      (3d array):  15 x 3 x 5
B      (3d array):  15 x 1 x 5
Result (3d array):  15 x 3 x 5

A      (3d array):  15 x 3 x 5
B      (2d array):       3 x 5
Result (3d array):  15 x 3 x 5

A      (3d array):  15 x 3 x 5
B      (2d array):       3 x 1
Result (3d array):  15 x 3 x 5

Get started

generate 5 points with 2 dimension features.

>>> points =np.random.randn(5, 2)
>>> idx = np.arange(5)
>>> np.random.shuffle(idx)
>>> idx
array([4, 2, 0, 1, 3])
>>> centroids = points[idx[:3]]
>>> centroids
array([[-0.92043547,  1.94704196],
       [ 1.32088275, -0.52664638],
       [ 2.18007867, -0.35835555]])
>>> centroids.shape
(3, 2)
>>> points.shape
(5, 2)
>>> centroids.shape
(3, 2)

The full code can be found in my github.

Clustering results

Number of iteraions and the total distance to the centroids

Reference

[1]https://docs.scipy.org/doc/numpy-1.13.0/user/basics.broadcasting.html

[2]https://flothesof.github.io/k-means-numpy.html

Simple implement K means by using NumPy

Get started

Clustering results

Written by Jimmy (xiaoke) Shen

No responses yet