Simple implement K means by using NumPy

Jimmy (xiaoke) Shen
2 min readJun 15, 2020

--

K means is a classical algorithm. I’d like to have a quick implementation of the K means algorithm in case that you are supposed to implement it within 10 minutes such as in an interview.

The hard part is related to NumPy broadcast.

A quick summary of the broadcast rule will be from [1]:

When operating on two arrays, NumPy compares their shapes element-wise. It starts with the trailing dimensions, and works its way forward. Two dimensions are compatible when

they are equal, or

one of them is 1

Some examples can be shown here:[1]

A      (2d array):  5 x 4
B (1d array): 1
Result (2d array): 5 x 4

A (2d array): 5 x 4
B (1d array): 4
Result (2d array): 5 x 4

A (3d array): 15 x 3 x 5
B (3d array): 15 x 1 x 5
Result (3d array): 15 x 3 x 5

A (3d array): 15 x 3 x 5
B (2d array): 3 x 5
Result (3d array): 15 x 3 x 5

A (3d array): 15 x 3 x 5
B (2d array): 3 x 1
Result (3d array): 15 x 3 x 5

Get started

generate 5 points with 2 dimension features.

>>> points =np.random.randn(5, 2)
>>> idx = np.arange(5)
>>> np.random.shuffle(idx)
>>> idx
array([4, 2, 0, 1, 3])
>>> centroids = points[idx[:3]]
>>> centroids
array([[-0.92043547, 1.94704196],
[ 1.32088275, -0.52664638],
[ 2.18007867, -0.35835555]])
>>> centroids.shape
(3, 2)
>>> points.shape
(5, 2)
>>> centroids.shape
(3, 2)

The full code can be found in my github.

Clustering results

Number of iteraions and the total distance to the centroids
Clustering results for the random generated points

Reference

[1]https://docs.scipy.org/doc/numpy-1.13.0/user/basics.broadcasting.html

[2]https://flothesof.github.io/k-means-numpy.html

--

--

No responses yet