Simple implement K means by using NumPy
K means is a classical algorithm. I’d like to have a quick implementation of the K means algorithm in case that you are supposed to implement it within 10 minutes such as in an interview.
The hard part is related to NumPy broadcast.
A quick summary of the broadcast rule will be from [1]:
When operating on two arrays, NumPy compares their shapes element-wise. It starts with the trailing dimensions, and works its way forward. Two dimensions are compatible when
they are equal, or
one of them is 1
Some examples can be shown here:[1]
A (2d array): 5 x 4
B (1d array): 1
Result (2d array): 5 x 4
A (2d array): 5 x 4
B (1d array): 4
Result (2d array): 5 x 4
A (3d array): 15 x 3 x 5
B (3d array): 15 x 1 x 5
Result (3d array): 15 x 3 x 5
A (3d array): 15 x 3 x 5
B (2d array): 3 x 5
Result (3d array): 15 x 3 x 5
A (3d array): 15 x 3 x 5
B (2d array): 3 x 1
Result (3d array): 15 x 3 x 5
Get started
generate 5 points with 2 dimension features.
>>> points =np.random.randn(5, 2)
>>> idx = np.arange(5)
>>> np.random.shuffle(idx)
>>> idx
array([4, 2, 0, 1, 3])
>>> centroids = points[idx[:3]]
>>> centroids
array([[-0.92043547, 1.94704196],
[ 1.32088275, -0.52664638],
[ 2.18007867, -0.35835555]])
>>> centroids.shape
(3, 2)
>>> points.shape
(5, 2)
>>> centroids.shape
(3, 2)
The full code can be found in my github.
Clustering results
Reference
[1]https://docs.scipy.org/doc/numpy-1.13.0/user/basics.broadcasting.html