Skip to main content

Module 2: Clustering and k-Nearest Neighbor

Data analysis tries to derive insights from data. It involves different phases, which we will visit in a later module. In this module, we look at simple but still important methods used in Machine Learning. If we have data that is characterized by few numeric features, we can use the spatial proximity of data points for analysis purposes. The features are interpreted as coordinates in an n-dimensional space. While old, the two algorithms considered here are still successfully being used. They assume that we have few features for data points. If we have too many, then the space of possible features has many dimensions, and high dimensional spaces have counter-intuitive properties. Among them is that the volume of a sphere is almost exclusively concentrated near the border. This means that two average points in a sphere have almost for sure at least one dimension in which their respective coordinate are widely apart.

2.1 Clustering

Figure 1: A set of 150 data points In Figure 1, we present an artificial data set. Imagine that the two coordinates of each points describe features that characterize the objects that we want to classify. Imagine further that these features characterize the objects. As a consequence, we want to classify objects that have similar features as belonging to the same grouping. In the language of Machine Learning, we are clustering, dividing the points in the data set into various clusters. Clustering is an important example of a non-supervised algorithm. In a non-supervised algorithm, there are no label attached to each data point. In a certain sense, our clustering algorithm creates the label.

2.3.1 Distances

Our first task is to define similarity, as we want to group similar data points together. The most natural way for most human beings is the distance between two points in a plane. This is the Euclidean distance. If we have two data points and , then the Euclidean distance is defined as

the square root of the sum of the squared differences between the data points. Often, the square of the Euclidean distance is used:

There are many alternatives, such as those based on the – norm, where

If we let take the limit , then we obtain the maximum distance

In practice, the Euclidean or -distance is the most frequently used.

2.3.2 Normalizing Data Set

The distances between data points depend heavily on the unit involved. For example, if you want to classify apartments for short-term rent, then the size of the apartment will matter. If you are using US-data, you will find the size quoted in square feet, in most of the other world however, in square-meters. The difference is about a factor of eleven. Switching from square meters to square feet increases the importance of square size by this factor.

Figure 2: Min-Max scaling on two dimensions of the Penguin data-set. The figures look identical because the graphing tool already uses min-max scaling automatically. Only the units on the x- and y-axis have changed. However, the distances between two points now have changed. The answer to this is to pre-process data. First, there are outliers, data points that are very unusual. Sometimes because some individuals are truly unusual and sometimes because data gathering was faulty. With real world data, the latter is quite frequent. One way to solve the outlier problem is to just prune outliers. This is a pretentious procedure, if the data might be accurate and we are removing important information from the data sets for the sake of neater results. In our context, normalizing data is more important. There are different ways to do so. The first possibility is to use a linear transformation so that all transformed values are between 0 and 1. Our data points are arranged in a big array

where the first row contains the features of the first data point, the second row the features of the second data point, and so on. We normalize the first coordinate separately from the second coordinate. We first calculate the maximum and the minimum for each row. We then subtract the minimum from the coordinate and then divide by the difference between maximum and minimum. This procedure is called min-max scaling. Its biggest drawback is its dependence on outliers.

Min-max scaling is done fairly simply in NumPy. We can use the min and the max functions, but because we want to ignore the NaN-values, we use versions that ignore NaN values. We specify the axis so that we select column minima and maxima. For an example, we use the Penguins data-set we looked at in the previous model. We use these functions to create a row of column minima and maxima. The bold line shows the scaling step. We give the results in Fig. 2. If the two figures look the same, it is because they are. The graphing tool also uses min-max scaling.