An experience of unsupervised learning

In my previous post I’ve explained why I think you should learn machine learning and promised to share my experiences with its unsupervised part.

The unsupervised machine learning has a mystical attraction. You don’t even bother to label the examples, just send them to the algorithm, and it will learn from them, and boom – it will automatically separate them to classes (clustering).

When I was studying electrical engineering, we’ve learned about the so called optimal filters, which are electrical circuits that can extract the useful signal from the noise, even though the noise is 100 times stronger than the signal, so that a human eye cannot even see the signal. This was like a magic, and a similar magic I have expected in this case: I would pass the examples to the clustering algorithm, and it will explore the hidden relationships between the data and give me some new, unexpected and useful insights…

Today, having tried it, I still believe that some other algorithms (maybe deep learning?) are able to produce such a magic (because well, you should never stop believing in magic), but my first impression was somewhat disappointing.

The very first thing the clustering algorithm wanted to know from me, is how many clusters it should look for. Pardon me? I’ve expected that you will find the clusters and tell me, how many clusters are there in my data? If I have to pass the number of clusters beforehand, it means I have to analyse the data to find out its inherent clustering structure, and it means I have to perform all that work what I’ve expected to be performed magically by the algorithm.

Well, it seems that the state of the art of current clustering algorithms indeed cannot find clusters autonomously and fully unsupervised, without any hint or input from the user. Some of them don’t require the number of clusters, but need some equivalent parameter, like the minimum number of examples needed in the neighborhood to establish a new cluster. But well, on the other case, this still allows for some useful applications.

One possible use case could be automatic clustering per se: if your common sense and domain knowledge tell you that the data has exactly N clusters, you can just run the data through the clustering algorithm, and there is a good chance that it will find exactly the clusters you’ve expected: no need to define all the rules and conditions separating the clusters manually. Besides, it will define centroids or medoids of each cluster, so that if new, unclastered objects are added daily, you can easily assign them to existing clusters by calculating distances to all centroids and taking the cluster with the shortest distance.

Another use case would be, if you don’t really care about the contents of the clusters and the clusters aren’t going to be examined by humans, but rather use clustering as a kind of lossy compression of the data space. A typical example would be some classical recommendation engine architectures, where you replace the millions of records with some smaller number of clusters, with some loss of recommendation quality, just to make the computation task at hand to be feasible for available hardware. In this case, you’d just consider how many clusters, at most, your hardware can handle.

Yet another approach, and I went this way, is to ask yourself, how many clusters is too little and how many clusters is too many? I was clustering people, and wanted to provide my clusters to my colleagues and to myself, to be able to make decisions on them. Therefore, due to well-known human constraints, I was looking for at most 7 to 8 clusters. I also didn’t want to have less than 5 clusters, because intuitively, anything less in my case would be underfitting. So I’ve played with parameters until I’ve got a reasonable number of clusters, and clusters of reasonable (and understandable for humans) content.

Speaking of which, it took a considerable amount of time for me to evaluate the clustering results. Just like with any machine learning, it is hard to understand the logic of the algorithm. In this case, you will just get clusters numbered from 0 to 7, and each person will be assigned to exactly one cluster. Now it is up to you to make sense of the clusters and to undestand, what kind of people were grouped together. To facilitate this process, I’ve wrote a couple of small functions returning to me the medoids of each clusters (i.e. the single cluster member who is nearest to the geometrical center of the cluster, or in other words, the most average member of the cluster), as well as average values of all features in the cluster. For some reason, most existing clustering algorithms (I’m using scikit-learn) don’t bother of computing and giving this information to me as a free service, which, again, speaks about the academic rather than industrial quality of modern machine learning frameworks.

By the way, another thing that was not provided for free was pre-scaling. In my first attempts, I’ve just collected my features, converted them to real numbers, put them in a matrix and fed this matrix to the clustering algorithm. I didn’t receive any warnings or such, just fully unusable results (like, several hundreds of clusters). Luckily for me, my previous experience with supervised learning had taught me that fully unusable results normally mean some problem with the input data, and I’ve got to the idea to scale all the features to be in the range of 0 to 1, just like with the supervised learning. This had fixed this particular problem, but I’m still wondering, if the clustering algorithms usually cannot meaningfully work on unscaled data, why don’t they scale data for me as a free service? In the industrial grade software, I would rather needed to opt-out of the pre-scaling by setting some configuration parameter, in case I wanted to turn it off in some very unique and special case, than having to implement scaling myself, which is the most common case anyway. If it is some kind of performance optimization, I’d say it is a very, very premature one.

But I digress. Another extremely useful tool helping to evaluate clustering quality was the silhouette metric (and a framework class implementing it in scikit-learn). This metric is a number from 0 to 1 showing how homogeneous the cluster is. If a cluster has silhouette of 0.9, it means that all members of this cluster are very similar to each other, and unsimilar to the members of another clusters.

Least but not last, clustering algorithms tend to create clusters for many, but not for all examples. Some of the examples remain unclustered and are considered to be outliers. Usually, you want the clustering algorithm to cluster the examples in a such way that there will me not too many outliers.

So I’ve assumed the following simple criteria:

  • 5 to 8 clusters
  • Minimal silhouette of 0.3
  • Average silhouette of 0.6
  • Less than 10% of all examples are outliers

and just implemented a trivial grid search across the parameters of the clustering algorithm (eps and min_samples of the DBSCAN, as well as different scaling weights for the features), until I’ve found the clustering result that suited all of my requirements.

To my astonishment, the results corresponded very well to my prior intuitive expectations based on some domain knowledge, but also have created a very useful quantitative measure of my previous intuitive understanding.

All in all, unsupervised learning can be used to gain some benefits from the data, if you don’t expect too much from it. I think, to gain more business value, we have to make the next step and to start a project including deep learning. In USA and China it seems to be that virtually everyone is doing deep learning (I wonder if Bitcoin farms can be easily repurposed for that), but it Germany it is rather hard to find anyone publicly admitting doing this. Although the self-driving cars of German manufacturers, already existing as prototypes, would be impossible without some deep learning…

Leave a Reply

Categories

Archive