Why Should You Learn Machine Learning

In the end of 80ies and early 90ies, the topics of fourth generation programming languages and genetic algorithms were very popular in mass media. We had read in the magazines that software developers would become obsolete, because users could create their programs themselves using 4GL, or else AI systems would soon be created that would extend themselves. By that time, I’ve learned my first programming languages, was about to choose my subject in the university; and therefore had doubts about job perspectives in software development.

Fortunately (or not), Steve Jobs and Bill Gates have popularized the graphical user interfaces by around that time, so that this first AI wave calmed down (or returned to its academic roots), because software development became less about finding an answer to a question, but more about displaying windows, buttons, menus and textboxes. Computer games’ focus has shifted from “what exactly you are doing” to “how cool is looks like”. Internet has changed from the source of scientific or personal information to a ingenious marketing tool and became a thing about pictures, graphic design and neuromarketing.

But, if you are a software developer and have not yet realized that you need to teach yourself machine learning, you should be concerned about your job. Because machine learning is coming and it is the next logical step of losing the full control about your software.

First, we’ve lost the control about exact machine instructions put in our program, and gave it up to the compilers. Next, we’ve lost the control about memory management and gave it up to the garbage collector. Next, we’ve partially lost the control about the order of execution and gave it up to event loops, multithreading, lambda expressions, and other tools. With machine learning, we will lose control about the business logic.

In the classic computer programming, we were trained for the situation when the desired business logic is exactly known and specified beforehand. Our task was to implement the specification as exact as possible. And in the first decades of software development practice, there were enough useful problems that could be specified with more or less acceptable efforts. Remember, the first computers were used for ballistic calculations. Having all formulae already invented by the scientist, the programming task at hand had a perfect specification.

Now, we want to go to the areas, where creating a specification is impossible, or too expensive, or just not the optimal course of action.

We will take fraud detection as example. Let’s say we have data about payment transactions of some payment system, and want to detect criminal activity.

A possible non-machine learning approach would include establishing some set of rules for fraud detection, based on common sense. For example, some limit on the transfer sum, above of that the transaction gets suspicious. Also, transactions from different geographical locations within some short period of time are suspicious, etc.

One obvious limitation of this approach is that the alarm thresholds are based on common sense, so that the objective quality of the fraud detection is highly dependent on how good the subjective common sense of its developers reflects the reality.

Another obvious limitation of the common-sense approach is that such a rule system cannot be infinitely complex. Humans can comprehend only a limited amount of rules at once, so that they usually stop having defined 5 or 7 rules; and see a system with 20 rules as “very complex” and a system with 100 rules as “we need a whole new department to make sense what is really going on here”. For comparison, Square, Inc is using a machine learning algorithm for fraud detection based on (my conservative guess) over 3000 rules (not mentioning that they can re-tune these rules automatically every day or more often).

It is even harder for human to comprehend possible interplay between the rules. A typical geo-based rule should usually fire for distance D and time period T, but not in the public holidays season (as many people travel in this time), but even in this season it must still fire if the amount is above M, if the recipient is a registered merchant, or above the amount P, if the recipient is a private person, but it still must not fire, if the money holder had already did similar money transfers one year before and that transfer was not marked as a fraud, but it must still fire if any automatic currency conversion is taking place… At some point, a classic software developer will raise her arms and declare herself out of the game. Usually, she will then create a generic business rule engine and assert that business guys will have to configure the system with all their chaotic business rules. Which doesn’t solve the problem, just shifts it from one department to the other.

Now, remember the Shannon-Hartley theorem? Me neither, but the main thing about it was that there is a difference between the information – the useful signal that is valued by the receiver – and merely the data, the stream of zeros and ones in some particular format. The fraud detection issue can be seen as an information extraction problem. Somewhere in the transaction data, the information is hidden from our eyes, signalizing criminal activity. We as humans have practical limits extracting this information. Machine learning, if done correctly, is a possibility to extract and to evaluate more information from data.

Classifiers in machine learning are algorithms that, based on a set of features (or attributes) of some event or object, try to predict its class, for example “benign payment” or “fraud”.

No matter what algorithm is used, the procedure is roughly the same. First, we prepare a training set of several (often at least 1000, the more the better) labeled events or objects, called the examples. “Labeled” means, for each of those examples, we already know the right answer. Then, we feed the classifier algorithm with the examples, and it trains itself. The particularities depend on exact algorithm, but what all algorithms are basically trying to do is to recognize how exactly the features are related to the class, and to construct a mathematical model that can convert any combination of input examples to the class. Often, the algorithms are not extremely complicated to understand, for example, they might try to count how often one of the features appears in one class and then in another class; or they might start with a more or less random limit for a rule, and then start to move it, every time counting the number of right predictions (the accuracy) and changing the direction when accuracy is getting worse. Unfortunately, not a single algorithm author cares about the learning curve of his users so that most of algorithm descriptions include some hardcore-looking math, even when it is not strictly necessary.

Finally, a trained classifier is ready. You can now pass unlabeled examples to it, and it will predict their classes. Some classifiers are nice and can even tell you how sure they are (like, 30% chance it is a benign payment and 70% chance it is a fraud), so that you can implement different warning levels depending on their confidence.

A huge disadvantage of machine learning (and welcome to the rant part of this post): only some of the classifiers can be logically understood by a human being. Often, they are only some probability distributions, or hundreds of decision trees, so that while it is theoretically possible, for a given input example, to work through the formulas with the pen and paper and to get the same result as the classifier did, but it would take a lot of time and won’t necessarily bring you a deep understanding of its logic, so that practically, it is not possible to explain classifiers. This means, sometimes you pass to the classifier an example, where you as a human can clearly see it is a fraud, and then get the class “benign” from the it, and you, like, “what the hell? is it not obviously a fraud case? And now what? How can I fix it?”.

I suppose, one could try to train a second classifier giving the wrongly predicted examples more weight in its training set, and then combine results of both classifiers using some ensemble methods, but I haven’t tried it yet. I haven’t found any solution to this problem in the books or training courses. Currently, most of the time you have to accept that the world is imperfect and to move on.

And generally, machine learning is still in a very half-backed state, at least in Python and R.

Another typical problem of contemporary machine learning, when teaching classifiers and providing them with too many features, or features in a wrong format, the classifying algorithms can easily become fragile. Typically, they don’t even try to communicate to you that they are overwhelmed, because they can’t even detect that. Most of them have still an academic software quality, so that they don’t have too much of precondition checking, strong typing, proper error handling and reporting, proper logging and other things that we accustomed to when using production-grade software. That’s why most of machine learning experts agree that currently, most of the time is getting spent on the process they call feature engineering, and I call “random tinkering with the features until the black box of the classifying algorithm suddenly starts producing usable results.”

But well, if you have luck or, more likely, after having invested a lot of time for feature engineering, you will get a well trained algorithm capable of accurately classifying most of the examples from its training set. You calculate its accuracy and are very impressed by some high number, like, 98% of right predictions.

Then you deploy it to production, and are bummed by something like 60% accuracy in the real conditions.

It is called overfitting and is a birth mark problem of many contemporary algorithms – they tend to believe that the (obviously limited) training set contains all possible combinations of values and underestimate training for combinations not present in the set. A procedure is developed by statisticians to overcome this, called cross-validation, which increases the training time of your algorithm by factor 5 to 20, but as a result giving you more accurate accuracy. In the example above, your algorithm would earn something like 64% accuracy after the cross-validation, so you are at least not badly surprised when running it in production.

Modern improved algorithms such as random forest have a built-in protection against overfitting, so I think this whole problem is an intermittent issue of the quickly developing tech and we will forget about it in a year or so.

I also have the feeling that machine learning frameworks authors consider themselves done as soon as a trained classifier is created and evaluated. Preparing and using it in production is not considered as a worthy task. As a result, my first rollout of a classifier had produced predictions that were worse than even the random guessing. After weeks of lost time, the problem has been found. To train the classifier, I’ve written an SQL query and stored my training set into a CSV file. This is obviously not acceptable for production, so I have reimplemented the code in Python. Unfortunately, it has been reimplemented in a different way, meaning that one of the features was encoded not in the same format as the format used during the training phase. The classifier has not produced any warnings and simply “predicted” garbage.

Another problem is that most algorithms cannot be trained incrementally. If you have 300 features, have spent weeks to train your algorithm, and want now to add the 301st feature, you will have to re-train the classifier using all 301 features, even though the relationships between the first 300 features hasn’t changed.

I think, there are more rants about the machine learning frameworks to come. But, at the same time, things in this area change astonishingly rapidly. I don’t even have time to try out that new shiny interesting thing announced every week. Its like driving bicycle on an autobahn. Some very big players have been secretly working in this area for 8 years and more, and now they are coming out, and you realize, a) how much more advanced they are compared to you, b) that all internet business will soon be separated by those who could implement and monetize big data, and those who was left behind, and c) I think, machine learning will be implemented as built-in statements in mainstream languages, in the next five years.

Summarizing, even the contemporary state-of-the-art machine learning has the advantages that are too significant to ignore:

– the possibility to extract more information from data than human-specified business logic;
– as a pleasant consequence, any pre-existing data (initially conceived for other primary purposes), can be repurposed and reused, meaning extracting more business value per bit;
– another pleasant consequence is the possibility to handle data with low signal-to-noise ratio (like user behavior data);
– and finally, if the legacy business logic didn’t have quality metrics, they will be introduced, because any kind of supervised machine learning includes measuring and knowing the quality metrics of the predictions (accuracy, precision, recall, f-scores).

In this post, I’ve only described the supervised machine learning. There is also a big area called unsupervised machine learning. In December last year, at the last day before my vacations, I’ve finished my first experiment with it and this will be the topic of my next post.

And Big Data is so much more than just machine learning. It also includes architecting and deployment a heterogeneous database landscape, implementing high-performance processing of online and offline data; implementing recommendation engines, computer linguistic and text processing of all kinds, as well as analytics over huge amounts of poorly-structured and ever growing data.

If you are interested to work in our big data team, contact me and I will see what can I do (no promises!)

Leave a Reply

Categories

Archive