Four generations

Degenerate world!
Anne of Austria, “Twenty years after”

The Titans

Like in any mythology, the first generation has to be titans: people who have made impossible things, many times, consistently, and seemingly with no effort. Gauss with his linear regression and normal probability distribution has laid foundations to numerous science disciplines.

Student’s t distribution created as by-product of Guinness brewery trying to access quality of the raw material.

Everything we consider now as absolute basics that seemingly existed forever has been created only recently, in the 18., 19. and the beginning of the 20. centuries. Thomas Bayes, Carl Friedrich Gauss, Pearson, Fisher, Kolmogorov, Markow…

These people have called themselves statisticians, mathematicians or just scientists. They have run their models in their brains and they have defined the model architecture with ink and paper. They have calculated the weights and biases using a slide rule and stored them using chalk and a blackboard.

Their contribution in our science, technology and economics is infinite, because as long as humanity exists, it will forever use logarithms, normal distribution and Bayes theorem – for fun and profit.

The Data Scientists

The second generation was instrumental applying statistics to real-word problems, for profit, using computer technology. Computers liberated the field from the purist mathematical approach: decision trees and random forest have lost the ability to be written down on paper and understood by a human brain, but have improved accuracy, recall, precision and protection agains the overfitting.

The second generation is the world of feature engineering, controlling of overfitting, structured tabular data, classification, regression, clustering, forecasting and outlier detection. It is Matlab, R and Python. It is naive Bayes, linear and logistic regressions, SVM, decision tree and forest, k-nn, DBSCAN, ARIMA…

The Data Hackers

With the invent of deep learning, data scientists have quickly realized that using DNN it is possible to simulate all the previous ML algorithms, as well as create new architectures specialized for particular tasks. There is no science of how to create these architectures, so people just had to tinker and to use trial and error to more or less randomly and intuitively find an architecture that has performed better on public datasets.

This was not only truly liberating. This was a massive scale out, a tipping point. While the Titans could be counted on one or two hands and would typically born once in a century, and the data scientists were only in the thousands, the ability to quickly hack the ML architecture allowed millions of people to contribute to data science. It feels like almost everbody who happen to have free time, a decent GPU and ambitions, can do it.

We are speaking here Tensorflow, PyTorch, JAX and CUDA. Our usual suspects are CNN, LSTM, and GAN. Our data is non-relational now (audio, images and video). Our problems are too little hardware, too slow training times and that the gradient decent is not guaranteed to converge. Hackers have quickly generated superstitions about how to choose initial weights and what non-linear function should be used when. In the flashes of ingenuity we have created auto-encoders, text embeddings and data augmentation. Our tasks are classification, segmentation, translation, sentiment mining, generation.

While data scientists are still developing non-DNN machine learning (with notable mention of pyro.ai and novel outlier detection algorithms), they are now far from the spotlight.

The Prompt Kids

No offense! Because “script kiddies” is an official term for people who take someones intellectual property and tries to apply it to everything around them, sometimes managing to produce something useful.

With the invention of LLMs, our role has splitted in two. A minority of hackers continue hacking DNN architectures with huge number of parameters and prohibitive data set sizes and training costs. And the rest of the world, now probably in hundreds of millions, are entitled to pick up the model checkpoints that the Hackers leave for us on the huggingface and to fuck around with the models trying to force them to do something useful.

We call it “prompt engineering” because we don’t want to honestly admit, even to ourselves, that most of us has no way to play in the premiere league of DNN training and that our activity already has neither anything to do with data nor with science. We are prompt engineers! It sounds like a reputable and fun job, and it definitely can be used to bring business value, but I wonder whether we have an adequate training for it and whether social workers, children psychologists, and teachers would be better suited for the task.

We don’t need data sets any more. We don’t do training. We don’t need science. All we need is empathy, ability to write in our mother tongue eloquently, creativity, and a lot of patience.

The future, maybe?

At some point, someone will create a way to train models with 1e9 to 1e11 parameters, on a cheap, Raspberry-Pi-priced chip. This would be our cue to return to be Hackers again.

Or we just ignore the hype and continue doing our data science things with Gaussian processes, variational inference, structured data, and the good old linear regression.

Leave a comment