How to survive M&A

So your company has been acquired or merged with another one and you want to know what to expect and how to survive it.

Nobody survives M&A

Your job as you know it, with its responsibilities, tasks, problems, goals, purpose, promotion and salary plans is over.

The smartest of us leave the company merely days after the official announcement. These are the people who were mindful enough to care for an active job possibilities to be able to switch immediately.

The dumbest of us would stay loyal to the old company values, structure, and purpose, and fight against the new management. They will be let go first.

The unluckiest of us will be let go because of the company reorganization. And expect to have several massive waves of structural changes over a couple of years.

And the rest of us will be moved, reorganized, and get a new job, with new responsibilities, problems, goals and promotion perspective.

So, Maxim, why do you write this article with a clickbaity title if nobody can survive M&A?

But you can save yourself

You are not your job. You don’t have control over M&A, but you have control over your reaction to it. This new situation challenges you as are person, and it entirely up to you whether you emerge a better or a worse person after its over.

Don’t blame

M&A is based on lies. Management promises a lot and doesn’t keep the promise. Management would lie to you looking straight into your eyes. Management would keep secrets.

Don’t blame them because you would probably do the same in their role.

Because everybody lie.

We lie to our parents, to our spouse, to our children, to our friends, and yes, to our employers. Lying is part of being adult.

Yet, it makes us very angry if somebody lies to us. Funny fact: we never feel as guilty when we ourselves lie to other people.

It is because we always have good reasons to lie.

So be a man and give the management benefit of doubt. They probably also have good reasons to lie.

Don’t revolt

It is easy to blame the Capitalism for the anger you feel. It is easy to hope that your job will somehow stay the same, you only need to use EU labor laws and policies, create a Betriebsrat, organize a strike, consult your lawyer, etc.

Yes, establishing a Betriebsrat or using labor laws to your advantage could sometimes be a good idea for you personally.

But unless you were already anti-capitalism before, don’t change your political opinions just because of M&A. Everything has good and bad sides. The upside of capitalism was your freedom to create cool new products and services for people, and get a reasonable salary and other perks for it. The downside of capitalism was always the risk to lose your job.

In Germany, they make it especially hard for us, because the worker protection laws trick us to believe our job is somehow safe. The fact is that we can always be released of duty for no plausible reason, effective immediately. Yes, in Germany. They just need to pay us a severance.

So Betriebsrat and labor law lawyers will not help you to get your old job back. At best, you can only get better compensated.

It is up to you how you use your free time: spend it on labor law activities, or on looking for a new job (or on your family and friends if you can accept the new reality and new company culture).

Don’t quit

Do not leave your current employer, just because you are dissatisfied with the changes caused by M&A. Only switch to another job, if it is more attractive to you.

The truth is that almost every job has a way to improve your skills, upgrade your CV with relevant key words or just enjoy a reasonable salary. Even if your job has been changed by M&A, you can always find ways to benefit personally from it. In the most of the cases, there is no immediate reason to leave, without having secured a new, more attractive job first. Even if you were demoted.

Also, try to look at it from the perspective of the company. If your job has been changed, it means there is a new belief that you will bring more value to the company in your new position. Isn’t bringing more value to the company good for everyone?

Don’t take it personally

When we have been laid off, we tend to linger for too long in the past, analyzing what went wrong, whether the decision to let you go was reasonable and good for the company, hating the new management, hearing rumors from the ex-coworkers, feeling Schadenfreude about each fail of the new management.

Know: these destructive feelings and thoughts might stay with you for months and years, and really make a dent on your health (gaining 20 kg of weight inclusive). Or they can be gone in a week, if your new job will be interesting, promising and fulfilling.

If you ever wanted to try something new and to change the trajectory of your career: this is God giving you a chance!

Don’t kiss the ass (too much)

Some people think they have a good chance for promotion during an M&A and so they kiss the ass of the new owners in an especially passionate manner. They would march through the office floors and proclamate the new policies and goals with the seemingly same authentic and honest look like they were saying the opposite policies and goals merely a year ago.

When asked they might even tell you that they have honestly changed their opinion.

Be honest to yourself. Your integrity is more important than a promotion.

Embrace changes

On your old job, your old ways to do things had their reasons and merits. This doesn’t mean the new way doing things after the M&A is unreasonable and meaningless.

Be open to try new ways. Gain experience in doing things differently. You might be surprised how some old beliefs previously considered to be cornerstones of the company culture turn out to be replaceable.

Enjoy the pinnacle

Everything is at its pinnacle short before it disappears. Mechanical clocks, Hi-Fi racks, sabers and swords. Enjoy the pinnacle of your product, of your company culture, while you can.

For illustration, just compare this amazing piece of marketing art (before / during M&A) and this cringe ad in Check24 style (after M&A).

Summary

The first M&A in your life is always very hard. You will survive it though, and probably even become a better person in the process.

How to design hard software

This is Part 2 of the series about hard software. Read Part 1 here.

Stop saying “architecture”

The first step to healing is to accept the problem.

I have a problem with the word “software architecture”. Architecture implicitely has the following properties:

  • It is “the” design step that is ultimately responsible for the quality, performance, price, and fitness of the end product. If we build a house, the architect interviews the clients and takes everything into consideration: the height of all family members, whether new family members are planned, whether the clients work in home office, how many cars do they have, even whether they are left- or right-handed. As a result, the clients can expect the real estate, perfectly suited for them, comfortable to use, without unnecessary details, reasonably priced and future-proof.
  • So the architecture is able to predict future and to take all relevant aspects into consideration.
  • It is notoriusly hard to change the architecture once we’ve started to build.
  • It happens before we start anything else. It is the root, the supreme source of all other decisions and activities.

And all of these properties is exactly what we don’t want in a hard software, because it doesn’t work like this. By definition, when developing hard software, we cannot predict the future, cannot take all relevant aspects into consideration and don’t even know the full list of all relevant aspects, and on the other hand, in software, it is often not so expensive to do changes.

Start with going live

It is impossible to tell what part of our hard software will cause problems in real production conditions. We can’t even be sure if our software hard at all, or just yet another undemanding app.

Note: “Undemanding app” might sound harsh and unfair, given the fact that most of the software out there is undemanding. But as the matter of fact, it is a very lucky case, because it means achieving business goals and following our company vision with the minimal efforts. The best case we as software engineers should wish is our software being dull and uneventful.

So our first priority will be developing the software and going live, to get the real load (if it is a web site), to get the real data quantity and quality (if it is a data pipeline), to get real usage pattern and usage peaks (if it is a real time software), etc.

After we go live, we observe the software and determine, where is the performance bottleneck (or other kind of problems, for example high battery consumption for the IoT scenarios).

Then we “just” solve the bottleneck, and release a new version of the software.

And we repeat it, until all stakeholders are satisfied.

You may ask, how should we develop the first version of our software without a diagram with boxes and arrows?

Agree on System Metaphor

If we look at many different software architectures (the boxes and arrows diagrams) we will find out that there are just a couple of common patterns, ideas or themes, always repeating and used again and again. I think, this is what the agile founding father Kent Beck called System Metaphor in the Extreme Programming. It is a short story, better just one sentence, giving the idea of the architecture.

Again, I’ll give some examples.

The system metaphor of REST is: we will re-use HTTP for CRUD API, and model all web service interactions in terms of CRUD. We do that, because we want that our web services can be easily:

  • load balanced,
  • proxied,
  • developed with standard web frameworks,
  • debugged,
  • monitored,
  • logged, and
  • deployed (think firewall configuration).

We pay for this with HTTP protocol overheads.

Some system metaphors of ClickHouse are:

  • We store tables by columns and not by rows. Because columns contains similar data and rows contain dissimilar data, columns can be better compressed so that we need to wait less time when reading data from the HDD. Also, we don’t need to read columns that the query doesn’t need.
  • We implement only the part of SQL that can be done with extremely good performance on big data sets, and we extend SQL with non-standard features that give the user more ways to ensure very good query performance.
  • Our only way to success on the database market is ultimate, uncompromising, best-in-class run-time performance. Therefore we write everything in C++. It is harder to write, but it gives the best run-time performance.

One of the system metaphors of Unix is: everything is a file.

The system metaphor for the initial version of our foobarbaz app (see Part 1) is: just a standard Flask website with frontend using React and backend using Python, hosted on our on-premises Linux server.

In my opinion, if we have a team capable of creating hard software, the sentence above is all we need to start developing the first version of the foobarbaz software. If our team is one fullstack developer, they probably wouldn’t even need anything else. If our team has separate front-end and back-end developers, they would probably need one Confluence page, which they edit collaboratively to decide on names and signatures of the backend web services. And another Confluence page, with PM and Designer together, to define the features, design and user scenarious of the MVP.

Solve problems by shifting Metaphor

We shouldn’t expect that on our very first try we will already find the System Metaphor satisfying all stakeholders. Fortunately, it is often very easy to evolve our Metaphor as we learn about the real-life problems of our software in production.

Something that very often works is shifting our Metaphor along the axis “Run-Time – Dev-Time”:

Usually we want to start quite far right on these axis, because the less time and money we will spend for development of the first version, the more time and money we will have to find the right system metaphor iteratively.

Let me say it again. The overall time and money for every project is limited, either explicitely, or implicitely. The less time and money we will spend for development of the first version, the more time and money we will have to find the right system metaphor iteratively.

As soon as we identify a problem in production, we can move left on these axis, a notch or two.

If we have been using Python or Ruby, we can switch to Go or Java (one notch) or to Rust or C++ (two notches).

If we have been using naive implementation of our algorithm, we can optimize it, using better data structures, better algorithms, highly optimized libraries etc.

We can move activities out of runtime: compiling the code, pre-aggregating, pre-rendering, minimizing Javascript, etc.

We can use a proper database, requiring more effort in development for defining schemas and indexes instead of throwing everything in a No-SQL document store.

The other two axis we often can move our System Metaphor on are Scale-Up and Parallelize (Scale Out):

Usually we don’t need to move the whole software to the left. Caches are affordable solution to put frequently used data into RAM while still benefiting from cheap and persistent HDDs. In a microservice architecture, different services can get different hardware.

The scale-out / parallelization goes withing one server as well as across several servers:

Experience is the king

Brooks observed that architecture of big software systems replicates the organizational structure. Not only that: a successful software also replicates the past experiences of the team members.

Nothing is as expensive as lessons learned in operating some software in production. When we choose our System Metaphor, often we must decide about competing technologies A and B. If we have team members having years of experience operating B, the technology A has to have some exceptional, remarkable advantages over B to be evaluated. Also, when people have 5 years of B and 2 years of C, we should tend to B. See also the idea of innovation tokens.

The most essential part of the job as software architect is to have broad knowledge of all possible technologies and “notches” we can use when shifting our metaphor. At best they need to be experienced in operating some of the technologies in a hard software, in production, and have cursory knowledge of the most other “notches”.

Summary

  • Hard software is not ready and cannot be ready before it goes to production.
  • We cannot develop hard software just like any other software.
  • We don’t know what technology we’ll end up using before we finish developing hard software.
  • Hard software needs a team that can freely navigate from the deepest depths of memory allocation, assembler and kernel space code over business logic, database indexes up to the top of JavaScript and CSS, and in the width from optimizing software on one core by choosing appropriate data structures to running software on 100 cores by using Kubertetes.
  • Hard software is expensive to develop, and therefore it creates an UVP that is hard to copycat.

Hard Software Architecture

What is hard software?

Whenever we need to process huge amounts of data, or perform millions of transactions per second, or our hardware resources are very limited, or our networks are slow and unreliable (IoT), or we need to perform huge amount of calculations in real time (think about computer games), our approach to software design can be very different from the usual, undemanding software.

I’d call these use-cases “hard software“, for the lack of the better term, and even though they are very different, they do share a couple of common principles.

Part 1. Antipatterns

No plan survives first contact with the enemy.

The usual approach of designing software architecture by drawing boxes and arrows and defining interfaces between the blocks is not very helpful for hard software, because the black box design and abstractions don’t work there.

I’ll give some examples.

Virtual memory is a cool concept, because we can have unlimited amount of it. We just swap unused data to disk. And when it will be needed again, we just load it back. The idea was very popular 30 years ago. It was especially appreciated by sales, because they could specify much smaller hardware requirements for their software. As a result, we have installed their software on the minimal supported configuration and waited minutes after every mouse click, listening to the various sounds of the hard drive and waiting for the fucking swapping to finish. Nowadays, swapping is either turned off by default, or configured in the way it will only jump in in an emergency situation.

Everyone who has ever developed a website knows these two pitfalls. We develop it on localhost and everything is fine and we deploy it to hosting, and it is fine still, and later we see how our website is behaving on a 16 MBit/s DSL line (spoiler alert: not good at all). Or, we release our new web app, and it works well and responsive, and then we publish a link to it in Hacker News or pay for adverts, and boom, our webserver is down. The HTTP protocol abstracts away, how far is the server hosting the website and how much compute this website requires per user, and this abstraction often falls on our feets.

malloc and free are two simple abstractions to allocate and free memory on heap. They just need to keep track, which memory addresses are currently used and which are free. What can go wrong? Let’s say we have in total 1024 bytes of memory and want to use it to store 8 objects, 128 bytes each. So we allocate the memory and everything is fine. Now we don’t need every second object, so we free memory of 4 of those objects. We have now in total 512 bytes free, but we cannot allocate memory for another object with size of 256 bytes, because all of our memory is fragmented. We also cannot defragment the memory and move the used chunks together, because somebody is already storing pointers to our objects and we don’t know where. So people start inventing their own memory allocators with different arenas, using fixed sized buffers on stack, etc. Or people switch to garbage collection and prohibit pointers. But also this abstraction doesn’t work in hard software, causing random hangups needed to GC to collect garbage and defragment the memory.

Let’s say we have some url with parameters in its query string and we need to parse it and extract the values of the parameters. An undemanding software could do it like this (without security and error handling):

def parse_params(query_string: str):
  params = dict()	          # memory allocation
  pairs = query_string.split('&') # memory allocations, copy all 
                                  # characters of query_string, 
                                  # except of &
  for pair in pairs:
    tmp = pair.split('=')         # memory allocations, copy all 
                                  # characters except of =
    params[tmp[0]] = tmp[1]       # memory allocations, copy all
                                  # characters again
  return params

Python provides very simple, comfortable and powerful abstractions for data handling (and it is popular precisely because of that), but looking closely, at the end of this function, we would need RAM for 4 copies of the query_string, and perform a lot of memory allocations and deallocations. We can’t just ignore the abstraction costs if this function gets called many times, or our hardware is limited, etc.

Here is how a hard software would do it (again, without error and security handling and probably crashing at special cases):

typedef struct params {
  char* user_id;
  char* page_id;
  char* tab;
} query_params;

void parse_params(char* query_string, query_params* output) {
  char* tmp;

  // we assume that the order of parameters is always the same
  // and it corresponds to the order of the fields in the output_struct

  for(size_t i=0; i < sizeof(query_params)/sizeof(char*); ++i) {		
    // search for next '='
    while(*query_string && *query_string != '=') 
      query_string++;
    if (!*query_string) 
      return;
    
    // store the beginning of the parameter
    tmp = ++query_string;

    // now we just need to terminate it with 0
    // so we search for next '&'
    while(*query_string && *query_string != '&') 
      query_string++;

    // found the '&', now ready to terminate 
    *query_string = 0;

    // now tmp is a pointer to a valid C String, we can output it
    *((char**)output+i) = tmp;

    query_string++;
  }
}

Look, ma: no memory allocations! No string copies!

By drawing boxes and arrows in our architecture design, we assume interfaces, and interfaces assume abstractions, and abstractions don’t always work.

Another reason for boxes and arrows not working is that we cannot know beforehand where exactly will our hard software have problems, so we don’t know what boxes to draw.

I’ll also give some examples.

We are developing a foobarbaz app, and so we have a frontend that is collecting foo from users. It then sends it to the backend web service, which calculates barbaz and returns it synchronously back to the frontend.

We deploy it to production and pay Google to bring some users to the web site. We realize then that calculating barbaz takes 5 minutes on the hardware we can afford, and so most users wouldn’t wait for so long and their browsers just timeout.

So now we need to reshuffle our boxes: the frontend will create a job and write it to a database or message bus, and then return to the user immediately and say “we will send you email when your barbaz is ready”. We also probably need to add some more boxes or services, because we are now handling emails and we have to be GDPR compliant. Also our barbaz is not a web service any more, but just some worker fetching jobs and storing results.

We go live and find out that there are still too many users for our hardware: even though all users are now receiving the barbaz via email, very little of them are satisfied enough to pay for our premium service. We need to find a way to compute barbaz online.

After some consideration, we suddenly realize that there are only 1024 possible different baz combinations, so if we pre-compute all bazes, then to compute barbaz we just need to calculate bar and then quickly combine it with baz.

So we are reshuffling out boxes and arrows again: we keep our job/worker idea just in case we’ll have loading peaks, and we resurrect a synchronous web service that calculates barbaz using pre-computed baz, and we run pre-computation of baz as a nightly batch script.

Now we have a modest success and we land our first enterprise customer. They want 10000 foobarbaz calculations per second and they give us money so we can buy more hardware. Now we need to re-arrange our boxes again, and scale out our web service. We’ll probably use Kubernetes for this, and so we’ll throw a whole new bunch of boxes and arrows in to our diagram. And by the way, the database we’re using to store jobs can’t handle 10000 inserts per second. Are we going to optimize our table? Switch the table mode to in-memory only? Give it an SSD instead of HDD? Replace it with another database? Replace it with a message bus? Or just shard all incoming requests into 4 streams and handle each stream by a separate, independend installation of our foobarbaz app, on a different hardware?

Quickly, rearrange boxes and arrows. Draw more of them.

What should we use instead of boxes and arrows

Boxes and arrows are not bad. I mean, at least they document our current approach for the software architecture.

But they are not very helpful either.

Just having some boxes and arrows diagram doesn’t reduce our risks, doesn’t prevent us from a potential rewrite in the future. It is just a documentation, so it becomes rapidly obsolete. It is “a” way of knowledge sharing, but often it is not “the” best way for it. It is cheaper and better to share knowledge with other means, for example pair programming, in-person Q&A sessions, or code review.

There should be a better way to design hard software architecture.

(to be continued)

New Features in ClickHouse

I have recently discovered a lot of interesting new features in ClickHouse, which I have missed previosly, and so I want to share them with you.

Variant data type

In a Variant column we can store data of different types. Which types they are, we have to specify during the column creation. In the practice I often have this requirement when importing data from untyped data formats (XML, JSON, Yaml). For example,

my_field Variant(Int64, String, Array(Int64), Array(String))

would cover a lot of my use-cases. When selecting the data, we can for example use my_field.Int64 to get either a number, or NULL if the row contains some other type.

Dynamic data type

Dynamic is similar to Variant, but you don’t need to define the possible data types up-front. As soon as you insert some value in this column, the data will be stored along with the data type, and you can use select queries with this data type from then on.

Geo data types

The family of Geo data types consists of Point, Ring, Polygon, and MultiPolygon. I think the work on geo features has started about 9 years ago, but now it looks like fully developed and production-ready.

Materialized database engines

ClickHouse has had already since many years the database engines for Postgres and MySQL. You would connect to ClickHouse and send a query, and ClickHouse would forward it behind the scenes to a running Postgres or MySQL instance and get the answer. This is nice if we want to join ClickHouse tables with some data stored in the other databases.

Now, we have an option to materialize these databases. When you create them, ClickHouse would download all data from Postgres / MySQL, store it in its native compressed format, and start receiving updates from the corresponding master servers, acting basically as a replica.

Very nice.

New table engines

Again, since many years ClickHouse has the possibility to define a table that is getting data from outside sources like MongoDB, S3, HDFS, Hive, ODBC, Kafka, and RabbitMQ.

New are the following integrations:

  • NATS
  • DeltaLake
  • Redis
  • SQLite
  • And some Apache stuff like Iceberg and Hudi

One of the most interesting table engines is EmbeddedRocksDB, which basically turns ClickHouse to a high-performance key-value store, with full high-performant update and delete support. For example, if you have a history table like this

create table MyTable (
  historydatetime DateTime,
  objid Int64,
  some_property String,
  some_other_property Float64
) 
engine = MergeTree
order by some_property
partition by toYYYYMM(historydatetime);

then to retrieve the latest values you had to write something like

select objid, argMax(some_property, historydatetime) last_property,
argMax(some_other_property) last_other_property
from MyTable
group by objid

Now, you can create a key-value store for this:

create table MyTableLast (
  objid Int64,
  some_property String,
  some_other_property Float64
) 
engine = EmbeddedRocksDB()
primary key objid

and organize a view to copy there all incoming data changes:

create materialized view MyTableLastCopy to MyTableLast 
as select objid, some_property, some_other_propery from MyTable

The EmbeddedRocksDB table engine implements inserts as updates if the primary key already exists, so to get the latest value, you just need

select objid, some_property, some_other_property
from MyTableLast

You can also use these tables in joins.

UDFs

You can implement a user-defined function (UDF) using any programming language and deploy it to the ClickHouse server, and call it from SQL.

Tukey fences

Yes, I’ve also never heard this term before, even though I am familiar with this simple method of outlier detection in time series. We also have there a function for STL decomposition (to trend, seasonality and noise), as well as for fast fourier transform.

More fancy functions

Here, I am just writing down functions with unknown terms.

  • A function to work with ULIDs.
  • UniqTheta functions
  • MortonEncode
  • HilbertEncode
  • DetectLanguage
  • And probably the biggest collection of hashes in the world:
    • halfMD5
    • MD4
    • MD5
    • sipHash64
    • sipHash64Keyed
    • sipHash128
    • sipHash128Keyed
    • sipHash128Reference
    • sipHash128ReferenceKeyed
    • cityHash64
    • intHash32
    • intHash64
    • SHA1, SHA224, SHA256, SHA512, SHA512_256
    • BLAKE3
    • URLHash(url[, N])
    • farmFingerprint64
    • farmHash64
    • javaHash
    • javaHashUTF16LE
    • hiveHash
    • metroHash64
    • jumpConsistentHash
    • kostikConsistentHash
    • murmurHash2_32, murmurHash2_64
    • gccMurmurHash
    • kafkaMurmurHash
    • murmurHash3_32, murmurHash3_64
    • murmurHash3_128
    • xxh3
    • xxHash32, xxHash64
    • ngramSimHash
    • ngramSimHashCaseInsensitive
    • ngramSimHashUTF8
    • ngramSimHashCaseInsensitiveUTF8
    • wordShingleSimHash
    • wordShingleSimHashCaseInsensitive
    • wordShingleSimHashUTF8
    • wordShingleSimHashCaseInsensitiveUTF8
    • wyHash64
    • ngramMinHash
    • ngramMinHashCaseInsensitive
    • ngramMinHashUTF8
    • ngramMinHashCaseInsensitiveUTF8
    • ngramMinHashArg
    • ngramMinHashArgCaseInsensitive
    • ngramMinHashArgUTF8
    • ngramMinHashArgCaseInsensitiveUTF8
    • wordShingleMinHash
    • wordShingleMinHashCaseInsensitive
    • wordShingleMinHashUTF8
    • wordShingleMinHashCaseInsensitiveUTF8
    • wordShingleMinHashArg
    • wordShingleMinHashArgCaseInsensitive
    • wordShingleMinHashArgUTF8
    • wordShingleMinHashArgCaseInsensitiveUTF8
    • sqidEncode
    • sqidDecode

chDB

And last but not least, the embedded version of ClickHouse, a direct competitor to DuckDB! If you need all the features of ClickHouse but don’t want to deploy and operate a separate software, you can just embed it into your software.

Your software can be currently written in Python, Go, Rust or NodeJs.

How cool is that! I don’t believe I have missed announcement of this project! Now I just need a suitable project to try it out.

Summary

Every now and then I go to the clickhouse.com or read their ReleaseNotes, and every time — every single time — I find some cool new useful functionality. Clickhouse is what, about 10 years old now? — you would think they’d reach a plateau, become a corporation, release minor updates and generate cashflow per each running core like Microsoft or Oracle. But no, the development speed is not only going down, it is increasing with every year.

How do they do that!?

Five Reasons for Message Bus

Why might we want to use a message bus? Not a ZeroMQ, but a dedicated, separate hardware with all the bells and whistles?

Here are my five reasons.

Future-proof data architecture

The first post service has been invented 4500 years ago. The beauty of post service is that it has unlimited use-cases. Initially you would use it to send business orders, bank notes and bills. Later it became cheap enough and people literate enough to write personal and romantic letters, or even to play chess via post. Nowadays the post is primarily used as additional authentication method for MFA, to distribute plastic cards, and voting ballots.

Message buses offer the same beauty in regard of how the data is processed. The publisher sends the data, and it doesn’t know if it will be processed at all (or dropped immediately after arrival) and if yes, then how.

This is a huge enabler for decoupled, independend, and therefore rapid software development. For this to work, the only requirement for the producer is not to implement some specific use-case. For example, if the producer has 10 data fields, and developers know about one consumer that would only need 3 fields, a common anti-pattern would be to send only these 3 fields to the message bus.

It is anti-pattern, because the loss of possible future data usages costs more than the gain of processing less data (and therefore using cheaper hardware).

Fan-out

To be able to add new consumers for the same data, we need to organize a fan-out: the possibility to consume the same data by different consumers. This can be implemented either by copying the data (eg. RabbitMQ exchanges and queues) or by creating a persistent message log (Kafka, RabbitMQ streams).

Without the fan-out, producer would need to send the data to different consumers, so it would need to know about existence of all consumers, which contradicts the “future-proof” principle above. Even though the producer could send then different fields to different consumers (reducing thereby hardware requirements), any saving in hardware cost would be more than compensated by the loss caused by waiting for the producer team to implement, test and roll out suppor for every new consumer.

Real-time “push” data processing

The two previous chapters have implicetely assumed the alternative to message bus to be a direct web service call (producer -> consumer). We now consider another alternative to message buses: store the data into a database.

The advantage of this approach is the ability to process the data in richer ways (eg. query across different data messages, aggregate, join messages with some dictionary data or correlate messages).

The disadvantage is that many databases don’t support pushing of events to the clients, so that messages cannot be processed in real-time. Some database have triggers so that data changes can at least be processed with SQL or whatever functions are available inside of the trigger. Message buses don’t impose any limitation to the consumers so that data can be processed in any programming language, using any frameworks and any hardware (we could use LLMs to process the data and do that on huge expensive GPU server cluster).

Some databases do not support in-memory-only tables so insert rate might be limited by the database insisting on writing all the data onto a persistent medium. Kafka has the same problem, but RabbitMQ and other message buses offer the choice for the admins whether the messages will be persisted. Therefore, a much higher byte throughput can be achieved if needed.

Fault Tolerance

Decoupling faults and crashes of consumers from the producer is the primary use-case for message buses. Also, rollouts of new versions of consumers do not influence the stability of the producer.

For this to work properly, we need to bust some myths and understand how to choose the hardware for the message bus.

Myth: performance decoupling

Sometimes, message bus is seen as a way to decouple the performance of producer from the performance of the consumer. “We don’t want the user to wait until this long-runnig process P is finished, so instead of calling the web service implementing P directly, we will just publish a message and return to the user immediately, and let the P to be processed in the background”.

This would work, only if the duration of P is less or equal the average time span between two consequtive messages published on the bus.

For example, if we publish a message every second (1000ms), then the consumer for P has enough time to process the data (200ms). If we publish every 200ms, then the queue will still be empty most of the time, because the consumer just needs these 200ms to process each message. Remember the school problem with a bath tube and two pipes? This is exactly it. Finally, after all these years, you can apply this school math at work.

In the practice we can allow spikes, for example, a 100 of messages published at the same millisecond. The consumer would need 20 seconds to process all of them, and if in the next 20 seconds no new messages arrive, the queue will become empty again. This spike handling is also one of the reasons why people employ dedicated message buses.

But in the general case, we cannot allow for the situation where consumer is consistenly slower than the incoming message rate. In this case, the queue will just grow, until the hardware resources will be depleted, and then the message bus would either crash or throttle incoming messages.

Myth: just add consumers

So one consumer takes 200ms for the process P, and we have new messages coming every 50ms. Surely we can just start 4 consumers for the same queue to work in parallel and so solve the problem?

Yes and no.

Consumers having only one input (from the message bus) and no outputs are quite pointless and useless. As a rule, a consumer might have more inputs (eg. reading some other data from a database) and at least one output (sending a e-mail, calling some web service, storing data into a database or sending it again to the message bus). These inputs and outputs don’t scale indefinitely, so increasing the number of consumers from 1 to 4 might saturate either one of the other inputs or outputs and therefore might not yield a desired scaling effect.

Choosing hardware

In the normal operations, the dedicated message bus is pure waste of hardware: the queues are always empty and the CPU is also not really loaded.

We have to choose hardware for the worst-case scenario. This can be a complete down of the consumers, for at least three days (a weekend until the problem will be detected, and a day for the devops to actually fix the problem and re-start the consumers). So just estimate the average amount of published data in bytes for three days, and provide as much RAM (plus overhead for the server and OS) or HDD space if you’re using persistent logs.

Again, paying for hardware here is more than compensated by the advantages discussed above (short time-to-market, less project management and team coordination, fault tolerance etc).

Monitoring and UI

In the real life, consumers will not take exact the same amount of time to process message (like, 200ms). Instead, some messages will take longer than others. Also, consumers might be limited by some other resource like a database or a GPU server running ML models, and so if these ressources have more load, also the consumers become slower.

Therefore, one of the core features a dedicated message bus must provide for successful operations is monitoring and UI.

We want to be notified if a queue has grown over some threshold and it still has the tendency to further growth. We also want to be notified if there are no incoming messages being published. We want to be able to quickly look in the UI what is happening (does the queue have consumers? How many? How big are the messages? Is the message bus already throttling the producers?).

Trivial Scaling

If we choose the hardware properly and operate the message bus professionally, and our consumers are quicker than the incoming message rate most of the time, scaling of the message bus itself is more than trivial: whenever the performance of one message bus server is not enough, we just provision second identical hardware, install second message bus there and install a TCP load balancer (nginx, traeffik, or their cloud alternatives) in front of them.

These two nodes don’t even need to know about existence of each other and don’t need to communicate with each other. Here, message buses have an advantage over database. If we shard the same table over different ClickHouse nodes, and then run some simple query like “SELECT count() FROM Table”, each of the nodes will execute the query by their own, and then they need to talk to each other to combine (sum) the partial results for the final answer. This is not the case for message buses, because each consumer only concentrates on one queue, one message per time, so there are no queries spanning across messages or across the queues.

Summary

Message buses should be used when rapid software development and short time-to-market are essential, for the systems that require high scalability, fault tolerance, and decoupled, independent data processing. They are particularly beneficial in environments where data needs to be consumed by multiple services or in real-time scenarios where immediate processing is critical. Despite the upfront costs of dedicated hardware and implementation, the long-term benefits of rapid software development, reduced system dependencies, and robust fault management significantly outweigh these expenses. Message buses streamline complex data flows, enhance reliability, and simplify scaling, making them a valuable investment for modern, dynamic architectures.

Data literacy etude

I can’t play chess. My brain is just not compatible with it. But I love solving chess etudes.

Data literacy is more important than chess. Especially now, when everybody is trying to give their optinion more weight by referencing some (random or not so random) study with some solid-looking graphs.

It is easy to use data to fool other people and it is easy to fool ourselves, by reading the data wrong. That’s why everybody should learn to read data.

Today I want to show you a beautiful data visualization that is simple enough to be used as an etude:

Take your time to answer the questions:

  • Where are the outliers (values different from their surrounding) or what other insights you can get from it?
  • How can it be possibly explained?

[SCROLL]

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

This is what I could see in the data:

  • Summer and early Autumn are more popular for weddings in England (probably because of the best weather).
  • February 14th is also popular, being Valentines Day.
  • Christmas and New Year holidays are naturally unpopular.
  • The 13th of each month is much less popular than other days.
  • The most popular days between July 28th and September 1st are about 7 days apart from each other. I am too lazy to check, but these might be the most often weekend positions during 1999-2019.

See also the original post on Reddit with further discussion: https://www.reddit.com/r/dataisbeautiful/comments/1dcpz5l/when_do_people_get_married_oc/

Six principles of debugging

Debugging is an underrated skill. Even though we spend 50% of time on debugging at work, it is barely trained in the universities. The ability to debug might be the most important factor differentiating the 10x developers from the rest, and therefore can be a predictor for our job satisfaction and compensation.

Debugging is

  • finding the bug,
  • fixing it,
  • and proving it is fixed (a test).

Unintuitively, debugging always starts with the last step.

Reproduce or measure

If we cannot reproduce a bug, any try to fix it cannot be considered to be proven. We can’t tell then to the customers or stakeholders whether the bug has been fixed or not.

Therefore, take measures to reproduce the bug. Ideally, it should be reproducable with low efforts, time spent, and monetary costs.

Some bugs cannot be easily reproduced. For example, when the bug depends on exact order of execution, or something is randomized, or we depend on some external system, or we use AI with billons of parameters.

Note that the operating systems allowing to run more than one process or thread concurrently (ie. all modern operating systems) are inhertent sources of randomization, because we can’t control what other processes and threads are also running and what memory we get allocated.

In this case, the bug should be at least measurable. For example, we could write a log line every time our software crashes with invalid pointer exception, and create a dashboard showing the number of crashes per week. After we implement a potential bugfix, we at least can see how this indicator is changing after we roll out the change into production. Better yet, roll it out on some part of prodiction servers and compare the number of crashes in the old and in the new versions.

Proving a fix of ML models is also something that can only be measured and not reproduced. Even if the bug description is very specific and fully reproducable (eg. “every time I ask it to calculate 1+1 it delivers a wrong answer”), we usually don’t want to fix this exact query, but all possible queries involving addition of numbers. So to measure our bugfixing success, we can generate a synthetic dataset containing a couple of thousands of additions, run it throgh the model, measure the rate of wrong answers, then fix the model and measure the improved error rate again.

Read the error message

The first step of searching for the bug is to read the error message. Surprisingly many people don’t do it.

I can understand it when my mom is not reading error messages: first she don’t speak the language of the error message, second she has a legitimate expectation that the three basic apps she is using must be bug-free.

But I absolutely can’t accept when I see software developers not reading error messages. In many cases, the most clues are already contained in the error messages.

Sometimes, error messages are hard to read, often because the printout takes several screens. Do spend some time reducing the noise step by step by reconfiguring what is being logged / printed out.

I have once spent a day trying to fix a problem, only to find that the exact description of the problem along with its suggested solution was always printed out just in front of my eyes, and then immediately scrolled outside of the visible area because of some other very chatty python framework (I am looking at you, pika).

Generate and Prioritize hypotheses

If the error message doesn’t contain hints for where the bug is, we need to create hypotheses of what it can be. Note the plural: hypotheses. So many people come up with just one possible reason for the bug and immediately try to check it.

Generate a lot of hypotheses. Then use your intuition to start with the most probable one.

Let’s say we can’t connect to the database. Some people would immediately call the DBA and ask them to fix the database. But let’s generate all possible hypotheses first:

  • Hardware failure of the DB server
  • Somebody cut the network cable to the server
  • Somebody has changed the firewall settings so it rejects our connection attemps
  • Somebody is rebooting the server
  • Database process has crashed and is restarting
  • Database process is overloaded and don’t accept more connections
  • Database process is hanging
  • Somebody has deleted our DB user or changed our DB password
  • Somebody cut the network cable to our client PC.
  • Our client PC has disconnected from the VPN required to access the DB server
  • Somebody has changed the firewall settings on our client PC and it prevents outbound connections to the server.
  • Somebody has messed with the routing table on our client PC so the traffic goes to the wrong gateway
  • Somebody has messed with the routing table of the router.
  • DB Server SSL certificate is expired.
  • The client SSL software stack is not supported any more
  • Somebody has upgraded the server version and the communication protocol of the new version is not compatible any more with our client
  • Somebody has changed the IP address of the server and this DNS change hasn’t propagated to our client PC yet.
  • Somebody has changed the DNS name of the server and forgot to update it in the secrets that we’re using to connect to the server.
  • We forgot to retrieve secrets to connect to the server

In my experience, the last hypothesis is the most probable one.

Note how these hypotheses span several IT domains: hardware, system software, networking, security, and application logic. Therefore, to become a better bug hunter, we need to dive deeper.

Dive deeper

No matter what IT domain you are working in, there is always something below. System software developers work on top of hardware. Network engineers work on top of systems and hardware. Framework developers use operating systems and networking software. You’ve got the gist.

Bugs don’t care about the artificial domain borders we people draw and have in our job descriptions. Even if you are a data scientist and actually only care about the ML and the third moment of the aposteriory probability distribution, it helps to have an understanding about networking, security, databases, operating systems, and hardware.

Such an understanding helps in general. But it helps especially when debugging.

Don’t hesitate to learn adjancend IT domains, at least on the overview level. We don’t need to know the IP header fields by heart or how BGP works, but we must understand how TCP differs from UDP and how HTTP builds up on TCP, and TLS builds up on HTTP.

With this knowledge, we can generate a few hypotheses to test. To be able to quickly check them one by one in some reasonable time, we need a fast feedback loop.

Fast feedback loop

Organize the development process in a way that we can change code and see if it has affected the bug as soon as possible (ideally immediately).

Often, we need to overcome various obstacles to ensure a truly fast feedback loop. We might need to get some production data in our development database, solving the permission and privacy issues. Or we might need the permission to attach our debugger to a production process.

Sometimes it is not possible and then debugging becomes very tedios. Once I needed to compile a gigabyte of C++ code, flash it to a TV set, reboot it, then start watching a particular video and do it for several minutes until the bug could be reproduced. This debugging took me four weeks.

Fix them all

So you have fixed the bug, and proven it is fixed.

Congratulations!

Now take a moment and think about all other similar places where this or similar bug can appear, and fix them too. Now we still have the full case in the head and are perfectly predisposed to implement high-quality fixes in all the other places too.

Let’s say, we couldn’t connect to the database, because somebody has changed the IP address and assumed that everybody is using the DNS name, and we have been using the IP address in our secrets. We should walk over all secrets and replace all other IP addresses with DNS names there (where possible and feasible).

Optionally: contribute back

If you were fixing an open source software, consider to contribute back to the community and to create a merge request.

For this:

  • Ask your boss to check the business, corporate and legal aspects of contributing to open source
  • Read the contribution guidelines and understand the license of the project you are contributing to
  • Comply with the guidelines (this might require reformatting, documenting, writing (more) unit tests etc)
  • Before you make a merge request, check you are not leaking any company secrets
  • Make a merge request
  • When it is merged, don’t forget to update your software version
  • Collect virtual karma points and real “contributor” reference for your CV.

Four generations

Degenerate world!
Anne of Austria, “Twenty years after”

The Titans

Like in any mythology, the first generation has to be titans: people who have made impossible things, many times, consistently, and seemingly with no effort. Gauss with his linear regression and normal probability distribution has laid foundations to numerous science disciplines.

Student’s t distribution created as by-product of Guinness brewery trying to access quality of the raw material.

Everything we consider now as absolute basics that seemingly existed forever has been created only recently, in the 18., 19. and the beginning of the 20. centuries. Thomas Bayes, Carl Friedrich Gauss, Pearson, Fisher, Kolmogorov, Markow…

These people have called themselves statisticians, mathematicians or just scientists. They have run their models in their brains and they have defined the model architecture with ink and paper. They have calculated the weights and biases using a slide rule and stored them using chalk and a blackboard.

Their contribution in our science, technology and economics is infinite, because as long as humanity exists, it will forever use logarithms, normal distribution and Bayes theorem – for fun and profit.

The Data Scientists

The second generation was instrumental applying statistics to real-word problems, for profit, using computer technology. Computers liberated the field from the purist mathematical approach: decision trees and random forest have lost the ability to be written down on paper and understood by a human brain, but have improved accuracy, recall, precision and protection agains the overfitting.

The second generation is the world of feature engineering, controlling of overfitting, structured tabular data, classification, regression, clustering, forecasting and outlier detection. It is Matlab, R and Python. It is naive Bayes, linear and logistic regressions, SVM, decision tree and forest, k-nn, DBSCAN, ARIMA…

The Data Hackers

With the invent of deep learning, data scientists have quickly realized that using DNN it is possible to simulate all the previous ML algorithms, as well as create new architectures specialized for particular tasks. There is no science of how to create these architectures, so people just had to tinker and to use trial and error to more or less randomly and intuitively find an architecture that has performed better on public datasets.

This was not only truly liberating. This was a massive scale out, a tipping point. While the Titans could be counted on one or two hands and would typically born once in a century, and the data scientists were only in the thousands, the ability to quickly hack the ML architecture allowed millions of people to contribute to data science. It feels like almost everbody who happen to have free time, a decent GPU and ambitions, can do it.

We are speaking here Tensorflow, PyTorch, JAX and CUDA. Our usual suspects are CNN, LSTM, and GAN. Our data is non-relational now (audio, images and video). Our problems are too little hardware, too slow training times and that the gradient decent is not guaranteed to converge. Hackers have quickly generated superstitions about how to choose initial weights and what non-linear function should be used when. In the flashes of ingenuity we have created auto-encoders, text embeddings and data augmentation. Our tasks are classification, segmentation, translation, sentiment mining, generation.

While data scientists are still developing non-DNN machine learning (with notable mention of pyro.ai and novel outlier detection algorithms), they are now far from the spotlight.

The Prompt Kids

No offense! Because “script kiddies” is an official term for people who take someones intellectual property and tries to apply it to everything around them, sometimes managing to produce something useful.

With the invention of LLMs, our role has splitted in two. A minority of hackers continue hacking DNN architectures with huge number of parameters and prohibitive data set sizes and training costs. And the rest of the world, now probably in hundreds of millions, are entitled to pick up the model checkpoints that the Hackers leave for us on the huggingface and to fuck around with the models trying to force them to do something useful.

We call it “prompt engineering” because we don’t want to honestly admit, even to ourselves, that most of us has no way to play in the premiere league of DNN training and that our activity already has neither anything to do with data nor with science. We are prompt engineers! It sounds like a reputable and fun job, and it definitely can be used to bring business value, but I wonder whether we have an adequate training for it and whether social workers, children psychologists, and teachers would be better suited for the task.

We don’t need data sets any more. We don’t do training. We don’t need science. All we need is empathy, ability to write in our mother tongue eloquently, creativity, and a lot of patience.

The future, maybe?

At some point, someone will create a way to train models with 1e9 to 1e11 parameters, on a cheap, Raspberry-Pi-priced chip. This would be our cue to return to be Hackers again.

Or we just ignore the hype and continue doing our data science things with Gaussian processes, variational inference, structured data, and the good old linear regression.

Who am I?

Ten years ago I was working as product manager and I had problems explaining to my parents what exactly am I doing.

Now, as a Data Scientist, the challenge is still there.

No, I am not creating reports based on data — data analysts do.

No, I am not inventing next artificial intelligence — scientists do.

No, I am not just training machine learning models — a) most of the available data cannot be used for that so I need to be involved in data engineering first to get usable data, b) I can’t just take a requirement from customer or product management and train a model implementing it, because machine learning is not magic and can’t do everything, so I need to work with the customer or stakeholder first, to come up with a requirement that would both be technically feasible and provide some value for their business, and c) training of LLMs is very expensive and there are other ways to use them

No, I am not a software developer — software developers implement user stories and tickets assigned to them. I decide myself what to implement and why. Besides, I don’t care about proper software process — my goal is to discover a product feature that uses data in a way that brings money. I usually don’t see any reason to create software documentation, write unit tests, implement CI/CD pipelines, use branches and merge requests and code reviews — not at least before we have first three paying customers.

No, I am not an infrastructure guy — the devops are.

What I would like to be doing?

  • I want to make sure there is data available, in quality and quantity that I need to be able to start working.
  • I want to create product ideas promising some business value by using this data.
  • I want to test the product ideas by implementing a prototype and trying to sell it to the customers.

Five mistakes interpreting data

We’re living in the age of data and everybody is making data-driven decisions by interpreting data.

As a data scientist, I often see people making the same mistakes again and again. Here are the top five most common mistakes.

Failing to relate the numbers

Data analysis shows that one of the KPI is 65%. The newspaper reports reduction of carbon emissions of five tonnes per year. Your diating app says you have consumed 200000 joules of energy at the last meal.

Suprisingly many people would believe that the KPI is quite well and solid, the carbon emissions were hugely reduced, and they’ve overeaten during last meal. This happens, because they relate these numbers to some average, implicit baseline, which may or may not be correct.

We often see percents when we speak about revenue, interest rate or growth and we get used that revenue and interest rates run around 0% to 20% and that growth of 20% is often considered to be very good.

We usually weight below 100 kg, our car weights 1.5 tonnes, so when we hear about 5 tonnes, it has to be a lot, right?

Whenever we see numbers like 200,000, we think about something really expensive, like a very luxury car or some small real estate. So 200,000 joules has to be a lot.

But if your performance indicator was between 90% and 120% in the last ten years, 65% is quite an unsatisfactory news.

The worldwide yearly carbon emissions are 38,000,000,000 tonnes, so 5 tonnes are merely 0.000000013% of the total emissions.

And 200000 joules are just about 50 cal, something like one apple.

Establish a meaningful baseline and compare any analytics results with this baseline.

The most common baselines are the values from the other time periods (KPI example), some global total value (carbon emissions) or the conversion from some rarely used unit like joules to a more intuitive one like calories.

Failing to account for costs or consequences

Everything has advantages and disadvantages. Everything good comes at cost. Looking only at one chart in the analytics report can lead to wrong decisions.

Yes, the KPI has fallen down from 100% to 65%. But what if it has been achieved with only 10% of the costs?

Yes, five tonnes CO2 per year is very little. But if it has negative costs (selling the car and using public transportation), then 100 millons people can be motivated to do that and so together they will contribute to 1.3% of carbon emissions reduction.

Yes, consuming just 50 cal for a meal looks good if we want to lose weight, but eating just one apple for lunch might induce food crawings and so increase the risk of abandoning the diat.

Put results of the analytics report into relation with its costs and consequences, before labeling the numbers to be “good” or “bad”

Failing to take data quality into account

Just because an analytical report looks authoritative with its tables of numbers and charts, there is no guarantee that the numbers have meaning at all.

Was our KPI really 65%? Or some new processes and recent reorganizations in our company were just overlooked and not included in the KPI calculation? Is the KPI of this year comparable with the KPI of the last year at all?

How did they measure the 5 saved tonnes of carbon emission? Did they attach a measurement device at the exhaust of the car? Or they just took the carbon captured in the gasoline? Did they take the emissions into account for production and for scraping of the car? Can public transport be considered emission-free?

How did we measure the 200,000 joules of energy? Did we burn the food and measure the released warmth? Or we just pointed our smartphone camera and hoped that the app can tell an apple from a brownie in shape and color of an apple?

Understand how the data has been produced or measured and take its limitations into account. Improve data quality if you are to make an important decision, or else make your decision tentatively and be ready to change it if some contradictory data emerges.

Failing to consider casuality or confounders

Consider these two maps of the USA:

On the left is the frequency of UFO sightings [1], on the right is the population density [2].

The red dots on the left correlate with the red dots on the right, and we should have at least three possible explanations for that:

1) More people means there is a higher chance that at any given second at least one of them is looking at the sky so the UFOs have higher chances to be spotted.

2) Aliens are secretly living at Earth and they use their secret political influence to support policies that lead to higher population density (agglomeration of attractive employers, zoning rules etc), because they feed from the stress that humans emit when living in overpopulated areas.

If we want to depict these ideas, we can use causation graphs like this:

This is also widely known as “Correlation is not Causation”. From the data alone we cannot determine the direction of the arrow.

Confounders are less widely known and they go like this:

3) There is a unknown field, distributed everywhere in the world with different strength. Physicists don’t know about this field yet. Let’s call it the Force. Some part of Earth emanate the Force stronger than others. The force makes people more optimistic and happy and so people tend to unconsciously dwell around sources of the Force. Aliens fly by to mine the Force, because they use it as a currency.

The Force is a confounder for the correlation beween UFO sightings and population density:

So, which explanation is the correct one? From this data alone, it is impossible to tell.

Perform additional analysis to establish casuality and control confounders, or else, avoid making statements about the causation of the data report.

Failing to be really data-driven

I have a strong opinion that the Explanation 1 (more people lead to more UFO sightings) is the correct one. This opinion is based on my personal experiece and common sense. If I was especially well trained in sociology and UFO data, you could call it “my expert opinion”.

Expert opinions are not data-driven. They are based on experience, common sense and intuition.

Another reason of being not or not fully data-driven is the pressure of responsibility, typically to find at the C-level of management or in politics. There are so many political factors as well as soft factors, marketing, influence, group dynamics, psychological factors, company culture, and big money that discourage the search for truth. Data is only accepted when it confirms the VIP opinion.

Don’t hesitate to state that your make the one or other decision in a not data-driven manner. Data is one, but not the only way to make great decisions.

Do not pretend being data-driven while making some or all of these five mistakes, just because being data-driven is “in” and it gives to your decision an scientific touch. Data scientists are watching you and they will know the truth.

Data sources

[1] https://www.reddit.com/r/dataisbeautiful/comments/1cti1rz/do_ufo_sightings_happen_near_airports_best

[2] https://ecpmlangues.unistra.fr/civilization/geography/map-us-population-density-2021