How to design hard software – Maxim Fridental

This is Part 2 of the series about hard software. Read Part 1 here.

Stop saying “architecture”

The first step to healing is to accept the problem.

I have a problem with the word “software architecture”. Architecture implicitely has the following properties:

It is “the” design step that is ultimately responsible for the quality, performance, price, and fitness of the end product. If we build a house, the architect interviews the clients and takes everything into consideration: the height of all family members, whether new family members are planned, whether the clients work in home office, how many cars do they have, even whether they are left- or right-handed. As a result, the clients can expect the real estate, perfectly suited for them, comfortable to use, without unnecessary details, reasonably priced and future-proof.
So the architecture is able to predict future and to take all relevant aspects into consideration.
It is notoriusly hard to change the architecture once we’ve started to build.
It happens before we start anything else. It is the root, the supreme source of all other decisions and activities.

And all of these properties is exactly what we don’t want in a hard software, because it doesn’t work like this. By definition, when developing hard software, we cannot predict the future, cannot take all relevant aspects into consideration and don’t even know the full list of all relevant aspects, and on the other hand, in software, it is often not so expensive to do changes.

Start with going live

It is impossible to tell what part of our hard software will cause problems in real production conditions. We can’t even be sure if our software hard at all, or just yet another undemanding app.

Note: “Undemanding app” might sound harsh and unfair, given the fact that most of the software out there is undemanding. But as the matter of fact, it is a very lucky case, because it means achieving business goals and following our company vision with the minimal efforts. The best case we as software engineers should wish is our software being dull and uneventful.

So our first priority will be developing the software and going live, to get the real load (if it is a web site), to get the real data quantity and quality (if it is a data pipeline), to get real usage pattern and usage peaks (if it is a real time software), etc.

After we go live, we observe the software and determine, where is the performance bottleneck (or other kind of problems, for example high battery consumption for the IoT scenarios).

Then we “just” solve the bottleneck, and release a new version of the software.

And we repeat it, until all stakeholders are satisfied.

You may ask, how should we develop the first version of our software without a diagram with boxes and arrows?

Agree on System Metaphor

If we look at many different software architectures (the boxes and arrows diagrams) we will find out that there are just a couple of common patterns, ideas or themes, always repeating and used again and again. I think, this is what the agile founding father Kent Beck called System Metaphor in the Extreme Programming. It is a short story, better just one sentence, giving the idea of the architecture.

Again, I’ll give some examples.

The system metaphor of REST is: we will re-use HTTP for CRUD API, and model all web service interactions in terms of CRUD. We do that, because we want that our web services can be easily:

load balanced,
proxied,
developed with standard web frameworks,
debugged,
monitored,
logged, and
deployed (think firewall configuration).

We pay for this with HTTP protocol overheads.

Some system metaphors of ClickHouse are:

We store tables by columns and not by rows. Because columns contains similar data and rows contain dissimilar data, columns can be better compressed so that we need to wait less time when reading data from the HDD. Also, we don’t need to read columns that the query doesn’t need.
We implement only the part of SQL that can be done with extremely good performance on big data sets, and we extend SQL with non-standard features that give the user more ways to ensure very good query performance.
Our only way to success on the database market is ultimate, uncompromising, best-in-class run-time performance. Therefore we write everything in C++. It is harder to write, but it gives the best run-time performance.

One of the system metaphors of Unix is: everything is a file.

The system metaphor for the initial version of our foobarbaz app (see Part 1) is: just a standard Flask website with frontend using React and backend using Python, hosted on our on-premises Linux server.

In my opinion, if we have a team capable of creating hard software, the sentence above is all we need to start developing the first version of the foobarbaz software. If our team is one fullstack developer, they probably wouldn’t even need anything else. If our team has separate front-end and back-end developers, they would probably need one Confluence page, which they edit collaboratively to decide on names and signatures of the backend web services. And another Confluence page, with PM and Designer together, to define the features, design and user scenarious of the MVP.

Solve problems by shifting Metaphor

We shouldn’t expect that on our very first try we will already find the System Metaphor satisfying all stakeholders. Fortunately, it is often very easy to evolve our Metaphor as we learn about the real-life problems of our software in production.

Something that very often works is shifting our Metaphor along the axis “Run-Time – Dev-Time”:

Usually we want to start quite far right on these axis, because the less time and money we will spend for development of the first version, the more time and money we will have to find the right system metaphor iteratively.

Let me say it again. The overall time and money for every project is limited, either explicitely, or implicitely. The less time and money we will spend for development of the first version, the more time and money we will have to find the right system metaphor iteratively.

As soon as we identify a problem in production, we can move left on these axis, a notch or two.

If we have been using Python or Ruby, we can switch to Go or Java (one notch) or to Rust or C++ (two notches).

If we have been using naive implementation of our algorithm, we can optimize it, using better data structures, better algorithms, highly optimized libraries etc.

We can move activities out of runtime: compiling the code, pre-aggregating, pre-rendering, minimizing Javascript, etc.

We can use a proper database, requiring more effort in development for defining schemas and indexes instead of throwing everything in a No-SQL document store.

The other two axis we often can move our System Metaphor on are Scale-Up and Parallelize (Scale Out):

Usually we don’t need to move the whole software to the left. Caches are affordable solution to put frequently used data into RAM while still benefiting from cheap and persistent HDDs. In a microservice architecture, different services can get different hardware.

The scale-out / parallelization goes withing one server as well as across several servers:

Experience is the king

Brooks observed that architecture of big software systems replicates the organizational structure. Not only that: a successful software also replicates the past experiences of the team members.

Nothing is as expensive as lessons learned in operating some software in production. When we choose our System Metaphor, often we must decide about competing technologies A and B. If we have team members having years of experience operating B, the technology A has to have some exceptional, remarkable advantages over B to be evaluated. Also, when people have 5 years of B and 2 years of C, we should tend to B. See also the idea of innovation tokens.

The most essential part of the job as software architect is to have broad knowledge of all possible technologies and “notches” we can use when shifting our metaphor. At best they need to be experienced in operating some of the technologies in a hard software, in production, and have cursory knowledge of the most other “notches”.

Summary

Hard software is not ready and cannot be ready before it goes to production.
We cannot develop hard software just like any other software.
We don’t know what technology we’ll end up using before we finish developing hard software.
Hard software needs a team that can freely navigate from the deepest depths of memory allocation, assembler and kernel space code over business logic, database indexes up to the top of JavaScript and CSS, and in the width from optimizing software on one core by choosing appropriate data structures to running software on 100 cores by using Kubertetes.
Hard software is expensive to develop, and therefore it creates an UVP that is hard to copycat.