Hard Software Architecture

What is hard software?

Whenever we need to process huge amounts of data, or perform millions of transactions per second, or our hardware resources are very limited, or our networks are slow and unreliable (IoT), or we need to perform huge amount of calculations in real time (think about computer games), our approach to software design can be very different from the usual, undemanding software.

I’d call these use-cases “hard software“, for the lack of the better term, and even though they are very different, they do share a couple of common principles.

Part 1. Antipatterns

No plan survives first contact with the enemy.

The usual approach of designing software architecture by drawing boxes and arrows and defining interfaces between the blocks is not very helpful for hard software, because the black box design and abstractions don’t work there.

I’ll give some examples.

Virtual memory is a cool concept, because we can have unlimited amount of it. We just swap unused data to disk. And when it will be needed again, we just load it back. The idea was very popular 30 years ago. It was especially appreciated by sales, because they could specify much smaller hardware requirements for their software. As a result, we have installed their software on the minimal supported configuration and waited minutes after every mouse click, listening to the various sounds of the hard drive and waiting for the fucking swapping to finish. Nowadays, swapping is either turned off by default, or configured in the way it will only jump in in an emergency situation.

Everyone who has ever developed a website knows these two pitfalls. We develop it on localhost and everything is fine and we deploy it to hosting, and it is fine still, and later we see how our website is behaving on a 16 MBit/s DSL line (spoiler alert: not good at all). Or, we release our new web app, and it works well and responsive, and then we publish a link to it in Hacker News or pay for adverts, and boom, our webserver is down. The HTTP protocol abstracts away, how far is the server hosting the website and how much compute this website requires per user, and this abstraction often falls on our feets.

malloc and free are two simple abstractions to allocate and free memory on heap. They just need to keep track, which memory addresses are currently used and which are free. What can go wrong? Let’s say we have in total 1024 bytes of memory and want to use it to store 8 objects, 128 bytes each. So we allocate the memory and everything is fine. Now we don’t need every second object, so we free memory of 4 of those objects. We have now in total 512 bytes free, but we cannot allocate memory for another object with size of 256 bytes, because all of our memory is fragmented. We also cannot defragment the memory and move the used chunks together, because somebody is already storing pointers to our objects and we don’t know where. So people start inventing their own memory allocators with different arenas, using fixed sized buffers on stack, etc. Or people switch to garbage collection and prohibit pointers. But also this abstraction doesn’t work in hard software, causing random hangups needed to GC to collect garbage and defragment the memory.

Let’s say we have some url with parameters in its query string and we need to parse it and extract the values of the parameters. An undemanding software could do it like this (without security and error handling):

def parse_params(query_string: str):
  params = dict()	          # memory allocation
  pairs = query_string.split('&') # memory allocations, copy all 
                                  # characters of query_string, 
                                  # except of &
  for pair in pairs:
    tmp = pair.split('=')         # memory allocations, copy all 
                                  # characters except of =
    params[tmp[0]] = tmp[1]       # memory allocations, copy all
                                  # characters again
  return params

Python provides very simple, comfortable and powerful abstractions for data handling (and it is popular precisely because of that), but looking closely, at the end of this function, we would need RAM for 4 copies of the query_string, and perform a lot of memory allocations and deallocations. We can’t just ignore the abstraction costs if this function gets called many times, or our hardware is limited, etc.

Here is how a hard software would do it (again, without error and security handling and probably crashing at special cases):

typedef struct params {
  char* user_id;
  char* page_id;
  char* tab;
} query_params;

void parse_params(char* query_string, query_params* output) {
  char* tmp;

  // we assume that the order of parameters is always the same
  // and it corresponds to the order of the fields in the output_struct

  for(size_t i=0; i < sizeof(query_params)/sizeof(char*); ++i) {		
    // search for next '='
    while(*query_string && *query_string != '=') 
      query_string++;
    if (!*query_string) 
      return;
    
    // store the beginning of the parameter
    tmp = ++query_string;

    // now we just need to terminate it with 0
    // so we search for next '&'
    while(*query_string && *query_string != '&') 
      query_string++;

    // found the '&', now ready to terminate 
    *query_string = 0;

    // now tmp is a pointer to a valid C String, we can output it
    *((char**)output+i) = tmp;

    query_string++;
  }
}

Look, ma: no memory allocations! No string copies!

By drawing boxes and arrows in our architecture design, we assume interfaces, and interfaces assume abstractions, and abstractions don’t always work.

Another reason for boxes and arrows not working is that we cannot know beforehand where exactly will our hard software have problems, so we don’t know what boxes to draw.

I’ll also give some examples.

We are developing a foobarbaz app, and so we have a frontend that is collecting foo from users. It then sends it to the backend web service, which calculates barbaz and returns it synchronously back to the frontend.

We deploy it to production and pay Google to bring some users to the web site. We realize then that calculating barbaz takes 5 minutes on the hardware we can afford, and so most users wouldn’t wait for so long and their browsers just timeout.

So now we need to reshuffle our boxes: the frontend will create a job and write it to a database or message bus, and then return to the user immediately and say “we will send you email when your barbaz is ready”. We also probably need to add some more boxes or services, because we are now handling emails and we have to be GDPR compliant. Also our barbaz is not a web service any more, but just some worker fetching jobs and storing results.

We go live and find out that there are still too many users for our hardware: even though all users are now receiving the barbaz via email, very little of them are satisfied enough to pay for our premium service. We need to find a way to compute barbaz online.

After some consideration, we suddenly realize that there are only 1024 possible different baz combinations, so if we pre-compute all bazes, then to compute barbaz we just need to calculate bar and then quickly combine it with baz.

So we are reshuffling out boxes and arrows again: we keep our job/worker idea just in case we’ll have loading peaks, and we resurrect a synchronous web service that calculates barbaz using pre-computed baz, and we run pre-computation of baz as a nightly batch script.

Now we have a modest success and we land our first enterprise customer. They want 10000 foobarbaz calculations per second and they give us money so we can buy more hardware. Now we need to re-arrange our boxes again, and scale out our web service. We’ll probably use Kubernetes for this, and so we’ll throw a whole new bunch of boxes and arrows in to our diagram. And by the way, the database we’re using to store jobs can’t handle 10000 inserts per second. Are we going to optimize our table? Switch the table mode to in-memory only? Give it an SSD instead of HDD? Replace it with another database? Replace it with a message bus? Or just shard all incoming requests into 4 streams and handle each stream by a separate, independend installation of our foobarbaz app, on a different hardware?

Quickly, rearrange boxes and arrows. Draw more of them.

What should we use instead of boxes and arrows

Boxes and arrows are not bad. I mean, at least they document our current approach for the software architecture.

But they are not very helpful either.

Just having some boxes and arrows diagram doesn’t reduce our risks, doesn’t prevent us from a potential rewrite in the future. It is just a documentation, so it becomes rapidly obsolete. It is “a” way of knowledge sharing, but often it is not “the” best way for it. It is cheaper and better to share knowledge with other means, for example pair programming, in-person Q&A sessions, or code review.

There should be a better way to design hard software architecture.

(to be continued)

Leave a comment