Four Weeks of Bugfixing

The hardest bug I’ve ever fixed in my life took me 4 weeks to find. The bug report itself was pretty simple, but I have to give more context first.

I was one of the developers of a Smart TV software, and the bug related to the part of the software responsible for playing video files stored on your USB memory or drive. The CPU that was available for this task was a 750MHz ARM chip, and clearly it had not enough power to decode video (let alone HD video) in software. Luckily, every digital TV set has a hardware H.264 decoder, and our SOC was so flexible that we could use it programmatically. In this way, we were able to support H.264 video playback (too bad for you DivX and VC-1 owners).

Technically, the SOC has provided a number of building blocks, including a TS demux, an audio decoder, a video decoder, a scaler and multi-layer display device, and a DMA controller to transfer all the data between the blocks. Some of the blocks were present more than once (for example, for the PIP feature you naturally need two video decoders) and therefore could be dynamically and freely interconnected programmatically, building a hardware-based video processing pipeline. Theoretically, one could configure the pipeline by writing some proper bits and bytes in specified configuration registers of the corresponding devices. Practically, the chip manufacturer has provided an SDK for this chip, so that you only had to call a pretty well-designed set of C functions. The SDK was intended to run in the kernel mode of a Linux kernel, and it came from the manufacturer together with all building scripts needed to build the kernel.

Furthermore, this SDK was wrapped and extended by some more kernel-side code, first to avoid dependency on a particular SOC, and second to provide some devices to the user-mode, where the rest of the Smart TV software was running. So to play video programmatically, one needed to open a particular device from user mode as a file, and write into it a TS stream containing video and audio data.

Sadly, there are many people out there who have invented a lot of different container formats besides of TS. Therefore, our software had to detect the container format of the file to be played, demux the elementary streams out of it, then mux them again into a TS stream, and then hand it over to the kernel mode code. The kernel code would pass the TS bytes to the DMA device, that would feed the hardware TS demuxer, that would send the video elementary stream to the hardware video decoder, where it finally would be decoded and displayed.

For the user mode, we could implement all possible container formats ourselves (and this would mean some job security for the next 10 years of so). Fortunately the Smart TV software was architected very well so that the GStreamer framework was used (for you Windows developers it is an open-source alternative to DirectShow). The framework is written in C (to be quick) and GLib (to be object-oriented) and provides a pipeline container, where you can put some filters and interconnect them. Some filters read the data (sources), some process the data (eg. mux or demux), some use the data (sinks). When the pipeline starts playing, the filters agree on which one will drive the pipeline, and the driver would pull the data from all filters before it in the pipeline, and push the data into all the filters after it in the pipeline. Our typical pipeline looked like this (in a simplified form): “filesrc ! qtdemux ! mpegtsmux ! our_sink”. As you can expect from such a framework, there are also a lot of stuff related to events and state machines, as well as memory management.

So now, back to the bug report. It looked like this: when playing a TS file from USB memory, you can seek forward and backward with no limitation. When playing any other container format, you can seek forward, but you cannot seek backward. When seeking backward, the video freezes for several seconds, and then the playback continues from the latest position.

This is the sort of bugs when I think this might be fixed in a day or two. I mean, it works with TS, it doesn’t work with MP4, it is fully reproducible, so just find out what is different in those two cases and you’ve caught it.

The GStreamer pipeline in TS case looked like this: “filesrc ! our_sink”. So it must be either qtdemux or mpegtsmux. I’ve built another MP4 demuxer and replaced qtdemux with it. Negative, the bug is still there. No wonder, it also appeared in other container formats. I couldn’t replace mpegtsmux, because I haven’t found any alternatives. So the only thing I could do is to use the pipeline “filesrc ! qtdemux ! mpegtsmux ! filesink”, write the output into a file, and then try to dump the TS format structure and to look for irregularities.

If you know TS format, then for sure, you are already very sympathetic with me. TS is a very wicked and complicated format, and they repeat some meta-information every 188 bytes, so that the dump of several seconds of video took megabytes. After reading it, I didn’t find anything suspicious. Then I’ve converted my test MP4 video into a TS using some tool, dumped that TS, and compared. Well, there were some differences, in particular, how often the PCR was transmitted. Theoretically, PCR is just a system clock and should not influence the playback at all, but practically we already knew about some hardware bugs in the decoder making it allergic to unclear PCR signaling. I’ve spent some time trying to improve PCR, but this didn’t help either.

I have then played the dumped TS file, and I could see the seek backwards that I did during the recording. This has convinced me that mpegtsmux was also bug-free. The last filter I could suspect was our own sink. Implementing a GStreamer filter is not easy to do right in the first time. So that I went through all the functions, all the states, all the events, informed myself how the proper implementation should looked like, and found a lot of issues. Besides of a lot of memory leaks, we’ve generated a garbage during the seek. Specifically, GStreamer needs it to work in the following way:

1. The seek command arrives at the pipeline and a flush event is sent to all filters.

2. All filters are required to drop all buffered information to prepare themselves for the new data streamed from the new location.

3. When all filters have signaled to be flushed, the pipeline informs the pipeline driver to change playback location.

4. After the seek, the new bytes start flowing in the pipeline. Our code has conformed to this procedure somewhat, but did the cleanup prematurely, so that after the cleanup some more stale data polluted our buffers, before the data from the new location arrived.

I couldn’t explain why did it work with TS but not with MP4, but I’ve figured out that fixing it will make our product better anyways, so I’ve fixed it. As you can imagine, this didn’t solve the original problem.

At this point I’ve realized that I had to go into the kernel. This was a sad prospect, because every time I’ve changed anything in kernel, I had to rebuild it, then put the update on a USB stick, insert it into TV set, upgrade it to the new kernel by flashing the internal SOC memory, and then reboot the chip. And sometimes I’ve broken the build process, and the new kernel wouldn’t even boot, and I had to rescue the chip. But I had no other choice: I was out of ideas what else I could do in the user space, and I suspected that in the kernel space, we also had a similar issue with a garbage during the seek.

So that I’ve bravely read the implementation of the sink device and changed it in a way that it would explicitly receive a flush signal from the user space, then flush the internal buffer of the Linux device, then signal back to user space it is ready, and only then I would unlock the GStreamer pipeline and allow it to perform the seek and start streaming from the new location.

It didn’t help.

I went further and flushed the DMA device too. It didn’t help. Also flushing the video decoder device didn’t help.

At this point I’ve started to experiment with the flush order. If I first flush the DMA, the video decoder might starve in absence of data and therefore get stuck. But if I flush the decoder first, the DMA would immediately feed it with some more stale data. So perhaps I have to disconnect the DMA from video decoder first, then flush the decoder, then the DMA, and then reconnect them back? Implemented that. Nope, it didn’t work.

Well, perhaps the video decoder is allergic to asynchronous flushes? I’ve implemented some code that has waited until the video decoder reported that it has just finished the video frame, and then flushed it. Nope, this wasn’t it.

In the next step, I have subscribed to all hardware events of all devices and dumped them. Well, that were another megabytes of logs to read. And it didn’t help, that the video playback was a very fragile process per se. Even when playing some video, that looked perfectly well on the screen, the decoder and the TS demux would routinely complain of being out of sync, or losing it, or being unable to decode a frame.

After some time of trying to see a pattern, the only thing I could tell is that after the seek forward, the video decoder would complain for some frames, but eventually recover and start producing valid video frames. After a seek backward, the video decoder has never recovered. Hmm, can it be something with the H.264 stream itself that prevented the decoder to work?

Usually, one doesn’t think about elementary streams in terms of a format. They are just BLOBs containing the picture, somehow. But of course, they have some internal format, and this structure is normally only dealt with by authors of encoders and decoders. I went back to GStreamer and looked up, file by file, all the filters from the pipeline producing the bug. Finally, I’ve found out that mpegtsmux has a file having “h264” in its name, and this has immediately ringed alarm in my head. Because well, TS is one abstraction level higher than H.264, why the hell mpegtsmux has to know about the existence of H.264?

It turned out, H.264 bitstream has in its internal structure so-called SPS/PPS, the sequence parameter set that is basically a configuration for the video decoder. Without the proper configuration, it cannot decode video. In most container formats, this configuration is stored once somewhere in the header. The decoder normally reads the parameters once before the playback start, and uses them to configure itself. Not so in TS. The nature of TS format is so that it is not a file format, it is a streaming format. It has been designed in the way that you can start playing from any position in the stream. This means that all important information has to be repeated every now and then. This means, when H.264 stream gets packed into the TS format, the SPS/PPS data also has to be regularly repeated.

This is piece of code responsible for this repeating: As you can see, during the normal playback, it would insert the contents of h264_data->cached_es every SPS_PPS_PERIOD seconds. This works perfectly well until you seek. But look how the diff is calculated in the line 234, and how the last_resync_ts is stored in line 241. The GST_BUFFER_TIMESTAMP is as you can imagine the timestamp of the current video sample passing through the tsmux. When we seek backwards, the next time we come into this function, the GST_BUFFER_TIMESTAMP will be much less than last_resync_ts, so the diff will be negative, and thus the SPS/PPS data won’t be repeatedly sent, until we reach the original playback time before the seek.

To fix the bug, one can either use the system time instead of playback time, or reset last_resync_ts during the flush event. Both would be just a one line change in the code.

Now, the careful reader might ask, why could the TS file I’ve recorded using mpegtsmux in the beginning of this adventure be played? The answer is simple. In the beginning of this file (i.e. before I’ve seek), there are H.264 data with repeated SPS/PPS. At some point (when I’ve seek during the recoding), the SPS/PPS stop being sent, and then some seconds later appear again. Because these SPS/PPS data are the same for the whole file, already the first instance of them configures the video decoder properly. On the other hard, during the actual seek of MP4 playback, the video decoder is being flushed, and therefore the SPS/PPS data is being also flushed, and this is the point when the video decoder relies on repeated SPS/PPS in the TS stream to recover, and this is exactly the point when they stop coming from the mpegtsmux.

Four weeks of search. 8 hours a day, 5 days a week. Tons of information read and understood. Dozens of other smaller bugs fixed on the way. Just to find out a single buggy line of code out of 50 millions lines of code in the source folder. A large haystack would contain to my estimate 40 to 80 millions of single hays, making this bug fixing adventure literally equivalent of finding a needle in a haystack.

Four Seasons of Enterprise

In the beginning, there is no enterprise, just a couple of founders fascinated by a single idea and working hard to realize it. The startup does not earn much money, and there are barely any employees, so that I suppose it might feel just like a (very hardcore) hobby. Or a side gig. There are no formally defined roles. Everybody is doing everything, and everybody is responsible for everything, and everybody can see the real contribution of each other. There is an Enterprise Spring feeling, full of the can-do mentality.

The Enterprise Summer begins, when the enterprise starts earning substantial amount of money and hires their 20th employee. The founders, now CEOs, suddenly realize that “they” (their company) are earning much more money than they would have been ever able to earn by their own. And they are responsible that this revenue increases, not decreases. Also, they realize that dozens of their employees trusted them and build on stability of the company to plan their life, pay off mortgages and so on. This is a huge responsibility and huge pressure. And for sure, a lot of sleepless nights, with a single thought running through your head: “how are we going to survive?”

I had a chance to observe several founders in several companies, hitting this level. They were all good-hearted, creative, smart, modest and ethical people. But I could see, day for day, how this pressure had melted, squeezed or at least severely bent their personality. At some point, you have to ignore interests of your friends, in the sake of the enterprise. At some point, you have to make unpopular, hard decisions and stop some projects, because your enterprise can’t handle too much projects at once and has to focus more sharply to survive. You have to cut parts of the body to save the rest. And on some day, you have to lay off somebody for the first time. If you didn’t had grey hairs before, this is the time for the first one.

At this stage, enterprises usually have a very loyal staff, and everyone has a very entrepreneurial approach: everybody knows exactly, how are we earning money, what does he or she has to do to help earning money, and what will happen if someone stops earning money. Summer Enterprises that don’t have enough staff of this kind, die very quickly.

First formally defined roles appear, out of a very practical and extremely transparent reason (that everyone can follow): that the division of work will reduce overhead, and thus help earning more money, and thus help the enterprise to survive. With roles comes responsibility, and some formal processes. The individual contribution of every single person starts getting fuzzy, because of division of labor, so that first non-monetary KPIs appear. Non-monetary KPIs lead to the first “locality problems”, where some people tend to over-optimize their own KPI, at the expense of some other departments, and the overall revenue. But because the company is still on the profitability edge and is fighting for its survival, these problems are usually timely detected by CEOs and fixed.

At some point, the enterprise gets a momentum. Some kind of a flywheel appears, generating ever more revenue and income, seemingly by itself. In the Enterprise Autumn, the company starts hiring more and more staff. Survival of the company is getting less and less dependent on individual contribution or individual decisions of any single employee. There will be more and more process. At this point, CEOs realize that they have finally achieved the nirvana they had envisioned so eagerly in their sleepless nights before, and start focusing on the conservation of the status quo. Minimizing or at least managing risks of destroying the flywheel is prioritized above trying some new ways earning money. Every single department is culturally trimmed to minimize risks and avoid mistakes. As a result, any major innovation ceases.

Usually, at this point, more and more people playing corporate politics are hired.

Remember the feeling of people before the 20th century? The mankind was so small compared with the nature that no one made any second thought cutting the last tree in the forest or spilling waste into a river. The “Well, when this forest is cut down, we’ll just move on to the next forest” attitude. Only in the 20th century, people have finally realized that the Earth is a closed and pretty limited ecosystem. The Enterprise Summer is just like ecological thinking – everybody is aware that any single major fuck-up can end up with a global meltdown. Everybody is an Entrepreneur. On the contrary, in the Enterprise Autumn companies, there are a lot of people with the middle age attitude. They know that the momentum is huge and flywheel is big, so that they can allow putting their own career interests above the interests of the enterprise.

This is why Autumn Enterprises are so full of corporate politics. And from some particular point of view, one can at least understand it. After all, the well-being of a living, breathing person should be valued more than some abstract 0,01% uplift in revenues of some soulless corporate monster, earning money for some minority to allow them to buy a second yacht. So no wonder some people feel it ethical to do corporate politics and enjoy playing politic games. Others have to participate to protect themselves. Yet another just go under radar and opt out.

Another consequence of the corporate politics is the rise of huge locality problems, where the narrow focus on the KPIs of my own department prevails, often at the expense of the overall revenue, and there is nobody who can untangle these problems.

But no momentum can be forever. Either the too much of locality problems, or some external sudden market shift damages the flywheel, so that it cannot rotate so effortlessly than before. This is the time of the Enterprise Winter. At this point, the company usually has a long history of corporate politics, so that

a) all of its most important posts are occupied by corporate politicians with a non-ecological thinking, and

b) most of ecologically thinking Entrepreneurs have either left the company, or remained on an outsider role without any real influence.

To fix the flywheel, or to find out a new one, the enterprise needs (more) Entrepreneurs. But the corporate politicians (correctly) see them as a danger for themselves and fight them.

Different things can now happen depending on balance of power between the two groups. Entrepreneurs might win the battle, or at least manage to fix the flywheel while being constantly under attack. Or personal interests of corporate politicians might accidentally be best represented by a project that also fixes the flywheel. Or the flywheel has so much energy that it allows the company to survive for years and years, even in the damaged state, and then, another lucky external influence might fix it. Microsoft’s flywheel has been severely damaged around 10 years ago, and they have demonstrated both spectacular flywheel repairs and awful additional flywheel damages since than. Apple had experienced a similarly long period, the 12 years without Jobs.

But in the worst case, if the flywheel is weak and the corporate politics prevails, the agony might start, with all possible short-term potentials being sucked out of the flywheel, then staff getting laid off, and then all remaining assets being sold.

How M should an MVP be?

Minimum Viable Product is now mainstream. But what exactly does it mean?

In my opinion, MVP is just an example of a more generic principle: Fail Fast. In other words, if you have to fail, it is better to fail in the very beginning, reducing the amount of burned investment.

If my idea is good, using MVP is counterproductive: some early adopters will get bad first impression due to lack of some advanced features or overall unpolishness, and we will need to spend much more money later just to make them to give us another chance.

If my idea is bad, MVP will save us a lot of money.

Because there is no sure way to know if my idea good or bad beforehand, it is safer to assume it is bad and go with the MVP.

But how exactly minimal the product should be? Do we want to reduce the feature set? Or don’t care about usability? Or save on proper UX and design? Does it mean it may be slow, unresponsive, unstable? Can its source code be undocumented and unmaintainable?

Well, the reason of MVP is reducing the overall investment. The principle behind it, is investing just that much to achieve a sound and valid market test, and not more. This means, when deciding about MVP, you tend to cut the area what costs you most.

For example, let’s assume we have a product development team that needs only 1 day to design a screen, 3 days to develop the backend for that screen, and 10 days to develop the frontend. It is naturally, that MVPs produced by this team would tend to have great visuals combined with an awful and buggy UI and a very good backend.

Let’s assume now that a team needs a week to design one screen, 1 day to develop the frontend, and 5 days to develop the backend. MVPs of that team would tend to have ugly, but responsive and user-friendly UI that would often need to show the loading animation because of a sluggish backend.

What does it mean?

This means that a double advantage is given to teams capable of designing and fully developing one screen per day: not only their MVP will be released sooner (or alternatively can have more features, better look and performance and more user-friendly UI), but also it can be a well-balanced and therefore mature-looking product (that’s an advantage to be mature-looking).

And this also means, if you want to identify where your business has capacity issues, just look at your typical MVPs: if some areas of them are substantially worse than other, you know what areas of the product team can be improved.

Client Driven Development

When I first tried out the test-driven development (it was around 1998, I think), I was fascinated how it helped me to design better APIs. My unit tests were the first clients of my code, so that my classes obtained a logical and easy-to-use interface, quite automatically.

Some time later I’ve realized that if you have a lot of unit tests, they can detect regressions and therefore support you during refactoring. I’ve implemented two projects, each took a couple of years, and have written around 200 unit tests for each.

And then I’ve stopped writing unit tests in such big counts. My unit tests have really detected some regressions from time to time. That is, around 5 times a year. But the efforts writing and maintaining them were much higher than any advantages of having detected a regression before manual testing.

But still, I was missing the first advantage of TDD, the logical and easy-to-use interfaces. So I’ve started to do Client Driven Development.

The problem with the unit tests is that they don’t have any direct business value per se. They might be helpful for business goals, but in a very indirect way. I’ve replaced them with a client code that does have some direct business value.

For example, I’m developing a RESTful web service. I roughly know what kind of queries and responses it must support. I start with developing a HTML page. In there, I write an <a> tag with all the proper parameters to the web service. I then might write some text around it, documenting the parameters of the service. Then I open this page in the browser and click on the link, which predictably gives me a 404 error, because the web service is not yet implemented. I then proceed with implementing it, reloading my page and generally using it in place of a unit test.

Of course, this approach has the drawback that, unlike in a unit test, I don’t check the returning values and thus this page cannot be run automatically. If you want, you still can replace this link with an AJAX call and check the returning values – I personally don’t believe that these efforts would pay off at the end of the day. More important is that this page has an immediate business value. You can use it as a rough and unpolished documentation for your web service. You can send it to your customer, or another team writing some client, etc.

If the web service is designed in a way that it is hard to get away with <a> and <form> tags, I would write some JavaScript or Silverlight code to call it properly. In this case, the page might have more business-relevant functions. For example, when it loads, it might request and display some data from the web service, in a sortable and scrollable grid, and allow you to edit them, providing you with a very low-level “admin” interface to the service.

This approach is not constrained by web development. I’ve used it for example for inter-process communication, and if my code has not yet been refactored out, it is flying now in passenger airplanes, and running inside of TV sets in many living rooms. In this variant, I start developing the inter-process communication by creating a bash script or a trivial console app that would send the messages to another process. I implement corresponding command-line options for them. When I’m ready, I start developing the receiving part, inside of some running process. This has the similar effect on the API design as unit testing, but has the advantage that you can use it during debugging, or even in production, for example in some startup scripts.

I’m not an inventor of this approach, indeed I often see this approach in many open source projects, but I’m not aware of any official name for it.


Smart TV application software architecture

Someone come to my blog searching the phrase in the title of this post. To avoid disappointments of future visitors, here is a gist of what the architecture looks like.

First of all, let’s interpret the architecture very broadly as “most important things you need to do to get cool software”. According to this, here is what you need to do:

1) Put a TV set on the developer’s desk. And no, not “we have one TV set in the nearby room, he can go test the app when needed”. And no, not “there is a TV set on a table only 3 meters away”. Each developer must have an own device.

2) Get a development firmware for the device so that you’ll get access to all log files (and ideally, access to the command line). A TV set is a Linux running WebKit or Opera browser.

3) Most Smart TVs support CE-HTML and playing H.264 / AAC videos in MP4 format. Just read the CE-HTML standard and create a new version of your frontend. Alternatively, you might try to use HTML5, because many Smart TVs would translate remove control presses as keyboard arrow key presses; and some Smart TVs support the <video> tag.

4) If you’re interested in a more tight integration with the TV, eg. be able to display live TV in your interface, or switch channels, or store something locally, you need to choose a target ecosystem, because unfortunately there is no standard app API spec today.


For some reason, I meet people every day who don’t agree with my MUSE framework or at least implicitly have a different priority framework in their minds. Usually it looks like this:

“Let us conceive, specify, develop, bugfix and release the product, and then ask the marketing guys to market it”. Well, what if we first ask marketing, what topics can bring us the cheapest leads and then conceive the product around them, or at least not against them?

“Solution A is better from the usability standpoint than solution B, therefore we should do A”. Well, B is better for motivation, because it looks more beautiful, and beauty sells. I don’t care if something is easy to use, if it looks so ugly that nobody would want to use it.

“So does this bug prevents users from using the feature, or it is just some optics, and the functionality is all in place and working?” Well, users first needs to have a reason to use our product, second must be able to understand the product. Unless these two requirements are satisfied, it doesn’t matter if the functionality is working. It is different from, say, enterprise software, where users are in the work setting and have to use the software. In the entertainment market, nobody has to read the book, listen the song, or watch the movie to the end. Or use our web site.

“MVC is a great idea, because it allows us to decouple logic from view. Let’s quickly find and use some MVC framework for HTML5!” Yes, MVC is a great idea for enterprise software, because it makes UI easier to test, allows designers and developers to work in parallel, and provides reusable components in a very natural and effortless way. But one of the MVC drawbacks is that you can easily hit a performance issue in the UI, because this architecture prevents you from squeezing 100% of the performance from the UI technology. Besides, HTML5 MVC frameworks often wrap or abstract away DOM objects and events, and therefore make it complicated to define exactly, what is shown to the user. Another MVC drawback is that it helps you to believe that reusing exactly the same UI components overall in the app is always a good idea, because hey it is good for implementation and good for usability. But a seductively looking beautiful design is more important. Even if it looks a little different on different screens.

And sometimes it is quite hard for me to understand these opinions, because MUSE sounds for me so natural and logical that I can’t imagine any other consistent priority framework.

MUSE – an attempt of product priority framework

Steve Jobs said, product management is deciding what not to do.

But how do you decide?

This is the priority framework I’m trying to live today. It works like a Maslow’s pyramid: until the first level is solved, it is not possible or not necessary to solve the second level.

Motivation. People must be motivated to start using the product. If they don’t even see a reason to start using it, nothing else matters.

  • Good marketing or packaging.
  • Product / Market fit.
  • Seducing optic (design for conversion).

Usability. People must be able to use the product, and have fun using it. If they fail to use the product, nothing else matters.

  • It must be not too hard to learn how to use the product.
  • People must understand the product.
  • Any unnecessary task people have to do must be eliminated.
  • If possible, limit the product with the functions that are easy to make easy to use. Or at least start with these features.

Service. This is where the functionality and features come.

  • Not just a function, but a service, solving a user’s problem.
  • Not just a service, but a trusted first-class high-quality service, with a sincere smile, and the good feeling of having made a right choice.

lovE. This is the ultimate dream.

  • People that are attached to your product.
  • People that identify part of themselves with your product.
  • People that not only recommend your product, but start flame wars defending its drawbacks.


Software Architecture Quality

What is a good software architecture?

Too many publications answer to this question by introducing some set of practices or principles considered to be best or at least good. The quality of the architecture is then measured by the number of best practices it adheres.

Such a definition might work well for student assignments or in some academia or other non-business settings.

In software business, everything you ever spend time or efforts on, must be reasonably well related with earning money, or at least with an expectation to earn or save money.

Therefore, in business, software architecture is a roadmap how to achieve functional requirements, and non-functional requirements, such as run-time performance, availability, flexibility, time-to-market and so on. Practices and principles are just tools that may or may not help achieving the goals. By deciding, which principles and practices to use, and which tools to ban, a software architect creates the architecture, and the better it is suited for implementing the requirements, the better the architecture quality is.

Like everything in the nature, good things don’t appear without an equivalent amount of bad things. If some particular software architecture has a set of advantages helping it to meet requirements, it has to have also a set of drawbacks. In an ideal world, software architect is communicating with business so that the drawbacks are clearly understood and strategically accepted. Note that any architecture has its disadvantages. The higher art and magic of successful software development is to choose the disadvantages nobody cares about, or at least can live with.

For already implemented software, the architecture can be “reverse engineered” and the advantages and disadvantages of that architecture can be determined.

And here is the thing. Implemented software is something very hard to change, and so its architecture. Thus, the set of advantages and disadvantages remains relatively stable. But the external business situation or even company’s strategy might change, and change abruptly. Features of the architecture that were advantages before, might become obstacles. Drawbacks that were irrelevant before, might become show stoppers.

The software architecture quality might change from being good to being bad, in just one moment of time, and without any bit of actual software being changed!

Here are two examples of how different software architectures can be. This is for an abstract web site that has some public pages, some protected area for logged-in users, some data saved for users, a CMS for editorial content and some analytics and reporting module.

Layered cake model

On the server side, there is a database, a middleware generating HTML, and a web server.

When a HTTP request comes, it gets routed to a separate piece of middleware responsible for it. This piece uses an authentication layer (if needed) to determine the user, the session layer to retrieve the current session from a cookie, persistence layer to get some data from the database (if needed), and then renders the HTML response back to the user.

Because of these tight dependencies, the whole middleware runs as a single process so that all layers can be immediately and directly used.

If AJAX requests are made, it is handled in the same way on the server side, except that JSON is rendered instead of HTML. If the user is has to input some data, a form tag is used in HTML, which leads to a post to the server, which is handled by the server-side framework “action” layer, extracting the required action and the form variables. The middleware logic writes then the data into the database.

CMS writes editorial data in the database, where it can be found by middleware and used for HTML rendering.

A SQL database is used, and tables are defined with all proper foreign constraints so that data consistency is always ensured. Because everything is stored in the same database, analytics and reporting can be done directly on production database, by performing corresponding SQL statements.


  • This architecture is directly supported by many web frameworks and CMSes.
  • Very easy to code due to high code reuse and IDE support, especially for simple sites.
  • Extremely easy to build and deploy. The middleware can be build as a single huge executable. Besides of it, just a couple of static web and configuration files have to be deployed, and a SQL upgrade script has to be executed on the database.
  • AJAX support in the browsers is not required; the web page can be easily implemented in the old Web 1.0 style. Also, no JavaScript development is required (but still possible).
  • Data consistency.
  • No need for a separate analytics DB.


  • Because of the monolithic middleware, parts of the web site cannot be easily modified and deployed separately from each other. Every time the software gets deployed, all its features have to be re-tested.
  • On highly loaded web sites, when using the Web 1.0 model, substantial hardware resources have to be provisioned to render the HTML on the server side. If AJAX is introduced gradually and as an after-thought, the middleware often continues to be responsible for HTML rendering, because it is natural to reuse existing HTML components. So that the server load doesn’t decrease.
  • Each and every request generates tons of read and write SQL requests. Scaling up a SQL database requires a very expensive hardware. Scaling out a SQL database requires very complicated and non-trivial sharding and partitioning strategies.
  • The previous two points lead to a relatively high latency of user interaction, because each HTTP request, on a loaded web site, tends to need seconds to execute.

Gun magazine model

Web server is configured to deliver static HTML files. When such a static page is loaded, it uses AJAX to retrieve dynamic parts of the user interface. CMS generates them in form of static html files or eg. mustache templates.

The data for the user interface is also retrieved with AJAX: frontend speaks with web services implemented on the server side, which always respond with pure data (JSON).

There are different web services, each constituting a separate area of API:

  • Authentication service
  • User profile service
  • One service for each part of the web site. For example, if we’re making a movie selling site, we need a movie catalog service, a movie information service, a shopping cart service, a checkout service, a video player service, and maybe a recommendation engine service.

Each area of API is implemented as a separate web application. This means, it has its own middleware executable, and it has its own database.

Yes, I’ve just said that. Every web service is a fully separate app with its own database.

In fact, they may even be developed using different programming languages, and use SQL or NoSQL databases as needed. Or they can be just 3rd-party code, bought or used freely for open-source components.

If a web service needs data it doesn’t own, which should be per design a rare situation, it communicates with other web services using either a public or a private web API. Besides, a memcached network can be employed, storing the session of currently logged in users that can be used by all web services. If needed, a separate queue such as RabbitMQ can be used for communication between the web services.


  • UI changes can be implemented and deployed separately from data and logic changes.
  • Different web services can be implemented and deployed independently from each other. Less re-testing is required before deployment, because scope of changes is naturally limited.
  • Different parts of the web site can be scaled independently from each other.
  • Other frontends can be plugged in (mobile web frontend, native mobile and desktop apps, as well as Smart TV optimized web frontend)
  • Static files and data-only communication style ensure lowest possible UI latency. This is especially true for returning users, who will have almost anything cached in their browsers.


  • Build and Deployment is more complicated: at very least, some kind of package manager with automatic dependency checking has to be used.
  • Middleware code reuse is more complicated (but still possible).
  • Data might get inconsistent, and software has to be developed in a way it can still behave in a reasonable and predictable way in case of inconsistencies.
  • AJAX support on the client side is a requirement.
  • A separate data-warehouse or at least a SQL mirroring database is required to run cross-part and whole-site analytics.

Tabu search

There is no simple solution how to achieve the best or at least a good conversion rate. Usually, one just makes an assumption, develops a page, tracks user behavior, analyses the data — and then makes another assumption. This repeats until one can find a satisfying conversion rate, or one runs out of time.

Interestingly enough, there is a quite large amount of mathematical work regarding a similar problem. In the mathematical slang, they call it the global optimization problem.

Mathematically, conversion rate optimization can be described as finding local (or at best, global) maximum of function f(X), where X is a vector of different factors influencing conversion rate. Unlike a typical mathematical optimization problem, the function f is not known beforehand, and both the function f and the factors included in the vector X might change after each optimization step.

There are a lot of metaheuristics formulating different strategies that can be used for the optimization. I’ve just read about a very small subset of them, and I find that the strategies are very interesting. Especially, their names. Who has said mathematicians are not creative?

This is my interpretation of them, as applied to the conversion rate optimization problem.

Gradient descent
You make a small change of all factors in X at the same time, to maximize the conversion rate as you see it. Then you measure it. Then you try to understand, what factor change worked positively and what negatively, and revert changes in wrong factors while increasing the change in the right factors. Iterate, until a local maximum is found.

Hill climbing
Start with a baseline of factors X. Now change the first factor x1 in X. Measure conversion rate. Revert x1 and change second factor x2 in X. Measure. Repeat testing, until all factors in X are tested. This procedure can also be done parallelly using a multi-variate A/B testing. However the testing is done, at the end of the day we can find out, which winning factor xi has made the largest improvement in conversion rate. Now establish a new baseline X’, which is just like X, but with the improved winning factor xi. Continue iterating from this new point, until a local maximum is found. It is quite probably that in each iteration a different factor will be improved.

These two strategies have the following drawbacks:

  • Walking on ridges is slow. Imagine the N-dimensional space of factors X, and the function f as a very sharp ridge leading to a peak. You might be constantly landing on a side of ridge, so that you are forced to move to the peak in zigzag. This takes valuable time.
  • Plateaus are traps. Imagine that no change of the current baseline X is measurably better than X. You know you’re on a plateau. Perhaps it is also the highest possible point. Or perhaps, if you make some steps, you’ll find an even higher point. The problem is, in which direction should you go (i.e. which factors do you want to change?)
  • Only local maximum might be found. If your starting point is nearer to a local maximum than to a global maximum, you’ll reach the smaller peak and stop, because any small change of the current baseline X will be measurably worse than X.

To fight these problems, a number of strategies has been developed.

In the shotgun hill climbing, when you’re stuck, you just restart the search from a random position that is sufficiently different from your latest one.

In the simulated annealing, you are allowed to go into directions what actually show worse conversion rates, but for some short time. This can help to jump over abysses in function f, if they are not too wide.

And in tabu search, if one is stuck, one explicitly ignores all the positive factor changes and explores all the other factors (that previously didn’t play a big role).

How to handle errors

It seems I’m going through my old articles and re-writing them. The first version of How to estimate has been written in 2008, and I wrote the first version of this article around 2004. Originally it was focused around exceptions, but here I want to talk about

  • Checking errors
  • Handling errors
  • Designing errors

Checking errors

In OCaml programming language, you can define so called variant types. A variant type is a composite over several other types; an expression or function can then have the value belonging to either one of those types:

type int_or_float = Int of int | Float of float

(* This type can for example be used like this: *)
let pi_value war_time =
    if war_time then Int(4) else Float(3.1415)

# pi_value true
Int 4

# pi_value false
Float 3.1415

(you can try this code online

OCaml is a very statically typed, very safe language. This means, if you use this function, it will force you to handle both the Int and the Float cases, separately:

# pi_value true + 10  (* do you expect answer 14? no, you'll get an error: *)
Error: This expression has type int_or_float
       but an expression was expected of type int

(* what you have to do is for example this: *)
# match pi_value true with
      Int(x) -> x + 10
    | Float(y) -> int_of_float(y) + 10
int 14

The last line checks if the returning value of pi_value true is actually an Int or a Float, and executes different expressions in each case. If you forget to handle one of the possible return types in your match clause, OCaml will remind you.

One of the idiomatic usages of variant types in OCaml is error handling. You define the following type:

type my_integer = Int of int | Error

Now, you can write a function returning a value of type “integer or error”:

let foobar some_int =
    if some_int < 5 then Int(5 - some_int) else Error

# foobar 3
Int 2

# foobar 7

Now, if you want to call the foobar, you have to use the match clause, and you should handle all possible cases. For example:

let blah a b =
    match (foobar a, foobar b) with
          (Int(x), Int(y)) -> x + y
        | (Error , Int(y)) -> y
        | (Int(x), Error)  -> x
        | (Error , Error)  -> 42

Not only this language design feels very clean, but also it helps to understand that errors are just return values of functions. They are part of the function codomain (function range), together with the "useful" return values. From this point of view, not checking and not being able to process error values returned by a function, should feel equally strange as if we wouldn't be able to process some particular integer return value.

Still, often I don't check for errors. I think, it is related to the design of many mainstream languages, making it harder to emulate variant types or to return several values. Let's take C for example:

int error_code;
float actual_result = 0;

error_code = foo(input_param, &actual_result);

if(error_code < 0) {
  // handle error
} else {
  // use actual_result

and compare it with OCaml, which is both safer and more concise:

type safe_int = Int of int | ErrorCode of int

match foo input_param with
      Float(f) -> (* use result *)
   |  ErrorCode(e) -> (* handle the error *)

Unfortunately, most of us have to use mainstream languages. Error checking makes source code less readable, therefore I try to counteract it by using a uniform specific code style for error handling (eg. same variable names for error codes and same code formatting).

Recap: checking for errors is the same as being able to handle the whole spectrum of possible return values. Make it part of your code style guide.

Handling errors

In Smalltalk, exceptions (any many other things) are implemented not as a magical part of the language itself. They are just normal classes in the standard library. (if you're ok with Windows, try Smalltalk with my favorite free implementation, otherwise go for the cross-platform market leader. In Smalltalk, REPL is traditionally called a Workspace)

Processor "This global constant gives 
           you the scheduler of Smalltalk 
           green threads"

Processor activeProcess "This returns the green thread 
                         being currently executed"

Processor activeProcess exceptionEnvironment "This gives the 
                                              current ExceptionHandler"

my_faulty_code := [2 + 2 / 0] "This produces a BlockClosure, 
                               which is also known as closure, 
                               lambda or anonymous method 
                               in other languages"

my_faulty_code on: ZeroDivide 
               do: [ :ex | Transcript 
                              display: ex; 
                              cr] "This will print the 
                                   ZeroDivide exception 
                                   to the console"

The latter line of code does roughly the following:

  1. The method #on:do: of the class BlockClosure creates a new object ExceptionHandler, passing ZeroDivide as the class of exceptions this handler cares about, and the second BlockClosure, which will be evaluated, when the exception happens.
  2. It temporarily saves the current value of Processor activeProcess exceptionEnvironment
  3. Sets the newly created ExceptionHandler as the new Processor activeProcess exceptionEnvironment
  4. Stores the previously saved value of exception handler in the outer property of the new ExceptionHandler.

This effectively creates a stack of ExceptionHandlers, based on a trivial linked list, and having its head (the top) in Processor activeProcess exceptionEnvironment.

Now, when you throw an exception:

ZeroDivide signal

the signal method of the Exception class, which ZeroDivide inherits from, starts with the ExceptionHandler currently defined in Processor activeProcess exceptionEnvironment and loops over the linked list, until it finds an ExceptionHandler suitable for this. Then, the exception object passes itself to the handler.

Not only it looks very clean and is a brilliant example of proper OOD, but also it helps to understand that exceptions is just a construct allowing you not to check for the error in the immediate function caller, but propagate it backwards in the call stack.

Now why is it important?

Because one thing is to check for error, and another thing is to handle it, meaningfully. The latter is not always possible in the immediate caller.

Deciding how to handle errors, meaningfully, is one of the advanced aspects of software development. It requires understanding of the software system I'm working on, as a whole, and the motivation to make code as user-friendly as possible -- in the most generic sense: my code can be used by linking and calling it from another code; or an end-user would execute it and interact with it; or somebody will try to read, to understand, to debug and to modify my code.

What makes things worse is the realization that most of time, sporadic run-time errors happen in a very small percentage of use-cases, and therefore they are usually associated with a quite small estimated business value loss. Therefore, from the business perspective, only a small time budget can be provided for error handling. We all know that and when under time pressure, the first thing most developers compromise is error handling.

Therefore, every time I decide how to handle an error, I try to answer all of the following questions:

  • How far should we go trying to recover from the error, given the limited time budget?
  • If the user is blocked waiting for results of our execution, how to unblock him, but (if possible) not to make him angry?
  • If the user is not blocked, should we inform him at all?
  • If we assume a software bug being the reason of an error, how to help testers to find it, and developers to fix it?
  • If we assume an issue with installation or environment, how to help admins to fix it?

Usually, this all boils down to one of the following error handling strategies (or a combination of them):

  • Silently swallow the error.
  • Just log the exception.
  • Immediately fully crash the app.
  • Just try again (max. N times, or indefinitely).
  • Try to recover, or at least to degrade gracefully.
  • Inform the user.

I'll try to describe a typical situation for each of the handling strategies.

I'm using a third-party library that throws an exception in 20% of cases when I use it. When this exception is thrown, the required function will still be somehow performed by the library. I will catch this specific class of exceptions and swallow them, writing a comment about it in the exception handler.

I'm writing a tracking function, which will be used 10 times a second to send user's mouse position from the web browser back to the web server. When posting to the server fails for first time, I will log the error (including all information available), and either swallow all other errors, or log every 10th error.

The technology I'm using for my app allows me to define what to do, if an exception is unhandled. In this handler, I will implement a detection if my app is running on development or staging; or in production. When running on production, I will log the error and then swallow it. If it not possible to swallow the error, I'll inform the user about unexpected error (if possible using some calming down graphic). Not on production, I will crash the app immediately and write a core dump, or at least log the most accurate and complete information about the exception and the current app state. This will help both me and testers to detect even smaller problems, because it will help creating a no-bug-tolerance mindset.

I'm writing a logger class for an app working in the browser. The log is stored on the web server. If sending the log message fails, I will repeat the post 3 times with some delay. If it still fails, I'll try to write it into the local offline storage. If writing in this storage fails, I will output it to the console.

I'm writing some boring enterprise app with a lot of forms. User clicks on a button to pre-populate the form with data from the previous week. In this event handler, I will place a try/catch block, and my exception handler will log the exception. I will then go chat with the UX designer to decide if and how exactly to inform the user about the issue.

Recap: handling errors is not trivial -- there are no hard and fast rules, and you need to know about the whole system and think about usability do to it properly.

Desiging errors

Designing errors is deciding when and how to signal error conditions from a reusable component (framework). If handling errors is complicated, because you have to know the overall context and think about usability, designing errors is in order of magnitude more complicated, because you have to imagine all possible systems, contexts and situations where your code will or can be used, and design your errors so that they can be handled easily (or at the very least, can be handled reasonably).

Frankly speaking, I haven't designed an error system (yet) I were particularly proud about, and I think this complicated topic is pretty subjective and a question of your style.

My personal style is to believe that my framework or library is just a guest, and the calling code is a host. As a guest, one must respect decisions of the host and do not try to force any specific usage pattern. This is why most (but not all) of my properties and methods are public. I don't know how the host is going to communicate with me, and I'm not going to force one specific style over him, or declare some topics taboo. I still specifically mark preferred class members though, so that I can indicate my own communication preferences (or suggested API) to the host. I also warn the host in the comments that all members outside of the suggested API are subject to change without notice. But ultimately, it is up to host what to use and how.

This approach has the drawback that the developer writing the calling code has to understand must more about my component, and that he has less support from IDE and compiler detecting usages of class members that I don't recommend to use. But it has the advantage of giving the greatest possible freedom to the host. And I believe that I have to trust in host that he won't use his freedom to shoot himself in the leg.

When designing errors, I think I should follow the same idea. This means:

  • If possible, do not force the caller to check for errors. It is his choice. Maybe he is just prototyping some quick and dirty idea and needs my component for a throw-away code.
  • If possible, do not handle errors in the component, but let the calling code to handle them. Or at least, make it configurable. Specifically, do not retry or try to recover from an error, because retrying or recovering takes time and resources, and the host might have a different opinion about what error handling is appropriate. Provide an easy API for retrying/recovering though, so that if the caller decides to recover, it would be easy for him. Another example: do not log errors, or at least make the logging configurable. In one of my recent components I've violated this recommendation. When the server didn't respond timely, 3 lines were added to the log instead of one: the first one has been added by a transport layer, the second one from the business logic that actually needed data from the server, and the third one from the UI event handler. This is unnecessary. Logs must be readable, and usability of the logging is one of the most important factors separating well designed error handling from the bad designed one.
  • Do not provide text messages describing the error. The caller might be tempted to use them "as is" to show them in a message box. And then he'll get problems, when his app will need to be translated into another language.
  • Provide different kinds of errors, when you anticipate that their handling might be different. For example, if the component is a persistency layer, provide one kind of exceptions for problems related with network communication with the database, and other kind of exceptions for logical problems like non-unique primary key when inserting a new record, or non-existent primary key when updating a record.
  • Add as much additional information into the error as possible. In one of my projects I went so far: when downloading and parsing of some feed from the web server failed, I've added the full http request (including headers and body) and full http response, along with the corresponding timestamps, into the error.
  • If possible, always try to signal errors using one and the same mechanism. In my recent project, some of my functions have signaled the error by returning 0, other functions by returning -1, and yet another one has accepted a pointer to the result code as argument.

Recap: Designing errors is even more complicated than handling them.