Six principles of debugging

Debugging is an underrated skill. Even though we spend 50% of time on debugging at work, it is barely trained in the universities. The ability to debug might be the most important factor differentiating the 10x developers from the rest, and therefore can be a predictor for our job satisfaction and compensation.

Debugging is

  • finding the bug,
  • fixing it,
  • and proving it is fixed (a test).

Unintuitively, debugging always starts with the last step.

Reproduce or measure

If we cannot reproduce a bug, any try to fix it cannot be considered to be proven. We can’t tell then to the customers or stakeholders whether the bug has been fixed or not.

Therefore, take measures to reproduce the bug. Ideally, it should be reproducable with low efforts, time spent, and monetary costs.

Some bugs cannot be easily reproduced. For example, when the bug depends on exact order of execution, or something is randomized, or we depend on some external system, or we use AI with billons of parameters.

Note that the operating systems allowing to run more than one process or thread concurrently (ie. all modern operating systems) are inhertent sources of randomization, because we can’t control what other processes and threads are also running and what memory we get allocated.

In this case, the bug should be at least measurable. For example, we could write a log line every time our software crashes with invalid pointer exception, and create a dashboard showing the number of crashes per week. After we implement a potential bugfix, we at least can see how this indicator is changing after we roll out the change into production. Better yet, roll it out on some part of prodiction servers and compare the number of crashes in the old and in the new versions.

Proving a fix of ML models is also something that can only be measured and not reproduced. Even if the bug description is very specific and fully reproducable (eg. “every time I ask it to calculate 1+1 it delivers a wrong answer”), we usually don’t want to fix this exact query, but all possible queries involving addition of numbers. So to measure our bugfixing success, we can generate a synthetic dataset containing a couple of thousands of additions, run it throgh the model, measure the rate of wrong answers, then fix the model and measure the improved error rate again.

Read the error message

The first step of searching for the bug is to read the error message. Surprisingly many people don’t do it.

I can understand it when my mom is not reading error messages: first she don’t speak the language of the error message, second she has a legitimate expectation that the three basic apps she is using must be bug-free.

But I absolutely can’t accept when I see software developers not reading error messages. In many cases, the most clues are already contained in the error messages.

Sometimes, error messages are hard to read, often because the printout takes several screens. Do spend some time reducing the noise step by step by reconfiguring what is being logged / printed out.

I have once spent a day trying to fix a problem, only to find that the exact description of the problem along with its suggested solution was always printed out just in front of my eyes, and then immediately scrolled outside of the visible area because of some other very chatty python framework (I am looking at you, pika).

Generate and Prioritize hypotheses

If the error message doesn’t contain hints for where the bug is, we need to create hypotheses of what it can be. Note the plural: hypotheses. So many people come up with just one possible reason for the bug and immediately try to check it.

Generate a lot of hypotheses. Then use your intuition to start with the most probable one.

Let’s say we can’t connect to the database. Some people would immediately call the DBA and ask them to fix the database. But let’s generate all possible hypotheses first:

  • Hardware failure of the DB server
  • Somebody cut the network cable to the server
  • Somebody has changed the firewall settings so it rejects our connection attemps
  • Somebody is rebooting the server
  • Database process has crashed and is restarting
  • Database process is overloaded and don’t accept more connections
  • Database process is hanging
  • Somebody has deleted our DB user or changed our DB password
  • Somebody cut the network cable to our client PC.
  • Our client PC has disconnected from the VPN required to access the DB server
  • Somebody has changed the firewall settings on our client PC and it prevents outbound connections to the server.
  • Somebody has messed with the routing table on our client PC so the traffic goes to the wrong gateway
  • Somebody has messed with the routing table of the router.
  • DB Server SSL certificate is expired.
  • The client SSL software stack is not supported any more
  • Somebody has upgraded the server version and the communication protocol of the new version is not compatible any more with our client
  • Somebody has changed the IP address of the server and this DNS change hasn’t propagated to our client PC yet.
  • Somebody has changed the DNS name of the server and forgot to update it in the secrets that we’re using to connect to the server.
  • We forgot to retrieve secrets to connect to the server

In my experience, the last hypothesis is the most probable one.

Note how these hypotheses span several IT domains: hardware, system software, networking, security, and application logic. Therefore, to become a better bug hunter, we need to dive deeper.

Dive deeper

No matter what IT domain you are working in, there is always something below. System software developers work on top of hardware. Network engineers work on top of systems and hardware. Framework developers use operating systems and networking software. You’ve got the gist.

Bugs don’t care about the artificial domain borders we people draw and have in our job descriptions. Even if you are a data scientist and actually only care about the ML and the third moment of the aposteriory probability distribution, it helps to have an understanding about networking, security, databases, operating systems, and hardware.

Such an understanding helps in general. But it helps especially when debugging.

Don’t hesitate to learn adjancend IT domains, at least on the overview level. We don’t need to know the IP header fields by heart or how BGP works, but we must understand how TCP differs from UDP and how HTTP builds up on TCP, and TLS builds up on HTTP.

With this knowledge, we can generate a few hypotheses to test. To be able to quickly check them one by one in some reasonable time, we need a fast feedback loop.

Fast feedback loop

Organize the development process in a way that we can change code and see if it has affected the bug as soon as possible (ideally immediately).

Often, we need to overcome various obstacles to ensure a truly fast feedback loop. We might need to get some production data in our development database, solving the permission and privacy issues. Or we might need the permission to attach our debugger to a production process.

Sometimes it is not possible and then debugging becomes very tedios. Once I needed to compile a gigabyte of C++ code, flash it to a TV set, reboot it, then start watching a particular video and do it for several minutes until the bug could be reproduced. This debugging took me four weeks.

Fix them all

So you have fixed the bug, and proven it is fixed.

Congratulations!

Now take a moment and think about all other similar places where this or similar bug can appear, and fix them too. Now we still have the full case in the head and are perfectly predisposed to implement high-quality fixes in all the other places too.

Let’s say, we couldn’t connect to the database, because somebody has changed the IP address and assumed that everybody is using the DNS name, and we have been using the IP address in our secrets. We should walk over all secrets and replace all other IP addresses with DNS names there (where possible and feasible).

Optionally: contribute back

If you were fixing an open source software, consider to contribute back to the community and to create a merge request.

For this:

  • Ask your boss to check the business, corporate and legal aspects of contributing to open source
  • Read the contribution guidelines and understand the license of the project you are contributing to
  • Comply with the guidelines (this might require reformatting, documenting, writing (more) unit tests etc)
  • Before you make a merge request, check you are not leaking any company secrets
  • Make a merge request
  • When it is merged, don’t forget to update your software version
  • Collect virtual karma points and real “contributor” reference for your CV.

Leave a comment