Four Weeks of Bugfixing

The hardest bug I’ve ever fixed in my life took me 4 weeks to find. The bug report itself was pretty simple, but I have to give more context first.

I was one of the developers of a Smart TV software, and the bug related to the part of the software responsible for playing video files stored on your USB memory or drive. The CPU that was available for this task was a 750MHz ARM chip, and clearly it had not enough power to decode video (let alone HD video) in software. Luckily, every digital TV set has a hardware H.264 decoder, and our SOC was so flexible that we could use it programmatically. In this way, we were able to support H.264 video playback (too bad for you DivX and VC-1 owners).

Technically, the SOC has provided a number of building blocks, including a TS demux, an audio decoder, a video decoder, a scaler and multi-layer display device, and a DMA controller to transfer all the data between the blocks. Some of the blocks were present more than once (for example, for the PIP feature you naturally need two video decoders) and therefore could be dynamically and freely interconnected programmatically, building a hardware-based video processing pipeline. Theoretically, one could configure the pipeline by writing some proper bits and bytes in specified configuration registers of the corresponding devices. Practically, the chip manufacturer has provided an SDK for this chip, so that you only had to call a pretty well-designed set of C functions. The SDK was intended to run in the kernel mode of a Linux kernel, and it came from the manufacturer together with all building scripts needed to build the kernel.

Furthermore, this SDK was wrapped and extended by some more kernel-side code, first to avoid dependency on a particular SOC, and second to provide some devices to the user-mode, where the rest of the Smart TV software was running. So to play video programmatically, one needed to open a particular device from user mode as a file, and write into it a TS stream containing video and audio data.

Sadly, there are many people out there who have invented a lot of different container formats besides of TS. Therefore, our software had to detect the container format of the file to be played, demux the elementary streams out of it, then mux them again into a TS stream, and then hand it over to the kernel mode code. The kernel code would pass the TS bytes to the DMA device, that would feed the hardware TS demuxer, that would send the video elementary stream to the hardware video decoder, where it finally would be decoded and displayed.

For the user mode, we could implement all possible container formats ourselves (and this would mean some job security for the next 10 years of so). Fortunately the Smart TV software was architected very well so that the GStreamer framework was used (for you Windows developers it is an open-source alternative to DirectShow). The framework is written in C (to be quick) and GLib (to be object-oriented) and provides a pipeline container, where you can put some filters and interconnect them. Some filters read the data (sources), some process the data (eg. mux or demux), some use the data (sinks). When the pipeline starts playing, the filters agree on which one will drive the pipeline, and the driver would pull the data from all filters before it in the pipeline, and push the data into all the filters after it in the pipeline. Our typical pipeline looked like this (in a simplified form): “filesrc ! qtdemux ! mpegtsmux ! our_sink”. As you can expect from such a framework, there are also a lot of stuff related to events and state machines, as well as memory management.

So now, back to the bug report. It looked like this: when playing a TS file from USB memory, you can seek forward and backward with no limitation. When playing any other container format, you can seek forward, but you cannot seek backward. When seeking backward, the video freezes for several seconds, and then the playback continues from the latest position.

This is the sort of bugs when I think this might be fixed in a day or two. I mean, it works with TS, it doesn’t work with MP4, it is fully reproducible, so just find out what is different in those two cases and you’ve caught it.

The GStreamer pipeline in TS case looked like this: “filesrc ! our_sink”. So it must be either qtdemux or mpegtsmux. I’ve built another MP4 demuxer and replaced qtdemux with it. Negative, the bug is still there. No wonder, it also appeared in other container formats. I couldn’t replace mpegtsmux, because I haven’t found any alternatives. So the only thing I could do is to use the pipeline “filesrc ! qtdemux ! mpegtsmux ! filesink”, write the output into a file, and then try to dump the TS format structure and to look for irregularities.

If you know TS format, then for sure, you are already very sympathetic with me. TS is a very wicked and complicated format, and they repeat some meta-information every 188 bytes, so that the dump of several seconds of video took megabytes. After reading it, I didn’t find anything suspicious. Then I’ve converted my test MP4 video into a TS using some tool, dumped that TS, and compared. Well, there were some differences, in particular, how often the PCR was transmitted. Theoretically, PCR is just a system clock and should not influence the playback at all, but practically we already knew about some hardware bugs in the decoder making it allergic to unclear PCR signaling. I’ve spent some time trying to improve PCR, but this didn’t help either.

I have then played the dumped TS file, and I could see the seek backwards that I did during the recording. This has convinced me that mpegtsmux was also bug-free. The last filter I could suspect was our own sink. Implementing a GStreamer filter is not easy to do right in the first time. So that I went through all the functions, all the states, all the events, informed myself how the proper implementation should looked like, and found a lot of issues. Besides of a lot of memory leaks, we’ve generated a garbage during the seek. Specifically, GStreamer needs it to work in the following way:

1. The seek command arrives at the pipeline and a flush event is sent to all filters.

2. All filters are required to drop all buffered information to prepare themselves for the new data streamed from the new location.

3. When all filters have signaled to be flushed, the pipeline informs the pipeline driver to change playback location.

4. After the seek, the new bytes start flowing in the pipeline. Our code has conformed to this procedure somewhat, but did the cleanup prematurely, so that after the cleanup some more stale data polluted our buffers, before the data from the new location arrived.

I couldn’t explain why did it work with TS but not with MP4, but I’ve figured out that fixing it will make our product better anyways, so I’ve fixed it. As you can imagine, this didn’t solve the original problem.

At this point I’ve realized that I had to go into the kernel. This was a sad prospect, because every time I’ve changed anything in kernel, I had to rebuild it, then put the update on a USB stick, insert it into TV set, upgrade it to the new kernel by flashing the internal SOC memory, and then reboot the chip. And sometimes I’ve broken the build process, and the new kernel wouldn’t even boot, and I had to rescue the chip. But I had no other choice: I was out of ideas what else I could do in the user space, and I suspected that in the kernel space, we also had a similar issue with a garbage during the seek.

So that I’ve bravely read the implementation of the sink device and changed it in a way that it would explicitly receive a flush signal from the user space, then flush the internal buffer of the Linux device, then signal back to user space it is ready, and only then I would unlock the GStreamer pipeline and allow it to perform the seek and start streaming from the new location.

It didn’t help.

I went further and flushed the DMA device too. It didn’t help. Also flushing the video decoder device didn’t help.

At this point I’ve started to experiment with the flush order. If I first flush the DMA, the video decoder might starve in absence of data and therefore get stuck. But if I flush the decoder first, the DMA would immediately feed it with some more stale data. So perhaps I have to disconnect the DMA from video decoder first, then flush the decoder, then the DMA, and then reconnect them back? Implemented that. Nope, it didn’t work.

Well, perhaps the video decoder is allergic to asynchronous flushes? I’ve implemented some code that has waited until the video decoder reported that it has just finished the video frame, and then flushed it. Nope, this wasn’t it.

In the next step, I have subscribed to all hardware events of all devices and dumped them. Well, that were another megabytes of logs to read. And it didn’t help, that the video playback was a very fragile process per se. Even when playing some video, that looked perfectly well on the screen, the decoder and the TS demux would routinely complain of being out of sync, or losing it, or being unable to decode a frame.

After some time of trying to see a pattern, the only thing I could tell is that after the seek forward, the video decoder would complain for some frames, but eventually recover and start producing valid video frames. After a seek backward, the video decoder has never recovered. Hmm, can it be something with the H.264 stream itself that prevented the decoder to work?

Usually, one doesn’t think about elementary streams in terms of a format. They are just BLOBs containing the picture, somehow. But of course, they have some internal format, and this structure is normally only dealt with by authors of encoders and decoders. I went back to GStreamer and looked up, file by file, all the filters from the pipeline producing the bug. Finally, I’ve found out that mpegtsmux has a file having “h264” in its name, and this has immediately ringed alarm in my head. Because well, TS is one abstraction level higher than H.264, why the hell mpegtsmux has to know about the existence of H.264?

It turned out, H.264 bitstream has in its internal structure so-called SPS/PPS, the sequence parameter set that is basically a configuration for the video decoder. Without the proper configuration, it cannot decode video. In most container formats, this configuration is stored once somewhere in the header. The decoder normally reads the parameters once before the playback start, and uses them to configure itself. Not so in TS. The nature of TS format is so that it is not a file format, it is a streaming format. It has been designed in the way that you can start playing from any position in the stream. This means that all important information has to be repeated every now and then. This means, when H.264 stream gets packed into the TS format, the SPS/PPS data also has to be regularly repeated.

This is piece of code responsible for this repeating: http://cgit.freedesktop.org/gstreamer/gst-plugins-bad/tree/gst/mpegtsmux/mpegtsmux_h264.c?h=0.11#n232 As you can see, during the normal playback, it would insert the contents of h264_data->cached_es every SPS_PPS_PERIOD seconds. This works perfectly well until you seek. But look how the diff is calculated in the line 234, and how the last_resync_ts is stored in line 241. The GST_BUFFER_TIMESTAMP is as you can imagine the timestamp of the current video sample passing through the tsmux. When we seek backwards, the next time we come into this function, the GST_BUFFER_TIMESTAMP will be much less than last_resync_ts, so the diff will be negative, and thus the SPS/PPS data won’t be repeatedly sent, until we reach the original playback time before the seek.

To fix the bug, one can either use the system time instead of playback time, or reset last_resync_ts during the flush event. Both would be just a one line change in the code.

Now, the careful reader might ask, why could the TS file I’ve recorded using mpegtsmux in the beginning of this adventure be played? The answer is simple. In the beginning of this file (i.e. before I’ve seek), there are H.264 data with repeated SPS/PPS. At some point (when I’ve seek during the recoding), the SPS/PPS stop being sent, and then some seconds later appear again. Because these SPS/PPS data are the same for the whole file, already the first instance of them configures the video decoder properly. On the other hard, during the actual seek of MP4 playback, the video decoder is being flushed, and therefore the SPS/PPS data is being also flushed, and this is the point when the video decoder relies on repeated SPS/PPS in the TS stream to recover, and this is exactly the point when they stop coming from the mpegtsmux.

Four weeks of search. 8 hours a day, 5 days a week. Tons of information read and understood. Dozens of other smaller bugs fixed on the way. Just to find out a single buggy line of code out of 50 millions lines of code in the source folder. A large haystack would contain to my estimate 40 to 80 millions of single hays, making this bug fixing adventure literally equivalent of finding a needle in a haystack.

Leave a comment