Media Player Framework Guidelines

The following recommendations are based on my experience of creating two media player implementations. The one was based on the MediaElement in Silverlight and provided out-of-the-box integration with SilverHD DRM as well as some smart transport channels and performance tweaks. Another one was based on GStreamer in an embedded software and featured tight coupling with HW decoder and support of 19 different pipelines. Surprisingly, there are design commonalities between these media players.

1. If your player is going to support playlist, do not define the playlist format. Typically, playlists come from external sources. You cannot change their format. It is better to write an ad-hoc playlist parser for each playlist type needed, and provide a parsed playlist in form of an object tree to the media player. In other words, you want to exclude playlist parsing from the media player framework. Because playlist formats are often too weird and non-standard so that there is no hope the playlist parsing code will ever be reusable.

2. Playlists consist of playlist items (if your player does not support playlists, you can still think of it just playing a single playlist item). Do not expect a playlist item to be just a url / file path. In general, it is not possible to reliably detect the format of the media, the transport protocol as well as many other parameters just by parsing a url. Create a full-featured object describing the playlist item. For example, with the following properties:

  • Transport (file, progressive download, rtsp, smooth streaming, hls, http streaming, etc)
  • Container format (MP4, ASF, fMP4, AVI, etc)
  • Video codec
  • Audio codec(s) per audio stream
  • DRM settings
  • Additional parameters needed for your player to work, eg. MPEG2 TS program number or live streaming flag.

3. It seems to be a more clear design to separate the player state from the playlist item, but surprisingly I was obtaining better source code by combining them. Therefore, I have added the following fields to the previously mentioned:

  • Play state
  • Play position
  • Media duration as detected by the player
  • Media bytesize as detected by the player

4. Speaking of the play state, two questions are most important. What is the difference between STOPPED and PAUSED? The difference is that at STOPPED you must destroy the playing pipeline, release all the memory, reset DRM state — basically, revert to the state before the playback. Do we need separate FINISHED state? Yes we do, many apps rely on the player’s ability to detect that the media item has been watched fully (and not stopped by user interaction).

5. Play position — I inevitable end up with expressing it both in time units and in percent. There are situations when you only can seek in time units, but not in percent. There are other situations when you can seek only in percent but not in time units. Most of the time, when displaying the play position as time, you also need to know it in percent.

6. Another thing I inevitable end up with is creating a repeated timer that would fetch current play position, say, 10 times per seconds, and fire all corresponding events to allow the UI to update itself. You might be wondering why this separate thread is needed at all bearing in mind there is a separate decoding thread anyways, and it knows exactly the PTS of the video frame it is going to dispay, so that it could check if the second portion of the new PTS is not the same as the one of the previous PTS and fire all necessary events (preferrably by posting them onto a main loop of UI thread). But no, the reality is different.

7. Media duration could also be detected by the player synchronously. After all, most of the formats have it pretty soon in the beginning of the file, and the demux has to parse this header anyways, so that nothing would prevent it to post a corresponding event to the UI thread. Again, the reality is different so that I always end up with implementing it in the same repeated timer routine I’ve mentioned in the previous point — i.e. checking a flag if the duration has been determined, and if not, ask the pipeline.

8. Another thing that might be controversial from the clean design point of view, but worked for me, is handling of pipeline asyncronity. You can’t change play state synchronously. Therefore, you can’t issue a command to the pipeline to change its state and then write the new state to your playlist item immediatly — for a couple of milliseconds (or even couple of seconds when we are seeking), this state would be wrong, which will lead to unpleasant racing conditions. Following a theoretical clean design, for each call of the pause(), play() or seek() method of your player you would create a job object describing the change and add it to the queue, and have another thread that would execute the queue, waiting for every job to be finished before starting the next one. This is complex. What worked for me is cloning the playlist item object. Basically, the player always keeps two copies of the playlist item object: actual state and desired state. When the player instantiated and a playlist item is passed to it to be played, it would set it to the desired state, set its copy to the actual state (resetting the play state to STOPPED), and then initiate the playback. All subsequent events coming from the pipeline (current play position and play state, duration etc) will be applied to the actual state. When the user calls pause(), the play state of the desired playlist item object will be set to PAUSED immediately, then pausing will be initiated. When the pipeline will really pause, the play state of the actual playlist item will be updated.

9. Generally, you should expect to be forced to write a lot of workarounds. I don’t quite understand why it is always needed. But it is just a fact of life. In Silverlight, you would get some wrong play states reported, and you wouldn’t get some events you’d expected to get, and you can easily overload the pipeline, and you cannot do a seek and wait on http response in the same time (a deadlock). With GStreamer, you wouldn’t get NEW SEGMENT events after a seek, and it would report an error when seeking in FLV but still perform the seek, and it wouldn’t post the state change from READY to NULL onto the bus.

10. Never ever think you can correlate byte offset in the file with its time position. You cannot divide byte offset by bitrate to get the time. There is no single codec (except of turning compression off) capable of holding the exact bitrate. There are VBR encodings. There are encodings with some unused streams or ID tags embedded into it (ever saw a MP3 file with 1 Mb of cover image embedded in it?). This is especially true if your player is used to play long content, and not just clips of a couple of seconds.

11. You will need a lot of test content in various formats, quality levels, bitrates and DRM protection levels. And you will need this test content to be stored locally, and on a web server, and on DLNA server or any other server that streams in the protocol you’re going to support. Preparing this test suite and configuring all needed software is a huge work, so that you probably want to find somebody (an intern, or a tester, or a product manager, an admin, etc) who would be willing to manage it for you.

12. I was always dreaming of creating the YouTube-like “immediate drag and drop seek” user experience, but it didn’t come through. Maybe the media platforms I was using were really so limited. Or maybe I just haven’t tried hard enough. In any case, do not expect your pipeline will handle it for you automatically.

Leave a comment