Adventures in embedded C land

Preface

I don’t think of myself as a software developer.

My motivation to learn and work in software development was based on the idea of building things. But if you think about it, software alone is not a “thing”, it is just a component of software product. Even worse, it is an invisible component nobody really cares about, as soon as it is not overly buggy.

Software products interact with users thru their UI. The UI together with use-cases supported by the product and its marketing platform create a user experience, which IS the software product from the user’s point of view. Note how the software itself is not part of this formula :)

If software is poorly written, it makes it hard to evolve it in the future versions. Unfortunately, more often than not the responsible managers either don’t care or can’t tell clean and good software from dirty and poor one. So even this internal aspect of a “thing” – its evolvability – is often neglected.

Having understood that software per se neither a “thing” for the users nor a “thing” for product owners, I’m moving to positions allowing me to define what really matters – UI, Usability and the Product as a whole. But because I have much more experience as a programmer, I both still have and want to work on some down-to-earth, in-the-trenches software development. At least part-time.

Currently I’m discovering embedded programming in C and want to share my impressions.

Horror

I have used various high-level programming languages in the last 19 years, and the last 12 years were exclusively in high-level area. Of course I’ve used assembler and C for some of my first programs back then 20 years ago; I also had some course works at my university to write in C, and I’ve created a commercial firmware for a telecom device in 1999 using ASM.

But returning back to C after having obtained all the experience and knowledge of other languages is something different. C was the third or fourth programming language I’ve learned in my life (after BASIC, ASM and Pascal). I was 15 at that time. In this age, when you learn a new language, you just accept it as it is and only care about how to bend you mind around it to produce a compilable program. This time,  learned designs of quite different programming languages (such as Smalltalk, C#, Javascript, etc), I had the possibility to actually evaluate the C language design and the paradigms behind it.

And my first impression was horror.

C doesn’t have reasonable integer types
This one was perhaps the biggest negative surprise for me. I mean, C is currently used only to write some low-level platform and/or performance-critical stuff. Often, the exact bit size and alignment of your variables is very critical. And still, C’s built-in integer types are absolutely unusable. When you write int in C#, you have your 32 bits. Guaranteed, fixed, always the same for any platform from a tiny mobile phone to a powerful Azure server cluster, and this will never ever change. When you write int in C, you’ll get something, depending on your platform. The only thing you know for sure is that its bit size is greater or equal than char, and less or equal than long int. Well, thanks for nothing!

Because the integer types are so unusable, there are efforts to create appropriate pre-processor directives that will try to figure out the current platform’s native bit size and #define useful types. Unfortunately, there are several such efforts, and in the real-life down-to-earth C source code, you will see variables declared as int, int32_t and gint32, all meaning the same, and used in the very same function. This happens especially often when you have a software using several another components, which are open source.

C doesn’t have byte and bool
This is another WTF moment. C is often used in constrained memory conditions, so that you expect a very powerful bit and byte manipulation engine. But there is nothing.

Instead of byte, I saw char is being used (as well as uint8_t, guint8 and BYTE). Such a byte of course does not support bit manipulations out of the box (which is even worse than some assemblers!), so that you have to spend hours trying to figure out what values you have to & and | with some int to get its bits from 3rd to 18th.

As for bool, it is often #defined to be char or int. Sometimes, boolean types together with the false and true constants are #defined several times per module (one #define situated in some indirectly included header while another is directly in the file). But this definition of boolean is quite lousy one, because if and while are happy to accept any integer, so that compiler doesn’t have any static checking support. You can forget to dereference a pointer to your “bool” variable and the if would happily accept it as a true value; you will not get even a warning from the compiler!

Generally, C compile-time checks are lousy
To demonstrate the point, let’s look at this code:

int main (void)
{
  printf("This is a test!\n");
  return 0;
}

Just put it into the file test.c and then execute

gcc -o test test.c
./test

Now, as a naïve ex-C# developer, I would expect the first command fail with the error telling me that the function printf is undefined. Right, gcc would link with libc by default, but I’ve forgot to #include <stdio.h>.

Instead! Instead, gcc would tell you the following:

test.c:3:3: warning: incompatible implicit declaration of built-in
function ‘printf’ [enabled by default]

How on the earth is can be a warning, you think. Then you check your working directory and see the executable test there. And then you execute it with the second command, and what does it do? Crashes? No, it prints out the given string! Are you amazed?

It turns out that, in case C detects something looking like a function call, and the function is undeclared, it just assumes this function takes an int as its argument, and returns an int back.

No, I’m not kidding!

And no, I don’t know why it is an int->int and not float->void, for example. And why the hell this automagic is required at all…

But wait, okay, okay, whoa! We’re passing a string to printf, so at least after assumption its signature is int printf (int), the C should print an error and stop, for Lords sake? Well, you see, string is just char*, and this is a pointer, and a pointer… well, from where C is sitting, the pointer is just an int.

Sooo, let’s try it out:

int main (void)
{
  int ret = foobar(5);
  return 0;
}

int foobar (int a)
{
  return a + 1;
}

then

gcc -Wall -o test test.c
test.c: In function ‘main’:
test.c:3:3: warning: implicit declaration of function ‘foobar’ 
[-Wimplicit-function-declaration]
test.c:3:7: warning: unused variable ‘ret’ [-Wunused-variable]

So, in C, an arbitrary assumption that undeclared functions are int->int has the same severity level than detection of an unused variable. If not the -Wall option (almost the highest warning level of gcc), it wouldn’t even print any warnings at all!

Let’s now go wild and explore the situation a little further. The following source code could easily occur after / during a slight refactoring of function signatures:

#include 

int main (void)
{
  int i;
  int num_records;
  char* input;

  init();
  input = read_input_data(&num_records);
  for (i = 0; i < num_records; i++)
  {
    process_data(input, i);
  }
  return 0;
}

void init(int security_token)
{
  printf("Initializing with token %d\n", security_token);
}

int read_input_data(char* out_buf, int security_token)
{
  printf("Reading with token %d to buffer %d\n", security_token, out_buf);
  out_buf[0] = 'a';
  out_buf[1] = 'b';
  out_buf[2] = 'c';
  out_buf[3] = '\0';
  return 3;
}

void process_data(const char* buf, char* out_ptr, int num, int security_token)
{
  int i;

  printf("Processing from buffer %d to buffer %d %d items with token %d\n",
  buf, out_ptr, num, security_token);

  for(i = 0; i < num; i++)
  {
    *out_ptr++ = *buf++;
  }
}

When you compile it, gcc will print a lot of warnings but never an error, and produce the executable. When executed, it might print something like this before it crashes:

./test
Initializing with token 134513456
Reading with token 0 to buffer -1076015340
Processing from buffer 3 to buffer 0 11529179 items with token 12862244
Segmentation fault

So, apparently this implicit int->int function declaration is an overly simplistic description of how C handles unknown functions, which of course increases the number of wonderful cases where you have to fix a sudden segfault. I have no idea from where the values of missing parameters are coming, and what awesome security implications might happen when just commenting out some existing function and defining another one with additional parameter void*, which would allow you to write... where? On the stack?

Generally, it is hard to write a future-proof code in C
I like the following example:

#include 

typedef struct
{
  int  id;
  char* name;
  int age;
} Employee;

int main (void)
{
  Employee ceo = {0, "Bill Jobs", 55};

  printf("%s is %d\n", ceo.name, ceo.age);
  return 0;
}

It works as expected. Now, let's say, we want to add department to the Employee struct. Piece of cake, right?

#include 

typedef struct
{
  int  id;
  char* name;
  char* department;
  int age;
} Employee;

int main (void)
{
  Employee ceo = {0, "Bill Jobs", 55};

  printf("%s is %d\n", ceo.name, ceo.age);
  printf("%s works in %s\n", ceo.name, ceo.department);
  return 0;
}

The same example implemented in C# would print "Bill Jobs works in", because the department is not initialized and is null. On C, the second printf will segfault, because the department string is being initialized with the CEOs age. There is one step between a perfectly working software and a sudden segfault. Either that, or you have to extend any existing structs by adding new fields to the end.

These discoveries I've made in a mere several first weeks working with C. I'm looking forward to post even more similar war stories here. But horror was not the only feeling I had. Curiosity and a sudden recognition of how C is designed were a great fun for me.

Fun

Modularity concept

In the OOP world, we think in classes. It's a habit. I was never concent with the silly tradition of C++ / C# / Java to store source code in files, because files add avoidable complexity. Best OOP languages like Smalltalk don't need files and store the source code in a database. Therefore I was fascinated when I've first heard that Microsoft's TFS was going to store source code in a database... Well, TFS turned out to be one of the most dissapointing Microsoft products to me, but that's another story.

In C, they think in files, and they really use and need files. A file is a first-class concept in this language. Files are means of modularization. There are two kinds of them - the .c and the .h files.

In a .c file you normally put one or several functions. This is your module. Functions that belong together are stored in the same .c file. Some of them are exposed for usage from other modules (.c files), others are private (in C you use the keyword static to mark such functions).

Now, to call public functions of module A from another module B, you have three two options:

1) You can manually declare exported functions in the beginning of the module B.
2) You can manually declare them in a .h file and then #include it in the module B.
3) You can use implicit declaration for your int->int functions as described above.

This makes the .h files to be roughly an analog of interfaces or public class members in OOP. The difference is that you are not constrained by any formal rules. For example, you can combine exposed functions of several modules in one .h file, or have different .h files for the same module, or even do all that for the code you don't own (and it might be already compiled). This is more flexible.

So, generally, when I write a new module, first I write a .h file to define its public interface, then I #include only those .h files into my .h file that are needed for my function declarations (mostly typedefs of missing built-in types). Then I write the .c file, #include the corresponding .h file to forward-declare public functions, then forward-declare the private functions, and then #include the .h files of all the other modules I need when implementing my module.

This is radically different from how I did that in C++ ten years ago, where I used to #include all possible header files to any other file, because I was pissed of by this manual file management and wanted to think in terms of classes and interfaces only, considering files as one (and the worst) of many possible source code storage backends.

Program for the compiler
When programming in modern languages, you have two very distinct modes: the run-time and the compile-time. My comeback to C has made me think about any program as being a double-program: the one for the compiler (executed at compile-time), and the other one for the run-time.

In the modern languages, the compile-time program is purely declarative without side effects. In C#, if you write

public class Point
{
  public int X;
  public int Y;
}

this is in essence a declarative instruction to the C# compiler to create a new class with the given members. It is declarative, because it doesn't matter if this class appears in the source code before or after some other class; and the order of its members also doesn't matter. The declarative compile-time style of modern languages makes it easy to think about it, so that you can distribute more of your focus to the run-time.

Not in C. There, the best way to think about a program is a double-helix DNA where procedural compile-time instructions are intertwined with procedural run-time instructions:

int open (void);
int write_data (int fd);

int main (void)
{
  int fd;

  fd = open();
  write_data(fd);
}

Reading this source code as compile-time program, first there is a command to put "open" and "write_data" with the corresponding signatures into the name table, then a command to put "main" into the name table, then a command to start writing compiled code of "main", then the command to put "fd" to the local scoped name table, then a command to compile a function call to "open" and add it to the "main" function object code, and so on.

Thinking about it this way makes it easier to grasp the behavior of the language. Especially when you start using macros (and in C, you have to). And it explains quite naturally the need of forward function declarations and the importance of the struct member order.

In a sense it reminds me how Smalltalk (also an ancient language) works; there, new classes or methods are also defined by calling a procedural method. The difference is that in Smalltalk the syntax for the compile-time is almost the same as the run-time syntax so that you don't have to learn two languages instead of one.

C with classes: GLib
C++ is often called "C with classes", but this is not the only truth; the pure C has its own OOP implementation; in GLib they've managed to do it without modifying the programming language itself. And boy it is a funny hack, I must say.

Consider the following code:

typedef struct
{
  int  id;
  char* name;
} BaseObject;

typedef struct
{
  BaseObject parent;
  int age;
  int department_id;
} Employee;

Employee*  ceo;

They use the fact that according to the C standard, fields of the parent field are stored in the declaration order at the beginning of the Employee structure, so the memory representation of it looks like this:

struct Employee_in_memory
{
  int  id;
  char* name;
  int age;
  int department_id;
}

This allows you to cast pointer to Employee to pointer to BaseObject, because, hey, its fields are at the beginning of the memory. This enables polymorhism like this:

typedef struct
{
  BaseObject parent;
  int num_employees;
} Department;

#define MAX_REPOSITORY_OBJECT_COUNT 1000;
BaseObject* repository[MAX_REPOSITORY_OBJECT_COUNT];

int read_repository(void)
{
  int stream;
  int i = 0;

  stream = open();
  while(!is_eof(stream))
  {
    switch(get_next_type(stream))
    {
      case EMPLOYEE:
        Employee* emp = deserialize_employee(stream);
        repository[i] = emp;
        break;
      case DEPARTMENT:
        Department* dept = deserialize_department(stream);
        repository[i] = dept;
        break;
    }
    i++
  }
  return i;
}

void dump_repository(int top)
{
  int i;
   
  for(i = 0; i < top; i++)
  {
    printf("%d,%s\n", repository[i]->id, repository[i]->name);
  }
}

As for methods, you could just add function pointers to the structs, but this would mean you copy them with each object instance, which is a big waste of memory (at least according to the C ideology), besides, it would allow you to have different methods per instance of the same class, which is normal for languages such as JavaScript, but just too weird for the conservative C. Therefore, in GLib, in their very base object GObject they put a pointer to another structure, the class structure, which has the function pointers of the class methods. This has the added benefit of run-time reflection, because this class structure has a couple of fields allowing you to read the class name, query for properties and so on.

But, because of this, as soon as you have any non-trivial hierarchy with virtual methods and so, it would be too complex to use it directly in plain C (because of constant casting and using obj->parent and obj->class), so that GLib has added a lot of #defines hiding this complexity (but also preventing an easy understanding of what exactly is happening). Here is a good example of how a simple OOP code with GLib looks like. All in all it feels like programming in C++ with all its black box covers removed. But this is yet another story.

2 responses to “Adventures in embedded C land

  1. Sander Saares

    Good reading! I’m orienting towards cross- and/or multiplatform software and also getting back to the C and C++ world these days. Lots of bad childhood memories coming back to me ;)

    What you write about here seems to be the same as I remember about C from 5-10 years back. Is there such a thing as “modern C” or is all the progress happening in C++, leaving the C world is still stuck in time as a legacy platform?

  2. Anonymous

    Thanks!

    Heh, this formula “C and C++” or its more often used variant “C/C++”… I consider it a smart marketing gag of C++ advocates. The user experience of C++ coding is so much different from the one of C coding as Objective C’s one… And speaking about the progress, I don’t believe any amount of progress can help C++. In my opinion, its syntax is DOA. Whenever I have choice, I would choose C over C++ (as well as Ruby over Python and C# over Java).

    I don’t really feel myself an expert in C. Theoretically, there are the new standards C99 and C11, but both are extremely cosmetic to my eyes. The former introduces a boolean type (but ifs and whiles are still happily accepting any other types) and guaranteed-bit-size numbers; the latter isn’t even that much useful at all. No matter what, when working with a lot of open-source code, I can’t see that either C99 or C11 have a distinct popularity among developers, so that their existence could be safely ignored for legacy and open-source projects.

    I do really pretty much like glib, gobject, as well as their containers and data structure types. This is the only structured approach to design a modern software development foundation and naming conventions for C that I know. But I think one can really appreciate glib, only if one has already tried (and failed) to develop on a pure C or using some random libraries, because from the C# developer point of view, the glib usability and feature set is still inferior to .NET.

Leave a Reply

Categories

Archive