Better C Strings, Simply

If you program in C, strings are just in your imagination. What you really have is a character pointer, and we all agree that a string is every character from that point up until one of the characters is zero. While that’s simple and useful, it is also the source of many errors. For example, writing a 32-byte string to a 16-byte array or failing to terminal a string with a zero byte. [Thasso] has been experimenting with a different way to represent strings that is still fairly simple but helps keep things straight.

Like many other languages, this setup uses counted strings and string buffers. You can read and write to a string buffer, but strings are read-only. In either case, there is a length for the contents and, in the case of the buffer, a length for the entire buffer.

We’ve seen schemes like this before and [Thasso] borrowed the idea from [Chris Wellons]. The real issue, of course, is that you now have to rewrite or wrap any “normal” C functions you have that take or return strings. We’ve also seen this done where the length is stored ahead of the string so you don’t have a field for the character pointer:


struct str
{
sz len;
char dat[0];
};

Even though the prototypical structure has a zero length, the actual structure can be larger.

If you are worried about efficiency, [Thasso] and [Wellons] both point out that modern compilers are good at handling small structures, so maybe that’s an advantage to not putting the data directly into the struct. If you need characters larger than one byte, the [Wellons] post has some thoughts on that, too.

This is all old hat on C++, of course. No matter how you encode your strings, you should probably avoid the naughty ones. Passwords, too.

13 thoughts on “Better C Strings, Simply

  1. There are better solutions already and c++ has them already when using new(); size of the array is stored “4 bytes left of the pointer”. This works, because when you free memory your pointer may point anywhere into the block you want to release, must not point to the beginning.

    The proposed parent structure, that carries the meta data, is a performance killer for cpu caches. String and meta data might be far away from one another, leading to cache misses. Best to keep them close to each other.

    (There is at least one typo in the 5 line code sample)

  2. Historically, Pascal has stored a string’s length in the first byte of the array. Doing something similar in C (or C++ for that matter) would also be reasonable. Perhaps using four bytes and accepting a max limit of 4 billion characters is a reasonable tradeoff.

      1. I’ve never understood these.

        Why cannot you be specific: I need 16 bit long signed integer: uint16_t

        Why it has to be decided by architecture? That just makes it unstandard (if you want to transfer raw data between systems) and is just guessing game for the developer.

        Plain horrible way to be non-standard, obfuscate and increase risk of bugs.

        1. It’s decided by architecture as the processor architecture (16-bit, 32-bit, 64-bit etc.) decides the default integer size and, more importantly, the address size.
          If my memory serves me right, a size_t needs to be able to fit the address size in the architecture.

      1. More code wants to access the string data, which is now potentially one extra cache mis away. And remember, getting data from RAM instead of cache takes more then a 100 cycles.
        Modern cpus also have specialized vector instructions to get string lenghs, so this is way less taxing then you would think.

  3. Side note: str and the first two elements str_buf are more or less the same as for struct iovec/iovec_t (UN*X scatter/gather arrays).
    So, I’d add the constraint that the size and cap parameters should be of type size_t or the equivalent on the platform used.

Leave a Reply

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.