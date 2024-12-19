If you program in C, strings are just in your imagination. What you really have is a character pointer, and we all agree that a string is every character from that point up until one of the characters is zero. While that’s simple and useful, it is also the source of many errors. For example, writing a 32-byte string to a 16-byte array or failing to terminal a string with a zero byte. [Thasso] has been experimenting with a different way to represent strings that is still fairly simple but helps keep things straight.
Like many other languages, this setup uses counted strings and string buffers. You can read and write to a string buffer, but strings are read-only. In either case, there is a length for the contents and, in the case of the buffer, a length for the entire buffer.
We’ve seen schemes like this before and [Thasso] borrowed the idea from [Chris Wellons]. The real issue, of course, is that you now have to rewrite or wrap any “normal” C functions you have that take or return strings. We’ve also seen this done where the length is stored ahead of the string so you don’t have a field for the character pointer:
struct str { sz len; char dat[0]; };
Even though the prototypical structure has a zero length, the actual structure can be larger.
If you are worried about efficiency, [Thasso] and [Wellons] both point out that modern compilers are good at handling small structures, so maybe that’s an advantage to not putting the data directly into the struct. If you need characters larger than one byte, the [Wellons] post has some thoughts on that, too.
This is all old hat on C++, of course. No matter how you encode your strings, you should probably avoid the naughty ones. Passwords, too.
4 thoughts on “Better C Strings, Simply”
There are better solutions already and c++ has them already when using new(); size of the array is stored “4 bytes left of the pointer”. This works, because when you free memory your pointer may point anywhere into the block you want to release, must not point to the beginning.
The proposed parent structure, that carries the meta data, is a performance killer for cpu caches. String and meta data might be far away from one another, leading to cache misses. Best to keep them close to each other.
(There is at least one typo in the 5 line code sample)
Historically, Pascal has stored a string’s length in the first byte of the array. Doing something similar in C (or C++ for that matter) would also be reasonable. Perhaps using four bytes and accepting a max limit of 4 billion characters is a reasonable tradeoff.
Or storing it as a
size_tor equivalent, it follows the integer size of the architecture.
Side note:
strand the first two elements
str_bufare more or less the same as for
struct iovec/
iovec_t(UN*X scatter/gather arrays).
So, I’d add the constraint that the
sizeand
capparameters should be of type
size_tor the equivalent on the platform used.
