If you program in C, strings are just in your imagination. What you really have is a character pointer, and we all agree that a string is every character from that point up until one of the characters is zero. While that’s simple and useful, it is also the source of many errors. For example, writing a 32-byte string to a 16-byte array or failing to terminal a string with a zero byte. [Thasso] has been experimenting with a different way to represent strings that is still fairly simple but helps keep things straight.
Like many other languages, this setup uses counted strings and string buffers. You can read and write to a string buffer, but strings are read-only. In either case, there is a length for the contents and, in the case of the buffer, a length for the entire buffer.
We’ve seen schemes like this before and [Thasso] borrowed the idea from [Chris Wellons]. The real issue, of course, is that you now have to rewrite or wrap any “normal” C functions you have that take or return strings. We’ve also seen this done where the length is stored ahead of the string so you don’t have a field for the character pointer:
struct str { sz len; char dat[0]; };
Even though the prototypical structure has a zero length, the actual structure can be larger.
If you are worried about efficiency, [Thasso] and [Wellons] both point out that modern compilers are good at handling small structures, so maybe that’s an advantage to not putting the data directly into the struct. If you need characters larger than one byte, the [Wellons] post has some thoughts on that, too.
This is all old hat on C++, of course. No matter how you encode your strings, you should probably avoid the naughty ones. Passwords, too.
There are better solutions already and c++ has them already when using new(); size of the array is stored “4 bytes left of the pointer”. This works, because when you free memory your pointer may point anywhere into the block you want to release, must not point to the beginning.
The proposed parent structure, that carries the meta data, is a performance killer for cpu caches. String and meta data might be far away from one another, leading to cache misses. Best to keep them close to each other.
(There is at least one typo in the 5 line code sample)
Yeah, this is re-inventing std::string and std::string_view but worse in every way.
Historically, Pascal has stored a string’s length in the first byte of the array. Doing something similar in C (or C++ for that matter) would also be reasonable. Perhaps using four bytes and accepting a max limit of 4 billion characters is a reasonable tradeoff.
Or storing it as a
size_t
or equivalent, it follows the integer size of the architecture.I’ve never understood these.
Why cannot you be specific: I need 16 bit long signed integer: uint16_t
Why it has to be decided by architecture? That just makes it unstandard (if you want to transfer raw data between systems) and is just guessing game for the developer.
Plain horrible way to be non-standard, obfuscate and increase risk of bugs.
Correction: UNsigned integer.
Because integer promotion rules make working with smaller sized ints more confusing and higher bug risk.
It’s decided by architecture as the processor architecture (16-bit, 32-bit, 64-bit etc.) decides the default integer size and, more importantly, the address size.
If my memory serves me right, a
size_t
needs to be able to fit the address size in the architecture.Depending on the application, it can work out faster, too; lots of code wants to know how long a string is and this makes that O(1) instead of O(n).
More code wants to access the string data, which is now potentially one extra cache mis away. And remember, getting data from RAM instead of cache takes more then a 100 cycles.
Modern cpus also have specialized vector instructions to get string lenghs, so this is way less taxing then you would think.
Side note:
str
and the first two elementsstr_buf
are more or less the same as forstruct iovec
/iovec_t
(UN*X scatter/gather arrays).So, I’d add the constraint that the
size
andcap
parameters should be of typesize_t
or the equivalent on the platform used.An even better solution: don’t use C.
For the love of god stop writing new code in C