If you program in C, strings are just in your imagination. What you really have is a character pointer, and we all agree that a string is every character from that point up until one of the characters is zero. While that’s simple and useful, it is also the source of many errors. For example, writing a 32-byte string to a 16-byte array or failing to terminal a string with a zero byte. [Thasso] has been experimenting with a different way to represent strings that is still fairly simple but helps keep things straight.
Like many other languages, this setup uses counted strings and string buffers. You can read and write to a string buffer, but strings are read-only. In either case, there is a length for the contents and, in the case of the buffer, a length for the entire buffer.
We’ve seen schemes like this before and [Thasso] borrowed the idea from [Chris Wellons]. The real issue, of course, is that you now have to rewrite or wrap any “normal” C functions you have that take or return strings. We’ve also seen this done where the length is stored ahead of the string so you don’t have a field for the character pointer:
struct str { sz len; char dat[0]; };
Even though the prototypical structure has a zero length, the actual structure can be larger.
If you are worried about efficiency, [Thasso] and [Wellons] both point out that modern compilers are good at handling small structures, so maybe that’s an advantage to not putting the data directly into the struct. If you need characters larger than one byte, the [Wellons] post has some thoughts on that, too.
This is all old hat on C++, of course. No matter how you encode your strings, you should probably avoid the naughty ones. Passwords, too.
So… a pascal string then? (a 256 character array with the first character indicating the length of the rest, 0-255)
There are better solutions already and c++ has them already when using new(); size of the array is stored “4 bytes left of the pointer”. This works, because when you free memory your pointer may point anywhere into the block you want to release, must not point to the beginning.
The proposed parent structure, that carries the meta data, is a performance killer for cpu caches. String and meta data might be far away from one another, leading to cache misses. Best to keep them close to each other.
(There is at least one typo in the 5 line code sample)
Yeah, this is re-inventing std::string and std::string_view but worse in every way.
The first example is like string_view, it doesn’t own the memory. It’s an amazing type to use in APIs to handle both strings and std::string.
C++ 20 extends the approach to arbitrary arrays, not just strings. std::span can describe c arrays, std::vector and std::array
Uh, no? At least not for std::string? It’s implementation dependent but std::string needs three fields (akin to the ‘str_buf’ here) and the compilers are all over the place as to how it’s implemented. It’s definitely not “size then pointer” for everyone.
https://devblogs.microsoft.com/oldnewthing/20240510-00/?p=109742
Leads to interesting behaviors because there are neat size/speed tradeoffs that occur with even something as simple as this due to small-string optimizations. GCC needs a pointer comparison when changing, clang needs a bit test for all operations, and msvc has a comparison operation even when accessing. Depending on the architecture those choices might be differently fast or slow.
This is a bit dodgy, as you would need to write the compiler around this to be performant. An obvious problem with this in some cases is also the “large or small” check, that’s a branch, ugh you don’t want those if you want to go fast!
I’ve seen some implementations using pointers for “begin” and “end”. So you can iterate for(char* i = string.start; i !=string.end; ++i) and the size is simply string.end-string.start. This can be fast if the compiler understands it but it cannot be vectorized easily in the general case. In fact, it cannot be vectorized easily if the size is unsigned. The size has to be signed – then the compiler knows there cannot be any overflow (signed integer overflow is UB). Like it or not, but that’s what makes C and C++ fast. I’ve been bitten by this… Vectorization can make a huge performance boost, and yes, there are parallel algorithms on strings – a common theme being testing all characters for some pattern, forming a binary mask of 0/1, and then summing the whole thing.
“you would need to write the compiler around this to be performant.”
I mean those are literally the implementation details of a language in a given compiler, sooo…. yeah, they are? That being said if your architecture’s super-hurt by branches and your compiler can’t branch-free do “return X if (Y is W) else return Z” your compiler sucks. Or your architecture is really dumb.
Even in that case GCC’s implementation only has the branch in the path where a branch is likely already to exist, and it’s easy to predict anyway (if you’re checking capacity, you’re probably adding something, and it’s probably large, since you’re only small once).
First all, who needs casual strings to be that performant? If you need the kind of performance that you’re worried about the compiler causing cache misses then you need to be using malloc if not outright crafting your own assembly. Second, this is not Sophie’s choice. You don’t have to pick only one way.
Considering literally all 3 major compilers do it, everyone, I guess?
If you allocate this, you need to make sure the buffer is aligned, probably on at least 64 bytes (or whichever cacheline size your architecture has). I think some parts of your libc may depend on malloc returning page aligned allocations (4k), that’s how memory really is allocated with mmap() inside the libc’s malloc() anyway, so there may be some severe waste unless you know what you’re doing (you probably have to roll your own malloc for this to be worth it).
I personnaly believe that “better solution” and “C++” cannot be in the same sentence
Ha! This comment made me smile! Making a point by omitting the period. Or wait: by making it overflow into the HaD stack space, or doing it properly by replacing it by the invisible zero? Nice!
Historically, Pascal has stored a string’s length in the first byte of the array. Doing something similar in C (or C++ for that matter) would also be reasonable. Perhaps using four bytes and accepting a max limit of 4 billion characters is a reasonable tradeoff.
Or storing it as a
size_t
or equivalent, it follows the integer size of the architecture.I’ve never understood these.
Why cannot you be specific: I need 16 bit long signed integer: uint16_t
Why it has to be decided by architecture? That just makes it unstandard (if you want to transfer raw data between systems) and is just guessing game for the developer.
Plain horrible way to be non-standard, obfuscate and increase risk of bugs.
Correction: UNsigned integer.
Ha glad I saw this, not because I was gonna shout “gotcha,” but because I did a double take with “signed uint16_t” then wondered if my time away had eroded what I thought I knew
It should be signed. You should not use unsigned as a type just because a value may not be negative. It can get really tricky when mixing signed and unsigned math in a single expression, the promotion rules don’t always do what you want.
Even Stroustrup now believes that using unsigned types for std container sizes and such was a mistake.
Signed/unsigned type choice wasn’t the mistake, it was pretending that signed/unsigned operations are the same math and should be represented by the same operator.
sigh, reply got clicked before I finished the comment.
Imagine if someone had decided that bitwise objects and arithmetic objects should have different types, and bitwise objects should use + and * for | and &, just because in Boolean algebra | and & are often written as addition and multiplication and share some concepts. That’s obviously dumb, right? Except it’s okay that we pretend that SBC/ADC and SBB/ADD are the same operations?
Because integer promotion rules make working with smaller sized ints more confusing and higher bug risk.
It’s decided by architecture as the processor architecture (16-bit, 32-bit, 64-bit etc.) decides the default integer size and, more importantly, the address size.
If my memory serves me right, a
size_t
needs to be able to fit the address size in the architecture.No, pointer size vlcan be bigger than size_t and even bigger than the largest int size. IBM iLE/C has exactly that. Integers do need to fit into pointers however.
size_t needs to be able to store the largest contiguous memory block index: e.g. if I do char p[all of the bytes], sizeof(p) fits in size_t.
The max contiguous memory block isn’t set by the C standard, though, so your implementation could limit “all of the bytes” to 65535 and size_t could be a short. (Tons of programs would fail, though, discovering they should really check returns from malloc).
And the pointer size is separate from that, too, which is why they added intptr_t. Although that’s an optional type, demonstrating again how weird C is.
You don’t understand size_t or stdint types (e.g. uint16_t)?
stdint types give a nice guarantee: the data is stored in 2s complement and the width is exactly as advertised, no padding bits allowed. A plain “char or int of any type” (int, short int, long int, unsigned int, char etc.) gives neither of these guarantees.
size_t: makes for flexible “max width” in the sense that no larger object can exist on the machine. There could be security implications. Storing everything as uint64_t is unnecessary and can take up a lot of space depending on application. 64-bit apps taking up much more memory due to pointers (and size_t sized) is was very noticeable around the era when computers started having aorund 4GB of RAM.
“no larger object can exist on the machine.”
That’s not exactly what size_t is. size_t is just the size of what “sizeof()” returns, which means it has to be as large (or larger) than the maximum possible array size on the machine. You might think that means “oh that’s the pointer width” but it doesn’t have to be, since you can have segmented architectures, too.
Hence the difference between size_t and intptr_t.
“Why it has to be decided by architecture?”
Because memory layout and how you can address that memory is architecture-specific.
“Plain horrible way to be non-standard, obfuscate and increase risk of bugs.”
You’re waaaay more likely to have a bug if you assume a common memory layout/accessibility between two different architectures.
I was taught by a friend to use size_t for strings way back on Borland Turbo C 2.0 in the early 90’s. Sometims I miss the good ol’ no-GUI days. 🥲
Storing the size vs a sentinel is the worst idea ever, memory wise and probably performance wise. You don’t know before hand the size of the string (it can be 1 byte or 4TB). So you must reserve space for the size variable that can hold the largest possible value (likely 64 bits). For small string (which are the majority of cases, this leads to a huge overhead, something like 1/8 for most string, since they are mostly under 64 bytes). The size can only be first (since you don’t know how long the string is, you can’t store it after the string), so it implies alignment issues for manipulating the string itself.
The only advantage to store the size first is to avoid computing it by searching for the sentinel value (likely ‘0’). That’s were the string class in C++ got right, you only need to do it once (as you would do anyway if you had to store the size first). Then you can store the size in another variable that won’t impact the alignment and performance of the string itself. Notice that extracting the string size is done at compile time for most string (static string), and since you’re usually only making runtime string from static string, where you can add the compile time’s computed size to produce a runtime size, the size computation is not O(N) as dumbly expected but O(1).
In the end, a good compile time string class like string_view or even fmt library will outperform any other string representation in many, if not all, use case.
Depending on the application, it can work out faster, too; lots of code wants to know how long a string is and this makes that O(1) instead of O(n).
More code wants to access the string data, which is now potentially one extra cache mis away. And remember, getting data from RAM instead of cache takes more then a 100 cycles.
Modern cpus also have specialized vector instructions to get string lenghs, so this is way less taxing then you would think.
Can you elaborate on those vector instructions? Are there intrinsics, or do we trust the compiler?
x86? Which instructions, out of curiosity?
A C-string in x86 is called an “implicit length” string at least by Intel – SSE4.2 introduced functions specifically for those: you can use pcmpistri to zip through a string 16 bytes at a time.
There are fast string implementations for sse4.2 here:
https://github.com/aklomp/sse-strings
x86 CPUs have had string processing instructions since the beginning, though, with the SCASx family of instructions and their repeat prefix.
Side note:
str
and the first two elementsstr_buf
are more or less the same as forstruct iovec
/iovec_t
(UN*X scatter/gather arrays).So, I’d add the constraint that the
size
andcap
parameters should be of typesize_t
or the equivalent on the platform used.An even better solution: don’t use C.
For the love of god stop writing new code in C
No.
Skill issue.
There’s absolutely nothing wrong with C. With good tools and good programming practices it is perfectly good to use. I use C/C++ for embedded code.
For the love of god stop telling me what do to.
I’m really questioning the wisdom of a post about using strings in C, when the only responsible use of C left is in highly constrained embedded systems, where you shouldn’t be messing with strings to begin with.
Null-terminated data is a perfectly valid storage method. All data has to be stored some way, and how you encode it and handle it depends on what level of error checking and such you need. Using a termination character to end a data field is used all over the place.
Null termination specifically has a lot of useful features in data transmission, too: for instance in a UART link, the null character is trivially detected because it’s the only character with 9 consecutive bit times of zero.
The issue with string processing in C is poor data validation in favor of speed, and this exists regardless of how data is stored. There’s nothing special about a C-string. It’s just a data encoding method.
I’ll continue to write in C and Pascal. Nice concise languages I’ve been using since the 80s. What you see is what you get. Love it. Works great for the little projects and like the RPI Pico/Pico 2. Nothing wrong with Python or Perl or Assembly… either … when applicable.
I agree with you.
C is just fine if you include some extra rules to avoid the worst pitfalls. Follow something like MISRA C and you’ll be just fine.
An even better solution: don’t use strings.
It’s all the legacy code that gives me a headache.
My system is similar but a little different.
struct c_str {
char *s;
unsigned n;
char _data[];
};
when creating a new string. you allocate it and place your data in _data and set your pointer to _data.
when creating a substring, you can point s into anywhere and you don’t use _data[].
there are some evil tricks with structs both with and without _data[] that I use so you can create substring handles without malloc().
“Better C Strings, Simply”
Didn’t see anything in the article about stringed musical instruments…
I wish I could do some C# pun here.
He wanted to pun on C#, but his attempt fell flat!
A very sharp person you are.
Author has a data type with members named dat, len, and cap.
Author also says “…this is C, so we accept being a little verbose and move on.”
So data, length, and capacity would be excessively verbose?
Seriously though, C’s native string type is an abomination and I applaud any effort to avoid it.
I don’t think there IS a native string type in C. Just a byte array. Or am I wrong?
A 0-sized array isn’t legal in C or c++. If you remove the 0 you get a C99 flexible array member, which doesn’t exist in c++.
https://en.m.wikipedia.org/wiki/Flexible_array_member
struct str
{
sz len;
char dat[];
};
There are better ways to do things in C++. But I also don’t use C++ unless paid to do so. It’s just not a language that want to spend any more time on than I absolutely have to. C is fine, it’s primitive at times, but it’s also pretty simple for getting the low level stuff done the way I want them done.
I did something similar years ago, extracting from an open source Microsoft embedded codebase, then making it cross-platform (with UTF-8 and all that): https://github.com/SamuelMarks/c-str-span
Waste of time. Check the lengths of things before storing or use functions like strncpy, snprintf and so on thus the functions do the checking for you. More common is miss assigning pointers or using g null when the malloc returns that
malloc is another area where the stdlib design screwed us all over.
TFA isn’t even wrong. It’s just that it’s many, many years too late.
in the 30 years i’ve been using C, my idioms keep changing. i’d like to think i’m fairly mature now but probably i’m going to keep changing?
but at the moment, i’m sensitive to the difference between interchange and computation. and for actual strings, i am generally in favor of NUL-termination for interchange. for interchange, i like “char *” NUL-terminated strings. but for computation, i often use a counted string like str_buf. there are a few idioms for growable strings that, once you’ve seen them implemented a dozen times, it just calls out for generalization. the biggest recent development is a safe sprintf that targets a growable str_buf…really amazing how generally useful that is (or how unsatisfactory naked sprintf is in many cases).
but my point is, that doesn’t replace NUL-terminated strings…it’s just good factoring for the cases where you do (briefly or locally) want a growable string.
another thing i’m coming around to is sometimes i pass structs by value :)
This is an area where Ritchie really screwed us all over.