Better C Strings, Simply

December 19, 2024

If you program in C, strings are just in your imagination. What you really have is a character pointer, and we all agree that a string is every character from that point up until one of the characters is zero. While that’s simple and useful, it is also the source of many errors. For example, writing a 32-byte string to a 16-byte array or failing to terminal a string with a zero byte. [Thasso] has been experimenting with a different way to represent strings that is still fairly simple but helps keep things straight.

Like many other languages, this setup uses counted strings and string buffers. You can read and write to a string buffer, but strings are read-only. In either case, there is a length for the contents and, in the case of the buffer, a length for the entire buffer.

We’ve seen schemes like this before and [Thasso] borrowed the idea from [Chris Wellons]. The real issue, of course, is that you now have to rewrite or wrap any “normal” C functions you have that take or return strings. We’ve also seen this done where the length is stored ahead of the string so you don’t have a field for the character pointer:


struct str
{
sz len;
char dat[0];
};

Even though the prototypical structure has a zero length, the actual structure can be larger.

If you are worried about efficiency, [Thasso] and [Wellons] both point out that modern compilers are good at handling small structures, so maybe that’s an advantage to not putting the data directly into the struct. If you need characters larger than one byte, the [Wellons] post has some thoughts on that, too.

This is all old hat on C++, of course. No matter how you encode your strings, you should probably avoid the naughty ones. Passwords, too.

62 thoughts on “Better C Strings, Simply”

Jim says:

December 19, 2024 at 8:50 am

So… a pascal string then? (a 256 character array with the first character indicating the length of the rest, 0-255)

Report comment

Reply
Christian says:

December 19, 2024 at 9:10 am

There are better solutions already and c++ has them already when using new(); size of the array is stored “4 bytes left of the pointer”. This works, because when you free memory your pointer may point anywhere into the block you want to release, must not point to the beginning.

The proposed parent structure, that carries the meta data, is a performance killer for cpu caches. String and meta data might be far away from one another, leading to cache misses. Best to keep them close to each other.

(There is at least one typo in the 5 line code sample)

Report comment

Reply
1. Daid says:
  
  December 19, 2024 at 10:41 am
  
  Yeah, this is re-inventing std::string and std::string_view but worse in every way.
  
  Report comment
  
  Reply
  1. Jeff Grills says:
    
    December 19, 2024 at 7:38 pm
    
    The first example is like string_view, it doesn’t own the memory. It’s an amazing type to use in APIs to handle both strings and std::string.
    
    C++ 20 extends the approach to arbitrary arrays, not just strings. std::span can describe c arrays, std::vector and std::array
    
    Report comment
    
    Reply
2. Pat says:
  
  December 19, 2024 at 12:58 pm
  
  Uh, no? At least not for std::string? It’s implementation dependent but std::string needs three fields (akin to the ‘str_buf’ here) and the compilers are all over the place as to how it’s implemented. It’s definitely not “size then pointer” for everyone.
  
  https://devblogs.microsoft.com/oldnewthing/20240510-00/?p=109742
  
  Leads to interesting behaviors because there are neat size/speed tradeoffs that occur with even something as simple as this due to small-string optimizations. GCC needs a pointer comparison when changing, clang needs a bit test for all operations, and msvc has a comparison operation even when accessing. Depending on the architecture those choices might be differently fast or slow.
  
  Report comment
  
  Reply
  1. Joel says:
    
    December 19, 2024 at 1:42 pm
    
    This is a bit dodgy, as you would need to write the compiler around this to be performant. An obvious problem with this in some cases is also the “large or small” check, that’s a branch, ugh you don’t want those if you want to go fast!
    
    I’ve seen some implementations using pointers for “begin” and “end”. So you can iterate for(char* i = string.start; i !=string.end; ++i) and the size is simply string.end-string.start. This can be fast if the compiler understands it but it cannot be vectorized easily in the general case. In fact, it cannot be vectorized easily if the size is unsigned. The size has to be signed – then the compiler knows there cannot be any overflow (signed integer overflow is UB). Like it or not, but that’s what makes C and C++ fast. I’ve been bitten by this… Vectorization can make a huge performance boost, and yes, there are parallel algorithms on strings – a common theme being testing all characters for some pattern, forming a binary mask of 0/1, and then summing the whole thing.
    
    Report comment
    
    Reply
    1. Pat says:
      
      December 19, 2024 at 2:11 pm
      
      “you would need to write the compiler around this to be performant.”
      
      I mean those are literally the implementation details of a language in a given compiler, sooo…. yeah, they are? That being said if your architecture’s super-hurt by branches and your compiler can’t branch-free do “return X if (Y is W) else return Z” your compiler sucks. Or your architecture is really dumb.
      
      Even in that case GCC’s implementation only has the branch in the path where a branch is likely already to exist, and it’s easy to predict anyway (if you’re checking capacity, you’re probably adding something, and it’s probably large, since you’re only small once).
      
      Report comment
      
      Reply
      1. C says:
        
        December 19, 2024 at 4:13 pm
        
        First all, who needs casual strings to be that performant? If you need the kind of performance that you’re worried about the compiler causing cache misses then you need to be using malloc if not outright crafting your own assembly. Second, this is not Sophie’s choice. You don’t have to pick only one way.
        
        Report comment
      2. Pat says:
        
        December 19, 2024 at 5:46 pm
        
        Considering literally all 3 major compilers do it, everyone, I guess?
        
        Report comment
3. Joel says:
  
  December 19, 2024 at 1:34 pm
  
  If you allocate this, you need to make sure the buffer is aligned, probably on at least 64 bytes (or whichever cacheline size your architecture has). I think some parts of your libc may depend on malloc returning page aligned allocations (4k), that’s how memory really is allocated with mmap() inside the libc’s malloc() anyway, so there may be some severe waste unless you know what you’re doing (you probably have to roll your own malloc for this to be worth it).
  
  Report comment
  
  Reply
4. Nath says:
  
  December 19, 2024 at 2:06 pm
  
  I personnaly believe that “better solution” and “C++” cannot be in the same sentence
  
  Report comment
  
  Reply
  1. mtr says:
    
    December 19, 2024 at 10:45 pm
    
    Ha! This comment made me smile! Making a point by omitting the period. Or wait: by making it overflow into the HaD stack space, or doing it properly by replacing it by the invisible zero? Nice!
    
    Report comment
    
    Reply
fluffy says:

December 19, 2024 at 9:17 am

Historically, Pascal has stored a string’s length in the first byte of the array. Doing something similar in C (or C++ for that matter) would also be reasonable. Perhaps using four bytes and accepting a max limit of 4 billion characters is a reasonable tradeoff.

Report comment

Reply
1. Teukka says:
  
  December 19, 2024 at 9:34 am
  
  Or storing it as a size_t or equivalent, it follows the integer size of the architecture.
  
  Report comment
  
  Reply
  1. Jouni says:
    
    December 19, 2024 at 10:15 am
    
    I’ve never understood these.
    
    Why cannot you be specific: I need 16 bit long signed integer: uint16_t
    
    Why it has to be decided by architecture? That just makes it unstandard (if you want to transfer raw data between systems) and is just guessing game for the developer.
    
    Plain horrible way to be non-standard, obfuscate and increase risk of bugs.
    
    Report comment
    
    Reply
    1. Jouni says:
      
      December 19, 2024 at 10:19 am
      
      Correction: UNsigned integer.
      
      Report comment
      
      Reply
      1. Sword says:
        
        December 19, 2024 at 12:14 pm
        
        Ha glad I saw this, not because I was gonna shout “gotcha,” but because I did a double take with “signed uint16_t” then wondered if my time away had eroded what I thought I knew
        
        Report comment
      2. Jeff Grills says:
        
        December 19, 2024 at 7:30 pm
        
        It should be signed. You should not use unsigned as a type just because a value may not be negative. It can get really tricky when mixing signed and unsigned math in a single expression, the promotion rules don’t always do what you want.
        
        Even Stroustrup now believes that using unsigned types for std container sizes and such was a mistake.
        
        Report comment
      3. Pat says:
        
        December 20, 2024 at 6:58 am
        
        Signed/unsigned type choice wasn’t the mistake, it was pretending that signed/unsigned operations are the same math and should be represented by the same operator.
        
        Report comment
      4. Pat says:
        
        December 20, 2024 at 7:07 am
        
        sigh, reply got clicked before I finished the comment.
        
        Imagine if someone had decided that bitwise objects and arithmetic objects should have different types, and bitwise objects should use + and * for | and &, just because in Boolean algebra | and & are often written as addition and multiplication and share some concepts. That’s obviously dumb, right? Except it’s okay that we pretend that SBC/ADC and SBB/ADD are the same operations?
        
        Report comment
    2. Daid says:
      
      December 19, 2024 at 10:39 am
      
      Because integer promotion rules make working with smaller sized ints more confusing and higher bug risk.
      
      Report comment
      
      Reply
    3. Teukka says:
      
      December 19, 2024 at 10:55 am
      
      It’s decided by architecture as the processor architecture (16-bit, 32-bit, 64-bit etc.) decides the default integer size and, more importantly, the address size.
      If my memory serves me right, a size_t needs to be able to fit the address size in the architecture.
      
      Report comment
      
      Reply
      1. John V says:
        
        December 19, 2024 at 12:43 pm
        
        No, pointer size vlcan be bigger than size_t and even bigger than the largest int size. IBM iLE/C has exactly that. Integers do need to fit into pointers however.
        
        Report comment
      2. Pat says:
        
        December 19, 2024 at 6:25 pm
        
        size_t needs to be able to store the largest contiguous memory block index: e.g. if I do char p[all of the bytes], sizeof(p) fits in size_t.
        
        The max contiguous memory block isn’t set by the C standard, though, so your implementation could limit “all of the bytes” to 65535 and size_t could be a short. (Tons of programs would fail, though, discovering they should really check returns from malloc).
        
        And the pointer size is separate from that, too, which is why they added intptr_t. Although that’s an optional type, demonstrating again how weird C is.
        
        Report comment
    4. Joel says:
      
      December 19, 2024 at 1:30 pm
      
      You don’t understand size_t or stdint types (e.g. uint16_t)?
      
      stdint types give a nice guarantee: the data is stored in 2s complement and the width is exactly as advertised, no padding bits allowed. A plain “char or int of any type” (int, short int, long int, unsigned int, char etc.) gives neither of these guarantees.
      
      size_t: makes for flexible “max width” in the sense that no larger object can exist on the machine. There could be security implications. Storing everything as uint64_t is unnecessary and can take up a lot of space depending on application. 64-bit apps taking up much more memory due to pointers (and size_t sized) is was very noticeable around the era when computers started having aorund 4GB of RAM.
      
      Report comment
      
      Reply
      1. Pat says:
        
        December 19, 2024 at 1:43 pm
        
        “no larger object can exist on the machine.”
        
        That’s not exactly what size_t is. size_t is just the size of what “sizeof()” returns, which means it has to be as large (or larger) than the maximum possible array size on the machine. You might think that means “oh that’s the pointer width” but it doesn’t have to be, since you can have segmented architectures, too.
        
        Hence the difference between size_t and intptr_t.
        
        Report comment
    5. Pat says:
      
      December 19, 2024 at 1:34 pm
      
      “Why it has to be decided by architecture?”
      
      Because memory layout and how you can address that memory is architecture-specific.
      
      “Plain horrible way to be non-standard, obfuscate and increase risk of bugs.”
      
      You’re waaaay more likely to have a bug if you assume a common memory layout/accessibility between two different architectures.
      
      Report comment
      
      Reply
  2. Kellen Burke Richardson says:
    
    December 19, 2024 at 10:53 am
    
    I was taught by a friend to use size_t for strings way back on Borland Turbo C 2.0 in the early 90’s. Sometims I miss the good ol’ no-GUI days. 🥲
    
    Report comment
    
    Reply
  3. sweethack says:
    
    December 20, 2024 at 1:06 am
    
    Storing the size vs a sentinel is the worst idea ever, memory wise and probably performance wise. You don’t know before hand the size of the string (it can be 1 byte or 4TB). So you must reserve space for the size variable that can hold the largest possible value (likely 64 bits). For small string (which are the majority of cases, this leads to a huge overhead, something like 1/8 for most string, since they are mostly under 64 bytes). The size can only be first (since you don’t know how long the string is, you can’t store it after the string), so it implies alignment issues for manipulating the string itself.
    
    The only advantage to store the size first is to avoid computing it by searching for the sentinel value (likely ‘0’). That’s were the string class in C++ got right, you only need to do it once (as you would do anyway if you had to store the size first). Then you can store the size in another variable that won’t impact the alignment and performance of the string itself. Notice that extracting the string size is done at compile time for most string (static string), and since you’re usually only making runtime string from static string, where you can add the compile time’s computed size to produce a runtime size, the size computation is not O(N) as dumbly expected but O(1).
    
    In the end, a good compile time string class like string_view or even fmt library will outperform any other string representation in many, if not all, use case.
    
    Report comment
    
    Reply
2. Tom says:
  
  December 19, 2024 at 10:03 am
  
  Depending on the application, it can work out faster, too; lots of code wants to know how long a string is and this makes that O(1) instead of O(n).
  
  Report comment
  
  Reply
  1. Daid says:
    
    December 19, 2024 at 10:44 am
    
    More code wants to access the string data, which is now potentially one extra cache mis away. And remember, getting data from RAM instead of cache takes more then a 100 cycles.
    Modern cpus also have specialized vector instructions to get string lenghs, so this is way less taxing then you would think.
    
    Report comment
    
    Reply
    1. Paul says:
      
      December 19, 2024 at 1:04 pm
      
      Can you elaborate on those vector instructions? Are there intrinsics, or do we trust the compiler?
      
      Report comment
      
      Reply
    2. MA22 says:
      
      December 19, 2024 at 1:04 pm
      
      x86? Which instructions, out of curiosity?
      
      Report comment
      
      Reply
      1. Pat says:
        
        December 20, 2024 at 7:21 am
        
        A C-string in x86 is called an “implicit length” string at least by Intel – SSE4.2 introduced functions specifically for those: you can use pcmpistri to zip through a string 16 bytes at a time.
        
        There are fast string implementations for sse4.2 here:
        https://github.com/aklomp/sse-strings
        
        x86 CPUs have had string processing instructions since the beginning, though, with the SCASx family of instructions and their repeat prefix.
        
        Report comment
Teukka says:

December 19, 2024 at 9:17 am

Side note: str and the first two elements str_buf are more or less the same as for struct iovec/iovec_t (UN*X scatter/gather arrays).
So, I’d add the constraint that the size and cap parameters should be of type size_t or the equivalent on the platform used.

Report comment

Reply
El Gru says:

December 19, 2024 at 10:13 am

An even better solution: don’t use C.

Report comment

Reply
1. A says:
  
  December 19, 2024 at 10:15 am
  
  For the love of god stop writing new code in C
  
  Report comment
  
  Reply
  1. TheBuzzSaw says:
    
    December 19, 2024 at 3:41 pm
    
    No.
    
    Report comment
    
    Reply
  2. Anonymous says:
    
    December 19, 2024 at 5:48 pm
    
    Skill issue.
    
    Report comment
    
    Reply
  3. C says:
    
    December 20, 2024 at 6:01 am
    
    There’s absolutely nothing wrong with C. With good tools and good programming practices it is perfectly good to use. I use C/C++ for embedded code.
    
    Report comment
    
    Reply
  4. Pete says:
    
    December 23, 2024 at 8:05 am
    
    For the love of god stop telling me what do to.
    
    Report comment
    
    Reply
2. walrustaskforce says:
  
  December 19, 2024 at 10:44 am
  
  I’m really questioning the wisdom of a post about using strings in C, when the only responsible use of C left is in highly constrained embedded systems, where you shouldn’t be messing with strings to begin with.
  
  Report comment
  
  Reply
  1. Pat says:
    
    December 20, 2024 at 8:05 am
    
    Null-terminated data is a perfectly valid storage method. All data has to be stored some way, and how you encode it and handle it depends on what level of error checking and such you need. Using a termination character to end a data field is used all over the place.
    
    Null termination specifically has a lot of useful features in data transmission, too: for instance in a UART link, the null character is trivially detected because it’s the only character with 9 consecutive bit times of zero.
    
    The issue with string processing in C is poor data validation in favor of speed, and this exists regardless of how data is stored. There’s nothing special about a C-string. It’s just a data encoding method.
    
    Report comment
    
    Reply
3. rclark says:
  
  December 19, 2024 at 12:14 pm
  
  I’ll continue to write in C and Pascal. Nice concise languages I’ve been using since the 80s. What you see is what you get. Love it. Works great for the little projects and like the RPI Pico/Pico 2. Nothing wrong with Python or Perl or Assembly… either … when applicable.
  
  Report comment
  
  Reply
  1. Paul says:
    
    December 19, 2024 at 1:05 pm
    
    I agree with you.
    
    Report comment
    
    Reply
4. Justin says:
  
  December 19, 2024 at 2:10 pm
  
  C is just fine if you include some extra rules to avoid the worst pitfalls. Follow something like MISRA C and you’ll be just fine.
  
  Report comment
  
  Reply
5. C says:
  
  December 20, 2024 at 6:03 am
  
  An even better solution: don’t use strings.
  
  Report comment
  
  Reply
6. deepdark103 says:
  
  December 21, 2024 at 10:05 am
  
  It’s all the legacy code that gives me a headache.
  
  Report comment
  
  Reply
Jon Mayo says:

December 19, 2024 at 11:15 am

My system is similar but a little different.

struct c_str {
char *s;
unsigned n;
char _data[];
};

when creating a new string. you allocate it and place your data in _data and set your pointer to _data.
when creating a substring, you can point s into anywhere and you don’t use _data[].
there are some evil tricks with structs both with and without _data[] that I use so you can create substring handles without malloc().

Report comment

Reply
The Commenter Formerly Known As Ren says:

December 19, 2024 at 3:12 pm

“Better C Strings, Simply”

Didn’t see anything in the article about stringed musical instruments…

Report comment

Reply
1. PPJ says:
  
  December 20, 2024 at 3:11 am
  
  I wish I could do some C# pun here.
  
  Report comment
  
  Reply
  1. Egghead Larsen says:
    
    December 20, 2024 at 7:03 am
    
    He wanted to pun on C#, but his attempt fell flat!
    
    Report comment
    
    Reply
    1. Davicious says:
      
      December 24, 2024 at 1:36 pm
      
      A very sharp person you are.
      
      Report comment
      
      Reply
NSFW says:

December 19, 2024 at 3:51 pm

Author has a data type with members named dat, len, and cap.

Author also says “…this is C, so we accept being a little verbose and move on.”

So data, length, and capacity would be excessively verbose?

Seriously though, C’s native string type is an abomination and I applaud any effort to avoid it.

Report comment

Reply
1. shinsukke says:
  
  December 19, 2024 at 9:08 pm
  
  I don’t think there IS a native string type in C. Just a byte array. Or am I wrong?
  
  Report comment
  
  Reply
Jeff Grills says:

December 19, 2024 at 7:49 pm

A 0-sized array isn’t legal in C or c++. If you remove the 0 you get a C99 flexible array member, which doesn’t exist in c++.

https://en.m.wikipedia.org/wiki/Flexible_array_member

struct str
{
sz len;
char dat[];
};

Report comment

Reply
1. Jon Mayo says:
  
  December 19, 2024 at 10:54 pm
  
  There are better ways to do things in C++. But I also don’t use C++ unless paid to do so. It’s just not a language that want to spend any more time on than I absolutely have to. C is fine, it’s primitive at times, but it’s also pretty simple for getting the low level stuff done the way I want them done.
  
  Report comment
  
  Reply
Samuel Marks says:

December 20, 2024 at 7:16 am

I did something similar years ago, extracting from an open source Microsoft embedded codebase, then making it cross-platform (with UTF-8 and all that): https://github.com/SamuelMarks/c-str-span

Report comment

Reply
Dave Hitchman says:

December 20, 2024 at 9:18 am

Waste of time. Check the lengths of things before storing or use functions like strncpy, snprintf and so on thus the functions do the checking for you. More common is miss assigning pointers or using g null when the malloc returns that

Report comment

Reply
1. deepdark103 says:
  
  December 21, 2024 at 10:04 am
  
  malloc is another area where the stdlib design screwed us all over.
  
  TFA isn’t even wrong. It’s just that it’s many, many years too late.
  
  Report comment
  
  Reply
Greg A says:

December 20, 2024 at 12:19 pm

in the 30 years i’ve been using C, my idioms keep changing. i’d like to think i’m fairly mature now but probably i’m going to keep changing?

but at the moment, i’m sensitive to the difference between interchange and computation. and for actual strings, i am generally in favor of NUL-termination for interchange. for interchange, i like “char *” NUL-terminated strings. but for computation, i often use a counted string like str_buf. there are a few idioms for growable strings that, once you’ve seen them implemented a dozen times, it just calls out for generalization. the biggest recent development is a safe sprintf that targets a growable str_buf…really amazing how generally useful that is (or how unsatisfactory naked sprintf is in many cases).

but my point is, that doesn’t replace NUL-terminated strings…it’s just good factoring for the cases where you do (briefly or locally) want a growable string.

another thing i’m coming around to is sometimes i pass structs by value :)

Report comment

Reply
deepdark103 says:

December 21, 2024 at 10:02 am

This is an area where Ritchie really screwed us all over.

Report comment

Reply

Hackaday

Better C Strings, Simply

62 thoughts on “Better C Strings, Simply”

Leave a ReplyCancel reply

Search

Never miss a hack

If you missed it

VRML And The Dream Of Bringing 3D To The World Wide Web

Australia’s Space Program Finally Gets Off The Pad, But Only Barely

What Happens When Lightning Strikes A Plane?

Happy Birthday 6502

Two For The Price Of One: BornHack 2024 And 2025 Badges

Our Columns

A Love Letter To Prototype Zero

Hackaday Podcast Episode 332: 5 Axes Are Better Than 3, Hacking Your Behavior, And The Man Who Made Models

This Week In Security: Perplexity V Cloudflare, GreedyBear, And HashiCorp

The 64-Degree Egg, And Other Delicious Variants

Jenny’s Daily Drivers: FreeDOS 1.4

62 thoughts on “Better C Strings, Simply”

Leave a ReplyCancel reply

Search

Never miss a hack

Subscribe

If you missed it

Our Columns