The Ongoing BcacheFS Filesystem Stability Controversy

In a saga that brings to mind the hype and incidents with ReiserFS, [SavvyNik] takes us through the latest data corruption bug report and developer updates regarding the BcacheFS filesystem in the Linux kernel. Based on the bcache (block cache) cache mechanism in the Linux kernel, its author [Kent Overstreet] developed it into what is now known as BcacheFS, with it being announced in 2015 and subsequently merged into the Linux kernel (6.7) in early 2024. As a modern copy-on-write (COW) filesystem along the lines of ZFS and btfs, it was supposed to compete directly with these filesystems.

Despite this, it has become clear that BcacheFS is rather unstable, with frequent and extensive patches being submitted to the point where [Linus Torvalds] in August of last year pushed back against it, as well as expressing regret for merging BcacheFS into mainline Linux. As covered in the video, [Kent] has pushed users reporting issues to upgrade to the latest Linux kernel to get critical fixes, which really reinforces the notion that BcacheFS is at best an experimental Alpha-level filesystem implementation and should probably not be used with important data or systems.

Although one can speculate on the reasons for BcacheFS spiraling out of control like this, ultimately if you want a reliable COW filesystem in Linux, you are best off using btrfs or ZFS. Of course, regardless of which filesystem you use, always make multiple backups, test them regularly and stay away from shiny new things on production systems.

55 thoughts on “The Ongoing BcacheFS Filesystem Stability Controversy

  1. ext4 FTW. My PC has suffered several hard turn offs over the years (until I wisened up and bought a UPS). ext4 has been very robust, never corrupting even a single file, and repaired everything after booting back again

    1. All decent file systems have such properties. Be it NTFS, xfs, UFS, and so on. This is more a property of your disks to honestly report the data that they said they stored than a property of a production-ready filesystem.

      One thing that ext4 misses is consistent snapshots: You can’t take a snapshot to back up all the files in a consistent state, with applications modifying files as you run the backup (except with a special block-level snapshot layer, which doesn’t count since that’s outside of the filesystem). Another issue is scalability: Checking inode-based file systems is expensive, as time and space necessary to check the metadata scale linearly with file system size.

      One inefficienty is the assumption of operating on a spinning HDD. Filesystems for SSDs have a very different architecture.

      Nobody was ever fired for choosing ext4, I guess.

      1. For when drives lie, that’s where I reach for ZFS (or formerly mdadm). Having redundant copies of data means the correct thing can be reconstructed (though ZFS’s combination of a journal and raid gets rid of the write hole issue).

      2. The “honestly report the data that they said they stored” part is important. Had a very small case of data loss recently when a customer lost power* just as a file was being written. the file system reported that it had been written, but apparently the SSD had written it to a cache, reported success, but lost power before it could permanently write it. If they’d waited a fraction of a second longer it would have been written.
        EXT4 in this case, but any filesystem is only as good as the drives you’re using.

        *(they were trying to reboot a router but didn’t know how, so instead just unplugged everything)

      1. Or know too much and use the simpler ext4 due to very predictable write behaviour and storage consumption. =)

        Unless I pay a team to take care of my stuff, I will always run a dumb stable system.

      2. i can’t speak for shinsukke but i definitely do not know anything about why / when zfs/btrfs are desirable.

        a quick skim shows that the two main features seem to be volume management and snapshotting (what is called copy-on-write).

        both of those features seem kind of neat to me, but they also each make me nervous.

        i’ve met overeager volume management schemes (such as hardware raid) that unrecoverably trash an entire volume whenever anything goes wrong, and that experience has made me prefer the mirroring mode built into md. volume management built into a complicated filesystem seems like a worse case for data recovery. but that’s just two of my prejudices talking — i don’t trust complication and i do want everything to do one simple task well.

        snapshotting / cp –reflink makes me even more uncomfortable because it violates the model in my head of how files work. but it is isolated to just one corner — cp operates differently and then afterwards you can’t tell the difference, right? again, i’d rather have separate tools for that…it really seems more like a git feature than a filesystem feature. but that’s again just my prejudice. but mostly, it would be very hard to get me to trust it. if i had a workload that really required snapshotting, i would want to have a custom implementation.

        i’m curious why people do chose these filesystems, what is different about their mentalities and requirements.

        1. For ZFS at least, it’s because it’s the brainchild of some of Sun Microsystems brightest filesystem engineers, who thought of almost everything when designing and coding it (apart from the non-GPL license, but nothing is ever perfect).

          Which subsequently makes it immensely flexible and “tunable” if you’re willing to dig into the nitty-gritty details and settings.

          That and RAIDZ 1, 2 and 3, which is RAID 5, 6 and a hypothetical 7, but in every way better (especially since it doesn’t suffer from the nasty write hole issue bogging down write speeds of conventional RAID 5 and 6).

          It’s especially in scenarios where one would like a RAID volume like that where you’d tend to find ZFS evangelism, since you can “attach” various types of cache volumes (IE: fast SSD’s) to the main volume consisting of spinny discs or slow SSD’s to boost read and write performance in almost all situations.

          Essentially if tuned right, the performance of SSD’s with the GiB per ‘monetary currency du jour’ of harddiscs.

          Also got some fancy filesystem integrity preservation features that I can’t explain well that puts ext4 to shame.

          Basically if you asked me to point out any negatives about ZFS, I can personally only refer to its software license, and the fact the many options and features it has can definitely seem overwhelming to someone with no interest in the nitty-gritty of filesystems.

          1. If I understand correctly, a thing ZFS engineers didn’t really think of initially was how to cleanly expand RaidZ, without having enough resources to create an entirely new larger pool of disks and run both the smaller and larger pools at the same time to copy from one to another. But a bit of Googling shows this has improved somewhat recently? (ZFS 2.3.0, released in 2025.)

            But I’d imagine that they absolutely thought about the non-GPL licence, specifically so that ZFS would not be a first-class filesystem on Linux and to give Solaris a competitive edge. Possibly while still being able to crowd-source improvements from the BSDs.

          2. ZFS is great, but really needs some better guidance for sane defaults that fit different scenarios. It was never designed for end users or systems with small numbers of drives, and the features enabled heavily impact it’s requirements. What @Bry refers to was a non-issue for enterprise business users.

            The licence issue is more complicated. Despite Sun’s engineers being very good, she quite into open source, Sun’s legal was notorious, and while the official line is that CDDL was not supposed to be GPL incompatible, that’s transparently bs, and some employees were more truthful about it back then. It was crafted specifically to be incompatible in order to prevent adoption and drive sales of their hardware.

          3. Btfs solves the raid expansion problem – as you add drives or swap smaller drives for larger ones, the extra space gets used (as long as you’ve got n larger drives for whatever level of redundancy you’ve picked.

          4. @Bry @SO

            Yes, there is raid expansion now. And WIP is use of mixed-size disks.

            No, ZFS’ licence wasn’t intended to be incompatible with the GPL. It was never even a concern. When Solaris was released as OpenSolaris Sun took the MPL 1.0 and added patent licencing because they’d been subject to patent lawsuits. The MPL at that time was GPL incompatible, and so the resulting CDDL was too. Sun never lived long enough to release a new version of the CDDL to introduce GPL compatibility as Mozilla eventually did.

            The rest is just a conspiracy theory spun by GPL zealots who hate anything non-GPL gaining traction, even if it’s (weak) copyleft.

        2. ext4 can be “repaired” but that just means the OS won’t crash when it reads the drive. There is zero guarantee any of your data is will still be there. There is only one copy of everything and it’s not checksummed, meaning that if that one copy of the data changes it won’t be fixed and you’ll be none the wiser.

          ZFS on the other hand will repair the data as a matter of routine. If anything changes on the drives without it’s say-so it will be detected, reported, and corrected. If a drive encounters too many errors it will be marked as failed and can be automatically replaced with a hot spare.

          You might wonder “ok why not just do traditional raid (mdadm) for a corruption-free virtual drive and layer ext4 on top for files and folders?” Well the reason is the raid5 “write hole.” ZFS has both a filesystem journal AND visibility into which disks in the array have received the new data, meaning that corner case in layered setups disappears.

        3. Snapshotting is safe and free (obviously minus storage of changes) and comes from the ground floor of how a Copy on Write filesystem works.

          Basically any change to a file gets written to a new chunk that gets referenced instead. If the old one isn’t snapshotted then it’s marked to be overwritten. If you snapshot the system then when that happens it’s not marked for overwriting and is just marked as part of the snapshot.

          Also ZFS has tons of checksumming and error checking that means that basically any data on the drives stays accurate though you’d want at least some RAID going on.

      3. Intersystems, and a few other enterprise vendors, do not certify their software for ZFS, arguing it is too slow and too resource hungry. OpenZFS has also dropped clangers, recently, with zeroed blocks and other bugs that spanned several point releases.

        ZFS is amazing, but isn’t perfect and isn’t always usable.

        1. ZFS is not slow, I suspect they are talking about use cases which favour distributed block storage systems (which are even less like a “traditional” file system than ZFS), or they are not optimizing the implementation at all, possible because they are using VMs.

          ZFS is an enterprise filesystem, and it’s not for everyone, but it is immensely powerful, and has had very few notable bugs in the decades since it was released. It’s overhead is entirely a function of the extended features you use, snapshots cost nothing, compression is lightweight, but online checksums and deduplication are not at all.

        2. ZFS can be tuned to not be so heavy on RAM, which is the main thing its been notorious for hogging when resilvering/scrubbing or other I/O heavy operations on the volume.

          Though that’s where fast SSD caches come in since they can be set to offset RAM consumption.

          Though I’ve ran a setup with 1.5 TB discs on a low power Intel Atom setup with merely 4GB of ram.

          People said it’d choke, but it worked fine with some parameters finagling.

          1. Tbh, so long as you don’t turn on dedupe (and you shouldn’t) ZFS doesn’t use very much ram at all. It can by default use up to 50% for the ARC cache, but modern ZFS releases this immediately if the system comes under memory pressure.

        1. The comment above is not meant to be hostile, it’s meant to be truthful.

          There are legitmate and important reasons to use ZFS or others over ext4 and if you don’t know enough about ZFS then you don’t know them and so just think ext4 is enough.

    2. I am still half ext4 half xfs on servers i manage as i haven’t found any comprehensive data on why one is clearly better than the other – i was all ext2 than ext3 than ext4, but centos 7 was xfs by default so i left it at that and started to use it elsewhere as well.

      1. I can think of one reason to use xfs over ext4 in the enterprise. When you extend a volume, it has to format the new space. This is near instant on xfs, but takes some time (depending on the size, a LOT of time) on ext4.

        1. Huh? that is not correct or at best misleading. resize2fs is pretty fast even when dealing with larger drives (1-8TB). If the ext4 filesystem is ok and clean. it should take seconds

    3. i also don’t understand why people chose newer filesystems

      a quick skim shows that the two main features seem to be volume management and snapshotting (what is called copy-on-write).

      both of those features seem kind of neat to me, but they also each make me nervous.

      i’ve met overeager volume management schemes (such as hardware raid) that unrecoverably trash an entire volume whenever anything goes wrong, and that experience has made me prefer the mirroring mode built into md. volume management built into a complicated filesystem seems like a worse case for data recovery. but that’s just two of my prejudices talking — i don’t trust complication and i do want everything to do one simple task well.

      snapshotting / cp –reflink makes me even more uncomfortable because it violates the model in my head of how files work. but it is isolated to just one corner — cp operates differently and then afterwards you can’t tell the difference, right? again, i’d rather have separate tools for that…it really seems more like a git feature than a filesystem feature. but that’s again just my prejudice. but mostly, it would be very hard to get me to trust it. if i had a workload that really required snapshotting, i would want to have a custom implementation.

      i’m curious why people do chose these filesystems, what is different about their mentalities and requirements.

      1. ZFS’s filesystem layer having visibility into which disks have received new writes allows it to overcome the raid5 write hole. It’s raid implementation is (alongside mdadm) also significantly better than pretty much any hardware raid. As usual, open source software is better implemented and less buggy than closed-source vendor firmware. There’s raid controller firmware out there that writes it’s config info over the data itself and then relies on the raid’s error correction to reconstruct what the data should have been. It’s terrible.

        1. As for why? ZFS (and raids like mdadm) allow correction of corrupted drives. You can “repair” ext4 but that only stops the computer from crashing when it reads the filesystem. None of your data is coming back, and corrupted files will go unnoticed.

    4. Also EXT4, but under moosefs. To which you might ask, Good lord, why? The main answer is low cognitive burden. I can see the health of the system at a glance from the web app monitor. Adding disks is so easy. I can shuffle disks between the chunk servers with very little hassle. I do have a disk that has had a few sectors go bad that ext4 didn’t detect that moosefs fixed. The other reason is that i was able to setup my server across two physically separate buildings so that fire or lightning strike in one can be easily recovered from. This is reference and archive data, I’d hate to lose it but could get the critical stuff back from bup based backups. I wanted to try bcachefs as I think it might solve my biggest gripe about btrfs, the cognitive burden, it requires careful inspection to see what is what in a btrfs system. I’m hopeful that bcachefs will stabilize or that gui tools for btrfs improve but for now moosefs with ext4 has been a godsend.

  2. Questionable filesystem stability seems like the kind of thing that should be resolved/proven using a formal proof. For some, that may seem extreme measure at first glance but flawed data storage algorithms are unacceptable.

    1. ideally the algorithms are proven before they are implemented but then proving a lack of implementation flaws is very difficult. there are tools for that but they run into harder limits

    2. Now that the SEL4 team have demonstrated how to do software formal proofs cheaper and quicker than previously, this is potentially viable. But, yes, this is important. ReiserFS, Reiser4, and BtrFS have had serious stability issues. Not sure if the clangers OpenZFS dropped could have been spotted sooner, but maybe.

      1. By and large ZFS has been pretty solid. So far there’s only been one bug that’s resulted in data loss, and it was in a situation where if a file was written and then very rapidly read the reader read zeros. The data itself always was written on disk perfectly and is always read back perfectly later.

        This hasn’t been due to formal proofs though, but really aggressive testing. ZFS began with a really comprehensive test suite and went hard on CI long before it became a widespread phenomenon. ZFS’ age certainly helps too, a lot of really terrible performance issues from the early 2000’s have been fixed and I’d trust it a lot more now than then.

        As for the feasibility of formal proofs on something as large and complex as a filesystem? Not really possible. It would be nice (as would be formal proofs of correctness for CPUs and compilers) but it’s too large and intractable at this point. Even L4’s claims about formal correctness are a lot weaker than they first appear and require digestion of a lot of context to understand what they’re actually assuring.

      1. All things are developed “on the kernel.” In-tree development has been the preferred method of development since the beginning. I think maybe the reliability of bcachefs at this stage has maybe been oversold, but at the end of the day filesystems take a very long time (years or decades) to mature and anyone using one younger than that is implicitly accepting the risk to their data.

    3. Ideally, yes. But like so many other things that really ought to be formally proven (CPU correctness, compiler correctness, etc) nobody has any idea how to do that. The problem is so large and complex that it’s intractable.

  3. Sad to see since the project plan has a number of nice features. Overstreet has shown himself to play fast and loose with his codebase with a marked disrespect towards the few users that are willing to test things out with actual workloads.

    Given how people are still wary of filesystems like BTRFS due to bugs (e.g. RAID5 writehole) that have been solved for years, this is probably not going to help the project.

  4. I’ve been burned by brtfs and snapshots enough (admittedly a while ago in Suse 12 -> 15 upgrades) that I’ve given up on it. For the most part, XFS and ext4 just work. I’ve also been trying bcachefs, and Kent is working hard on stuff. And does have a good testing system in place, but …. filesystems are hard and people keep trying to ask for new features for their pet project in terms of filesystem access.

    Writing blocks to disk is easy. Writing them reliably can be easy. Writing them as fast as possible can be easy. Snapshotting the filesystem is easy. Having one process write or read is easy. Having multiple processes write correctly at the same time is easy. Combining this all into one glorious whole? Damn tough. Because doing anything fast, but at the same time reliably takes mucho work.

    And I’m proud to be a member of the “flamed by kent” club. LOL!

  5. This article is utter bullshit. Kent takes data loss very serious. If you go to his IRC or contact him in some way to talk about your broken FS, there is close to zero chance that the FS will loose your data. He wants it to be the “FS that doesn’t loose your data” and he means it.

    The current data loss was unfortunate. The people did things they shouldn’t have done. And I do partially blame Kent for that, as the operations they did went outside the “won’t loose your data” guarantee and the tool didn’t properly warn about that.

    But calling Bcachefs Alpha level quality is absolute bullshit. BTRFS was a lot more unstable in the kernel then (in-kernel) bcachefs ever was. Sure Kent pushes a lot of new code and I’m surprised by how much still changes. And the advertisement and hype doesn’t help as well. But it is still marked experimental for a reason.

    On the other hand, Kent seems to be a person difficult to work with. Linus reasons to repeatedly speak up to Kent were usually personal reasons. Kent submitted patches late in the dev-cycle, didn’t talk to the maintainers once (IIRC) even pushed code that was rejected earlier. And Kent sometimes insults people and got banned from committing code in one Kernel cycle because of that.

    Maybe Bcachefs should have stayed out of Kernel for another year or two. I firmly believe, if it was merged today with it’s current state or maybe even end of year, it would be praised for being the best FS for most tasks. (Of course some tasks will always be worse on COW filesystems and there are still some benchmarks that BTRFS is faster with.)

  6. filesystems are one place I want to be far far away from the ‘bleeding edge’ of. This is not windows 95, where a monthly re-install is acceptable. ext is extremely mature and very very stable. It might not have all the latest shinys, but reliability and safety is monumentally more important.

    But more concerning than all that is the way code is just “YOLO’d” out there. That just underscores the apprehension about safety and stability. If the development process doesn’t have a very structured and rigorous approach, it undermines the confidence it needs to engender in order to gain adoption. If every fix is an emergency fix that needs heroic urgency and an exception to the usual release rules, and happens with regularity…. yeah, naw, not gonna trust my data to that.

    1. Of course, which is why people tend to like ZFS, among other reasons. It’s mature and stable. Becachefs isn’t even complete, and many of the issues Linus has with it have to do with the developer’s bad practices and violations of kernel policy, more than whether an alpha filter system is stable.

    2. “ext is extremely mature and very very stable. It might not have all the latest shinys, but reliability and safety is monumentally more important.”

      Which has to be assumed, considering how long ext has been worked on.
      It’s a minimum requirement, considering how much time ext4 had to come to be.
      Other filesystems reached maturity within a couple of years, while ext had needed decades, rather. Almost 15 years from from ext to ext4, I think.

  7. I just use ext4 for my file systems for my home use. Corporations may have different needs, but I feel the extra features aren’t needed for my use-case. KISS is my philosophy. Never has let me down. I feel backups are way more important than what disk format I use. Also, I don’t need to to extend storage over multiple drives. My drives are minimum 2TB so have plenty of space. If there comes a time I need more ‘space’, I’ll just buy a bigger one. Got to 4TB…. So what if I am down for 30 minutes, 1 hour, a day, swapping out a disk and restoring the data to it.

    1. Same here. I just don’t have the patience for “new and sparkly” when it comes to computer technology, anymore. Sure, I remember playing with plenty of filesystems, e.g. btrfs, xfs, reiserfs, etc. years ago. It was fun, and I sure learned a lot, but oh boy did I pay for it with my time.

      Somewhere along the way, I needed to simplify my workflow, and get things done. That meant using software, tools, and filesystems that were reliable, proven, and worked very well within established norms. I’ll pick one out of the air: Clonezilla. Fantastic tool, and has worked perfectly every time for me, on ext3/4 and NTFS partitions. It probably works fine on many others too, but I KNOW it works for my needs, through experience. Timeshift has also been pretty good, although it’s better for when you realized you messed up your OS horribly, than when the system disk suddenly dies.

      Point is, I’m using what I know to work.

      That said, I do run a couple of small ZFS pools locally, one RAIDZ2 on six 2-TB HDDs, the other a RAIDZ1 on a handful of 512-GB SSDs. It’s definitely a budget-build machine, using a transplanted Dell OptiPlex 7010 motherboard and a used LSI HBA, stuffed into a cheap Corsair PC chassis. But it’s been humming along, running my Website, holding local data and backups, etc. on Arch, with no complaints. It’s probably the most boring machine I’ve ever built, because it just works, and I’ve not had to twiddle the configuration much after initial setup.

  8. I have been using bcachefs on my NAS for like 5 years now? I’ve never once lost data, verified with regular hashdeep checks. I HAVE had errors that prevent me from mounting until kent can push a fix. But he has always been very quick to respond as long as I provide the requested information and debugging information. I have like 3 or 4 branches with my username on them from me breaking the early scalability lol.

    I do not fear data loss with bcachefs, but i also know I might not get perfect uptime. Tho i haven’t had any of those issues either in quawhilehile. Overall I’ve had a better experience than with btrfs where i would have systems regularly eat themselves into total dataloss and any attempt to ask for help was met with silence. Btrfs documentation says that you should not use the fsck unless directed…. That should tell you all you need to know about it’s integrity, meanwhile Kent says try fsck first, if it doesn’t work thats a bug and as long as you provide debug data he will prioritize fixing it. Thats what I personally want in a fs dev.

    1. 5 years! Wow, a true pioneer. I have it as the root filesystem on a few systems. I thought that was bold. I have not had any problems. I am looking forward to a structure so I can go back and forth from current to LTS kernels though.

      I love how engaged Kent is. That, and the features, are why I have been using bcachefs.

  9. to summarize what i learned about ZFS over the last day, it is assertively for people with larger farms of disks where the configuration overhead is a delightful trove of valuable features instead of a nuissance voyage of discovery. it’s for sun e10k customers. one of the goodies from the brief era of opensolaris.

    btrfs seems to have the same purpose but a different pedigree and perhaps less maturity.

    neither one of them is for a guy like me who buys disks three at a time and puts a pair in md raid1 ext4 and the third in a USB enclosure for offline backups (monthly rsync).

    just pleased that the question ‘who wants this?’ had such a clear answer

    1. Oh man, ZFS is light years better than ext4 for exactly that use case. RAIDZ, on-the-fly built-in compression/encryption, snapshots, and send-receive are all huge improvements over mdadm+ext4.

      RAIDZ allows the array to self-heal from corrupted data rather than just detect it. When RAID5 detects data on the two drives is different, it has no way to know which is correct. ZFS checksums everything, so it can correct the data and go about its day. (This saved one of my servers with a flaky SATA cable once that silently corrupted a few hundred KB per month until I tracked down the issue and got it replaced.)

      Built-in, fast compression can save significant space without impact to performance. In fact,, it can often improve performance, especially on slower, spinning media, where reading the compressed data and decompressing it is significantly faster than reading the uncompressed data.

      Snapshots allow Time-Machine-like backups on the array, allowing you to browse your entire file system as it existed at a specific point in the past, or rollback to it. Taking a snapshot is atomic and instantaneous.

      Send/Receive allows easy incremental backups to another array (your USB enclosure). Incremental backups from one snapshot to another is incredibly efficient, without requiring resource-consuming scanning of directory trees or file content, like rsync does. Since snapshots are done at the block level, ZFS already knows which blocks to send to add the newer snapshot to the backup that already has the older one. The amount of data sent is minimal and can be done without re-compressing or re-encrypting blocks during the transfer. If the backup pool is in an untrusted location, like cloud storage, you don’t even have to unlock it to back up encrypted ZFS dataset, since the encrypted blocks can just be sent as-is.

      Ubuntu even allows the boot drive to be formatted with ZFS, and pfSense takes it one step further, doing it by default. And all Apple’s devices now use a similar copy-on-write filesystem called APFS.

      ZFS is incredibly useful, and I highly recommend you look into it. If you’re already setting up RAID and rsync backups, you’ve got the skills, and ZFS will make life SO much better.

Leave a Reply to PanondorfCancel reply

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.