Linux Fu: The SSD Super Cache

NVMe solid state disk drives have become inexpensive unless you want the very largest sizes. But how do you get the most out of one? There are two basic strategies: you can use the drive as a fast drive for things you use a lot, or you can use it to cache a slower drive.

Each method has advantages and disadvantages. If you have an existing system, moving high-traffic directories over to SSD requires a bind mount or, at least, a symbolic link. If your main filesystem uses RAID, for example, then those files are no longer protected.

Caching sounds good, in theory, but there are at least two issues. You generally have to choose whether your cache “writes through”, which means that writes will be slow because you have to write to the cache and the underlying disk each time, or whether you will “write back”, allowing the cache to flush to disk occasionally. The problem is, if the system crashes or the cache fails between writes, you will lose data.

Compromise

For some time, I’ve adopted a hybrid approach. I have an LVM cache for most of my SSD that hides the terrible performance of my root drive’s RAID array. However, I have some selected high-traffic, low-importance files in specific SSD directories that I either bind-mount or symlink into the main directory tree. In addition, I have as much as I can in tmpfs, a RAM drive, so things like /tmp don’t hit the disks at all.

There are plenty of ways to get SSD caching on Linux, and I won’t explain any particular one. I’ve used several, but I’ve wound up on the LVM caching because it requires the least odd stuff and seems to work well enough.

This arrangement worked just fine and gives you the best of both worlds. Things like /var/log and /var/spool are super fast and don’t bog down the main disk. Yet the main disk is secure and much faster thanks to the cache setup. That’s been going on for a number of years until recently.

The Upgrade Issue

I recently decided to give up using KDE Neon on my main desktop computer and switch to OpenSUSE Tumbleweed, which is a story in itself. The hybrid caching scheme seemed to work, but in reality, it was subtly broken. The reason? SELinux.

Tumbleweed uses SELinux as a second level of access protection. On vanilla Linux, you have a user and a group. Files have permissions for a specific user, a specific group, and everyone else. Permission, in general, means if a given user or group member can read, write, or execute the file.

SELinux adds much more granularity to protection. You can create rules that, for example, allow certain processes to write to a directory but not read from it. This post, though, isn’t about SELinux fundamentals. If you want a detailed deep dive from Red Hat, check out the video below.

The Problem

The problem is that when you put files in SSD and then overlay them, they live in two different places. If you tell SELinux to “relabel” files — that is, put them back to their system-defined permissions, there is a chance it will see something like /SSD/var/log/syslog and not realize that this is really the same file as /var/log. Once you get the wrong label on a system file like that, bad, unpredictable things happen.

There is a way to set up an “equivalence rule” in SELinux, but there’s a catch. At first, I had the SSD mounted at /usr/local/FAST. So, for example, I would have /usr/local/FAST/var/log. When you try to equate /usr/local/FAST/var to /usr/var, you run into a problem. There is already a rule that /usr and /usr/local are the same. So you have difficulties getting it to understand that throws a wrench in the works.

There are probably several ways to solve this, but I took the easy way out: I remounted to /FAST. Then it was easy enough to create rules for /var/log to /FAST/var/log, and so on. To create an equivalence, you enter:


semanage fcontext -a -e /var/log /FAST/var/log

The Final Answer

So what did I wind up with? Here’s my current /etc/fstab:


UUID=6baad408-2979-2222-1010-9e65151e07be /              ext4    defaults,lazytime,commit=300 0 1
tmpfs                                     /tmp           tmpfs   mode=1777,nosuid,nodev 0 0
UUID=cec30235-3a3a-4705-885e-a699e9ed3064 /boot          ext4    defaults,lazytime,commit=300,inode_readahead_blks=64 0 2
UUID=ABE5-BDA4                            /boot/efi      vfat    defaults,lazytime 0 2
tmpfs                                       /var/tmp    tmpfs  rw,nosuid,nodev,noexec,mode=1777 0 0

<h1>NVMe fast tiers</h1>

UUID=c71ad166-c251-47dd-804a-05feb57e37f1 /FAST  ext4  defaults,noatime,lazytime  0  2
/FAST/var/log /var/log  none  bind,x-systemd.requires-mounts-for=/FAST 0 0
/FAST/usr/lib/sysimage/rpm /usr/lib/sysimage/rpm none bind,x-systemd.requires-mounts-for=/FAST 0 0
/FAST/var/spool /var/spool  none  bind,x-systemd.requires-mounts-for=/FAST 0 0

As for the SELinux rules:


/FAST/var/log = /var/log
/FAST/var/spool = /var/spool
/FAST/alw/.cache = /home/alw/.cache
/FAST/usr/lib/sysimage/rpm = /usr/lib/sysimage/rpm
/FAST/alw/.config = /home/alw/.config
/FAST/alw/.zen = /home/alw/.zen

Note that some of these don’t appear in /etc/fstab because they are symlinks.

A good rule of thumb is that if you ask SELinux to relabel the tree in the “real” location, it shouldn’t change anything (once everything is set up). If you see many changes, you probably have a problem:


restorecon -Rv /FAST/var/log

Worth It?

Was it worth it? I can certainly feel the difference in the system when I don’t have this setup, especially without the cache. The noisy drives quiet down nicely when most of the normal working set is wholly enclosed in the cache.

This setup has worked well for many years, and the only really big issue was the introduction of SELinux. Of course, for my purposes, I could probably just disable SELinux. But it does make sense to keep it on if you can manage it.

If you have recently switched on SELinux, it is useful to keep an eye on:


ausearch -m AVC -ts recent

That shows you if SELinux denied any access recently. Another useful command:


systemctl status setroubleshootd.service

Another good systemdstupid trick.” Often, any mysterious issues will show up in one of those two places. If you are on a single-user desktop, it isn’t a bad idea to retry any strange anomalies with SELinux turned off as a test: setenforce 0. If the problem goes away, it is a sure bet that something is wrong with the SELinux system.

Of course, every situation is different. If you don’t need RAID or a huge amount of storage, maybe just use an SSD as your root system and be done with it. That would certainly be easier. But, in typical Linux fashion, you can make of it whatever you want. We like that.

14 thoughts on “Linux Fu: The SSD Super Cache

  1. I was wanting more performant disks just this morning. I’m kind of shooting in the dark because i haven’t done the work to properly diagnose it but what i think happens is that my attempts to find an option that lets close() complete without flushing the RAM cache to disk have failed. (I have tried noatime, lazytime, async) So at random a lot of things have a user-facing 1 second delay in them. And sometimes (like this morning) a much longer delay. The longer delay is probably due to using SMR disks in a raid1 for most of my storage. Which intermittently fills up its own layered caches and must block for an extended period as it flushes those to SMR.

    So my point is, i considered that it would be worth doing if i could make another layer of cache decently transparent. But then i realized, the problem i face today is caused by a decently transparent cache that fills up with queued writes. So i’m back to being pessimistic about what little good even a write-back cache can do for me.

    Anyways i keep my big builds on an SSD. Not remotely transparent, nor particularly safe (i use git for safety). But very fast for the specific bottleneck.

    1. Disclaimer, this setup is meant for a laptop with a battery, and should not be used for a database etc. YMMV =)

      SSD: a proper ssd from Samsung or HP/IBM with write SLC cache area, Trim support, and onboard buffer ram. Cheap SSD are just not worth a $2k data recovery that probably won’t work.
      kernel: Set you RAM block cache to an absurd size of for installed memory > 8GiB, long dirty page flush interval (defers writes), and swap priority to 1 (slow flush blocks to the ssd unless necessary.) Example:

      sudo nano /etc/sysctl.conf

      vm.swappiness=1
      vm.vfs_cache_pressure=50
      vm.dirty_background_ratio=60
      vm.dirty_ratio=80
      vm.dirty_background_bytes=2684354560
      vm.dirty_bytes=5368709120
      vm.dirty_writeback_centisecs=10000
      vm.dirty_expire_centisecs=6000
      vm.min_free_kbytes = 16384
      fs.file-max=120000
      kernel.pid_max = 32767
      net.ipv4.ip_local_port_range = 3000 64000

      OS: for root use the read-only mode overlayfs on a ramdrive with a standard ext4fs. Ubuntu has a meta-package that will set this up for you, and adding a grub menu to enable write mode for updates is easy. This will take care of most SSD wear sources:

      sudo apt install overlayroot f2fs-tools
      sudo nano /etc/overlayroot.conf
      overlayroot=”tmpfs:swap=1,recurse=0″

      Quality of life tweak: set different role contexts for grub
      sudo nano /etc/default/grub
      GRUB_SAVEDEFAULT=true
      GRUB_DEFAULT=saved
      GRUB_TIMEOUT=3
      GRUB_RECORDFAIL_TIMEOUT=$GRUB_TIMEOUT
      GRUB_TIMEOUT_STYLE=menu
      GRUB_TERMINAL=console
      GRUB_CMDLINE_LINUX_READONLY=”quiet splash i915.tuxedo_disable_psr2=1 i915.enable_psr=0 ”
      GRUB_CMDLINE_LINUX_DEFAULT=”quiet splash i915.tuxedo_disable_psr2=1 i915.enable_psr=0 overlayroot=disabled fsck.mode=force fsck.repair=yes ”
      insert an auto menu item for the read only OS boot up

      sudo nano /etc/grub.d/10_linux

      if [ “x$is_top_level” = xtrue ] && [ “x${GRUB_DISABLE_SUBMENU}” != xtrue ];
      then
      linux_entry “${OS}” “${version}” simple \
      “${GRUB_CMDLINE_LINUX} ${GRUB_CMDLINE_LINUX_DEFAULT}”

      linux_entry "${OS} READ ONLY OS" "${version}" simple \
      "${GRUB_CMDLINE_LINUX} ${GRUB_CMDLINE_LINUX_READONLY}"

      finally enable the role selection menu with your current kernel:

      sudo update-grub
      sudo update-initramfs -u

      OS: for /home use F2FS logging file-system (it will almost double the drives life). This can also be a specific users LUKS encrypted partition (other paths public for web-server etc.), and disable folder date updates on something that is going to get accessed thousands of times a day.

      sudo nano /etc/fstab

      /dev/mapper/somelukspart /home/someuser f2fs defaults,noatime,nodiratime,noquota,discard,nobarrier,inline_xattr,inline_data 0 2

      The trim operation to “clear” flash deleted space in most distros will ONLY auto run once a week, this will speed up future SSD writes, and wear is still distributed over a wider area even on cheaper hardware. Also note, “preload” will fill your shared object cache with the last most popular application resources you used (run your email and browser before rebooting into read-only mode.)

      sudo apt-get install preload hdparm util-linux
      sudo hdparm -I /dev/sda

      If SSD hardware states Trim is supported:

      sudo systemctl enable fstrim.timer
      sudo systemctl start fstrim.timer

      Finally, after setting up all your programs, use the SSD manufacturers tool (usually in windows) to flush the SSD SLC buffer area into the long-term flash memory before booting into read-only mode
      Reboot into “read-only” mode after testing everything works

      Seen some systems with over 5 years up-time that used this setup with a cheap Sandisk TLC flash SSD.

      Good luck
      J

      1. Very detailed, thanks…but I use a simpler way: put root, boot, and efi on SSD (NVMe), and then create swap, var, tmp, and home on spinning rust (make sure var and tmp are big enough). NVMe will only take writes on software updates. I only run a manual trim once a year on the NVMe drive, and I patch about twice a year…no auto patching on my machines. I only run XFS on internal drives, so this gives me the speed and reliability of XFS, without wearing out my SSD drives. I stay in an area in South Africa where the power can drop out on you daily…multiple times, and without warning, sometimes for many days (cable and equipment theft). A very good, name brand power supply lasts about a year, but I have not lost a single file to corruption yet in over 7 years.

        There is a stupid pulseaudio library, that when being loaded, creates a .cache and a .config directory in root…even if you are not using pulseaudio (just like systemd, pulseuadio is tied like a disease into so many packages). It looks ugly, but leave the two directories there after being created…they are not updated if you don’t run pulseaudio.

        For external SSD, I create my file systems as ext4…without log. External spinning rust?…XFS… :-)

        1. Very lucky, most consumer HDD hardware can’t be trusted beyond 2 to 3 years 24/7 now, and start with a bathtub curve >1.3% for failure rates. XFS also has a rather simplified Trim routine that may simply ignore already previously Trim erased blocks. Spinning rust >2TB has become less reliable, and consumes more sector spares on drive firmware remapping than many know. HDD only make sense now for sleeping JBODs, where the $/capacity is preferable to performance and reliability.

          ext4 with an external journal drive is quite performant, and usually handles Trim more effectively than XFS:

          https://libre-consulting.com/posts/fs_ext4_external_journal/

          F2FS logging is a bit slower, but on /home the advantages make it pretty reliable as it simply rolls back drive state to the last non-corrupt 10MiB entry… given it defers old state deletion. It is a small file-type aware system, and on an SSD is quite usable. ZFS is good on flash too, but unless you are running a 4 split channel 4x4x4x4 card for NVMe SSD it costs a lot of RAM for simple workstations.

          It is a personal choice everyone makes, as bugs tend to be less common in more popular formats. btrfs looks great on paper until something goes wrong… some got hit hard hard.

          Those who regularly backup… laugh last of course . Cheers =)

          1. If you can keep writes away from SSD, AND use it often, they last quite long. My 1st gen Korg Kronos is still running its original 32GB SSD (about 16-17 years?). However, I forgot to mention…once a year I ‘dd’ the NVMe drives into ‘/dev/null’ to make sure that all cells are refreshed, or re-allocated if necessary.

            Have used XFS for over 20 years, so I trust it. If it works flawlessly underneath a 12TB HANA database engine, it is rock-solid.

            BtrFS?….nope…been bitten by it a few times in the early openSuSE days, and I still see corruption reports with recent versions. BtrFS needs to back down on features, and stabalize first.

            Yeah…spinning rust on a workstation only makes sense <= 2TB, but even bigger drives with SMR I find “acceptable” in performance for my use cases. This is more than enough for workstation use. The spinning rust performance “just” about cuts it for my gaming use (but waiting for the steam client to load is a pain though). My “server” runs a large ZFS array on spinning disk, with NFS exports that automount on my workstations, so I don’t need that much storage inside my workstations. I have tried ZFS on a workstation (freeBSD), but yes…you’re correct. The RAM use for ZFS doesn’t really work on a workstation…it was meant for dedicated server use. I only use spinning disk for backups…would never trust that to SSD.

            Backups?…’rsync’ is your friend… :-)

        2. Very lucky, most consumer HDD hardware can’t be trusted beyond 2 to 3 years 24/7 now, and start with a bathtub curve >1.3% for failure rates. XFS also has a rather simplified Trim routine that may simply ignore already previously Trim erased blocks. Spinning rust >2TB has become less reliable, and consumes more sector spares on drive firmware remapping than many know. HDD only make sense now for sleeping JBODs, where the $/capacity is preferable to performance and reliability.

          ext4 with an external journal drive is quite performant, and usually handles Trim more effectively than XFS:

          https://libre-consulting.com/posts/fs_ext4_external_journal/

          F2FS logging is a bit slower, but on /home the advantages make it pretty reliable as it simply rolls back drive state to the last non-corrupt 10MiB entry… given it defers old state deletion. It is a small file-type aware system, and on an SSD is quite usable. ZFS is good on flash too, but unless you are running a 4 split channel 4x4x4x4 card for NVMe SSD it costs a lot of RAM for simple workstations.

          It is a personal choice everyone makes, as bugs tend to be less common in more popular formats. btrfs looks great on paper until something goes wrong… some got hit hard hard.

          Those who regularly backup… laugh last of course . Cheers =)

  2. Just to nitpick: the idea that you lose data if you use ‘writeback’ caching is only true if the cache you’re using is ephemeral, such as RAM caches.

    If you use non-ephemeral media like SSD caching, writeback is perfectly fine… with one caveat: The hardware is not immune to power outages.

    If an SSD is in the middle of writing a flash block during a power outage, it can lose data or leave a hole in the data. But that’s no different than with HDDs, which also have volatile DRAM caches. It’s even more relevant for ‘SMR’ (Shingled Magnetic Recording) disks, which share the same issue as SSDs: they have to read and then re-write a whole area. If timed poorly, a power cut there can leave a significant hole in the data.

    Datacenter SSDs have a gigantic capacitor bank (PLP capacitors) to deal with these powerloss events, which makes them more resilient than harddisks (because you can’t spin a mechanical platter with a capacitor bank, that’s too power heavy).

    If you have consumer hardware, and regular power outages, it pays to have a UPS.

    1. I’m not sure how effective it is but HDDs have a budget of how long they will spin after the power goes out, and ideally it’s enough to flush their own internal caches. That’s possible because the cache and the power-out-capacitor / platter inertia are all engineered under the same roof. But for an OS, it’s harder to come up with a guarantee like that. So if you’re using SSD as a write-back cache, that can be pretty reliable but you’ll need to ensure all writes at least make it to the SSD before the power goes out…it limits the possible architectures.

  3. I find most of my performance issues to be related to Windows hammering a drive doing something it thinks is important that I don’t (indexing, prefetching, scanning something or other).

    I have no such issue in Linux.

    It can clog my drive so much that the internet doesn’t work until it’s done!

    It’s really quite remarkable how poorly it can be designed.

  4. LOL. This article was either written months ago or the author hasn’t looked at prices as SSDs are skyrocketing in price and will continue as long as data centers buy-up all the NAND!

  5. I used to have a complicated SSD/HDD mix in my computer (using some mostly-undocumented Powershell to set up a tiered Storage Space), but eventually I decided to move all my spinning hard drives into a NAS, and my desktop computer is now SSD only. Gigabit networking is generally faster than a harddrive so I’m not losing anything there.
    I find this to be an easier to manage system. My main computer is entirely SSD, so I don’t have to worry about where a particular file is going to be saved. The NAS is used for storing things like media files, that take up a lot of space but don’t need super fast access, and for backups.

Leave a Reply to Elkr49Cancel reply

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.