Leveraging the GPU to accelerate the Linux kernel


Powerful graphics cards are pretty affordable these days. Even though we rarely do high-end gaming on our daily machine we still have a GeForce 9800 GT. That goes to waste on a machine used mainly to publish posts and write code for microcontrollers. But perhaps we can put the GPU to good use when it comes compile time. The KGPU package enlists your graphics card to help the kernel do some heavy lifting.

This won’t work for just any GPU. The technique uses CUDA, which is a parallel computing package for NVIDIA hardware. But don’t let lack of hardware keep you from checking it out. [Weibin Sun] is one of the researchers behind the technique. He posted a whitepaper (PDF) on the topic over at his website.

Add this to the growing list of non-graphic applications for today graphics hardware.

UPDATE: Looks like we won’t be trying this out after all. Your GPU must support CUDA 2.0 or higher. We found ours on this list and it’s only capable of CUDA 1.0.

[Thanks John]


  1. wowme@wtf.com says:

    I don’t have nvidia

    • Gdogg says:

      Good to know.

    • Hirudinea says:

      Don’t worry, I’m sure support for the Hercules will be coming out soon.

    • Whatnot says:

      Indeed, fuck nvidia’s propriety crap when there is a damn open standard, CUDA on linux? That’s just insulting in my view.

      • rasz says:

        Nvidia because he got PAID by Nvidia to do his “research”.

        • xorpunk says:

          Irony is it’s buggy as hell and it’s considered a success…

          I actually use nvidia over ATI because of driver stability and performance. ATI actually has more ALUs and still can’t compete, and that’s saying something…

          • pcf11 says:

            I have been using nVidia’s binary drivers myself ever since I bought myself an MX200. In the very beginning there were a couple of rough patches but overall I have had very few issues with nVidia’s binary driver’s stability. ATI is a mess.

          • Whatnot says:

            I’ve been through nvidia driver hell more than once, for years, but the same for ATI though, they both are very experimental and beta, both in software as well as hardware, but that’s the name of the game, but nvidia said they would embrace OpenCL (and directcompute but that’s a windows thing), and seeing OpenCL is the open standard that belongs to the open source linux it’s annoying if they try to get the CUDA nonsense (which they at some point said they’d get rid of even) on linux. It’s like they would encourage people to make DirectX stuff for linux.. you’d also be WTF??!?

      • jc says:

        Right. At least their linux drivers aren’t discontinued. And of course we all know how usable the open source alternatives are. In my view, nvidia is currently the only way to go on linux if you want decent 3d and/or gpu acceleration.

      • draeath says:

        Great, so we should toss all of the work out?

        You’re sure none of the work or ideas could be ported over to OpenCL or something?

  2. Ralph says:

    So, what is the cheapest 4xx, 5xx, or 6xx gpu ard out there? And, is the cheapest card good enough to do useful work?

    • fluffles says:

      Which series? There are quite a number of them. Going for cheap n good you can get high end GT series cards with CUDA 2.1 for 60 dollars. What is ‘useful work’? What do you want to do? What can you do now? Have you integrated with CUDA systems before? All the cards will improve your power, by how much varies based on, well, everything.

      Depending on the code and the system you can get 40x+ improvement. See here for some well documented examples:

      Here to benchmark your own:

      • Ralph says:

        Any series. I thought it might be useful to practice programming a GPU. I have no project in mind, other than learning. I left the definition of useful work up to people who might answer. I have not interfaced with Cuda before. I have stayed away from nvidia cards for years, since nvidia would not produce open drivers. I don’t play games(unless learning is a game), Intel and ATI video has done everythingg I needed. I’ll probably start with the author’s encrypted filesystem drivers using GPUs and some benchmarking of code to factor integers.
        Based on fluffers comment, I ordered a GT 610 based card today from Amazon.

      • sa_penguin says:

        I have a 9800 GT myself – its best feature is that its passive [no fan noise]. To reach CUDA 2.0 or higher, but stay passive, I’m looking at the ZOTAC GT 640. Lousy for fancy games, fine for a little CUDA powered processing oomph.

    • fluffles says:
    • I’ve gotten significant improvement (orders of magnitude) porting personal projects to CUDA on the GeForce 8400. It’s a different computing architecture, and with the proper work load, 8 cuda cores can beat the crap out of a single x86 core.

    • six677 says:

      you can get a brand new GTX640 for about £75 which will support CUDA 2.1. I run a GTX460 which is a hair over £100 normally, it also runs 2.1. I have no idea which is the more powerful card, the x60 models are intended for games whereas the x40 models are intended for media but then the 4xx is older than the 5xx or 6xx so it isn’t unreasonable to suggest that it is slower (mine is factory overclocked though and I am thinking about cranking it further).

      It seems the 610 also supports CUDA 2.1 so that might also be a cheap option.

      As for useful work, define useful…. CUDA will be faster than CPU any day but the gaming performance of the cheap cards will barely beat using the old motherboard/CPU graphics. The 640 isn’t a bad card, not intended for gaming but I imagine most things will at least run on it. The 460 although old copes very nicely with newer games, I run skyrim with most settings maxed (a few knocked a notch or 2 down) and with Fraps being used to test framerate over a 1 hour period I got a minimum framerate of 59 and a max of 61 so nicely V-Synced as it should be.

      Just bare in mind: laptops cannot have their GPU’s upgraded and thats not an old hackaday overcomeable limitation of laptops (exception to the rule, thunderbolt ports now have external PCIe x8 ports available for them, many compatible with running GPU’s). Also keep note of what your computers power supply can kick out as it may need upgrading before a GPU upgrade.

  3. MrX says:

    Me either, I only have a crappy mobile 8400GS (CUDA 1.1). I hope this gathers some kernel dev.’s attention and make them rewrite the idea using OpenCL instead.

  4. Mildly Impressed says:

    EVGA GTX 680 Classified /w Backplate

    * using windows : ( *

  5. sdfsdfsdfdsf says:

    Two problems:

    1)GPUs are not energy-efficient. Not even remotely. Intel is cranking out server CPUs with boatloads of cores that use less than 50W. A GTX480 uses upwards of 200W-250W.

    2)It has long been the stance of many Linux folks that things like network and crypto accelerators are a waste of money – that the money is better spent on a slightly faster CPU or better overall architecture – because when you’re not using the extra capacity for network or crypto (for example), it’s available for everything else to go a bit faster.

    A pretty nice graphics card is about $200. That’s more than the difference between the bottom end and top end Intel desktop CPUs on Newegg…

    • Ravyne says:

      Except that, with the right problem, A GPU performs *orders of magnitude* faster than a CPU — I don’t know what definition of efficiency you use, but 5x power draw for 10x, 100x, even 1000x performance seems to fit the definition I know.

      On your second point, modern GPUs are not the same as crypto or network accelerators because they stopped being single-purpose circuits many years ago. In point of fact, GPUs can be used to accelerate both crypto and network functions, audio processing, image processing — any application where you do the same basic operation over a large amount of data — think the map portion of map-reduce.

      Spending $200 on a GPU gives you something on the order of 1 or 2 *thousand* simple CPU cores, plus a gig or two of high-bandwidth RAM, and the memory architecture to make it all work nicely. Spending $200 on a CPU upgrade buys you a few hundred Mhz and *maybe* another couple cores, depending where your baseline CPU is.

      Its very simple really, a CPU is optimized for low-latency processing — to give you a single result as quickly as it can. GPUs are optimized for high throughput processing — to give you many results reasonably fast, essentially by utilizing economies of scale at a silicon and software level (which, to achieve you sacrifice some ISA capability, though that will change in the next round or two of GPUs.)

      Finally, your point falls flat because many people, indeed most, already have a GPU that’s capable of performing compute functions, and that number is only going to grow as even integrated GPUs provide some pretty non-trivial grunt. Integrated GPUs in some cases even perform better than either CPUs *or* high-end GPUs because the problem isn’t bottle-necked by the PCIe bus.

      Honestly, if you went out and bought/built a computer in the last 3 years, you literally have to go out of your way to come out with a system that’s not capable of any GPU computing.

  6. rasz says:

    TLDR version:
    -accelerated crypto
    -tried accelerated Raid 6 recovery, was on par with CPUs

    • rasz says:

      oh, forgot to mention latency:


      Cuda doesnt run all of your kernels in parallel, instead it glues them into bundles called “warps” and runs them in series :/ You need a whole lot of tehm to get decent performance. Oh , did I mention how branches make whole “warps” invalid? if you have a branch inside a warp GPU will run WHOLE “warp” twice per every branch.

      Cuda is a nightmare.

      • Brooks Moses says:

        Where did you get the idea that CUDA runs all the warps in series? Cuda bundles the kernels into warps, yes, and then the bundles within a warp run in lockstep on a set of cores — but that’s only using 32 cores. The cores are massively hardware-multithreaded (like hyperthreading but much more so!), so several warps are running in parallel on that set of 32 cores, and there are dozens and dozens of sets of cores — so you can easily have a couple of hundred warps running in parallel.

        This, of course, is why you need to create so many parallel kernels at the same time to get decent performance.

        Also, if you have a branch inside a warp, CUDA will NOT run the whole warp twice; that wasn’t true even on the early versions. All it does is run both sides of the branch, in series (but it doesn’t run code outside the branch twice), and it only even does that if some threads in the warp take one side and some the other. If all the threads take the same side of the branch, there’s no slowdown, and if the branched code is small, it’s only a minor issue. The only way it will effectively “run the whole warp twice” is if you’ve got the entire kernel wrapped in a big if statement and some threads in the warp take the “true” branch and some take the “false” branch.

      • cptfalcon says:

        This isnt a very accurate post. Firstly, the GPU has ~10x greater memory throughput than a single socket CPU. Secondly, submitting literally thousands of threads to the CUDA cores is a mechanism to mask latency while threads are busy waiting for memory to move around. It’s not a very serial operation. If you want fast serial code, use x86. Which brings me to your complaint about warp branching…

        On x86 one of the heuristics to improve serial code execution performance is something called “speculative execution”. When a processor sees a branch, it speculates on the path taken, and continues executing code along that path before fully deciding which path to take! The purpose of this is to keep the pipeline full while waiting for a potentially lengthy conditional statement. It’s part of the reason an x86 processor uses more power than a simple embedded processor. On some RISC processors, this latency hiding was accomplished with a “delay slot”. So thirdly, the GPU uses the immense hardware parallelism to execute both sides of a branch, and pick the one that is correct after the conditional. There is a limit to the depth which can be done, just as with speculative execution.

        However, latency to send data back and forth from the GPU and CPU is one of the big bottlenecks that makes me scratch my head about this. Ideally you would have some sort of task that you want to run over a large chunk of data before returning, such as crypto.

        CUDA is actually a very nice framework and toolchain, compared to other tools for high performance massively parallel architectures. Take a look at programming for the Cell architecture. Without even optimizing for branch prediction, it is fairly easy to port data parallel C code to CUDA and achieve 10x or more speedup. I would say that your complaints towards a GPU architecture applies even more so for x86, which is by far uglier, with an instruction set to legitimately have nightmares over…

        • Brooks says:

          cptfalcon: “So thirdly, the GPU uses the immense hardware parallelism to execute both sides of a branch, and pick the one that is correct after the conditional.”

          No; this is not speculative execution, nor “immense hardware parallelism”, nor anything relating to hiding latency. There is no speculation involved. It’s much more like lane-masking on SIMD vector operations.

          What’s happening is that, for the 32-”core” warp, there is only one instruction decoder, and so all 32 cores must be executing the same instruction at the same time. Thus, if there is a branch where some cores go one way and some go the other, the whole set needs to execute the instructions for both branches. However, the instructions from each branch are “masked” so that, when they are executed on a core that supposed to be taking the other branch, they don’t write their results into the registers. This gives the effect of one set of cores taking one branch and the others taking the other branch — but a key point is that the mask is known before the instruction executes.

          There are no direct performance advantages for this, unlike with speculative execution on CPUs. The performance advantage is indirect: If you only have one instruction scheduler for every 32 compute units, you can put many more compute units in the same size-and-power budget.

  7. Oli Stabilize says:

    The likely hood of most people running higher end GPU’s and using Linux enough to warrant kernel compilation is rather slim as gaming on Linux while fully achievable is a head ache at best. That being said I’m tinkering with facial recognition and CUDA using the Linux platform.

    • jc says:

      I don’t think so. I have a GTX570 and I even write kernel code. And I know more people that do too. I think that the picture that kernel developers only use intel cards is some sort of a stereotype.
      And then there are tons of people who use Gentoo and derivatives. I think most of us compile our kernels. I think most people multiboot for gaming. Using windows for anything besides gaming is a headache.

      • Oli says:

        I was coming from the average computer tinkerer deciding to see the performance differences in CPU vs GPU (aka me). Now, I’m no fan boy of either OS but I do use windows as my main platform as I do a lot of audio/visual work and I really do like coding in C#.

        Unfortunately my main laptop has switchable graphics and there is just no hope of ever getting it to work effectively on Linux :(. I have done some kernel mods and looked into kernel coding but I have never needed to reinvent the wheel for most of my needs.

        A lot of my friends and majority of my work peers use lower end GPU’s in the there Linux based systems because they feel that it is sufficient for there needs. Noise is also a big factor with GPU’s.

        I think its just interesting getting different views on things :).

  8. pcf11 says:

    I only want my nVidia card to play Quake.

  9. Alan says:

    I gather the NEON graphics built into my Beaglebone isn’t good enough, then?

  10. neatbasis says:

    I’ve been waiting for this to happen for a while now. I can’t wait for someone to implement this with radeon cards as well. I believe this is the future of software RAID, crypto and many other application not yet imagined as well.

    I’m hyped. I rarely get hyped.

    • rasz says:

      The future of software raid is using $200 GPU to archive same performance as CPU?

      • edumacation says:

        No. Technically if you are doing a number of small parallel threads the GPU will be much, much faster. RAID stands for Redundant Array of Independent Disks. Meaning all RAID does is increase storage reliability or storage speed. RAID (disk I/O) and processing speed are not related. If your program is not multiprocessing aware it will only run on one of the many computational units inside a GPU, just like if it was only running on your CPU. If you want faster software RAID then buy many of the latest SATA solid state drives and set them up for striping. CUDA will do nothing for RAID performance just as putting a more powerful engine in your car wont make the suspension better.

        • pcf11 says:

          I detect a little newspeak in your comment, “RAID stands for Redundant Array of Independent Disks.” The I in RAID is supposed to stand for inexpensive. I can see how independent would suit some better though.

        • sa_penguin says:

          The bottleneck in RAID 5 is parity generation & checking. If any of that can be offloaded (to GPU or another CPU on a RAID card) the throughput should be higher.

  11. six677 says:

    I wonder if it would be possible to modify the linux kernel itself to be OpenCL/CUDA accelerated. Would certainly be an interesting experiment

    • Brooks Moses says:

      My guess is that there’s not much in the kernel that would be worth accelerating — because GPUs are designed for high data throughput at the expense of latency, and nearly everything in an operating system kernel needs to have very low latency and doesn’t involve a lot of data.

      I did recently see an interesting research paper that was using GPU acceleration to implement automated cache prefetch, but it involved using something closer to an AMD Fusion core with the GPU on the same chip as the CPU, and I think it also involved some specialized hardware to make it really work properly. And even then it wasn’t clear to me that it was really doing anything that wasn’t just “let’s power up these GPUs that aren’t running at the moment and use lots of energy for a marginal speedup of our CPU code.”

  12. lwatcdr says:

    I would have liked to see openCL used but Cuda is more mature and hey I didn’t write the code.

  13. Brandon says:

    For those who might be interested, Coursera is currently running a course on Heterogeneous Parallel Programming which touches on CUDA.


  14. The Timmy says:

    As a Linux ‘noob’, the sounds interesting. The general consensus I get from reading this post and the comments is “more processing power”

    But what, if anything, does it mean to a user that doesn’t do much more in linux than: internet, XP in vmware E-mail and chat?

    I *do* like over-powered without overclocking…

    • six677 says:

      for you, very little, changes would need to be done at a much much lower level to offer any benefit to those tasks.

      However several internet browsers can make use of your GPU for drawing now, chrome is one which will use GPU acceleration (without installing this package).

  15. jc says:

    Pretty awesome. I wonder how resource management works though? For example what happens when many applications (and the kernel) ask for more resources from the GPU than it can provide? Will their compute time get prioritized fairly? Or will new requests just fail outright?

    • Brooks Moses says:

      The GPU kernel driver is responsible for marshalling all of those requests into a single execution queue that gets fed to the GPU, so the answer to most of those questions depends on how the driver is written. However, there’s no task pre-emption once things are actually running on the GPU (with current technology), so once a task is running, it will run to completion unless completely cancelled; there’s nothing the driver can do about it.

      Since it’s a queue rather than a “do this now” programming model, there’s no real sense of “more resources from the GPU than it can provide”; the queue just keeps getting longer and tasks wait longer to execute. I assume at some point you get an error when the driver runs out of queue space, but that may only happen when the whole machine is out of memory.

  16. old greg says:

    So what I am wonder is does this mean it will be easier for slightly more than average user to brute force a password?

  17. hardcorefs says:

    Yep that is what bitcoin is about

  18. Necromant says:

    Hm, so it’s just a hack to run cuda code from the kernelspace, you still have to write your own implementations to get them cuda-accelerated.

  19. What I could see if this code could be convtered to work on something like the Parallella. That way you leverage the extra processors without having to write a bunch of extra/custom code.

  20. forex says:

    Felix Homogratus, Dimitri Chavkerov Rules! You pay us we post good about us!!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s