CUDA, But Make It AMD

Compute Unified Device Architecture, or CUDA, is a software platform for doing big parallel calculation tasks on NVIDIA GPUs. It’s been a big part of the push to use GPUs for general purpose computing, and in some ways, competitor AMD has thusly been left out in the cold. However, with more demand for GPU computation than ever, there’s been a breakthrough. SCALE from [Spectral Compute] will let you compile CUDA applications for AMD GPUs.

SCALE allows CUDA programs to run as-is on AMD GPUs, without modification. The SCALE compiler is also intended as a drop-in swap for nvcc, right down to the command line options. For maximum ease of use, it acts like you’ve installed the NVIDIA Cuda Toolkit, so you can build with cmake just like you would for a normal NVIDIA setup. Currently, Navi 21 and Navi 31 (RDNA 2.0 and RDNA 3.0) targets are supported, while a number of other GPUs are undergoing testing and development.

The basic aim is to allow developers to use AMD hardware without having to maintain an entirely separate codebase. It’s still a work in progress, but it’s a promising tool that could help break NVIDIA’s stranglehold on parts of the GPGPU market.

 

40 thoughts on “CUDA, But Make It AMD

    1. OpenCL exists and works. It has historically been slower than CUDA, but newest benchmarks put them very close to each other performance-wise. Library availability is another matter, where CUDA has some advantage.

      1. Maybe. Though I was using it recently to experiment with image segmentation to apply background blur to a webcam feed. In that case, I would rather have something that supported a larger user population and didn’t use up all available CPU resources. Faster than CPU would be fast enough in this application.

      2. “probably trying to make money in one way or another”

        Or save money, if CUDA means less cloud GPU time to pay for. Especially if you’re doing research in an academic context. Gotta stretch those grant dollars.

      3. Even if you weren’t trying to make any money, plenty of the free, low-entry-barrier things you might want to make use of funnel you towards cuda. For example pytorch forced me to borrow a nvidia gpu in the past, in order to mess with some voice related machine learning stuff. I later messed with the image synthesis stuff once that became popular, but it can be fairly slow on CPU.

  1. Any numbers on performance comparisons?

    This triggers a memory from a floating point emulator I had running on my 80386 to do AutoCAD (No not an autocrat,you silly spelling thing) at home back in the ’90-ies. It worked, but it was terribly slow. After a while you learn exactly how far you can zoom in and out without triggering a “regenerate”, because that was a two minute wait.

    1. This is what it will ultimately come down to.

      But I don’t think this is intended for big computations where it’s likely even small performance differences will translate to significant sums (or hours).

      I bet this will be more useful for the I just want to run this tool locally for some reason.

      Still it will add pressure to Nvidia which is helpful.

    2. Some numbers would indeed be useful.
      I suspect that the metaphor of an emulator is perhaps not the best, and I’d suggest instead that of a cross-compiler.
      So some interesting numbers might be found from running:
      * test algorithm implemented in CUDA and ROCm running on same silicon (presumably AMD)
      * test algorithm implemented in CUDA running through this ‘cross compiler’
      This might show what the abstraction penalty is, if any.
      I’m sure there is some room for variability just in the quality of code that the ‘cross compiler’ generates.
      It is my (admittedly limited) understanding of the subject that the majority of the performance gains comes from the underlying compute architecture, the volume of memory, and memory bandwidth, and with FLOPs having a smaller contribution. The user code defines the compute pipeline which is then expected to scorch through a truly large amount of data. So, will SCALE know how to do this effectively for the cross-target target silicon, or will it do it naively?

    1. Eh.. I think they want to keep a pet competitor around for antitrust purposes. Kind of funny that nvidia and intel both have the same duopoly prop-up operation.
      If AMD was failing too hard, they would pay to keep it alive so they didn’t have to risk being broken up

      1. Have you noticed that AMD has approx double the market cap of Intel right now? And AMD has been growing whereas Intel has been sinking. NVIDIA on the other hand…. wow!

        1. Yeah AMD and Intel is a more fair competition and AMD is crushing Intel it seems. However AMD vs Nvidia is something different altogether where AMD will seriously struggle to compete due to everything working better using Nvidia’s proprietary technology and most things being made to use Nvidia GPUs. AMD can mostly compete with Nvidia in low and mid range gaming but for professionals especially for AI or high end gaming you still need Nvidia.

          1. High end gaming has gotten so high end that the terms have lost their meaning; maybe you could term the range AMD doesn’t offer as ultra high end. Very few people should care that AMD’s fastest option is not as fast as Nvidia’s since even AMD’s is $800 and Nvidia’s is double that. For most people those cards are just for the halo effect. People should care about the differences in ray tracing and upscaling and such, just like support for the compute that’s the subject of this article. But figuratively, just because the racing version of your camaro is faster than the racing version of some guy’s mustang doesn’t mean your base model 4cyl is necessarily any better than his base model 4cyl.

        2. I for one like having an underdog. Built a 5ghz (P-Core) 20-core system for $500 (yes that’s a used Z690 and a 14700 non-K), and that includes an Antec case, RGB and a good OEM Gold 600w PSU. But I’ve also built a 5700G in an HP 25L case for ~$400 so YMMV. Rumours are there may be a 12 P-core only chip coming to LGA1700, that I’d like to see.

          Got an RX 6800 for $190. So it would be nice if more software like Photoshop would work on AMD

          1. Nice! Have you experienced the instability / possible degradation that’s been going on and under investigation with 13th-14th gen intel chips? I’ve avoided them where I can because I liked the AMD options in my price range better than the way Intel did their P and E cores, and from the sounds of it I’m glad I did.

    2. Or NVIDIA is dissecting it and looking for new CUDA commands to add that it will have lower performance on AMD hardware. Basically AMD now have to play catch-up with a standard fully controller by NVIDIA. NVIDIA internally will know the standard from day one, issue a newer revision of their software and there will be a delay before AMD can add any new commands. NVIDIA will always have the latest revision and AMD will be at best no versions behind after a delay.

      1. Have you noticed that AMD has approx double the market cap of Intel right now? And AMD has been growing whereas Intel has been sinking. NVIDIA on the other hand…. wow!

    1. Speaking for the machine learning engineers: we don’t care. Stochastic Gradient Descent is pretty resilient. You can truncate our floats brutally and we won’t even notice.
      Speaking for the crypto folks: oh noes my hashes might get rejected!!!

    1. “The header file contains mostly inlined functions and thus has very low overhead”

      bah humbug. it may be very good. the readme didn’t really answer my questions, and i’m not going to dig further. but inlining is not worth thinking about. function calls are cheap. everyone these days uses a good ABI with cheap function calls. “inline makes function calls cheap” is C++’s original sin. by the time Stroustrup was working on C++, Guy Steele had already debunked this fallacy. so much of C++ practice — both at the language committee and in the wild — is meditating on inlining.

      inlining isn’t fast. it just destructures your code. i’m pretty pessimistic about the output of anyone who has spent their day meditating on inlining.

  2. Development in this direction makes me realise, that the work I am currently doing to get sycl up and running might be in vain. I honestly cant fathom how many different projects and ways this is implemented.

  3. an anecdotal report…i was curious about all this OpenCL stuff so i wanted to see what it’s like from a programmer perspective, just to dip a toe in and see if the water is warm.

    it’s a neat API. at runtime you pass it a “kernel” as a string, and it compiles it to whatever its internal representation is. the kernel looks like a bit of C code but obviously has a bunch of restrictions and a few special features.

    started out running on my Celeron N4000 (UHD 600) and its performance was abysmal. just like OpenGL, i struggled to know: is it accelerated at all or has it silently fallen back on host CPU software emulation? does the idiom in my kernel match the features of the accelerator? is the compilation overhead swamping my test case? is copying overhead swamping my test case?

    i don’t like how opaque OpenGL and OpenCL are. i don’t understand the interfaces between kernel and userland, and the libraries are so deeply layered and are designed to mask a lot of it. like, i’m surprised as heck to say that acceleration seems to work under lxc / docker — how does it allocate this resource?? just like the late 90s, i know if hardware GL works by whether or not my videogames are fast. unlike the 90s, i can’t be sure my CPU isn’t fast enough to do glxgears in software.

    so i was puzzling this out and decided to try it on my AMD Ryzen 3 2200G (Radeon Vega 8). same example that ‘worked’ on my laptop. the program crashed immediately. it was unkillable. after running the program, clinfo and radeontop would also crash. i got fed up quickly. my load average was pegged at 4.00 for months until i rebooted, from all the zombie processes.

    so i really don’t know but overall i’m not impressed. it seems like people are hacking without a thought for anything but performance. if i had a real goal, i’d bother to go through the debugging process but i just wanted to know if the water is warm. it isn’t.

  4. The sad bottom line is if you need a heavy duty GPU is is going to be expensive. I have been playing with a bunch of AI stuff and have been looking into things and it is not a poor mans hobby. You need a machine that it will physically fit in and interface with, than you need a much larger power supply, and then the card itself.

    If you already have the compatible AMD hardware and have some time on your hands, it might be something to play with. If you are a serious user in a production environment, I would be looking for the most “plug and play” solution, At this point in time I would not make promises based on a new project.

    As a hobbyist and it really pains me to say this, but you are probably better off just renting gpu/tpu time from Google via colab. You get high end hardware to play with, you do not have to worry about it depreciating when you are not using it, you do not have to deal with power or cooling, and I suspect at my level of usership, I would never in the rest of my life use near the amount of time it would cost to get a mid scale gpu and house, power, and cool it.

    But I do with the project success. I suspect there are a lot of people with AMD hardware that would like to be able to run CUDA transparently.

Leave a Reply

Please be kind and respectful to help make the comments section excellent. (Comment Policy)

This site uses Akismet to reduce spam. Learn how your comment data is processed.