If you have a curiosity about how fancy graphics cards actually work, and why they are so well-suited to AI-type applications, then take a few minutes to read [Tim Dettmers] explain why this is so. It’s not a terribly long read, but while it does get technical there are also car analogies, so there’s something for everyone!
He starts off by saying that most people know that GPUs are scarily efficient at matrix multiplication and convolution, but what really makes them most useful is their ability to work with large amounts of memory very efficiently.
Essentially, a CPU is a latency-optimized device while GPUs are bandwidth-optimized devices. If a CPU is a race car, a GPU is a cargo truck. The main job in deep learning is to fetch and move cargo (memory, actually) around. Both devices can do this job, but in different ways. A race car moves quickly, but can’t carry much. A truck is slower, but far better at moving a lot at once.
To extend the analogy, a GPU isn’t actually just a truck; it is more like a fleet of trucks working in parallel. When applied correctly, this can effectively hide latency in much the same way as an assembly line. It takes a while for the first truck to arrive, but once it does, there’s an unbroken line of loaded trucks waiting to be unloaded. No matter how quickly and efficiently one unloads each truck, the next one is right there, waiting. Of course, GPUs don’t just shuttle memory around, they can do work on it as well.
The usual configuration for deep learning applications is a desktop computer with one or more high-end graphics cards wedged into it, but there are other (and smaller) ways to enjoy some of the same computational advantages without eating a ton of power and gaining a bunch of unused extra HDMI and DisplayPort jacks as a side effect. NVIDIA’s line of Jetson development boards incorporates the right technology in an integrated way. While it might lack the raw horsepower (and power bill) of a desktop machine laden with GPUs, they’re no slouch for their size.
So its more like container shipping? :p
When we don’t have the time to really understand something, we use analogies. We explain something new by using an already understood concept. To some degree everything is a representation, but when we use representations of representations ad. infinitum, it at some point willingly trades understanding for simplicity and can result in a percieved understanding that bypasses any factual basis. I see this a lot in AI, and the attempts to easily make it explainable. So to ammend your above statement, I would say it is not so much container shipping, as the collection of individual container on a particular ship before they have entered the harbor. Get it? No?
He still doesn’t explain the deep learning connection, does he? He explains the 3 significant benefits of GPU architecture over CPUs, but doesn’t explain why the problems associated with deep learning are better solved by having those 3 benefits.
I still don’t know what the fundamental problems are with deep learning. But at least I do now know that high memory bandwidth, thread parallelism and fast register access are the basis of the solutions to them.
It’s because deep learning, artificial neural networks, etc. are essentially emulating another “computer” in the computer’s memory. It’s like a physics simulation where you compute how the parts of the machine operate instead of directly computing the logic that it implements – largely because you don’t know what the logic is. You simply run the machine over and over, changing and varying its internal rules by trial and error, until it does the job that you want – and you don’t care how it does that just as long as it does.
So what you have is a massive matrix of data that represents the internal state of the virtual computer, and you’re trying to “run” it as fast as possible to test its response, and that sort of a job is best suited for vector processors like GPUs. The computation itself is very simple, just some addition or multiplication etc. without any branching, but the matrix is so big that you also need extremely fast memory access to churn through it – otherwise your thousands of parallel CPUs would be sitting idle waiting on memory most of the time.
“Computer emulating a computer” tells people nothing.
Neural networks are composed of layers of neurons, with connections inbetween. The outputs of the neurons of one layer gets passed to the next, but the neurons in each later are pretty much independent of each other. Thus, you can compute the result for each neuron in a layer in parallel with the layer’s other neurons, in theory.
A CPU has a single thread (or perhaps a handful, but let’s assume one) and has to chug through the neurons one-by-one regardless of whether they could be handled in parallel. A GPU with 1000 cores can compute the results of a 1000-neuron layer in parallel in one step, then move on to the next layer. That yields a 1000x speedup compared to a single CPU core.
We’re not talking of just neural networks, but all sorts of systems that can be represented in similar ways. What the “AI” is doing is running a simulation of some hypothetical mechanism or a computing machine, that is supposed to resemble how brains process information. To understand the point, we don’t need to go to details of how an SNN works – it could be any such “cellular automata”. We could just as well be iterating through different versions of the Game of Life to find one that can drive a car, or buy you napkins off of Amazon.
Seeing that such computing machines and their representations all necessarily reduce to the Turing Machine, what you are doing IS just running a virtual computer in the memory of your actual computer, by simulating the hardware of said machine “piece by piece”. The state of each piece is represented by data in memory, and you shuffle the data around to make the virtual computer run.
>That yields a 1000x speedup compared to a single CPU core.
Not if you don’t have the memory bandwidth and speed to keep those 1000 cores fed with data. Ordinary CPUs are designed to handle complex computational problems of the size that fit in the CPU cache. A GPU is optimized to perform simple computational routines on massive amounts of data, which fits AI simulations well – or rather, our AI simulations have been formulated to run well on GPUs.
Though most AI simulations are not Turing complete. It’s far more likely to find a simpler solution that performs only the task you require of it, and nothing else.
I totally agree with your summarisation as to what is being achieved in the GPU, all intelligent systems are just simulations of other systems. But firther to this, the more accurate and granular the simulation is the more ‘understanding’ it has of the external system it is trying to simulate. Although there is also a problem with the overarching managing system, or what I like to call the operator. The operator chooses the system to simulate and the rules of how the simulator operates. In doing, the top level operator or manager could ask the system to simulate something that is not true, like a deep fake or a belief in the flat world theory. In doing so the underlying system ‘believes’ in a reality that does not exist. Further to this, we should ask, who or what is the overarching operator of the operator, this can be another intelligent system, or just a component of the environment.
Too many words, too much conjecture, too many incorrect details.
And the hack would be? I do not come here to read common news…
Common AI clickbait news, that is, in which I see no value.
Yeah, he didn’t even mention Compute cards like Nvidia Tesla line, and the ending paragraph seems to imply he doesn’t even know about them because he mentions ‘display outputs’ which Tesla cards don’t even have.
Today in BizarroLand we get our facts from Quora, our investment advice from Yahoo!Finance and our medical diagnosis from WebMD.
That’s, like, so yesterday! Now we get our facts from TikTok, our investment advice from Reddit’s r/wallstreetbets and our cyberchondriac medical diagnosis from herbalist FB pages.
Just think how people are making their election choices…well, you don’t have to actually…like…think….
“move cargo (memory, actually) around” and “Of course, GPUs don’t just shuttle memory around, they can do work on it as well”
The more apt term would be data, the memory itself stays in place, only the memory contents is moved and operate upon.
Maybe this comes from C’s memcpy, but this refers to main memory as opposed to hard disk or other physical storage media, not that memory is actually copied (which it isn’t).
Copying a disk makes sense, when you actually have physical media you move around, instead of mere data transfer.
Buying NVIDIA hardware today is like buying a Volkswagen in 1939.
If the ‘Maxwell’ architecture is enough for you there is an Nvidia card with 24GB GDDR5 memory for around $110 the Tesla M40. I confess to not knowing what the current hobbyist Large Language Models have in the way of hardware requirements.
If you need a Pascal card the Tesla P40 is available around $200 and has 24GB GDDR5X
I almost forgot, there may be some BIOS hacks for AMD CPU integrated graphics (APU) to give them a larger ‘dedicated’ portion of system memory for use in a large language model. This way you can use less expensive (64GB DDR4 ~$80 or 64GB DDR5 ~$100)
Looks like even a Kepler Tesla K40 might run a CUDA 11 LlammaSharp backend: https://github.com/SciSharp/LLamaSharp
According to the documentation here: https://en.wikipedia.org/wiki/CUDA
The k40 12GB card is around $30, although the K80 (two K40 on a single card) is only $40. So it could be cheap to get into.