What are these "GPUs" really?

Author

Name: Xe Iaso
pony.social/@cadey: pony.social/@cadey

A blonde woman sings an RTX 4090 into being with magic. — Image by Annie Ruygt

Fly.io runs containerized apps with virtual machine isolation on our own hardware around the world, so you can safely run your code close to where your users are. We’re in the process of rolling out GPU support, and that’s what this post is about, but you don’t have to wait for that to try us out: your app can be up and running on us in minutes.

GPU hardware will let our users run all sorts of fun Artificial Intelligence and Machine Learning (AI/ML) workloads near their users. But, what are these “GPUs” really? What can they do? What can’t they do?

Listen here for my tale of woe as I spell out exactly what these cards are, are not, and what you can do with them. By the end of this magical journey, you should understand the true irony of them being called “Graphics Processing Units” and why every marketing term is always bad forever.

How does computer formed?

In the early days of computing, your computer generally had a few basic components:

The CPU
Input device and assorted peripherals (keyboard, etc)
Output device (monitor, printer, etc)
Memory
Glue logic chips
Video rendering hardware

Taking the Commodore 64 as an example, it had a CPU, a chip to handle video output, a chip to handle audio output, and a chip to glue everything together. The CPU would read instructions from the RAM and then execute them to do things like draw to the screen, solve sudoku puzzles, play sounds, and so on.

However, even though the CPU by itself was fast by the standards of the time, it could only do a million clock cycles per second or so. Imagine a very small shouting crystal vibrating millions of times per second triggering the CPU to do one part of a task and you’ll get the idea. This is fast, but not fast enough when executing instructions can take longer than a single clock cycle and when your video output device needs to be updated 60 times per second.

The main way they optimized this was by shunting a lot of the video output tasks to a bespoke device called the VIC-II (Video Interface Chip, version 2). This allowed the Commodore 64 to send a bunch of instructions to the VIC-II and then let it do its thing while the CPU was off doing other things. This is called “offloading”.

As technology advanced, the desire to do bigger and better things with both contemporary and future hardware increased. This came to a head when this little studio nobody had ever heard of called id Software released one of the most popular games of all time: DOOM.

Now, even though DOOM was a huge advancement in gaming technology, it was still incredibly limited by the hardware of the time. It was actually a 2D game that used a lot of tricks to make it look (and feel) like it was 3D. It was also limited to a resolution of 320x200 and a hard cap of 35 frames per second. This was fine for the time (most movies were only at 24 frames per second), but it was clear that there was a lot of room for improvement.

One of the main things that DOOM did was to use a pair of techniques to draw the world at near real-time. It used a combination of “raycasting” and binary-space partitioning to draw the world. This basically means that they drew a bunch of imaginary lines to where points in the map would be to figure out what color everything would be and then eliminated the parts of the map that were behind walls and other objects. This is a very simplified explanation, and if you want to know more, Fabien Sanglard explains the rendering of DOOM in more detail.

The dream of 3D

However, a lot of this was logic that ran very slowly on the CPU, and while the CPU was doing the display logic, it couldn’t do anything else, such as enemy AI or playing sounds. Hence the idea of a “3D accelerator card”. The idea: offload the 3D rendering logic to a separate device that could do it much faster than the CPU could, and free the CPU to do other things like AI, sound, and so on.

This was the dream, but it was a long way off. Then Quake happened.

Really, Half-Life is based on Quake so much that the pattern for blinking lights has carried forward 25 years later to Half-Life: Alyx in VR. If it ain’t broke, don’t fix it.

Unlike Doom, Quake was fully 3D on unmodified consumer hardware. Players could look up and down (something previously thought impossible without accelerator hardware!) and designers could make levels with that in mind. Quake also allowed much more complex geometry and textures. It was a huge leap forward in 3D gaming and it was only possible because of the massive leap in CPU power at the time. The Pentium family of processors was such a huge leap that it allowed them to bust through and do it in “real time”. Quake has since set the standard for multiplayer deathmatch games, and its source code has lineage to Call of Duty, Half-Life, Half-Life 2, DotA 2, Titanfall, and Apex Legends.

However, the thing that really made 3D accelerator cards leap into the public spotlight was another little-known studio called Crystal Dynamics and their 1996 release of Tomb Raider. It was built from the ground up to require the use of 3D accelerator cards. The cards flew off the shelves.

“3D accelerator cards” would later become known as “Graphics Processing Units” or GPUs because of how synonymous they became with 3D gaming, engineering tasks such as Computer-Aided Drafting (CAD), and even the entire OS environment with compositors like DWM on Windows Vista, Compiz on GNU+Linux, and Quartz on macOS. Things became so much easier for everyone when 2D and 3D graphics were integrated into the same device so you didn’t need to chain your output through your 3D accelerator card!

The GPU as we know it

When GPUs first came out, they were very simple devices. They had a few basic components:

A framebuffer to store the current state of the screen
A command processor to take instructions from the game and translate them into something the hardware can understand
Memory to store temporary data
Shader processing hardware to allow designers to change how light and textures were rendered
A display output that was chained through an existing VGA card so that the user could see what was going on in real time (yes, this is something we actually did)

This basic architecture has remained the same for the past 20 years or so. The main differences are that as technology advanced, the capabilities of those cards increased. They got faster, more parallel, more capable, had more memory, were made cheaper, and so on. This gradually allowed for more and more complex games like Half-Life 2, Crysis, The Legend of Zelda: Breath of the Wild, Baudur’s Gate 3, and so on.

Over time, as more and more hardware was added, GPUs became computers in their own rights (sometimes even bigger than the rest of the computer thanks for the need to cool things more aggressively). This new hardware includes:

Video encoding hardware via NVENC and AMD VCE so that content creators can stream and record their gameplay in higher quality without having to impact the performance of the game
Seriously, once you experience high framerate HDR raytraced Tetris you can’t really go back to the old way.
Raytracing accelerator cores via RTX so that light can be rendered more realistically
AI/ML cores to allow for dynamic upscaling to eke out more performance from the card
Display output hardware to allow for multiple monitors to be connected to the card
Faster and faster memory buses and interfaces to the rest of the system to allow for more data to be processed faster
Direct streaming from the drive to GPU memory to allow for faster loading times

But, at the same time, that AI/ML hardware started to get noticed by more and more people. It was discovered that the shader cores and then the CUDA cores could be used to do AI/ML workloads at ludicrous speeds. This enabled research and development of models like GPT-2, Stable Diffusion, DLSS, and so on. This has led to a Cambrian Explosion of AI/ML research and development that is continuing to this day.

The “GPUs” that Fly.io is using

I’ve mostly been describing consumer GPUs and their capabilities up to this point because that’s what we all have the biggest understanding of. There is a huge difference between the “GPUs” that you can get for server tasks and normal consumer tasks from a place like Newegg or Best Buy. The main difference is that enterprise-grade Graphics Processing Units do not have any of the hardware needed to process graphics.

Author’s note: This will not be the case in the future. Fly.io is going to add Lovelace L40S GPUs that do have 3D rendering, video encoding, shader cores, and so on. But, that’s not what we’re talking about today.

Yes. Really. They don’t have rasterization hardware, shader cores, display outputs, or anything useful for trying to run games on them. They are AI/ML accelerator cards more than anything. It’s kinda beautifully ironic that they’re called Graphics Processing Units when they have no ability to process graphics.

What can you do with them?

These GPUs are really good at massively parallel tasks. This naturally translates to being very good at AI/ML tasks such as:

Summarization (what is this article about in a few sentences?)
Translation (what does this article say in Spanish?)
Speech recognition (what is a voice clip saying?)
Speech synthesis (what does this text sound like?)
Text generation (what would a cat say if it could talk?)
Basic rote question and answering (what is the safe cooking temperature for chicken breasts in celsius?)
Text classification (is this article about cats or dogs?)
Sentiment analysis (is this article positive or negative, what could that mean about the companies involved?)
Image classification (is this a cat or a dog?)
Object detection (where are the cats and dogs in this image?)

Or any combination/chain of these tasks. A lot of this is pretty abstract building blocks that can be combined in a lot of different ways. This is why AI/ML stuff is so exciting right now. We’re in the early days of understanding what these things are, what they can do, and how to use them properly.

Imagine being able to load articles about the topic you are researching into your queries to find where someone said something roughly similar to what you’re looking for. Queries like “that one recipe with eggs that you fold over with ham in it”. That’s the kind of thing that’s possible with AI/ML (and tools like vector databases) but difficult to impossible with traditional search engines.

How to use AI for reals

Fortunately and unfortunately, we’re in the Cambrian Explosion days of this industry. Key advances happen constantly. Exact models and tooling changes almost as often. This is both a very good thing and a very bad thing.

If you want to get started today, here’s a few models that you can play with right now:

Llama 2 - A generic foundation model with instruction and chat tuned variants. It’s a good starting point for a lot of research and nearly everything else uses the same formats that Llama 2 does.
Whisper - A speech to text model that transcribes audio files into text better than most professional dictation software. I, the author of this article, wrote most of this article using Whisper.
OpenHermes-2.5 Mistral 7B 16k - An instruction-tuned model that can operate on up to 16 thousand tokens (about 40 printed pages of text, 12,000 words) at once. It’s a good starting point for summarization and other tasks that require a lot of context. I personally use it for my personal AI chatbot named Mimi.
Seriously Annie, you’re great!
Stable Diffusion XL - A text-to-image model that lets you create high quality images from simple text descriptions. It’s a good starting point for tasks that require image generation, such as when you want to add images to your blog posts but don’t have an artist like Annie to draw you what you want.

For a practical example, imagine that you have a set of conference talks that you’ve given over the years. You want to take those talk videos, extract the audio, and transform them into written text because some people learn better from text than video. The overall workflow would look something like this:

Use ffmpeg to extract the audio track from the video files
Use Whisper to convert the audio files into subtitle files
Break the subtitle file into sequences based on significant pauses between topics (humans do this subconsciously, take advantage of it and you can make things seem heckin’ magic)
Use a large language model to summarize the segments and create a title for each segment
Paste the rest of the text into a markdown document between the segment titles
Manually review the documents and make any necessary changes with technical terms that the model didn’t know about or things the model got wrong because English is a minefield of homophones that even trained experts have trouble with (ask me how I know)
Publish the documents on your blog

Then bam, you don’t just have a portfolio piece, you have the recipe for winning downtime from visitors of orange websites clicking on your link so much. You can also use this to create transcripts for your videos so that people who can’t hear can still enjoy your content.

The true advantage of these is not using them as individual parts on themselves, but as a cohesive whole in a chain. This is where the real power of AI/ML comes from. It’s not the individual models, but the ability to chain them together to do something useful. This is where the true opportunities for innovation lie.

Conclusion

So that’s what these “GPUs” are really: they’re AI/ML accelerator cards. The A100 cards incapable of processing graphics or encoding video, but they’re really, really good at AI/ML workloads. They allow you to do way more tasks per watt than any CPU ever could.

I hope you enjoyed this tale of woe as I spilled out the horrible truths about marketing being awful forever and gave you ideas for how to actually use these graphics-free Graphics Processing Units to do useful things. But sadly, not for processing graphics unless you wait for the Lovelace L40S cards early in 2024.

Next post ↑: Fly.io has GPUs now
Previous post ↓: Scaling Large Language Models to zero with Ollama