Rokcso's blog

译> 如果 GPU 如此强大,为什么我们还要使用 CPU 呢?

本文由 AI 翻译,rokcso 修正。

原文链接:If GPUs Are So Good, Why Do We Still Use CPUs At All?

There’s this old video from 2009 that’s been going viral on Twitter recently. Its supposed to give viewers an intuition of the difference between CPUs and GPUs.

最近有一个 2009 年的 老视频 在 Twitter 上疯传。这个视频的目的是让观众直观地了解 CPU 和 GPU 之间的区别。

The idea is that a CPU and a GPU go to head-to-head in a painting duel. The processors are hooked up to a machine that shoot paintballs.

这个想法是让 CPU 和 GPU 进行一场绘画对决。处理器连接到一台发射彩弹的机器上。

The CPU takes a full 30 seconds to paint a very basic smiley face:

CPU 需要整整 30 秒来绘制一个非常简单的笑脸:

And then the GPU paints the Mona Lisa in an instant:

然后 GPU 瞬间绘制出了蒙娜丽莎:

One takeaway from this video: CPUs are slow and GPUs are fast. While this is true, there’s a lot more nuance that the video doesn’t give.

这段视频的一个要点是:CPU 速度慢,GPU 速度快。虽然这是事实,但其中还有很多细微差别是视频没有提到的。

每秒万亿次浮点运算(TFLOPS)

Tera Floating Point Operations Per Second (TFLOPS)

When we say GPUs are much more performant than CPUs, we’re talking about a measurement called TFLOPS, which essentially measures how many trillions of mathematical operations a processor can do in a second. For example, the Nvidia A100 GPU can do 9.7 TFLOPS (9.7 trillion operations per second) while the recent Intel 24-core processor can do 0.33 TFLOPS. That means a middle-of-the-road GPU is at least 30x faster than even the most capable CPU.

当我们说 GPU 比 CPU 性能强得多时,我们指的是一个叫做 TFLOPS 的指标,它基本上衡量了一个处理器每秒可以执行多少万亿次数学运算。例如,Nvidia A100 GPU 可以达到 9.7 TFLOPS(每秒 9.7 万亿次运算),而最新的英特尔 24 核处理器只能达到 0.33 TFLOPS。这意味着一个中端 GPU 至少比最高端的 CPU 快 30 倍。

But the chip in my MacBook (Apple M3 chip) contains a CPU and a GPU. Why? Can’t we just do away with these terribly slow CPUs?

但是我 MacBook 中的芯片(Apple M3 芯片)包含一个 CPU 和一个 GPU。为什么?我们不能直接去掉这些非常慢的 CPU 吗?

不同类型的程序

Different Types of Programs

Let’s define two types of programs: sequential programs and parallel programs.

让我们定义两种类型的程序:串行程序并行程序

串行程序

Sequential Programs

Sequential programs are programs where all the instructions have to run one-after-another. Here’s an example:

串行程序是指所有指令必须一个接一个运行的程序。以下是一个示例:

def sequential_calculation():
    a = 0
    b = 1
   
    for _ in range(100):
        a, b = b, a + b
    
    return b

Here, 100 times in a row, we calculate the next number using the previous two numbers. The important quality of this program is that each step depends on the two steps before it. If you were doing this calculation by hand, you couldn’t tell a friend, “You calculate steps 51 through 100 while I start from step 1” because they would need the results of steps 49 and 50 to even begin calculating step 51. Each step requires knowing the previous two numbers in the sequence.

在这里,我们连续进行 100 次计算,每次使用前两个数字来计算下一个数字。这个程序的重要特性是每一步都依赖于前面的两步。如果你手工进行这个计算,你不能告诉朋友,「你计算第 51 步到第 100 步,而我从第 1 步开始」,因为他们需要第 49 步和第 50 步的结果才能开始计算第 51 步。每一步都需要知道序列中的前两个数字。

并行程序

Parallel Programs

Parallel programs are programs where multiple instructions can be executed simultaneously because they don’t depend on each other’s results. Here’s an example:

并行程序是多个指令可以同时执行的程序,因为它们彼此之间的结果没有依赖关系。以下是一个示例:

def parallel_multiply():
    numbers = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
    results = []

    for n in numbers:
        results.append(n * 2)

    return results

In this case, we do ten multiplications that are totally independent of each other. The important thing is that the order doesn’t matter. If you wanted to split the work with a friend, you could say, “You multiply the odd numbers while I multiply the even numbers.” You could work separately and simultaneously and get an accurate result.

在这种情况下,我们进行十次彼此完全独立的乘法运算。重要的是,顺序无关紧要。如果你想和朋友分担工作,你可以说:「你乘奇数,而我乘偶数」。你们可以分开同时工作,并得到准确的结果。

错误的二分法

A False Dichotomy

In reality, this divide is a false dichotomy. Most large real-world applications contain a mix of sequential and parallel code. In fact, every program will have a percentage of its instructions that are parallelizeable.

实际上,这种划分是一种错误的二分法。大多数真实的大型应用程序包含串行代码和并行代码的混合。事实上,每个程序都会有一部分指令是可并行化的

For example, lets say we have a program that runs 20 calculations. The first 10 are Fibonacci numbers that must be calculated in sequence, but the later 10 calculations can be run in parallel. We would say this program is “50% parallelizeable” because half the instructions can be done independently. To illustrate this:

例如,我们有一个程序需要进行 20 个计算。前 10 个是必须按顺序计算的斐波那契数,但后面的 10 个计算可以并行运行。我们可以说这个程序是「50% 可并行化」的,因为一半的指令可以独立完成。以下是一个示例:

def half_parallelizeable():
    # Part 1: Sequential Fibonacci calculation
    # 第 1 部分:顺序计算斐波那契数
    a, b = 0, 1
    fibonacci_list = [a, b]
    for _ in range(8):  # Calculate 8 more numbers
        a, b = b, a + b
        fibonacci_list.append(b)

    # Part 2: Each step is independent
    # 第 2 部分:每一步都是独立的
    parallel_results = []
    for n in fibonacci_list:
        parallel_results.append(n * 2)
    
    return fibonacci_list, parallel_results

The first half must be sequential - each Fibonacci number depends on the two numbers before it. But the second half can take that completed list and double each number independently.

前半部分必须是串行的 —— 每个斐波那契数都依赖于前面的两个数。但后半部分可以取那个完整的序列,并独立地将每个数翻倍。

You couldn’t calculate the 8th Fibonacci number without first calculating the 6th and 7th numbers, but once you have the full sequence, you could distribute the doubling operations across as many workers as you have available.

你无法在不先计算第 6 个和第 7 个斐波那契数的情况下计算第 8 个斐波那契数,但一旦你有了完整的序列,你可以将加倍操作分配给你可用的任意数量的员工。

不同的程序类型需要不同的处理器

Different Processors For Different Program Types

Broadly speaking, CPUs are better for sequential programs and GPUs are better for parallel programs. This is because of a fundamental design difference in CPUs and GPUs.

从广义上讲,CPU 更适合处理串行程序,而 GPU 更适合处理并行程序。这是因为 CPU 和 GPU 在基本设计上存在差异。

CPUs have a small number of large cores (Apple’s M3 has an 8-core CPU), and GPUs have many small cores (Nvidia’s H100 GPU has thousands of cores).

CPU 具有少量的大核心(例如 Apple 的 M3 具有 8 核 CPU),而 GPU 具有大量的小核心(Nvidia 的 H100 GPU 拥有数千个核心)。

This is why GPUs are great at running highly parallel programs - they have thousands of simple cores that can perform the same operation on different pieces of data simultaneously.

这就是为什么 GPU 非常擅长运行高度并行的程序 —— 它们拥有数千个简单的核心,可以同时对不同的数据片段执行相同的操作。

Rendering video game graphics is an application where many simple repetitive calculations are required. Imagine your video game screen as a giant matrix of pixels. When you suddenly turn your character to the right, all the those pixels need to be recalculated to new color values. Luckily, the calculation for the pixels at the top of the screen are independent from the pixels at the bottom of the screen. So the calculations can be split across the many thousands of cores of GPUs. This is why GPUs are so crucial for gaming.

渲染电子游戏图形就是一个需要进行许多简单重复计算的应用场景。想象一下,你的屏幕是一个巨大的像素矩阵。当你突然让角色向右转时,所有这些像素都需要重新计算新的颜色值。幸运的是,屏幕顶部的像素计算与屏幕底部的像素计算是独立的。因此,这些计算可以分散到 GPU 的数千个核心上。这就是为什么 GPU 对游戏如此重要。

CPU 擅长处理随机事件

CPUs Are Good At Handling Random Events

CPUs are much slower than GPUs at highly parallel tasks like multiplying a matrix of 10,000 independent numbers. However, they excel at sophisticated sequential processing and complex decision-making.

CPU 在高度并行的任务上比 GPU 慢得多,比如乘以一个包含 10,000 个独立数字的矩阵。然而,它在复杂的串行处理和复杂的决策方面表现出色。

Think of a CPU core as a head chef in a busy restaurant kitchen. This chef can:

将一个 CPU 核心想象成繁忙餐厅厨房里的主厨。这位主厨可以:

In contrast, GPU cores are like a hundred line cooks who excel at repetitive tasks - they can chop an onion in two seconds, but they can’t effectively run the whole kitchen. If you asked a GPU to handle the constantly changing demands of a dinner service, it would struggle.

相反,GPU 核心就像一百个擅长重复性任务的厨师助手 —— 他们可以在两秒钟内切好一个洋葱,但他们无法有效地管理整个厨房。如果你让 GPU 处理晚餐服务中不断变化的需求,它会感到吃力。

This is why CPUs are crucial for running your computer’s operating system. Modern computers face a constant stream of unpredictable events: apps starting and stopping, network connections dropping, files being accessed, and users clicking randomly across the screen. The CPU excels at juggling all these tasks while maintaining system responsiveness. It can instantly switch from helping Chrome render a webpage to processing a Zoom video call to handling a new USB device connection - all while keeping track of system resources and ensuring every application gets its fair share of attention.

这就是为什么 CPU 对于运行计算机操作系统至关重要。现代计算机面临着持续不断的不可预测事件:应用程序的启动和停止、网络连接的中断、文件的访问,以及用户在屏幕上随意点击。CPU 擅长于处理所有这些任务,同时保持系统的响应能力。它可以瞬间从帮助 Chrome 渲染网页切换到处理 Zoom 视频通话,再到处理新的 USB 设备连接 —— 同时跟踪系统资源并确保每个应用程序获得应有的关注。

So while GPUs excel at parallel processing, CPUs are still essential for their unique ability to handle complex logic and adapt to changing conditions. Modern chips like Apple’s M3 have both: combining CPU flexibility with GPU computing power.

因此,尽管 GPU 在并行处理方面表现出色,但 CPU 仍然因其处理复杂逻辑和适应变化条件的独特能力而必不可少。像 Apple 的 M3 这样的现代芯片同时具备两者:将 CPU 的灵活性与 GPU 的计算能力相结合。

In fact, a more accurate version of the painting video would show the CPU managing the image download and memory allocation, before dispatching the GPU to rapidly render the pixels.

事实上,一个更准确的绘画视频应该展示 CPU 在调度 GPU 快速渲染像素之前,管理图像下载和内存分配的过程。

#translation