In this section, we survey GPU system architectures in common use today. We discuss system configurations, GPU functions and services, standard programming interfaces, and a basic GPU internal architecture.
在本节中,我们将调查目前常用的GPU系统架构。 我们将讨论系统配置,GPU功能和服务,标准编程接口以及基本的GPU内部架构。

Heterogeneous CPU–GPU System Architecture 异构CPU-GPU系统架构

A heterogeneous computer system architecture using a GPU and a CPU can be described at a high level by two primary characteristics: first, how many functional subsystems and/or chips are used and what are their interconnection technologies and topology; and second, what memory subsystems are available to these functional subsystems. See Chapter 6 for background on the PC I/O systems and chip sets.
使用GPU和CPU的异构计算机系统架构可以通过两个主要特征在高级别描述:第一,使用多少功能子系统和/或芯片以及它们的互连技术和拓扑是什么; 第二,这些功能子系统可以使用哪些内存子系统。 有关PC I / O系统和芯片组的背景信息,请参见第6章。

The Historical PC(circa 1990) 历史PC(大约1990年)

Figure C.2.1 shows a high-level block diagram of a legacy PC, circa 1990. The north bridge (see Chapter 6) contains high-bandwidth interfaces, connecting the CPU, memory, and PCI bus. The south bridge contains legacy interfaces and devices: ISA bus (audio, LAN), interrupt controller; DMA controller; time/counter. In this system, the display was driven by a simple framebuffer subsystem known as a VGA (video graphics array) which was attached to the PCI bus. Graphics subsystems with built-in processing elements (GPUs) did not exist in the PC landscape of 1990.
的传统PC的高级框图。北桥(见第6章)包含高带宽接口,连接CPU,内存和PCI总线。 南桥包含传统接口和设备:ISA总线(音频,LAN),中断控制器; DMA控制器;定时/计数器。 在该系统中,显示器由称为VGA(视频图形阵列)的简单帧缓冲子系统驱动,该子系统连接到PCI总线。 1990年的PC环境中不存在具有内置处理元件(GPU)的图形子系统。


Figure C.2.2 illustrates two configurations in common use today. These are characterized by a separate GPU (discrete GPU) and CPU with respective memory subsystems. In Figure C.2.2a, with an Intel CPU, we see the GPU attached via a 16-lane PCI-Express 2.0 link to provide a peak 16 GB/s transfer rate, (peak of 8 GB/s in each direction). Similarly, in Figure C.2.2b, with an AMD CPU, the GPU is attached to the chipset, also via PCI-Express with the same available bandwidth. In both cases, the GPUs and CPUs may access each other’s memory, albeit with less available bandwidth than their access to the more directly attached memories. In the case of the AMD system, the north bridge or memory controller is integrated into the same die as the CPU.
图C.2.2说明了目前常用的两种配置。 它们的特征在于单独的GPU(离散GPU)和具有相应存储器子系统的CPU。 在图C.2.2a中,使用Intel CPU,我们看到GPU通过16通道PCI-Express 2.0链路连接,以提供16 GB / s的峰值传输速率(每个方向的峰值为8 GB / s)。

同样,在图C.2.2b中,使用AMD CPU,GPU也通过具有相同可用带宽的PCI-Express连接到芯片组。 在这两种情况下,GPU和CPU可以访问彼此的内存,尽管可用带宽少于访问更直接连接的内存。 在AMD系统的情况下,北桥或存储器控制器集成到与CPU相同的管芯中。

A low-cost variation on these systems, a unifed memory architecture (UMA) system, uses only CPU system memory, omitting GPU memory from the system. These systems have relatively low performance GPUs, since their achieved performance is limited by the available system memory bandwidth and increased latency of memory access, whereas dedicated GPU memory provides high bandwidth and low latency.
这些系统的低成本变化,统一内存架构(UMA)系统,仅使用CPU系统内存,从系统中省略GPU内存。 这些系统具有相对低性能的GPU,因为它们实现的性能受可用系统存储器带宽和存储器访问延迟的限制,而专用GPU存储器提供高带宽和低延迟。
A high performance system variation uses multiple attached GPUs, typically two to four working in parallel, with their displays daisy-chained. An example is the NVIDIA SLI (scalable link interconnect) multi-GPU system, designed for high performance gaming and workstations.
高性能系统变体使用多个连接的GPU,通常两到四个并行工作,其显示器采用菊花链式连接。一个例子是NVIDIA SLI(可扩展链路互连)多GPU系统,专为高性能游戏和工作站而设计。
The next system category integrates the GPU with the north bridge (Intel) or chipset (AMD) with and without dedicated graphics memory.
Chapter 5 explains how caches maintain coherence in a shared address space. With CPUs and GPUs, there are multiple address spaces. GPUs can access their own physical local memory and the CPU system’s physical memory using virtual addresses that are translated by an MMU on the GPU. The operating system kernel manages the GPU’s page tables. A system physical page can be accessed using either coherent or noncoherent PCI-Express transactions, determined by an attribute in the GPU’s page table. The CPU can access GPU’s local memory through an address range (also called aperture) in the PCI-Express address space.
第5章解释了缓存如何在共享地址空间中保持一致性。对于CPU和GPU,有多个地址空间。 GPU可以使用由GPU上的MMU转换的虚拟地址访问自己的物理本地内存和CPU系统的物理内存。操作系统内核管理GPU的页表。可以使用由GPU页面表中的属性确定的相干或非相干PCI-Express事务来访问系统物理页面。 CPU可以通过PCI-Express地址空间中的地址范围(也称为孔径)访问GPU的本地存储器。

Game Consoles 游戏主机

Console systems such as the Sony PlayStation 3 and the Microsof Xbox 360 resemble the PC system architectures previously described. Console systems are designed to be shipped with identical performance and functionality over a lifespan that can last five years or more. During this time, a system may be reimplemented many times to exploit more advanced silicon manufacturing processes and thereby to provide constant capability at ever lower costs. Console systems do not need to have their subsystems expanded and upgraded the way PC systems do, so the
major internal system buses tend to be customized rather than standardized.
诸如Sony PlayStation 3和Microsof Xbox 360之类的控制台系统类似于先前描述的PC系统架构。 控制台系统设计为在相同的使用寿命期内具有相同的性能和功能,可以使用五年或更长时间。 在此期间,系统可以多次重新实现以利用更先进的硅制造工艺,从而以更低的成本提供恒定的能力。 控制台系统不需要像PC系统那样扩展和升级子系统,所以

GPU Interfaces and Drivers

In a PC today, GPUs are attached to a CPU via PCI-Express. Earlier generations used AGP. Graphics applications call OpenGL [Segal and Akeley, 2006] or Direct3D [Microsof DirectX Specifcation] API functions that use the GPU as a coprocessor. The APIs send commands, programs, and data to the GPU via a graphics device driver optimized for the particular GPU.
在今天的PC中,GPU通过PCI-Express连接到CPU。 早期的几代人使用AGP。 图形应用程序调用OpenGL [Segal和Akeley,2006]或Direct3D [Microsof DirectX规范] API函数,这些函数使用GPU作为协处理器。 API通过针对特定GPU优化的图形设备驱动程序将命令,程序和数据发送到GPU。

APG :An extended version of the original PCI I/O bus, which provided up to eight times the bandwidth of the original PCI bus to a single card slot. Its primary purpose was to connect graphics subsystems into PC systems.
APG: 原始PCI I / O总线的扩展版本,其提供的带宽是原始PCI总线的八倍,达到单个卡插槽。 其主要目的是将图形子系统连接到PC系统。

Graphics Logical Pipeline

The graphics logical pipeline is described in Section C.3. Figure C.2.3 illustrates the major processing stages, and highlights the important programmable stages (vertex, geometry, and pixel shader stages).
图形逻辑流水线在C.3节中描述。 图C.2.3说明了主要的处理阶段,并重点介绍了重要的可编程阶段(顶点,几何和像素着色器阶段)。

Mapping Graphics Pipeline to Unified GPU Processors

Figure C.2.4 shows how the logical pipeline comprising separate independent programmable stages is mapped onto a physical distributed array of processors.

Basic Unifed GPU Architecture

Unifed GPU architectures are based on a parallel array of many programmable processors. They unify vertex, geometry, and pixel shader processing and parallel computing on the same processors, unlike earlier GPUs which had separate processors dedicated to each processing type. The programmable processor array is tightly integrated with fixed function processors for texture filtering, rasterization, raster operations, anti-aliasing, compression, decompression, display, video decoding, and high-defnition video processing. Although the fixed-function processors signifcantly outperform more general programmable processors in terms of absolute performance constrained by an area, cost, or power budget, we will focus on the programmable processors here.
统一的GPU架构基于许多可编程处理器的并行阵列。 它们在相同的处理器上统一顶点,几何和像素着色器处理以及并行计算,这与早期的GPU不同,后者具有专用于每种处理类型的独立处理器。 可编程处理器阵列与固定功能处理器紧密集成,用于纹理过滤,光栅化,光栅操作,抗锯齿,压缩,解压缩,显示,视频解码和高清晰度视频处理。 虽然固定功能处理器在面积,成本或功率预算限制的绝对性能方面明显优于更多通用可编程处理器,但我们将专注于此处的可编程处理器。

Compared with multicore CPUs, manycore GPUs have a different architectural design point, one focused on executing many parallel threads effciently on many processor cores. By using many simpler cores and optimizing for data-parallel behavior among groups of threads, more of the per-chip transistor budget is devoted to computation, and less to on-chip caches and overhead.
与多核CPU相比,多核GPU具有不同的架构设计点,一个侧重于在许多处理器内核上有效地执行许多并行线程。 通过使用许多更简单的内核并优化线程组之间的数据并行行为,更多的每芯片晶体管预算用于计算,而更少用于片上高速缓存和开销。

Processor Array

A unifed GPU processor array contains many processor cores, typically organized into multithreaded multiprocessors. Figure C.2.5 shows a GPU with an array of 112 streaming processor (SP) cores, organized as 14 multithreaded streaming multiprocessors (SMs). Each SP core is highly multithreaded, managing 96 concurrent threads and their state in hardware. The processors connect with four 64-bit-wide DRAM partitions via an interconnection network. Each SM has eight SP cores, two special function units (SFUs), instruction and constant caches, a multithreaded instruction unit, and a shared memory. This is the basic Tesla architecture implemented by the NVIDIA GeForce 8800. It has a unifed architecture in which the traditional graphics programs for vertex, geometry, and pixel shading run on the unifed SMs and their SP cores, and computing programs run on the same processors.
统一GPU处理器阵列包含许多处理器内核,通常组织成多线程多处理器。 图C.2.5显示了一个具有112个流处理器(SP)内核阵列的GPU,组织为14个多线程流多处理器(SM)。 每个SP核心都是高度多线程的,管理96个并发线程及其硬件状态。 处理器通过互连网络连接四个64位宽的DRAM分区。 每个SM有8个SP内核,2个特殊功能单元(SFU),指令和常量高速缓存,多线程指令单元和共享存储器。 这是由NVIDIA GeForce 8800实现的基本Tesla架构。它采用统一架构,其中顶点,几何和像素着色的传统图形程序在统一的SM及其SP内核上运行,计算程序在相同的处理器上运行。

The processor array architecture is scalable to smaller and larger GPU confgurations by scaling the number of multiprocessors and the number of memory partitions. Figure C.2.5 shows seven clusters of two SMs sharing a texture unit and a texture L1 cache. The texture unit delivers filtered results to the SM given a set of coordinates into a texture map. Because filter regions of support often overlap for successive texture requests, a small streaming L1 texture cache is effective to reduce the number of requests to the memory system. The processor array connects with raster operation processors (ROPs), L2 texture caches, external DRAM memories, and system memory via a GPU-wide interconnection network. The number of processors and number of memories can scale to design balanced GPU systems for different performance and market segments.
通过扩展多处理器的数量和内存分区的数量,处理器阵列架构可扩展到越来越小的GPU配置。 图C.2.5显示了共享纹理单元和纹理L1缓存的两个SM的七个簇。 纹理单元将过滤结果传递给SM,给定一组坐标到纹理贴图中。 因为支持的过滤器区域经常与连续的纹理请求重叠,所以小的流式L1纹理高速缓存对于减少对存储器系统的请求的数量是有效的。 处理器阵列通过GPU范围的互连网络与光栅操作处理器(ROP),L2纹理高速缓存,外部DRAM存储器和系统存储器连接。 处理器的数量和存储器的数量可以扩展,以设计用于不同性能和细分市场的平衡GPU系统。