CPUThis is a featured page


中央处理器,或简称为处理器,英文缩写为CPU,即Central Processing Unit,是電子計算機(港译-電子計算機)的主要设备之一,其功能主要是解译计算机指令以及处理计算机软件中的数据。CPU为電子計算機设计提供了基本的数字计算特性。CPU、存储设备输入/输出设备是现代微型电脑的三大核心部件。由集成电路制造的CPU通常称为微型处理器。从20世纪70年代中期开始,单芯片微型处理器几乎取代了所有其他类型的CPU,今天CPU这个术语几乎成为了所有微型处理器的代称。 “中央处理器”这个名称,常规上来讲,用来描述一系列可以执行复杂的电脑程序的逻辑机械。这个空泛的定义很容易的将在“CPU”这个名称被普遍使用之前的早期的计算机也包括在内。无论如何,至少从20世纪60年代早期开始(Weik 1961),这个名称及其缩写已开始在电子计算机产业中得到广泛应用。尽管与早期相比,“中央处理器”在物理形态,设计制造和具体任务的执行上有了戏剧化的发展,但是其基本的操作原理一直没有改变。 早期的中央处理器通常是为大型及特定应用的计算机(港译-電子計算機)而订制。但是,这种昂贵为特定应用定制CPU的方法很大程度上已经让位于开发便宜、标准化、适用于一个或多个目的的处理器类。这个标准化趋势始于由单个晶体管组成的大型机微机年代,随着集成电路(英文integrated circuit(IC)的出现而加速。IC使得更为复杂的CPU可以在很小的空间中设计和制造(在微米的量级)。CPU的标准化和小型化都使得这一类数字设备(港译-电子零件)在现代生活中的出现频率远远超过有限应用专用的计算机。现代微处理器出现在包括从汽车手机到儿童玩具在内的各种物品中。

历史


之一。
主条目: 计算机硬件的历史 在微电子技术还未发展到可以制造类似于今天的CPU的程度时,像ENIAC(Electronic Numerical Integrator And Calculator[Computer] 电子数字积分计算机,第一台通用计算机的名称, 1946年由美国制造)这种计算机需要重新组装以完成不同的任务。这些机器常被称为“固定程序计算机”因为它们需要重新组装以运行其他程序。因为CPU通常被定义为软件(计算机程序)的执行设备,所以直到程序储备计算机的出现,他们的“心脏”才能开始被称为CPU。 内储程序计算机的想法在ENIAC的设计中就已经得以体现,但是为了尽快完成机器在一开始就被取消。1945年6月30日,在ENIAC最终完成之前,数学家约翰·冯·诺依曼(John von Neumann) 发表了题为“关于EDVAC的报告草案”的论文。文中提出内储程序计算机的设计将于1949年8月完成(von Neumann 1945)。设计EDVAC是为了执行数个不同类型的指令(或者操作)。这些指令集合在一起可以创造出有用的程序使EDVAC得以运行。十分明显地,为EDVAC编写的程序存放在高速电脑存储器(computer memory)而不是由计算机物理配线所决定。这就解决了ENIAC的一个十分严重的不足,即需要大量的时间和精力来重新配置计算机以完成新的任务。按照冯·诺依曼的设计,EDVAC运行的程序,或者软件只需简单的通过改变计算机的存储器来更改。 值得注意的是,虽然冯·诺依曼因为他的EDVAC的设计而经常与内储程序计算机设计相提并论,在他之前的人们例如克兰德·楚泽(Konrad Zuse)也提出过类似的想法。另外,先于EDVAC完成的Harvard Mark I中所谓的哈佛架构(Harvard architecture),也已经通过穿孔纸带磁带(punched paper tape)而不是电子存储器实现了内储程序的设计。冯·诺依曼和哈佛架构的主要不同在于后者将CPU指令和数据的存储和处理分开进行,而前者则为此使用了相同的存储空间。大多数现代的中央处理器主要采用了冯·诺依曼设计,然而哈佛架构中的元素也时常可见。 作为数字(港译:数码)(digital)装置,所有的CPU处理的都是不连续的状态,所以需要某种开关部件来区分和转变不同的状态。在晶体管商业化使用前,电子继电器(electrical relay)和电子管(vacuum tube)(热离子管)(港译:真空管)是普遍使用的开关部件。虽然比之前纯粹的机械设计有显著的速度优势,继电器和电子管却因多种原因而不稳定可靠。例如,用继电器构造直流电(direct current)时序逻辑(sequential logic)电路需要另外的硬件来解决触点颤动(contact bounce)的问题。而电子管虽然没有触点颤动的烦恼,它们却需要在工作前加热以及使用寿命短。 经常因为一个电子管的损坏,需要检查整个CPU找到坏掉的部分加以替换。所以,基本上早期的基于电子管的电子计算机比基于继电器的电动机械计算机速度快但是不可靠。电子管计算机如EDVAC故障间的平均连续工作时间只有八个小时;与之相对的,继电器计算机-比如(较慢但较早)的Harvard Mark I却很少故障(Weik 1961:238)。最终,基于电子管的中央处理器以其显著的速度优势压倒了稳定性的问题而成为主流。与现代微电子设计的中央处理器相比,大多数早期的同步性中央处理器都以较低的时钟频率(clock rate)运行(参见下面关于时钟频率的讨论)。当时,时钟信号频率通常在100kHz到4MHz的范围里,主要是被构成它们的转换部件的速度限制。

分立式晶体管与集成电路处理器


随着多样的技术生产出更小更可靠电子器件,中央处理器的设计也更加复杂。第一次这种改进来自于晶体管(transistor)的诞生。1950-60年代的晶体管化的中央处理器的生产不再需要体积大、不稳定、易碎的电子管(vacuum tube)和电子继电器(electrical relays)。随着这项改进,更复杂更稳定的中央处理器由一块或数块印刷电路板(printed circuit board)上的个体部件建立。 同一时期,一种将很多晶体管放入细小空间的新技术渐渐流行。集成电路缩写IC)技术的出现,在单一的半导体"芯片"上可以制造大量的晶体管。最初,只有一些非常基本而通用的电路(例如NOR gate)会被"缩细"成为集成电路。 建基于这些"元件"的CPU,一般都归类为"小型集成电路"(英文"small-scale integration" device 缩写SSI device). 虽然个别 SSI IC 一般只有数十伙晶体管,而制作CPU级的电路需要上千的SSI IC (60-70年代美国的太阳神登月计划中的导航電子計算機(POC=计算机)就有使用),但相比使用单一晶体管的早期设计,无论是占用的空间或耗电量已经大大减少。 随著微电子技术的进步,放进集成电路的晶体管越来越多,制作CPU所需要的IC随著减少。 MSI and LSI (中度- and 高度集成) IC将晶体管数量提升到数百以及数千。 1964年 IBM(美国万国商业机器公司)发表System/360電子計算機(POC=计算机)架构,使用同一架构的電子計算機效率和速度虽然不一样,但可以运行相同的電子計算機程序。当时,就算是同一公司制作的電子計算機(POC=计算机)亦不能够在软件上兼容,这个系列可算是一大突破。 为促进以上的改良,IBM首先应用了微形程序(英文miroprogram)(通称微码'英文microcode')的概念,这个概念今天依然广泛的应用在今天的CPU中(Amdahl et al. 1964)。System/360在当时非常成功,独占了大部份的大型電子計算機市场详数十年。它的成功,甚至连带的新架构-IBM zSeries也受惠。同年(1964年),迪吉多(Digital Equipment Corporation简称DEC)发表另一系列影响深远的電子計算機 - 针对科研市场的PDP-8。随后,DEC发表了极受欢迎的PDP-11系列。早期的PDP-11是使用SSI IC的,随后恩因著LSI元件的实用化而转用LSI。 相对于早期使用SSI或MSI,第一次LSI的PDP-11,CPU只用了四片LSI电路使用制作(Digital Equipment Corporation 1975)。 基于晶体管的计算机与和前几代计算机相比有许多独特的优点。除了更可靠和更低能耗以外,晶体管因为有比电子管和继电器更高的切换频率,所以能使中央处理器工作在更高频率。由于稳定性和开关(此时开关已经由专门的晶体管负责)切换速度增加,在这个时期中央处理器时钟频率已经能达到几十兆赫兹。另外,当分立式晶体管IC CPU在广泛使用时,高性能的新设计像单指令流多数据流向量处理器)开始出现。这些早期实验性的设计成为后来专门的超级计算机(比如由Cray Inc.制造)的发展奠定了基础。

微处理器

Intel 80486DX2微处理器的晶片(实际尺寸: 12×6.75 mm)

Intel 80486DX2微处理器的芯片(实际尺寸: 12×6.75 mm)
微处理器在1970年代的出现深刻地影响了CPU的设计和实现。自从第一块微处理器(Intel 4004)在1970年出现以及第一颗广泛使用的微处理器(Intel 8080)在1974年出现,这类CPU几乎完全取代了其它所有实现方式。与个人计算机的发展联合起来,这导致在过去数十年"CPU"一词几乎专门用来指代微处理器。 过去几代CPU通过分离的组件和装在一块或多块线路板上的很多小型集成电路(IC)实现。然而,微处理器是制造在极少量的IC上的CPU;通常就是单独一片IC。由于实现在单个芯片而导致的整体减小的CPU尺寸意味着更快的转换时间,这是因为像缩小的门电路寄生电容这样的物理因素。这使得同步微处理器拥有从几千万导几十亿赫兹的时钟频率。而且,随着在一个IC上制造更小的晶体管的能力的增长,单个CPU的晶体管复杂度和数量急剧增长。这个常见的现象由莫尔定律描述,它至今还是对CPU(和其它IC)的改进有相当准确的预测。 虽然CPU的复杂度、尺寸、构造和一般形式在过去60年中发生了极大的改变,值得注意的是基本的设计思想和功能并未改变。几乎所有今天的CPU可以用冯·诺依曼程序贮存机器来表述。 因为上述的莫尔定律还在成立着,对集成电路晶体管技术的担心出现了。门电路的极度小型化导致了电迁移亚阈泄漏这类现象的影响变得更为严重。这些新的担心是导致研究人员开始寻找诸如量子计算机这样的新计算方法的众多因素之一,其它的方法还有并行计算的使用的拓广以及将经典冯·诺依曼模型推广的办法。

CPU工作过程及原理

大多数CPU的基本工作原理,不管它们的外观,本质上都是执行具有一定顺序的指令集程序。这里我们讨论的是遵照一般的Von Neumann 构架设计的设备。程序以一系列数字的形式储存在电脑内存中。差不多所有Von Neumann CPU的工作过程都是由四部分组成,即为抽取指令解码执行写回

The first step, fetch, involves retrieving an instruction (which is represented by a number or sequence of numbers) from program memory. The location in program memory is determined by a program counter (PC), which stores a number that identifies the current position in the program. In other words, the program counter keeps track of the CPU's place in the current program. After an instruction is fetched, the PC is incremented by the length of the instruction word in terms of memory units. Often the instruction to be fetched must be retrieved from relatively slow memory, causing the CPU to stall while waiting for the instruction to be returned. This issue is largely addressed in modern processors by caches and pipeline architectures (see below). The instruction that the CPU fetches from memory is used to determine what the CPU is to do. In the decode step, the instruction is broken up into parts that have significance to other portions of the CPU. The way in which the numerical instruction value is interpreted is defined by the CPU's instruction set architecture (ISA). Often, one group of numbers in the instruction, called the opcode, indicates which operation to perform. The remaining parts of the number usually provide information required for that instruction, such as operands for an addition operation. Such operands may be given as a constant value (called an immediate value), or as a place to locate a value: a register or a memory address, as determined by some addressing mode. In older designs the portions of the CPU responsible for instruction decoding were unchangeable hardware devices. However, in more abstract and complicated CPUs and ISAs, a microprogram is often used to assist in translating instructions into various configuration signals for the CPU. This microprogram is sometimes rewritable so that it can be modified to change the way the CPU decodes instructions even after it has been manufactured.

CPU
After the fetch and decode steps, the execute step is performed. During this step, various portions of the CPU are connected so they can perform the desired operation. If, for instance, an addition operation was requested, an arithmetic logic unit (ALU) will be connected to a set of inputs and a set of outputs. The inputs provide the numbers to be added, and the outputs will contain the final sum. The ALU contains the circuitry to perform simple arithmetic and logical operations on the inputs (like addition and bitwise operations). If the addition operation produces a result too large for the CPU to handle, an arithmetic overflow flag in a flags register may also be set (see the discussion of integer precision below). The final step, writeback, simply "writes back" the results of the execute step to some form of memory. Very often the results are written to some internal CPU register for quick access by subsequent instructions. In other cases results may be written to slower, but cheaper and larger, main memory. Some types of instructions manipulate the program counter rather than directly produce result data. These are generally called "jumps" and facilitate behavior like loops, conditional program execution (through the use of a conditional jump), and functions in programs. [5] Many instructions will also change the state of digits in a "flags" register. These flags can be used to influence how a program behaves, since they often indicate the outcome of various operations. For example, one type of "compare" instruction considers two values and sets a number in the flags register according to which one is greater. This flag could then be used by a later jump instruction to determine program flow. After the execution of the instruction and writeback of the resulting data, the entire process repeats, with the next instruction cycle normally fetching the next-in-sequence instruction because of the incremented value in the program counter. If the completed instruction was a jump, the program counter will be modified to contain the address of the instruction that was jumped to, and program execution continues normally. In more complex CPUs than the one described here, multiple instructions can be fetched, decoded, and executed simultaneously. This section describes what is generally referred to as the "Classic RISC pipeline," which in fact is quite common among the simple CPUs used in many electronic devices (often called microcontrollers). 中央处理器CPU CPU是电脑系统的心脏,电脑特别是微型电脑的快速发展过程,实质上就是CPU从低级向高级、从简单向复杂发展的过程。 一、CPU的概念 CPU(Central Processing Unit)又叫中央处理器,其主要功能是进行运算和逻辑运算,内部结构大概可以分为控制单元、算术逻辑单元和存储单元等几个部分。按照其处理信息的字长可以分为:八位微处理器、十六位微处理器、三十二位微处理器以及六十四位微处理器等等。 二、CPU主要的性能指标 主频:即CPU内部核心工作的时钟频率,单位一般是兆赫兹(MHz)。这是我们平时无论是使用还是购买计算机都最关心的一个参数,我们通常所说的133、166、450等就是指它。对于同种类的CPU,主频越高,CPU的速度就越快,整机的性能就越高。 外频和倍频数:外频即CPU的外部时钟频率。外频是由电脑主板提供的,CPU的主频与外频的关系是:CPU主频=外频×倍频数。 内部缓存:采用速度极快的SRAM制作,用于暂时存储CPU运算时的最近的部分指令和数据,存取速度与CPU主频相同,内部缓存的容量一般以KB为单位。当它全速工作时,其容量越大,使用频率最高的数据和结果就越容易尽快进入CPU进行运算,CPU工作时与存取速度较慢的外部缓存和内存间交换数据的次数越少,相对电脑的运算速度可以提高。 地址总线宽度:地址总线宽度决定了CPU可以访问的物理地址空间,简单地说就是CPU到底能够使用多大容量的内存。 多媒体扩展指令集(MMX)技术:MMX是Intel公司为增强Pentium CPU 在音像、图形和通信应用方面而采取的新技术。这一技术为CPU增加了全新的57条MMX指令,这些加了MMX指令的 CPU比普通CPU在运行含有MMX指令的程序时,处理多媒体的能力上提高了60%左右。即使不使用MMX指令的程序,也能获得15%左右的性能提升。 微处理器在多方面改变了我们的生活,现在认为理所当然的事,在以前却是难以想象的。六十年代计算机大得可充满整个房间,只有很少的人能使用它们。六十年代中期集成电路的发明使电路的小型化得以在一块单一的硅片上实现,为微处理器的发展奠定了基础。在可预见的未来,CPU的处理能力将继续保持高速增长,小型化、集成化永远是发展趋势,同时会形成不同层次的产品,也包括专用处理器。

设计与实作


The way a CPU represents numbers is a design choice that affects the most basic ways in which the device functions. Some early digital computers used an electrical model of the common decimal (base ten) numeral system to represent numbers internally. A few other computers have used more exotic numeral systems like ternary (base three). Nearly all modern CPUs represent numbers in binary form, with each digit being represented by some two-valued physical quantity such as a "high" or "low" voltage.
MOS 6502 microprocessor in a dual in-line package, an extremely popular 8-bit design.

MOS 6502 microprocessor in a dual in-line package, an extremely popular 8-bit design.
Related to number representation is the size and precision of numbers that a CPU can represent. In the case of a binary CPU, a bit refers to one significant place in the numbers a CPU deals with. The number of bits (or numeral places) a CPU uses to represent numbers is often called "word size", "bit width", "data path width", or "integer precision" when dealing with strictly integer numbers (as opposed to floating point). This number differs between architectures, and often within different parts of the very same CPU. For example, an 8-bit CPU deals with a range of numbers that can be represented by eight binary digits (each digit having two possible values), that is, 28 or 256 discrete numbers. In effect, integer precision sets a hardware limit on the range of integers the software run by the CPU can utilize. Integer precision can also affect the number of locations in memory the CPU can address (locate). For example, if a binary CPU uses 32 bits to represent a memory address, and each memory address represents one octet (8 bits), the maximum quantity of memory that CPU can address is 232 octets, or 4 GiB. This is a very simple view of CPU address space, and many modern designs use much more complex addressing methods like paging in order to locate more memory with the same integer precision. Higher levels of integer precision require more structures to deal with the additional digits, and therefore more complexity, size, power usage, and generally expense. It is not at all uncommon, therefore, to see 4- or 8-bit microcontrollers used in modern applications, even though CPUs with much higher precision (such as 16, 32, 64, even 128-bit) are available. The simpler microcontrollers are usually cheaper, use less power, and therefore dissipate less heat, all of which can be major design considerations for electronic devices. However, in higher-end applications, the benefits afforded by the extra precision (most often the additional address space) are more significant and often affect design choices. To gain some of the advantages afforded by both lower and higher bit precisions, many CPUs are designed with different bit widths for different portions of the device. For example, the IBM System/370 used a CPU that was primarily 32 bit, but it used 128-bit precision inside its floating point units to facilitate greater accuracy and range in floating point numbers (Amdahl et al. 1964). Many later CPU designs use similar mixed bit width, especially when the processor is meant for general-purpose usage where a reasonable balance of integer and floating point capability is required.

运算周期速率


主条目: 运算周期速率 Most CPUs, and indeed most sequential logic devices, are synchronous in nature. [9] That is, they are designed and operate on assumptions about a synchronization signal. This signal, known as a clock signal, usually takes the form of a periodic square wave. By calculating the maximum time that electrical signals can move in various branches of a CPU's many circuits, the designers can select an appropriate period for the clock signal. This period must be longer than the amount of time it takes for a signal to move, or propagate, in the worst-case scenario. In setting the clock period to a value well above the worst-case propagation delay, it is possible to design the entire CPU and the way it moves data around the "edges" of the rising and falling clock signal. This has the advantage of simplifying the CPU significantly, both from a design perspective and a component-count perspective. However, it also carries the disadvantage that the entire CPU must wait on its slowest elements, even though some portions of it are much faster. This limitation has largely been compensated for by various methods of increasing CPU parallelism (see below). Architectural improvements alone do not solve all of the drawbacks of globally synchronous CPUs, however. For example, a clock signal is subject to the delays of any other electrical signal. Higher clock rates in increasingly complex CPUs make it more difficult to keep the clock signal in phase (synchronized) throughout the entire unit. This has led many modern CPUs to require multiple identical clock signals to be provided in order to avoid delaying a single signal significantly enough to cause the CPU to malfunction. Another major issue as clock rates increase dramatically is the amount of heat that is dissipated by the CPU. The constantly changing clock causes many components to switch regardless of whether they are being used at that time. In general, a component that is switching uses more energy than an element in a static state. Therefore, as clock rate increases, so does heat dissipation, causing the CPU to require more effective cooling solutions. One method of dealing with the switching of unneeded components is called clock gating, which involves turning off the clock signal to unneeded components (effectively disabling them). However, this is often regarded as difficult to implement and therefore does not see common usage outside of very low-power designs. [10] Another method of addressing some of the problems with a global clock signal is the removal of the clock signal altogether. While removing the global clock signal makes the design process considerably more complex in many ways, asynchronous (or clockless) designs carry marked advantages in power consumption and heat dissipation in comparison with similar synchronous designs. While somewhat uncommon, entire CPUs have been built without utilizing a global clock signal. Two notable examples of this are the ARM compliant AMULET and the MIPS R3000 compatible MiniMIPS. Rather than totally removing the clock signal, some CPU designs allow certain portions of the device to be asynchronous, such as using asynchronous ALUs in conjunction with superscalar pipelining to achieve some arithmetic performance gains. While it is not altogether clear whether totally asynchronous designs can perform at a comparable or better level than their synchronous counterparts, it is evident that they do at least excel in simpler math operations. This, combined with their excellent power consumption and heat dissipation properties, makes them very suitable for embedded computers (Garside et al. 1999).

多线程平行处理

Model of a subscalar CPU. Notice that it takes fifteen cycles to complete three instructions.

Model of a subscalar CPU. Notice that it takes fifteen cycles to complete three instructions.
主条目: 并行计算 The description of the basic operation of a CPU offered in the previous section describes the simplest form that a CPU can take. This type of CPU, usually referred to as subscalar, operates on and executes one instruction on one or two pieces of data at a time. This process gives rise to an inherent inefficiency in subscalar CPUs. Since only one instruction is executed at a time, the entire CPU must wait for that instruction to complete before proceeding to the next instruction. As a result the subscalar CPU gets "hung up" on instructions which take more than one clock cycle to complete execution. Even adding a second execution unit (see below) does not improve performance much; rather than one pathway being hung up, now two pathways are hung up and the number of unused transistors is increased. This design, wherein the CPU's execution resources can operate on only one instruction at a time, can only possibly reach scalar performance (one instruction per clock). However, the performance is nearly always subscalar (less than one instruction per cycle). Attempts to achieve scalar and better performance have resulted in a variety of design methodologies that cause the CPU to behave less linearly and more in parallel. When referring to parallelism in CPUs, two terms are generally used to classify these design techniques. Instruction level parallelism (ILP) seeks to increase the rate at which instructions are executed within a CPU (that is, to increase the utilization of on-die execution resources), and thread level parallelism (TLP) purposes to increase the number of threads (effectively individual programs) that a CPU can execute simultaneously. Each methodology differs both in the ways in which they are implemented, as well as the relative effectiveness they afford in increasing the CPU's performance for an application. 指令管线处理(ILP)与超标量架构
Basic five-stage pipeline.  In the best case scenario, this pipeline can sustain a completion rate of one instruction per cycle.

Basic five-stage pipeline. In the best case scenario, this pipeline can sustain a completion rate of one instruction per cycle.
主条目: 指令管线处理, 超标量 One of the simplest methods used to accomplish increased parallelism is to begin the first steps of instruction fetching and decoding before the prior instruction finishes executing. This is the simplest form of a technique known as instruction pipelining, and is utilized in almost all modern general-purpose CPUs. Pipelining allows more than one instruction to be executed at any given time by breaking down the execution pathway into discrete stages. This separation can be compared to an assembly line, in which an instruction is made more complete at each stage until it exits the execution pipeline and is retired. Pipelining does, however, introduce the possibility for a situation where the result of the previous operation is needed to complete the next operation; a condition often termed data dependency conflict. To cope with this, additional care must be taken to check for these sorts of conditions and delay a portion of the instruction pipeline if this occurs. Naturally, accomplishing this requires additional circuitry, so pipelined processors are more complex than subscalar ones (though not very significantly so). A pipelined processor can become very nearly scalar, inhibited only by pipeline stalls (an instruction spending more than one clock cycle in a stage).
Simple superscalar pipeline.  By fetching and dispatching two instructions at a time, a maximum of two instructions per cycle can be completed.

Simple superscalar pipeline. By fetching and dispatching two instructions at a time, a maximum of two instructions per cycle can be completed.
Further improvement upon the idea of instruction pipelining led to the development of a method that decreases the idle time of CPU components even further. Designs that are said to be superscalar include a long instruction pipeline and multiple identical execution units. In a superscalar pipeline, multiple instructions are read and passed to a dispatcher, which decides whether or not the instructions can be executed in parallel (simultaneously). If so they are dispatched to available execution units, resulting in the ability for several instructions to be executed simultaneously. In general, the more instructions a superscalar CPU is able to dispatch simultaneously to waiting execution units, the more instructions will be completed in a given cycle. Most of the difficulty in the design of a superscalar CPU architecture lies in creating an effective dispatcher. The dispatcher needs to be able to quickly and correctly determine whether instructions can be executed in parallel, as well as dispatch them in such a way as to keep as many execution units busy as possible. This requires that the instruction pipeline is filled as often as possible and gives rise to the need in superscalar architectures for significant amounts of CPU cache. It also makes hazard-avoiding techniques like branch prediction, speculative execution, and out-of-order execution crucial to maintaining high levels of performance. By attempting to predict which branch (or path) a conditional instruction will take, the CPU can minimize the number of times that the entire pipeline must wait until a conditional instruction is completed. Speculative execution often provides modest performance increases by executing portions of code that may or may not be needed after a conditional operation completes. Out-of-order execution somewhat rearranges the order in which instructions are executed to reduce delays due to data dependencies. In the case where a portion of the CPU is superscalar and part is not, the part which is not suffers a performance penalty due to scheduling stalls. The original Intel Pentium (P5) had two superscalar ALUs which could accept one instruction per clock each, but its FPU could not accept one instruction per clock. Thus the P5 was integer superscalar but not floating point superscalar. Intel's successor to the Pentium architecture, P6, added superscalar capabilities to its floating point features, and therefore afforded a significant increase in floating point instruction performance. Both simple pipelining and superscalar design increase a CPU's ILP by allowing a single processor to complete execution of instructions at rates surpassing one instruction per cycle (IPC). Most modern CPU designs are at least somewhat superscalar, and nearly all general purpose CPUs designed in the last decade are superscalar. In later years some of the emphasis in designing high-ILP computers has been moved out of the CPU's hardware and into its software interface, or ISA. The strategy of the very long instruction word (VLIW) causes some ILP to become implied directly by the software, reducing the amount of work the CPU must perform to boost ILP and thereby reducing the design's complexity. 同步执行绪处理(TLP) Another strategy commonly used to increase the parallelism of CPUs is to include the ability to run multiple threads (programs) at the same time. In general, high-TLP CPUs have been in use much longer than high-ILP ones. Many of the designs pioneered by Cray during the late 1970s and 1980s concentrated on TLP as their primary method of enabling enormous (for the time) computing capability. In fact, TLP in the form of multiple thread execution improvements has been in use since as early as the 1950s (Smotherman 2005). In the context of single processor design, the two main methodologies used to accomplish TLP are chip-level multiprocessing (CMP) and simultaneous multithreading (SMT). On a higher level, it is very common to build computers with multiple totally independent CPUs in arrangements like symmetric multiprocessing (SMP) and non-uniform memory access (NUMA). While using very different means, all of these techniques accomplish the same goal: increasing the number of threads that the CPU(s) can run in parallel. The CMP and SMP methods of parallelism are similar to one another and the most straightforward. These involve little more conceptually than the utilization of two or more complete and independent CPUs. In the case of CMP, multiple processor "cores" are included in the same package, sometimes on the very same integrated circuit. SMP, on the other hand, includes multiple independent packages. NUMA is somewhat similar to SMP but uses a nonuniform memory access model. This is important for computers with many CPUs because each processor's access time to memory is quickly exhausted with SMP's shared memory model, resulting in significant delays due to CPUs waiting for memory. Therefore, NUMA is considered a much more scalable model, successfully allowing many more CPUs to be used in one computer than SMP can feasibly support. SMT differs somewhat from other TLP improvements in that it attempts to duplicate as few portions of the CPU as possible. While considered a TLP strategy, its implementation actually more resembles superscalar design, and indeed is often used in superscalar microprocessors (such as IBM's POWER5). Rather than duplicating the entire CPU, SMT designs only duplicate parts needed for instruction fetching, decoding, and dispatch, as well as things like general-purpose registers. This allows an SMT CPU to keep its execution units busy more often by providing them instructions from two different software threads. Again, this is very similar to the ILP superscalar method, but simultaneously executes instructions from multiple threads rather than executing multiple instructions from the same thread concurrently.

向量处理器与SIMD

主条目: 向量处理器, SIMD A less common but increasingly important paradigm of CPUs (and indeed, computing in general) deals with vectors. The processors discussed earlier are all referred to as some type of scalar device. [15] As the name implies, vector processors deal with multiple pieces of data in the context of one instruction. This contrasts with scalar processors, which deal with one piece of data for every instruction. These two schemes of dealing with data are generally referred to as SISD (single instruction, single data) and SIMD (single instruction, multiple data), respectively. The great utility in creating CPUs that deal with vectors of data lies in optimizing tasks that tend to require the same operation (for example, a sum or a dot product) to be performed on a large set of data. Some classic examples of these types of tasks are multimedia applications (images, video, and sound), as well as many types of scientific and engineering tasks. Whereas a scalar CPU must complete the entire process of fetching, decoding, and executing each instruction and value in a set of data, a vector CPU can perform a single operation on a comparatively large set of data with one instruction. Of course, this is only possible when the application tends to require many steps which apply one operation to a large set of data. Most early vector CPUs, such as the Cray-1, were associated almost exclusively with scientific research and cryptography applications. However, as multimedia has largely shifted to digital media, the need for some form of SIMD in general-purpose CPUs has become significant. Shortly after floating point execution units started to become commonplace to include in general-purpose processors, specifications for and implementations of SIMD execution units also began to appear for general-purpose CPUs. Some of these early SIMD specifications like Intel's MMX were integer-only. This proved to be a significant impediment for some software developers, since many of the applications that benefit from SIMD primarily deal with floating point numbers. Progressively, these early designs were refined and remade into some of the common, modern SIMD specifications, which are usually associated with one ISA. Some notable modern examples are Intel's SSE and the PowerPC-related AltiVec (also known as VMX).

参见

Intel 80486DX-33中央處理器。攝影/Weihao.chiu

Intel 80486DX-33中央处理器。摄影/Weihao.chiu


vulcano
vulcano
Latest page update: made by vulcano , Jul 1 2006, 8:29 AM EDT (about this update About This Update vulcano Edited by vulcano

4064 words added
7 images added

view changes

- complete history)
More Info: links to this page

Anonymous  (Get credit for your thread)


There are no threads for this page.  Be the first to start a new thread.

Related Content

  (what's this?Related ContentThanks to keyword tags, links to related pages and threads are added to the bottom of your pages. Up to 15 links are shown, determined by matching tags and by how recently the content was updated; keeping the most current at the top. Share your feedback on Wetpaint Central.)