Processor design

From Wikipedia, the free encyclopedia

This is an old revision of this page, as edited by Mintguy (talk | contribs) at 09:42, 26 August 2002 (spelling). The present address (URL) is a permanent link to this revision, which may differ significantly from the current revision.

Jump to navigation Jump to search

To a large extent, the design of a Central Processing Unit is the design of its control unit. The modern (ie, 1965 to 1985) way to design control logic is to write a microprogram.

CPU design was originally an ad-hoc process. Just getting a CPU to work was a substantial governmental and technical event.

Key design innovations include cache, virtual memory, instruction pipelining, superscalar, CISC, RISC, virtual machine, emulators, microprogram, and stack.

General Purpose CPU Design

1960s: The Computer Revolution and CISC

The major problem with early computers was that a program for one would not work on others. In 1962, IBM bet the company on a new way to design computers. The plan was to make a family of computers that could all run the same software, but with different performances, and at different prices. The plan required the company to use different radically different computers, with microprograms to emulate a reference computer.

Each computer would be targeted at a specific price point. As users' requirements grew, they could move up to larger computers. This computer family was called the 360/370, and updated, but compatible, computers are still being sold as of 2001.

IBM chose to make the reference instruction set quite complex, and very capable. This was a conscious choice. The "control store" containing the microprogram was relatively small, and could be made with very fast memory. Another important effect was that a single instruction could describe quite a complex sequence of operations. Thus the computers would generally have to fetch fewer instructions from the main memory, which could be made slower, smaller and less expensive for a given combination of speed and price.

An often-overlooked feature of the IBM 360 instruction set was that it was the first instruction set designed for data processing, rather than mathematical calculation. The crucial innovation was that memory was designed to addressed in units of a single printable character, a "byte." Also, the instruction set was designed to manipulate not just simple integer numbers, but text, scientific floating-point (similar to the numbers used in a calculator), and the decimal arithmetic needed by accounting systems.

Another important feature was that the IBM register set was binary, a feature first tested on the Whirlwind computer built for Lawrence Laboratory's nuclear weapons simulations. Binary arithmetic is substantially cheaper to implement, because it requires fewer electronic devices to store the same number.

Almost all following computers included these innovations in some form. This basic set of features is called a "complex instruction set computer," or CISC (pronounced "sisk").

In many CISCs, an instruction could access either registers or memory, usually in several different ways. This made the CISCs easier to program, because a programmer could remember just thirty to a hundred instructions, and a set of three to ten "addressing modes," rather than thousands of distinct instructions. This was called an "orthogonal instruction set." The PDP-11 and Motorola 68000 architecture is an example of a nearly orthogonal instruction set.

Early 1980s: The Lessons of RISC

In the early 1980s, researchers at UC Berkley and IBM both discovered that most computer languages produced only a small subset of the instructions of a CISC. Much of the power of the CPU was simply being ignored in real-world use. They realized that by making the computer simpler, less orthogonal, they could make it faster and less expensive at the same time.

Another ongoing issue was that CPU's were continuing to grow faster and faster in relation to the memory they addressed. Since many operations required several memory accesses, the overall speed of the CPU could be greatly increased by dedicating chip space to internal memory in the form of registers, which also reduces the number of addressing modes and orthogonality.

The computer designs based on this theory were called Reduced Instruction Set Computers, or RISC. RISCs generally had larger numbers of registers, accessed by simpler instructions, with a few instructions specifically to load and store data to memory. The result was a very simple core CPU running at very high speed, supporting the exact sorts of operations the compilers were using anyway.

One downside to the RISC design is that the program that run on them tend to be larger. That's because instead of including a single very powerful instruction like that in the CISC philosophy, the RISC version requires the compiler to write out a number of instructions to complete the same operation. Since these instructions need to be loaded from memory anyway, the larger code size offsets some of the benefit of the RISC design's memory handling.

Recently, engineers have found ways to compress the reduced instruction sets so they fit in even smaller memory systems than RISCs. Examples of such compression schemes include ARM architecture's "Thumb" instruction set. In applications that do not need to run older binary software, compressed RISCs are coming to dominate sales.

Another approach to RISCs was the "niladic" or "zero-address" instruction set. This approach realized that the majority of space in an instruction was to identify the operands of the instruction. These machines placed the operands on a push-down (last-in, first out) stack. The instruction set was supplemented with a few instructions to fetch and store memory. Most used simple caching to provide extremely fast RISC machines, with very compact code. Another benefit was that the interrupt latencies were extremely small, smaller than most CISC machines (a rare trait in RISC machines). The first zero-address computer was developed by Charles Moore, placed six 5-bit instructions in a 32-bit word, and was a precursor to VLIW design (see below: 1990 to Today).

Commercial variants were mostly characterized as "FORTH" machines, and probably failed because that language became unpopular. Also, the machines were developed by defense contractors at exactly the time that the cold war ended. Loss of funding may have broken up the development teams before the companies could perform adequate commercial marketing.

RISC chips now form the vast majority of all CPU's in use. That's because the main market for RISC based CPUs has been in the places where their tiny core makes the most sense, systems were low power or small size is important. Thus RISC CPU's are now found in almost every electronic device in the world, where they have replaced earlier generations of 8-bit CISC CPU's like the Zilog Z80 and 6502 with full 32-bit implementations like the ARM family, which are in fact even simpler.

This may come as a surprise to many, because the "market" is based on the desktop computer as a hallmark. With Intel designs dominating the vast majority of all desktop sales, RISC is found only in the Apple computer lines.

Mid 1980s to Today: Synthesis

In the mid-to-late 1980s, designers began using a technique known as instruction pipelining, in which the processor works on multiple instructions in different stages of completion. For example, the processor may be retrieving the operands for the next instruction while calculating the result of the current one. Modern CPUs may use over a dozen such stages.

A similar idea, introduced only a few years later, was to execute multiple instructions in parallel on separate arithmetic-logic units (ALUs). Instead of operating on only one instruction at a time, the CPU will look for several similar instructions that are not dependant on each other, and run them all at the same time. The results are then interleaved when they exit, making it look like a single CPU was running (say) twice as fast while still using only one bus.

This approach, referred to as scalar processor design, is limited by the degree of instruction level parallelism (ILP), the number of non-dependant instructions in the program code. Some programs are able to be run very well on scalar processors, notably data processing. However more general problems require complex logic, and this almost always results in instructions whose results are based on other results -- thus making them not able to be run in parallel.

Branching is one major culprit, for instance you might add two numbers and then do one of two things if the number is bigger or smaller than some third number. In this case even if the branch operation is sent to the second ALU for processing, it still has to wait for the results from the addition, and thus runs no faster than had there been only one ALU.

To get around this limit, so-called superscalar designs were developed. Additional logic in the CPU looks at the code as it is being sent into the CPU, and "forces" it to be parallel. In the branching case a number of solutions are applied, including watching past examples of the branch to see which one is most common (called branch prediction), and simply running that case as if there was no branch at all. A similar concept is speculative execution, where both sides of a branch are run at the same time, and the results of one or the other thrown out once the answer is known.

These advances, which were originally developed from research for RISC-style designs, allow modern CISC processors to execute twelve or more instructions per clock cycle, when traditional CISC designs could take twelve or more cycles to exectue just one instruction.

The resulting microcode is complex and error-prone, mostly due to the dependencies between different instructions. Furthermore, the electronics to coordinate these ALUs require more transistors, increasing power consumption and heat. In this respect RISC is superior because the instructions have less interdependencies and tend to be easier to apply superscalar concepts too. However, as Intel has demonstrated, the concepts are just as applicable to a CISC design if you want to spend the time and money.

1990 to Today: Looking Forward

It's important to consider that the microcode that is used to make a superscalar processor is just that -- computer code. In the early 1990s, a significant innovation was to realize that the coordination of a multiple-ALU computer could be moved into the compiler, the software that translates a programmer's instructions in machine-level instructions. In fact, there's really no difference at any theoretical level.

But at a practical level, running the code in the compiler has a huge number of advantages. Primary among them is that the CPU has to do all of its operations in real-time, so if you make the prediction code more complex the CPU might in fact slow down. But since compiling is something that happens only on the developer's machine, you can spend as much time doing speculation as you want.

So all of that complexity that is placed in the CPU to do things like branch predication -- which is really just software -- can be removed and placed in the compiler instead. The resulting CPU is simpler, and even in the worst case it runs the same speed it would if the prediction was in the CPU. Once the compiler knows what instructions can be run at the same time, it bundles them together into one very large instruction. The result is a very long instruction word (VLIW) computer.

There were several unsuccessful attempts to commercialize VLIW. The basic problem was that a VLIW computer does not scale to different price and performance points, as a microprogrammed computer can. Also, VLIW computers maximize throughput, not latency, so they were not attractive to the engineers designing controllers and other computers embedded in machinery. The embedded systems markets had often pioneered other computer improvements by providing a large market that did not care about running older binary software.

In January 2000, a company called Transmeta took the interesting step of placing a complete compiler in the central processing unit, and making the compiler translate from a reference byte code (in their case, x86 instructions) to an internal VLIW instruction set. This approach combines the hardware simplicity, low power and speed of VLIW RISC with the compact main memory system and software reverse-compatibility provided by popular CISC.

Later this year (2002), Intel intends to release a chip based on what they call an Explicitly Parallel Instruction Computer (EPIC) design. This design suposedly provides the VLIW advantage of increased instruction throughput. However, it avoids some of the issues of scaling and complexity, by explicitly providing in each "bundle" of instructions information concerning their dependencies. This information is calculated by the compiler, as it would be in a VLIW design. The early versions will also be reverse compatible with current x86 software by means of an on-chip emulation mode.

Also, we may soon see multi-threaded CPUs. Current designs work best when the computer is running only a single program, however nearly all modern operating systems allow the user to run multiple programs at the same time. For the CPU to change over and do work on another program requires an expensive context-switch. In contrast, a multi-threaded CPU could handle instructions from multiple programs at once.

To do this, such CPU's include a huge number of registers. When a context switch occurs the contents of the "working registers" are simply copied into one of a set of registers specifically included for this purpose. Such designs often include thousands of registers instead of hundreds as in a typical design. On the downside, registers tend to be somewhat expensive in chip space needed to implement them, chip space that could otherwise be used for some other purpose.

Another track of development is to combine reconfigurable logic with a general-purpose CPU. In this scheme, a special computer language compiles fast-running subroutines into a bit-mask to configure the logic. Slower, or less-critical parts of the program can be run by sharing their time on the CPU. This process has the capability to create devices such as software radios, by using digital signal processing to perform functions usually performed by analog electronics.

Embedded Design

The majority of computer systems in use today are embedded in other machinery, such as telephones, clocks, appliances, vehicles, and infrastructure. These "embedded systems" usually have small requirements for memory, modest program sizes, and often simple but unusual input/output systems. For example, most embedded systems lack keyboards, screens, disks, printers, or other recognizable I/O devices of a personal computer. They may control electric motors, relays or voltages, and read switches, variable resistors or other electronic devices. Often, the only I/O device readable by a human is a single light-emitting diode, and severe cost or power constraints will eliminate even that.

In contrast to general-purpose computers, embedded systems often seek to minimize interrupt latency over instruction throughput.

For example, low-latency CPUs generally have relatively few registers in their central processing units. When an electronic device causes an interrupt, the intermediate results, the registers, have to be saved before the software responsible for handling the interrupt can run, and then must be put back after it is finished. If there are more registers, this saving and restoring process takes more time, increasing the latency.

Other design issues

Another common problem involves virtual memory. Historically, random-access memory has been thousands of times more expensive than rotating mechanical storage. For businesses, and many general computing tasks, it is a good compromise to never let the computer run out of memory, an event which would halt the program, and greatly inconvenience the user. Instead of halting the program, many computer systems save less-frequently used blocks of memory to the rotating mechanical storage. In essence, the mechanical storage becomes main memory. However, mechanical storage is thousands of times slower than electronic memory. Thus, almost all general-purpose computing systems use "virtual memory" and also have unpredictable interrupt latencies unless their operating system contains a real-time scheduler. Such a scheduler keeps critical pieces of code and data in solid-state RAM and guarantees a minimum amount of CPU time and a maximum interrupt latency.

One interesting near-term possibility would be to eliminate the bus. Modern vertical laser diodes enable this change. In theory, an optical computer's components could directly connect through a holographic or phased open-air switching system. This would provide a large increase in effective speed and design flexibility, and a large reduction in cost. Since a computer's connectors are also its most likely failure point, a busless system might be more reliable, as well.

Another farther-term possibility is to use light instead of electricity for the digital logic itself. In theory, this could run about 30% faster and use less power, as well as permit a direct interface with quantum computational devices. The chief problem with this approach is that for the foreseeable future, electronic devices are faster, smaller (i.e. cheaper) and more reliable. An important theoretical problem is that electronic computational elements are already smaller than some wavelengths of light, and therefore even wave-guide based optical logic may be uneconomic compared to electronic logic. We can therefore expect the majority of development to focus on electronics, no matter how unfair it might seem. See also optical computing.

Yet another possibility is the "clockless CPU." This unit coordinates stages of the CPU using logic devices called "pipline controls" or "FIFO sequencers." Basically, the pipline controller clocks the next stage of logic when the existing stage is complete. In this way, a central clock is unnecessary. The advantage is that the clockless CPU runs at the speed of its slowest component.


References