Optimizing compiler

Optimization refers to techniques used to improve the efficiency (in terms of running time or resource usage) of the code output by a compiler. These techniques allow programmers to write code in a straightforward manner, expressing their intentions clearly, while allowing the computer to make choices about implementation details that lead to efficient code. Contrary to what the term might imply, this rarely results in code that is perfectly "optimal" by any measure, only code that is much improved compared to direct translation of the programmer's original code.

Techniques in optimization can be broken up along various dimensions:

local vs. global: Local techniques tend to be easier to implement, but result in lesser gains, while global techniques make the opposite tradeoff. Often, optimizations that act on the complete control flow graph are termed global, and those that work on a basic block alone are termed local.

programming language-independent vs. language-dependent: Most high-level languages share common programming constructs and abstractions--decision (if, switch, case), looping (for, while, repeat .. until, do .. while), encapsulation (structures, objects). Thus similar optimization techniques can be used across languages. However certain language features make some kinds of optimizations possible and/or difficult.

machine independent vs. machine dependent: A lot of optimizations that operate on abstract programming concepts (loops, objects, structures) are independent of the machine targetted by the compiler. But many of the most effective optimizations are those that best exploit special features of the target platform.

These dimensions aren't completely orthogonal. It is often the case that machine dependent optimizations are local, for example.

An instance of a local machine dependent optimization: to set a register to 0, the obvious way is to use the constant 0 with the instruction that sets a register value to a constant. A less obvious way is to XOR a register by itself. It is up to the compiler to know which instruction variant to use. On many RISC machines, both instructions would be equally appropriate, since they would both be the same length and take the same time. On many other microprocessors such as the Intel x86 family, it turns out that the XOR variant is shorter and faster, because it eliminates the need to store and fetch the constant from memory.

Factors affecting optimization

The machine itself

It is sometimes possible to parametrize some of these machine dependent factors. A single piece of compiler code can be used to optimize different machines just by altering the machine description parameters. See GCC for a compiler that exemplifies this approach.

The architecture of the target CPU

Number of CPU registers: To a certain extent, the more registers, easier it is to optimize for performance, because local variables can be allocated in the registers and not on the stack.
RISC vs. CISC: CISC instruction sets often have variable instruction lengths, often have a larger number of possible instructions that can be used, and each instruction could take differing amounts of time. RISC instruction sets attempt to limit the variability in each of these: instruction sets are usually constant length, with few exceptions, there are usually fewer combinations of registers and memory operations, and the instruction issue rate is usually constant (instruction issue rate is effectively the number of instructions completed per time period, usually the clock cycle). Thus, CISC CPUs usually have several ways of carrying out a certain task, from which the programmer can choose one that best suits his needs. Compilers have to know the relative costs among the various instructions and choose the best instruction sequence.
Pipelines: A pipeline is essentially an ALU broken up into an assembly line. It allows use of parts of the ALU for different instructions by breaking up the execution of instructions into various stages: instruction decode, address decode, memory fetch, register fetch, compute, register store, etc. One instruction could be in the register store stage, while another could be in the register fetch stage. Pipeline conflicts occur when an instruction in one stage of the pipeline depends on the result of another instruction ahead of it in the pipeline but not yet completed. Pipeline conflicts can lead to pipeline stalls: where the CPU wastes cycles waiting for a conflict to resolve.

Compilers have to schedule instructions such that the pipelines don't stall, or stalls are reduced to a minimum.

number of functional units: Some CPUs have several ALUs and FPUs. This allows them to execute multiple instructions simultaneously. There may be restrictions on which instructions can pair with which other instructions ("pairing" is the simultaneous execution of two or more instructions), and which functional unit can execute which instruction. They also have issues similar to pipeline conflicts.

Here again, instructions have to be scheduled so that the the various functional units are fully fed with instructions to execute.

The architecture of the machine

Cache size (256KiB, 1MiB) and type (fully associative, 4-way associative): Techniques like inlining and loop unrolling may increase the size of the generated code and reduce code locality. The program may slow down drastically if an opt-run piece of code (like inner loops in various algorithms) suddenly cannot fit in the cache. Also non-fully associative caches have higher chances of cache collisions even in an unfilled cache.
Cache/Memory transfer rates: These give the compiler an indication of the penalty for cache misses. This is used only in specialized applications.

Intended use of the generated code

Debugging: A main factor is on speed of compilation. Also, debugging code is usually stepped through in a symbolic debugger -- so it is useful not to apply transformations that make it difficult to identify the source code line numbers from which the code being stepped through was generated from.

General purpose use: Prepackaged software is very often expected to be executed on a variety of machines and CPUs that may share the same instruction set, but have different timing, cache or memory characteristics. So, the code may not be tuned to any particular CPU, or may be tuned to work well on the most popular CPU and work reasonably on other CPUs.

Special purpose use: If the software is compiled to be used on one or a few very similar machines, with known characteristics, then the compiler can heavily tune the generated code to those machines alone.

Optimization techniques

Common Themes

To a large extent, optimization techniques have the following themes, which sometime conflict

Less code: There is less work for the CPU, cache, and memory. So, likely to be faster.

Straight line code, fewer jumps: Less complicated code. Jumps interfere with the prefetching of instructions, thus slowing down code

Code locality: Pieces of code executed close together in time should be placed close together in memory

Extract more information from code: The more information the compiler has, the better it can optimize.

Techniques

Some optimization techniques are:

loop unrolling: Attempts to reduce the overhead inherent in testing a loop's condition by replicating the code in the loop body. Completely unrolling a loop, of course, requires that the number of iterations be known at compile time.
loop combining: Another technique which attempts to reduce loop overhead. When two adjacent loops would iterate the same number of times (whether or not that number is known at compile time), their bodies can be combined as long as they make no reference to each other's data.
loop interchange (swapping inner and outer loops)
common subexpression elimination: In the expression "(a+b)-(a+b)/4", "common subexpression" refers to the duplicated "(a+b)". Compilers implementing this technique realize that "(a+b)" won't change, and as such, only calculate its value once.
test reordering: if we have two tests that are the condition for something, we can first deal with the simpler tests (e.g. comparing a variable to something) and only then with the complex tests (e.g. those that require a function call). This technique complements lazy evaluation, but can be used only when the tests are not dependant on one another.
inlining of procedures: when some codes invokes a procedure, it is possible to put the body of the procedure inside this code rather than invoking it from another location. This saves the overhead related to procedure calls, but comes at the cost of duplicating the function body each time it's called inline. Generally, inlining is useful in performance-critical code that uses makes a lot of small procedure calls.
constant folding and propagation: replacing expressions consisting of constants (e.g. "3 + 5") which their final value ("8") rather than doing the calculation in run-time. Used in most modern languages.
instruction scheduling
dead code elimination: "Dead code" refers to code segments which can never be executed. Dead code elimination prevents the compiler from emitting code for such segments, saving on CPU instruction cache.
using CPU instructions with complex offsets to do math: on many processors in the 68000 family, for example, "lea a0,25(a1,d5*4)" assigns to the a0 register 25 + the contents of a1 + 4 * the contents of d5 in a single instruction and without an explicit move or overwriting a1 or d5
code-block reordering to reduce conditional branches and improve locality of reference
factoring out of invariants: if an expression is carried out both when a condition is met and otherwise, it can be written just once outside of the conditional statement. Similarly, if certain types of expressions (e.g. the assignment of a constant into a variable) appear inside a loop, they can be moved out of it because their effect will be the same no matter if they're executed many times or just once.
sentinels
removing recursion: recursion is often expensive as it involves the function call overhead, and inconvenient as it can rapidly deplete the stack. Tail recursive algorithms can be converted to iteration, which does not have call overhead and uses a constant amount of the stack.
strength reduction
reduction of cache collisions (e.g. by disrupting alignment within a page)
register optimization (I remember doing some homework at the university making graphs of register usage to optimize register accesses, but can't remember more - can anyone fill in the blanks?)

please expand this list, and add references to papers/books discussing each technique.

Optimizing compiler

Contents

Factors affecting optimization

The machine itself

Intended use of the generated code

Optimization techniques

Common Themes

Techniques

Navigation menu

Optimizing compiler

Factors affecting optimization

The machine itself

Intended use of the generated code

Optimization techniques

Common Themes

Techniques

Navigation menu

Search