motivated by common user problems and misconceptions
not specific to any processor or operating system but YMMV
aimed at understanding the basic optimisation techniques
assuming predominantly floating point scientific computing
Exercise material
> tar xf /short/c23/SerialOpt.tar
> cd SerialOpt
> ls
IO - the good, the bad and the ugly ...
Most of it is UGLY!
Easily the most serious performance problem on the AC
Filesystems on large parallel systems geared to BIG IO - cannot
support 500 "general purpose" users.
What works OK on your PC may be disastrous on the AC
Do IO in big chunks as infrequently as possible
Filesystems for all occasions. Different file systems can give vastly
different performance.
Exercise: IO
Compile and run each of goodio1.f90, goodio2.f90,
goodio3.f90 and bestio.f90 and compare the
different IO methods. Beware, the array sizes are different and the
timings include the computation (which should be negligible).
The demonstrator will run poorio.f90 to show you how bad it can be.
Disk IO Time Scales and Locality
Timers
measuring cputime (shared system) and real time (dedicated system)
granularity of timers - might need to time multiple runs
cputime is only accurate to 1/100th or 1/1024th sec on Linux
overhead of timer call may affect timings
watch out for wraparound of timer counters
for accurate walltime, use:
gettimeofday() on Linux
MPI_Wtime() on some systems
simple time command, eg. > time a.out 48.231u 0.303s 0:48.67 99.7% 0+0k 0+0io 0pf+0w
Exercise: Timers
Compile and run timers.(c|f90) and compare the difference
between the timers. For Fortran, we have provided a C wrapper to
gettimeofday() - compile using:
Note how many iterations of a loop can fit within ticks of the clock.
Compilers
Usually large wins from using compiler optimization flags
Default often uses some optimization
Beware of slowdowns or wrong answers(!) at highest optimization (eg. -O5)
Look for good generic optimization flags like -fast or -Ofast
For Intel: -fast = "-ipo -O3 -static" where ipo means
interprocedural optimization
Recognizing basic assembler (produced by -S option) might be useful
Compiler option gotchas
Beware of Intel -ipo: object files (.o files) are not really object
files, they are translated source files. Optimized compilations happens at "link"
time when all .o's are visible. Cant manage .o's with link tools like
ld or ar.
Beware of result changes due to -fast_maths (-IPF-fp-relaxed
for Intel compilers) options.
Watch out for -fpe or -speculate all type flags
(eg. -IPF-fp-speculation fast for Intel). May cause real FPE
problems to be hidden.
Optimization levels
-O0 disable optimization
-O1 optimizations without code bloat (no SWP or loop unrolling)
-O2 (default) intra-file IPO, SWP, speculation
-O3 aggressive loop transforms, data prefetch, ...
Basic documentation online in /opt/histx/histx-1.3b/doc/histx.txt
No profiling output using -fast
Hardware monitoring
Provide cycle counting, cache misses, stalls, memory access and a
whole heap of very detailed information. A course in its own right.
Most performance gains can be made without hardware monitors.
Exercise: Profiling
Build and profile (using both gprof and
histx) your own code or laplace.(c|f) or
oceancode.f. Try using the
different profiling features eg. determine which line in the code is
the most often executed.
Architecture
Common features:
multiple floating point units
pipelined
out-of-order execution
cycle counts for operations
caches and cache lines
memory access times
Arithmetic Units
Loads one or two operands (from registers) and writes the result
back to a register
Processor may or may not support the following FP operations
in hardware:
Operation
Stages
Pipelined
Add
5
YES
Multiply
5
YES
FMA
5
YES
Divide
20
NO
Sqrt
20
NO
Pipelining
Hardware execution of arithmetic operation:
5 to 10 "single-cycle" sub-operations (stages)
scalar execution:
operands pass through all stages and result produced
before next pair of operands start
1 results every 5 clock cycles
pipelined execution:
new operation starts every cycle
operands of different iteration at every stage simultaneously
results every clock cycle
Pipelining 2
FP Add operation stages:
compare exponents
shift mantissa bits in one operand
integer add of the mantissas (XOR of bits)
check for normalization
increment or decrement the results mantissa if necessary
Example Pipeline
Pipelining 3
compiler generally manages pipelining
only way to get anywhere near peak flops
often limited by memory bottleneck (see next)
requires lots of independent operations
recursion kills pipelining:
do i = 1,n
a(i) = b(i) + a(i-1)
enddo
see later for other problems
Memory access
Processors are getting faster much faster than memory!
Systems use hierarchical memory to try to overcome this
Small amounts of "fast memory" and large amounts of
"slow memory"
All sorts of sophisticated mechanisms to integrate these:
caching
cacheline loading
prefetching
...
Lots of pitfalls to kill performance! (strided access, cache misses, TLB misses, ...)
Memory levels
Memory level
Size
Access time
Registers
64-640B
1ns
L1 cache
16-64KB
2-5ns
L2 cache
512KB-8MB
5-10ns
Main memory
128MB-...
70-400ns
1ns = 1-3 clock cycles.
Cache
Cache is fast memory close to the CPU.
A load of a word from a memory address actually loads a cacheline
(typically 32-256 contiguous bytes or 4-32 words).
So a long memory access time (high latency) returns a lot of data
(reasonable bandwidth)
If you only use 1 word from the cacheline, you waste the bandwidth!
Using all the contiguous words amelorates the load cost
Typical memory system
Exercise: Strided access
Take a look at memory.(f90|c) and notice how it is doing a
strided access through memory.
Compile the code with (ifort|icc) memory.(f90|c) -o memory
Run it with ./memory
Run it with different values for the stride (make sure you do 1) and
notice how as you increase the stride, the time to execute increases.
What happens if you use -fast?
Memory access patterns
Worst-to-Best List:
strided or random access over a large range (~1GB) with gap > pagesize (4-16KB)
random or strided access with a gap >= cacheline size (64B-256B)
small stride with gaps between 1 and cacheline size (2-8)
contiguous access
repeat use of data:
from a small set of pages
from a small set of cachelines (512KB-8MB) 2nd or 3rd level cache
from an even smaller set of cachelines (16KB-64KB) 1st level cache
from a small set of words (8-80) in registers
Multidimensional arrays
Contiguous indices may not mean contiguous memory
Multidimensional arrays are "linearized" in memory
Fortran has first index contiguous, C has last
parameter :: N = 128
real*8 a(N,N,N)
do i = 1,N
do j = 1,N
do k = 1,N
.... a(i,j,k) ....
enddo
enddo
enddo
BAD!
Streaming contiguously through lots of memory is still slow
Getting data directly from cache on every load is up to an
order of magnitude faster
Reuse data already in cache as many times as possible before
it gets expelled
Blocked algorithms for dense linear algebra
Reordering of loop operations in other algorithms
Exercise: Cache reuse
The simplest way to make better use of cache it to continually use
the same memory addresses. By interleaving iterations, modify the
code fd.(f90|c) to make better use of cache.
Code Optimization
Strength reduction
Conditionals
Procedure calls and inlining
Unrolling
Aliasing
...
Strength reduction
Should know which operations are expensive
Example
divides
a**2.0d0 vs a*a
a**0.5d0 vs sqrt(a)
a**2.5d0 vs a*a*sqrt(a)
pow()
Exercise ####
Conditionals
try to avoid conditional in inner loops
breaks pipelining
branch mispredictions cause major disruption to op flow
unusable memory loads, registers and caches polluted ...
Use of:
c=merge(a,b,logical) in Fortran
c=logical?a:b; in C
can generate single instruction (cmov) and no branch
Exercise: Conditionals
Compile and time the code conditional.(c|f90)
> ifort -fast conditional.f90 -o conditional
> time ./conditional
and in C
> icc -fast condition.c -o conditional
> time ./conditional
Now removing as many conditionals as possible and see what performance increase is observed.
Procedure calls and inlining
Procedure calls are a GOOD THINGTM (for code structure
and management)
But they do cost cycles to push registers on the stack etc
Much worse - procedure calls break pipelining in "tight" loops
Procedure calls and inlining 2
Inlining involves inserting the procedure code in place
of the call:
by hand
by preprocessors and macros
by the compiler
often requires more ops etc
compiler has to "see" the procedure source when compiling the call