EA + Frostbite + Star Wars = fun times ahead

I recently moved to Motive Studios and joined Jade Raymond and her talented (still growing) team for new exciting projects!

I’m very excited by this new opportunity, as I’m now directly collaborating with DICE and contributing to the Frostbite engine, meeting old friends (Colin Barré-Brisebois ;)), and will help building a solid core team for our amazing future projects.

It was a pleasure to work at WB Games, where I left a few good friends. Can’t wait to see what we’ve been working on there being announced!

It kind of explains why I haven’t had much time to finish and post Part 3 of my code perf series, but yeah it’s almost there, so stay tuned!

Posted in EA, Frostbite | Leave a comment

Maximizing code performance by thinking data first – Part 2

Knowing the hardware

Memory cache

CPU is not physically directly connected to the main memory. All random access memory (load and stores) on modern CPU goes through memory cache.

When the CPU executes a fetch instruction (load), the memory controller searches the caches first for a cache entry with a tag matching the memory address it has to read. If there is a match, this is a cache hit and the data can directly be loaded from cache memory. Otherwise it’s a cache miss, in which case the memory controller will attempt to fetch the data from subsequent cache levels (ie. L1D, L2, then L3 if any), and ultimately from main RAM. The data will then be stored in L1, L2 and L3 (inclusive cache).

CPUCache0Memory latency on consoles – Jason Gregory

In this simplified illustration, the CPU (AMD Jaguar, used on PS4 and XB1) has 2 cache levels L1 and L2. As we can see, not just the data is cached, L1 is split between code instruction cache (L1I) and data cache (L1D), memory regions needed for code and data being independent from each other. Generally speaking, L1I is much less problematic than L1D, but we’ll see a few guidelines on how to improve instruction cache usage in another post.

In terms of latency, L1 is orders of magnitude faster than L2, itself being around 10 times faster than main RAM. The numbers look high, but the entire cost does not have to be paid for each cache miss. It is possible to hide part of the cost by hiding latency, scheduling, … but such techniques are beyond the scope of this post.

CacheAnimMemory access latency – Andreas Fredriksson

Each cache entry, or cache line, contains several contiguous words (64 bytes for an AMD Jaguar or a Core i7). When the CPU executes an instruction that fetches or stores a value, the entire cache line is transferred into L1D. In the case of a store, the cache line which has been written to is marked as dirty, until it’s written back into RAM.

CachelineWrite from register to memory

To be able to load new data in the cache, it is almost always first necessary to make room by evicting a cache line.

  • Exclusive cache: eviction from L1D pushes the cache line into L2. This means room has to be made in L2, which might push the data again into main memory. Moving the evicted line from L1D to L2 contributes to the latency for a cache miss.
  • Inclusive cache:  each cache line in L1D is also present in L2. Evicting from L1D is much faster and require no further actions.

 

Recent Intel and AMD processors implement inclusive cache. It might look like a waste of cache at first, but there are two advantages to using such a technique:

  • it lowers the latency of a cache miss by not having to push the cache line from one cache level to the next when evicting,
  • if one core needs data another core is working on, it can fetch the most current version from upper cache levels without the need to interrogate that core, which is why inclusive cache became more and more popular with the explosion of multi-core CPU architectures.

 

cache line collisions: while multiple cores can read cache lines efficiently, writes can cause performance issues. False sharing stands for when different cores modify independent data mapped on a same cache line. Due to cache coherence protocols, if a core writes to a cache line, then the cache line referring the same memory is invalidated on the other core (cache trashing), forcing memory stalls with every data write. False sharing can be avoided by making sure different cores work on different cache lines (by adding extra padding, aligning structures on 64 bytes, …).

falseSharingAvoid false sharing by having per thread writable data on different cache lines

As one can see, understanding the hardware architecture he is targeting is key for the programmer to spot and fix those problems that otherwise would go unnoticed. 

Coreinfo is a command-line utility that provides a detailed summary of all the instruction sets found on a CPU, as well as the caches assigned to each logical processor, … This is the output when running on a Core i5-3570K:

*--- Data Cache        0, Level 1,  32 KB, Assoc  8, LineSize 64
*--- Instruction Cache 0, Level 1,  32 KB, Assoc  8, LineSize 64
*--- Unified Cache     0, Level 2, 256 KB, Assoc  8, LineSize 64
**** Unified Cache     1, Level 3,   6 MB, Assoc 12, LineSize 64
-*-- Data Cache        1, Level 1,  32 KB, Assoc  8, LineSize 64
-*-- Instruction Cache 1, Level 1,  32 KB, Assoc  8, LineSize 64
-*-- Unified Cache     2, Level 2, 256 KB, Assoc  8, LineSize 64
--*- Data Cache        2, Level 1,  32 KB, Assoc  8, LineSize 64
--*- Instruction Cache 2, Level 1,  32 KB, Assoc  8, LineSize 64
--*- Unified Cache     3, Level 2, 256 KB, Assoc  8, LineSize 64
---* Data Cache        3, Level 1,  32 KB, Assoc  8, LineSize 64
---* Instruction Cache 3, Level 1,  32 KB, Assoc  8, LineSize 64
---* Unified Cache     4, Level 2, 256 KB, Assoc  8, LineSize 64

it reports a 32KB L1 data cache, a 32KB L1 instruction cache, a 256KB L2 cache, and a 6MB L3 cache. On this architecture, L1 and L2 are assigned to each core, while L3 is shared by all cores. 

This is just one specific architecture. In the case of an AMD Jaguar CPU, each core has a dedicated L1 cache, while L2 is shared by groups of 4 of cores, called clusters (Jaguar has no L3).

JaguarA 4-core cluster (AMD Jaguar CPU)

Special care should be taken when dealing with clusters. We’ve seen how writing to a cache line invalidates the same cache line in other cores, and how it degrades performances. Well it gets worse on this type of architecture: a core fetching data from its nearest L2 cache, located on the same cluster, takes around 26 cycles [2], while fetching from another cluster’s L2 can take up to 190 cycles [6] It’s almost the same cost as fetching data from RAM! 

ClustersL2L2 latency on AMD Jaguar Clusters – Jason Gregory

 

For a deeper look at cache coherency, I encourage readers to have a look at Ryg’s Cache Coherency Primer [5].

 

 

Assembly basics

x86-64bit, x64, IA-64, AMD64… or the birth of x64 architecture

Intel and AMD both developed their their own 64-bit architecture: AMD64 and IA-64. IA-64 was a drastic change from x86-32-bit processors, in the way that it contained no legacy from the x86 architecture. x86 applications had to run through an emulation layer, and therefore were performing poorly on this architecture. Suffering lack of x86 compatibility, the IA-64 never took off, except in the commercial space.. On the other hand, AMD designed a more conservative 64-bit architecture (AMD64) by extending the existing x86 architecture with a new 64-bit instruction set. Intel, having lost the 64-bit war[18], had no choice but to adopt the same extensions in its own x86 processors[1]. This section will focus on x86-64bit, also known as x64 or AMD64 architecture. 

For years, PC programmers used x86 assembly to write performance-critical code: mode’X'[4], CPU-Skinning, collisions, software rasterizers… However, 32-bit PCs have slowly been replaced with 64-bit ones, and the underlying assembly code has changed.

Understanding assembly is necessary if you want to know why some things are slow and others are fast. It also helps understanding how to use intrinsics to optimize critical code paths, and debugging optimized (ex: -O3) code when source code level debugging is not relevant anymore.


Registers

Registers are small amount of very fast memory with almost no latency (usually 1 CPU cycle). They can be seen as the internal memory of a CPU. It holds data that is directly processed by CPU instructions.

A x64 processor has 16 general-purpose registers (GPRs). A GPR is not used for storing any specific type of data. Instead operands as well as addresses are stored at the time of execution.

x64 extends x86’s 8 GPRs to be 64-bit, and adds 8 new 64-bit registers. The 64-bit registers have names beginning with “r”, so for example the 64-bit extension of eax (32-bit) is called rax. The new registers are named r8 through r15.

x64ASMGeneral architecture (software.intel.com)

x64 registers include:

  • 16 64-bit general purpose registers (GPR), the first eight of which are labeled rax, rbx, rcx, rdx, rbp, rsi, rdi, and rsp. The second eight are named r8-r15.
  • 8 64-bit mmx  registers (MMX instruction set), overlaid on floating point registers fpr (x87 FPU)
  • 16 128-bit xmm vector registers (SSE instruction set)
  • (on more recent CPUs) 256-bit ymm registers (AVX instruction set), extending xmm registers
  • (on more recent CPUs) 512-bit zmm registers (AVX-512 instruction set), extending xmm registers, and increasing their number to 32

zmm.pngRelationship of ZMM, YMM and XMM registers

Some GPRs are labeled differently for historical reasons. For example ax used to be the Accumulator register, cx the Counter register, dx the Data register, … Most of those registers have lost their special purpose in the modern instruction set, except rsp (Stack Pointer) and rbp (Base Pointer) which are reserved for the hardware stack management (although rbp can often be ‘optimized out’ and used as a general purpose register – ‘omit frame pointer’ in Clang)

x86 registers lower bits can be accessed using sub-registers. In the case of the first 8 x86 registers, this is done using their legacy names. Newer registers (r8-r15), on the other hand, follow a same and simpler pattern (see the figure bellow).

x64-Regs-names.pngLabeled scalar registers 


Adressing

When assembly instructions require 2 operands, the first one is generally the destination, and the second one the source. Each of them contain either the data to be processed, or the address of the data. The 3 basic modes of addressing are:

  • Immediate
    • mov eax, 4                      ;move 4 into eax
  • Register to register
    • mov eax, ecx                  ;move the content of ecx to eax
  • Indirect:
    • mov eax, [ebx]               ;move the 4 bytes (size of eax) at address ebx into eax
    • mov byte ptr [rcx], 5    ;move 5 into the byte at address rcx
    • mov rdx, dword ptr [rcx+4*rax] ;move the dword at address rcx+4*rax into rdx

“dword ptr” is called a size directive. A size directive tells the assembler which size should be considered when there is an ambiguity on the size of a referred-to memory region (ex: mov [rcx], 5 : should it write a byte ? a dword ? …).

It can indicate: byte (8-bit), word (16-bit), dword (32-bit), qword (64-bit), xmmword (128-bit), ymmword (256-bit), zmmword (512-bit).

 


SIMD Instruction sets

A scalar implementation refers to operations one pair of operands at a time. Vectorization is the process of converting an algorithm from operating on single pieces of data at one time to multiple pieces of data at one time (we’ll see later on how to do that).

Modern processors have the ability to take advantage of Single Instruction on Multiple Data (SIMD) instruction sets (vector instructions) to process data in parallel.

simd2SIMD processing

SIMD instruction sets available on x86 processors are:

  • Multimedia eXtension (MMX)
    • Legacy. Supports arithmetic operations on integers packed into 64-bit vector registers.
  • Streaming SIMD Extensions (SSE)
    • arithmetic operations on floats packed into 128-bit vector registers. Integer and double support was added in SSE2.
  • Advanced Vector Extensions (AVX) – x64 only
    • added support for 256-bit vector registers
  • AVX-512 – x64 only
    • added support for 512-bit vector registers

VectorRegistersVector registers on x64 processors

Game engines usually spend 90% of their execution time running a small portion of the code base, mainly iterating over data and processing it. In such scenarios, SIMD can make a big difference. SSE instructions are commonly used to process sets of 4 floats packed into 128-bit vector registers, in parallel. One can see how fast and convenient this can be for 3D maths, provided the data is well organized in memory.

SSE is mostly geared toward vertical data representation (SoA) and processing, but generally speaking, the choice of Structure of Arrays (SoA) versus Array of Structures (AoS) for best performances depends on access patterns. 

  • AoS is probably the most natural one and easy to write. Fits OOP paradigms.
  • AoS has better data locality if all the members are accessed together.
  • SoA exposes more vectorization opportunities (vertical processing).
  • SoA often uses less memory because padding is only between arrays.

 

// Array Of Structures
struct Sphere
{
  float x;
  float y;
  float z;
  double r;
};
Sphere* AoS;

Memory layout (struct aligned on 8 bytes):
------------------------------------------------------------------
| x | y | z | r | pad | x | y | z | r | pad | x | y | z | r | pad
------------------------------------------------------------------

// Structure Of Arrays
struct SoA
{
  float* x;
  float* y;
  float* z;
  double* r;
  size_t size;
};

Memory layout:
------------------------------------------------------------------
| x | x | x ..| pad | y | y | y ..| pad | z | z | z ..| pad | r..
------------------------------------------------------------------

One final note on AVX, which is a natural extension of SSE, and brings the vector register size up to 256 bits, meaning up to 8 floats can be packed and processed in parallel. While Intel processors support 256-bit registers natively since their introduction, AMD processors can be problematic. Early AVX AMD CPUs, such as Bulldozer and Jaguar, decompose 256-bit operations into two 128-bit operations, driving the latency up compared to SSE.

In conclusion, it’s tricky to target AVX exclusively (maybe for internal tools if your workstations are Intel based), and AMD processors don’t support it natively for the most part. On the other hand, SSE2 can be taken for granted on any x64 CPU (it’s part of the specs), so there is no real excuse not to use it.

 


Out-of-Order execution

In an Out-of-Order (OoO) CPU pipeline, when the execution of an instruction is delayed because the necessary input data is not ready, the CPU tries to find later instructions to execute first, if their input data is ready.

An instruction cycle (or fetch-decode-execute cycle) is the process by which a CPU retrieves an instruction from memory, determines what to do with it, and executes it. An Out-of-Order CPU pipeline instruction cycle can be summed up as follows:

  • Fetching/Decoding: the instruction is fetched from the L1I (instruction cache). It is then decoded into smaller operations called micro-operations, or µops.
  • Renaming: the CPU register set can cause execution stalls due to register and data dependencies. To help solve this problem and eliminate false dependencies, the CPU provides a set of unnamed internal registers, which are used for actual computations. Register renaming is the process of converting references to the architectural registers (logical) into unnamed register references (physical).
  • Reorder Buffer: it contains µops that are waiting to be executed, stored in order, as well as those that have already been executed, but not yet retired.
  • Schedulingµops stored in the reorder buffer can be dispatched to parallel execution units in any order, taking into account data dependencies and availability. The result of a µop is written back to the reorder buffer along with the µop, until it is retired.
  • Retirement: the retirement unit continuously checks the status of µops in the reorder buffer, writes the results of executed µops back into architectural registers  (user visible), and removes the µops from the reorder buffer.

 

Jaguar_ AMD Jaguar Processor architecture

Going back to the AMD Jaguar architecture [11][12], we can find all the blocks mentioned earlier. For the integer pipeline:

  • “Decode and Microcode ROMs”
    • = Fetch/Decode Unit
  • “Int Rename” and “Int PRF” (Physical Register File)
    • = Renaming Unit
    • Retire Control Unit (RCU), not visible here, handles register renaming and µops retirement.
  • Schedulers
    • Int Scheduler (ALUs)
      • can issue 1 µop per pipe (2 ALU execution units I0 and I1), out-of-order
    • AGU Scheduler (Load/Store)
      • can issue 1 µop per pipe (2 AGU execution units LAGU and SAGU), out-of-order

 

µops examples

Instruction                   µops
add reg, reg                  1: add
add reg, [mem]                2: load, add
addpd xmm, xmm                1: addpd
addpd xmm, [mem]              2: load, addpd

Looking at the AMD Jaguar section of Agner’s excellent Instruction tables [10], we can deduce what the execution pipeline looks like for the following code:

Code example 
mov eax, [mem1]  ; 1 - load
imul eax, 5      ; 2 - mul
add eax, [mem2]  ; 3 - load, add
mov [mem3], eax  ; 4 - store

Execution pipe (Jaguar)
 I0I1LAGU  |  SAGU   |  FP0  |  FP1   
      |       | 1-load |         |       |                  
2-mul |       | 3-load |         |       |
      | 3-add |        |         |       |
      |       |        | 4-store |       |

In this example, we can see that breaking instructions into µops gives the CPU opportunities to take advantage of the parallel execution units, partially or totally ‘hiding’ the latency of an instruction (3-load and 2-mul are executed in parallel, on 2 different execution units).

It is not always possible though. The dependency chain between 2-mul, 3-add and 4-store prevents the CPU from reordering those µops (4-store needs the result of 3-add, and 3-add needs the result of 2-mul). Avoiding long dependency chains is key to using parallel execution units efficiently.

 

 


Visual Studio Options

I will use msvc++ 14.0 (VS2015) and Clang to illustrate compiler generated assembly. I strongly encourage programmers to do the same and getting used to compare different compilers. By doing so, they will get a better idea of how all the pieces connect together, and make their own opinion on a compiler code quality.

A few good to knows:

  • Show Symbol Names option can show the local variable names and function names in disassembly views instead of instruction addresses or stack relative addresses.

ShowSymbolNamesShow Symbol Names

  • Make the assembly easier to read by removing some checks:  
    • Project settings > C/C++ > Code Generation > Basic Runtime Checks to Default. 
  • Output a .asm file:
    • Project settings > C/C++ > Output Files > Assembler Output to Assembly With Source Code.
  • Frame-Pointer omission: tells the compiler not to use ebp for stack management 
    • /Oy (x86 only, in Clang: -fomit-frame-pointer, works in x64)

 


Basic disassembly examples 

This section goes through very simple C++ code snippets and their corresponding disassembly. The whole assembly code has been reorganized and fully commented to help beginners find their way, but I still encourage programmers to check [17] if they have doubts on what an instruction does or doesn’t.

Function prologues and epilogues have been removed for simplicity and won’t be discussed here. Part 3 of this series will come back to this and explain the details of stack management and calling conventions.

Note: local variables are declared on the stack. For example mov dword ptr [rbp + 4], 0Ah ; int b = 10 means the local variable ‘b’ is placed on the stack (pointed by rbp) at the offset 4, and initialized to 0Ah, or 10 in decimal.


Simple precision floating-point arithmetic 

Floating-point arithmetic operations can be performed either using x87 FPU (80 bits or precision, scalar), or SSE (32 bits or 64-bits of precision, vectorized). On x64, the SSE2 instruction set is always supported, and the default choice for floating point arithmetic [7].

floatingPointArithmeticSimple floating-point arithmetic, using SSE. msvc++

Initializations

  • movss xmm0, dword ptr [adr]             ; loads a floating-point located at adr into xmm0
  • movss dword ptr [rbp], xmm0            ; and stores it on the stack (float x)
  • …                                                                      ; same thing for y and z

Computes x*x

  • movss xmm0, dword ptr [rbp]            ; loads scalar x into xmm0
  • mulss xmm0, dword ptr [rbp]             ; multiplies xmm0 (=x) by x

Computes y*y and adds it to x*x

  • movss xmm1, dword ptr [rbp+4]        ; loads scalar y into xmm1
  • mulss xmm1, dword ptr [rbp+4]         ; multiplies xmm1 (=y) by y
  • addss xmm0, xmm1                                ; adds xmm1 (y*y) to xmm0 (x*x)

Computes z*z and adds it to x*x + y*y

  • movss xmm1, dword ptr [rbp+8]        ; loads scalar z into xmm1
  • mulss xmm1, dword ptr [rbp+8]         ; multiplies xmm1 (=z) by z
  • addss xmm0, xmm1                               ; adds xmm1 (z*z) to xmm0 (x*x + y*y)

Stores the final result

  • movss dword ptr [rbp+0Ch], xmm0    ; stores xmm0 into result
  • xor eax, eax                                                  ; eax = 0. eax contains the main()’s return value

In this example, xmm registers are used to hold a single floating-point value. With SSE, it is possible with work on a single or multiple values, of different data types. For example lets consider the SSE add instruction:

  • addss xmm0, xmm1   ; each reg as a 1 scalar single precision floating-point value
  • addps xmm0, xmm1  ; each reg as 4 packed single precision floating-point values
  • addsd xmm0, xmm1  ; each reg as 1 scalar double precision floating-point value
  • addpd xmm0, xmm1 ; each reg as 2 packed double precision floating-point values
  • paddd xmm0, xmm1 ; each reg as 4 packed double word (32-bit int) values

 


Branching

BranchExample of branch. msvc++

Initializations

  • mov dword ptr [rbp], 5                                  ; stores 5 on the stack (int a)
  • mov dword ptr [rbp+4], 0Ah                       ; stores 10 on the stack (int b)
  • mov dword ptr [rbp+8], 0                             ; stores 0 on the stack (int result)

Condition

  • mov eax, dword ptr [rbp+4]                        ; loads b into eax
  • cmp dword ptr [rbp], eax                             ; compares a to eax (b)
  • jge @ECF81536                                                 ; jumps if a greater or equal to b

‘then’ result = a

  • mov eax, dword ptr [rbp]                             ; loads a into eax
  • mov dword ptr [rbp+8], eax                        ; stores eax on the stack (result)
  • jmp @ECF8153C                                               ; jumps to ECF8153C

‘else’ result = b

  • (ECF81536) mov eax, dword ptr [rbp+4]   ; loads b into eax
  • mov dword ptr [rbp+8], eax                                ; stores eax on the stack (result)
  • (ECF8153C) xor eax, eax                                  ; eax = 0. eax contains the main()’s return value

The cmp instruction compare the first source operand to the second, and sets the status flags in the RFLAGS register according to the result [9]. The (R)FLAGS register is the status register of x86 CPUs that contains the current state of the processor. This cmp instruction is typically used in conjunction with a conditional jump (ex. jge), and the condition codes used by conditional jumps are based on the result of a cmp instruction (RFLAGS condition codes).

 


Integer arithmetic and ‘for’ loop

In assembler, loops are basically represented as a series of conditional jumps (=if .. goto).

forLoopInteger arithmetic and ‘for’ loop. msvc++

Initializations

  • mov dword ptr [rbp], 0                                 ; stores 0 on the stack (int sum)
  • mov dword ptr [k], 0Ah                                ; stores 10 on the stack (int k)
  • mov dword ptr [rbp+8], 0                            ; stores 0 on the stack (int i) for iterating on the loop
  • jmp main+30h                                                 ; jumps to main+30h

Portion of code responsible for incrementing i

  • (main+28h) mov eax, dword ptr [rbp+8]  ; loads i into eax
  • inc eax                                                                   ; increments it
  • mov dword ptr [rbp+8], eax                           ; and stores it back into the stack

Portion of code responsible for testing the exit condition (i >= k)

  • (main+30h) mov eax, dword ptr [k]           ; loads k from the stack into eax
  • cmp dword ptr [rbp+8], eax                           ; compares i to eax (= k) 
  • jge main+47h                                                      ; Jumps (ends the loop) if i Greater or Equal than k

The ‘actual work’: sum+=i

  • mov eax, dword ptr [rbp+8]                       ; loads i into eax
  • mov ecx, dword ptr [rbp]                            ; loads sum into ecx
  • add ecx, eax                                                     ; adds eax to ecx (ecx = sum + i)
  • mov eax, ecx                                                    ; moves ecx into eax
  • mov dword ptr [rbp], eax                            ; stores eax (sum) back into the stack
  • jmp main+28h                                                 ; jumps and process the next loop iteration
  • (main+47h) xor eax, eax                             ; eax = 0. eax contains the main()’s return value

 


SSE Intrinsics

Here is a typical example of vertical processing, where SSE allows the programmer to perform 4 times the same operation in parallel (in this case, a dot product). We can see how intrinsics easily map to their assembly equivalent:

  • _mm_mul_ps maps to mulps
  • _mm_load_ps maps to movaps
  • _mm_add_ps maps to addps
  • _mm_store_ps maps to movaps

image
SSE Intrinsics, msvc++ in Release

Initializations (xmmword is 128-bit wide, and equals to 4 dwords)

  • (main+340h) movaps xmm1, xmmword ptr [rdx+rax]   ; loads a 128-bit xmmword (4 floats) at xs+i into xmm1
  • movaps xmm3, xmmword ptr [rax]                    ; loads 4 floats at ys+i into xmm3
  • movaps xmm0, xmmword ptr [r8+rax]            ; loads 4 floats at zs+i into xmm0
  • movaps xmm2, xmmword ptr [r9+rax]            ; loads 4 floats at ws+i into xmm2

Computes dot(v[i], A) = xi * Ax + yi * Ay + zi * Az + wi * Aw, 4 vertices at a time

  • mulps xmm1, xmm4                     ; xmm1 *= xmm4    xn.Ax  , n [0..3]
  • mulps xmm3, xmm5                      ; xmm3 *= xmm5    yn.Ay  , n [0..3]
  • mulps xmm0, xmm6                     ; xmm0 *= xmm6    zn.Az  , n [0..3]
  • mulps xmm2, xmm7                     ; xmm2 *= xmm7    wn.Aw, n [0..3]
  • addps xmm3, xmm1                       ; xmm3 += xmm1   xn.Ax + yn.Ay
  • addps xmm2, xmm0                      ; xmm2 += xmm0   zn.Az + wn.Aw
  • addps xmm2, xmm3                      ; xmm2 += xmm3   xn.Ax + yn.Ay + zn.Az + wn.Aw

Stores the results at the memory address (results + offset), and loop

  • movaps xmmword ptr [r10 + rax], xmm2 ; stores a 128-bit xmmword (4 floats) to the address pointed by r10+rax
  • add rax, 10h                                      ; adds 16 to rax (current offset = size of 4 floats)
  • sub r11,1                                            ; r11– , remaining loop iterations
  • jne main+34h                                    ; jumps and process the next loop iteration

Porting this code to AVX (256-bit, or 8 single-precision floats) is very straightforward. Here is what it looks like:

_m256 Ax = _mm256_broadcast_ss(A); 
...
for (int i = 0; i < vertexCount; i+=8) // 8 floats (256-bit)
{
   __m256 x4 = _mm256_load_ps(xs + i);
   ..
   __m256 dx = _mm256_mul_ps(Ax, x4);
   ..
   __m256 a0   = _mm256_add_ps(dx, dy);
   ..
   _mm256_store_ps(results + i, dots);
}

 

 


Switch case

SwitchCaseSwitch case. msvc++

Initializations

  • mov dword ptr [rbp], 0                       ; stores 0 on the stack (int val)
  • mov eax, dword ptr [argc]                  ; loads argc into eax
  • mov dword ptr [rbp+44h], eax         ; stores it on the stack

Conditions

  • cmp dword ptr [rbp+44h], 0              ; compares argc to 0
  • je main+38h                                            ; if argc == 0, jumps to main+38h (case 0)
  • cmp dword ptr [rbp+44h], 1              ; compares argc to 1
  • je main+41h                                            ; if argc == 1, jumps to main+41h (case 1)
  • cmp dword ptr [rbp+44h], 2              ; compares argc to 0
  • je main+4Ah                                           ; if argc == 2, jumps to main+4Ah (case 2)
  • cmp dword ptr [rbp+44h], 3              ; compares argc to 3
  • je main+53h                                            ; if argc == 3, jumps to main+53h (case 3)
  • jmp main+5Ch                                       ; jumps to main+5Ch (default)

Case 0

  • (main+38h) mov dword ptr [rbp], 1 ; stores 1 on the stack (val)
  • jmp main+63h     ; jumps to main+63h, and exits the switch case

Case 1

  • (main+41h) mov dword ptr [rbp], 3  ; stores 3 on the stack (val)
  • jmp main+63h     ; jumps to main+63h, and exits the switch case

  • (main+63h) xor eax, eax           ; eax = 0. eax contains the main()’s return value

This assembler generated in this case corresponds to a series of branches. The result would be very similar if replacing the switch case by a series of if-else in the c++ code. In some scenarios, and depending on the compiler, branches might be optimized into a lookup table of jump addresses.

 

 

Special Thanks

Mickael Gilabert (@mickaelgilabert) and Gabriel Lassonde for their inputs 🙂

 

References

Posted in CPU | 5 Comments

Maximizing code performance by thinking data first – Part 1

Introduction

As Rendering Programmers, we tend to live in a world where low level considerations are mandatory in order to produce a 30ms long GPU frame. Techniques and new rendering passes are designed from the ground up considering bandwidth (geometry attributes, texture cache, export, …), GPR pressure, texture cache, latency hiding, ROP, to name a few.

Back in the days, it used to be a quite the thing as well in the CPU world, and it’s actually significant that we are now moving old CPU tricks to recent GPUs in order to accelerate ALU ops (Low Level Optimizations for AMD GCN, Quake’s fast inverse square root, …)

wtfQuake’s fast inverse square root

Recently, especially since the move to 64-bit, I tend to see an increasingly quantity of unoptimized code being produced, as if all the knowledge gathered until then was suddenly buried.

Old tricks such as fast inverse square root might prove counter productive on today’s processors, yes, but programmers shouldn’t forget about low level considerations and hope for the compiler to solve all the problems. It won’t.

This post is not an exhaustive dive in the hardcore technical details of the hardware. It only serves as an introduction, a reminder, to basic principles of writing efficient code for the CPU, and show that low-level thinking is still relevant today, even when it comes to CPUs I might add.

This post is the first part of a series of 2 or 3 posts that will introduce programmers to memory caching, vector programming, reading and understanding assembly code and writing compiler friendly code.

 

Why bother ?

Mind the gap

In the 80s, memory bus frequency was similar to frequency of the CPU, with almost zero latency. Then improvements in CPU speeds followed Moore’s law increasing their performances logarithmically. On the other hand, the performances of RAM chips didn’t increase proportionally, and memory quickly became a bottleneck. This is not due to the fact that faster memory could not be built. It is possible but it is not economical.

CPU_vs_DRAMProcessor-Memory speed evolution

In order to reduce this bottleneck, CPU designers added a very small amount of this very expensive memory between the CPU and main memory: the cache memory.

Cache0Cache memory

The idea is that for a short amount of time, there is a good chance the same code or data gets reused.

  • spatial locality: loops in the code so that the same code gets executed over and over again
  • temporal locality: even if the memory used over short periods of time is not close together, there is a high chance that the same data will be reused before long

CPU cache is a sophisticated acceleration technique, but cannot work optimally without some help from the programmer. Unfortunately, memory cost and CPU cache structure are not understood by most programmers.

Data oriented design

In our case, we are mostly interested in game engines. Game engines have to handle an increasingly large amount of data, transform it, and ultimately render it on screen, in real-time. In this context, and in order to solve problems efficiently, the programmer has to understand the data he is processing, and the hardware he is targeting. Thus the necessity to adopt a data oriented design (DoD).

Can’t the compiler do it for me ?

 dumbASM
Simple addition. Left: C++ code. Right: generated assembly

Let’s consider the trivial example above, on a AMD Jaguar CPU [3][4] (close to what’s in consoles):

  • a load operation (around 200 cycles, if not cached
  • the actual work: inc eax  (1 cycle)
  • a store operation (~3 cycles, same cache line)

We can see in such a simple example most of the CPU time is spent waiting for data, and this doesn’t get better with more complex routines, unless the programmer is aware of the underlying architecture he is targeting. 

In short, compilers:

  • don’t have the big picture, very hard to predict how data will be organized and accessed
  • can be quite good at optimizing arithmetic operations systematically, but it’s often the tip of the iceberg

The compiler space is actually quite small when it comes to memory access optimization. Only the programmer knows the context, and the piece of software he’s trying to write. Thus, it’s critical for him to understand the data flow, and adopt a data oriented approach in order to get the most out of modern CPUs.

The ugly truth: OOP vs DoD

ColsRowsMemory access pattern impact of performances (Mike Acton GDC15) 

Object Oriented Programming (OOP) is the dominant paradigm being taught in schools these days. It encourages the programmer to think in terms of real-world objects and their relationship in order to solve problems.

A class usually encapsulates code and data, so that an object will contain all its data. By encouraging Array of Structures (AoS) layouts, and Arrays of *pointers to* Structures/Objects, OOP violates the spatial locality principle on which cache memory relies to speed up RAM access. Remember the performance gap between CPUs and memory ?

ModernGapWith modern hardware, excessive encapsulation is bad.

The main goal of this post, by focusing on Data Oriented Design (DoD), is to shift the focus of developing software from worrying about code to instead understanding data transformations, and to respond to a programming culture and status quo that has been engendered by OOP advocates.

I will end this section by quoting Mike Acton and the 3 big lies [1]

  • Software is a platform
    • You need to understand the hardware you work on
  • Code designed around model of the world
    • Code needs to be designed around the model of the data
  • Code is more important than data
    • Memory being the bottleneck, data is clearly the most important thing

Part 2 will cover basics of x86 hardware, stay tuned!

Special Thanks

@mickaelgilabert and @Doomed_Daniel for their comments 🙂

 References

 

Posted in CPU | 2 Comments

A look at Direct3D 12 – Part 1

Need for a new API

Direct3D, as any 3D API on PC, has two main goals:

  1. Provide a low overhead graphics API
  2. Provide a single API that works on different hardware

With APIs and GPUs becoming more and more complex, it has become difficult to achieve these goals. D3D’s abstraction layer (HAL) requires extra work from the driver, and leads to higher CPU overhead compared to fixed hardware platforms like consoles.

Performance on PC is not optimal. In fact, it’s very likely you’ll be CPU bound if you naïvely port your console game to PC.

We need a console level efficiency API: more CPU efficiency, and better CPU parallelism. The later is not achieved really well by D3D11 and the fact is most of the job will be done by one core anyway.

D3D12 intend to be a low CPU overhead API, and make it more efficient to

  • generate rendering commands
  • reuse rendering commands
  • efficiently generate commands amongst multiple threads.

 

Pipeline State Object : PSO

API calls can be costly. Each call introduce some CPU and Driver overhead.
D3D10 reduced CPU overhead over D3D9 by introducing Render State Objects and allowing the application to setup a set of related state values in one single API call.

DX9 Style, 1 call sets 1 state value

Device->SetRenderState(D3DRS_SRCBLEND, D3DBLEND_SRCALPHA);

DX10/11 Style, 1 call sets all blend state values

float blendFactors[] = {0.0f, 0.0f, 0.0f, 0.0f};
Device->OMSetBlendState(BlendStateObject, blendFactors, 0xffffffff);

In D3D11, states are usually recorded into a set of commands, and resolved by the driver at Draw/Dispatch time into a set of GPU commands. We call this hardware mismatch overhead.

State Overhead

If we take a look at the diagram above [1], we can see GPU states (on the right) depends on multiple pipeline states (on the left). At draw time, the driver has to check all these pipeline states in order to generate a set of GPU commands that reflects states set by the app.

Engineers at Microsoft made the observation that in a typical modern game, there are around 200 to 400 complete pipeline states per frame. What if we let the app create them and switch from one to another when needed ? Again, this is the same idea behind the move from D3D9 single state values to D3D10 render state objects, one step further.

We come up with this new design, a single Pipeline State Object [1]. Pipeline State Optimized

D3D12 replaces Render State Objects by grouping them together into a Pipeline State Object (PSO).
To keep the number of unique PSOs low, some states that tend to change very frequently (viewport, scissor), are kept out of PSO States and named Non PSO States.

Pipeline State Objects include all set shaders, and a signifcant amount of the state objects. The only way to change one of the states in a PSO is to set a new PSO.
With such a design, the driver knows exactly how to program the hardware for a given PSO, and can preprocess the GPU commands to setup HW States.

Stay tuned for part 2..

References

[1] Max McMullen, “Direct3D 12 API Preview”, BUILD 2014.

Posted in API, D3D12, GPU | Leave a comment