Maximizing code performance by thinking data first – Part 2

Knowing the hardware

Memory cache

CPU is not physically directly connected to the main memory. All random access memory (load and stores) on modern CPU goes through memory cache.

When the CPU executes a fetch instruction (load), the memory controller searches the caches first for a cache entry with a tag matching the memory address it has to read. If there is a match, this is a cache hit and the data can directly be loaded from cache memory. Otherwise it’s a cache miss, in which case the memory controller will attempt to fetch the data from subsequent cache levels (ie. L1D, L2, then L3 if any), and ultimately from main RAM. The data will then be stored in L1, L2 and L3 (inclusive cache).

CPUCache0Memory latency on consoles – Jason Gregory

In this simplified illustration, the CPU (AMD Jaguar, used on PS4 and XB1) has 2 cache levels L1 and L2. As we can see, not just the data is cached, L1 is split between code instruction cache (L1I) and data cache (L1D), memory regions needed for code and data being independent from each other. Generally speaking, L1I is much less problematic than L1D, but we’ll see a few guidelines on how to improve instruction cache usage in another post.

In terms of latency, L1 is orders of magnitude faster than L2, itself being around 10 times faster than main RAM. The numbers look high, but the entire cost does not have to be paid for each cache miss. It is possible to hide part of the cost by hiding latency, scheduling, … but such techniques are beyond the scope of this post.

CacheAnimMemory access latency – Andreas Fredriksson

Each cache entry, or cache line, contains several contiguous words (64 bytes for an AMD Jaguar or a Core i7). When the CPU executes an instruction that fetches or stores a value, the entire cache line is transferred into L1D. In the case of a store, the cache line which has been written to is marked as dirty, until it’s written back into RAM.

CachelineWrite from register to memory

To be able to load new data in the cache, it is almost always first necessary to make room by evicting a cache line.

  • Exclusive cache: eviction from L1D pushes the cache line into L2. This means room has to be made in L2, which might push the data again into main memory. Moving the evicted line from L1D to L2 contributes to the latency for a cache miss.
  • Inclusive cache:  each cache line in L1D is also present in L2. Evicting from L1D is much faster and require no further actions.

 

Recent Intel and AMD processors implement inclusive cache. It might look like a waste of cache at first, but there are two advantages to using such a technique:

  • it lowers the latency of a cache miss by not having to push the cache line from one cache level to the next when evicting,
  • if one core needs data another core is working on, it can fetch the most current version from upper cache levels without the need to interrogate that core, which is why inclusive cache became more and more popular with the explosion of multi-core CPU architectures.

 

cache line collisions: while multiple cores can read cache lines efficiently, writes can cause performance issues. False sharing stands for when different cores modify independent data mapped on a same cache line. Due to cache coherence protocols, if a core writes to a cache line, then the cache line referring the same memory is invalidated on the other core (cache trashing), forcing memory stalls with every data write. False sharing can be avoided by making sure different cores work on different cache lines (by adding extra padding, aligning structures on 64 bytes, …).

falseSharingAvoid false sharing by having per thread writable data on different cache lines

As one can see, understanding the hardware architecture he is targeting is key for the programmer to spot and fix those problems that otherwise would go unnoticed. 

Coreinfo is a command-line utility that provides a detailed summary of all the instruction sets found on a CPU, as well as the caches assigned to each logical processor, … This is the output when running on a Core i5-3570K:

*--- Data Cache        0, Level 1,  32 KB, Assoc  8, LineSize 64
*--- Instruction Cache 0, Level 1,  32 KB, Assoc  8, LineSize 64
*--- Unified Cache     0, Level 2, 256 KB, Assoc  8, LineSize 64
**** Unified Cache     1, Level 3,   6 MB, Assoc 12, LineSize 64
-*-- Data Cache        1, Level 1,  32 KB, Assoc  8, LineSize 64
-*-- Instruction Cache 1, Level 1,  32 KB, Assoc  8, LineSize 64
-*-- Unified Cache     2, Level 2, 256 KB, Assoc  8, LineSize 64
--*- Data Cache        2, Level 1,  32 KB, Assoc  8, LineSize 64
--*- Instruction Cache 2, Level 1,  32 KB, Assoc  8, LineSize 64
--*- Unified Cache     3, Level 2, 256 KB, Assoc  8, LineSize 64
---* Data Cache        3, Level 1,  32 KB, Assoc  8, LineSize 64
---* Instruction Cache 3, Level 1,  32 KB, Assoc  8, LineSize 64
---* Unified Cache     4, Level 2, 256 KB, Assoc  8, LineSize 64

it reports a 32KB L1 data cache, a 32KB L1 instruction cache, a 256KB L2 cache, and a 6MB L3 cache. On this architecture, L1 and L2 are assigned to each core, while L3 is shared by all cores. 

This is just one specific architecture. In the case of an AMD Jaguar CPU, each core has a dedicated L1 cache, while L2 is shared by groups of 4 of cores, called clusters (Jaguar has no L3).

JaguarA 4-core cluster (AMD Jaguar CPU)

Special care should be taken when dealing with clusters. We’ve seen how writing to a cache line invalidates the same cache line in other cores, and how it degrades performances. Well it gets worse on this type of architecture: a core fetching data from its nearest L2 cache, located on the same cluster, takes around 26 cycles [2], while fetching from another cluster’s L2 can take up to 190 cycles [6] It’s almost the same cost as fetching data from RAM! 

ClustersL2L2 latency on AMD Jaguar Clusters – Jason Gregory

 

For a deeper look at cache coherency, I encourage readers to have a look at Ryg’s Cache Coherency Primer [5].

 

 

Assembly basics

x86-64bit, x64, IA-64, AMD64… or the birth of x64 architecture

Intel and AMD both developed their their own 64-bit architecture: AMD64 and IA-64. IA-64 was a drastic change from x86-32-bit processors, in the way that it contained no legacy from the x86 architecture. x86 applications had to run through an emulation layer, and therefore were performing poorly on this architecture. Suffering lack of x86 compatibility, the IA-64 never took off, except in the commercial space.. On the other hand, AMD designed a more conservative 64-bit architecture (AMD64) by extending the existing x86 architecture with a new 64-bit instruction set. Intel, having lost the 64-bit war[18], had no choice but to adopt the same extensions in its own x86 processors[1]. This section will focus on x86-64bit, also known as x64 or AMD64 architecture. 

For years, PC programmers used x86 assembly to write performance-critical code: mode’X'[4], CPU-Skinning, collisions, software rasterizers… However, 32-bit PCs have slowly been replaced with 64-bit ones, and the underlying assembly code has changed.

Understanding assembly is necessary if you want to know why some things are slow and others are fast. It also helps understanding how to use intrinsics to optimize critical code paths, and debugging optimized (ex: -O3) code when source code level debugging is not relevant anymore.


Registers

Registers are small amount of very fast memory with almost no latency (usually 1 CPU cycle). They can be seen as the internal memory of a CPU. It holds data that is directly processed by CPU instructions.

A x64 processor has 16 general-purpose registers (GPRs). A GPR is not used for storing any specific type of data. Instead operands as well as addresses are stored at the time of execution.

x64 extends x86’s 8 GPRs to be 64-bit, and adds 8 new 64-bit registers. The 64-bit registers have names beginning with “r”, so for example the 64-bit extension of eax (32-bit) is called rax. The new registers are named r8 through r15.

x64ASMGeneral architecture (software.intel.com)

x64 registers include:

  • 16 64-bit general purpose registers (GPR), the first eight of which are labeled rax, rbx, rcx, rdx, rbp, rsi, rdi, and rsp. The second eight are named r8-r15.
  • 8 64-bit mmx  registers (MMX instruction set), overlaid on floating point registers fpr (x87 FPU)
  • 16 128-bit xmm vector registers (SSE instruction set)
  • (on more recent CPUs) 256-bit ymm registers (AVX instruction set), extending xmm registers
  • (on more recent CPUs) 512-bit zmm registers (AVX-512 instruction set), extending xmm registers, and increasing their number to 32

zmm.pngRelationship of ZMM, YMM and XMM registers

Some GPRs are labeled differently for historical reasons. For example ax used to be the Accumulator register, cx the Counter register, dx the Data register, … Most of those registers have lost their special purpose in the modern instruction set, except rsp (Stack Pointer) and rbp (Base Pointer) which are reserved for the hardware stack management (although rbp can often be ‘optimized out’ and used as a general purpose register – ‘omit frame pointer’ in Clang)

x86 registers lower bits can be accessed using sub-registers. In the case of the first 8 x86 registers, this is done using their legacy names. Newer registers (r8-r15), on the other hand, follow a same and simpler pattern (see the figure bellow).

x64-Regs-names.pngLabeled scalar registers 


Adressing

When assembly instructions require 2 operands, the first one is generally the destination, and the second one the source. Each of them contain either the data to be processed, or the address of the data. The 3 basic modes of addressing are:

  • Immediate
    • mov eax, 4                      ;move 4 into eax
  • Register to register
    • mov eax, ecx                  ;move the content of ecx to eax
  • Indirect:
    • mov eax, [ebx]               ;move the 4 bytes (size of eax) at address ebx into eax
    • mov byte ptr [rcx], 5    ;move 5 into the byte at address rcx
    • mov rdx, dword ptr [rcx+4*rax] ;move the dword at address rcx+4*rax into rdx

“dword ptr” is called a size directive. A size directive tells the assembler which size should be considered when there is an ambiguity on the size of a referred-to memory region (ex: mov [rcx], 5 : should it write a byte ? a dword ? …).

It can indicate: byte (8-bit), word (16-bit), dword (32-bit), qword (64-bit), xmmword (128-bit), ymmword (256-bit), zmmword (512-bit).

 


SIMD Instruction sets

A scalar implementation refers to operations one pair of operands at a time. Vectorization is the process of converting an algorithm from operating on single pieces of data at one time to multiple pieces of data at one time (we’ll see later on how to do that).

Modern processors have the ability to take advantage of Single Instruction on Multiple Data (SIMD) instruction sets (vector instructions) to process data in parallel.

simd2SIMD processing

SIMD instruction sets available on x86 processors are:

  • Multimedia eXtension (MMX)
    • Legacy. Supports arithmetic operations on integers packed into 64-bit vector registers.
  • Streaming SIMD Extensions (SSE)
    • arithmetic operations on floats packed into 128-bit vector registers. Integer and double support was added in SSE2.
  • Advanced Vector Extensions (AVX) – x64 only
    • added support for 256-bit vector registers
  • AVX-512 – x64 only
    • added support for 512-bit vector registers

VectorRegistersVector registers on x64 processors

Game engines usually spend 90% of their execution time running a small portion of the code base, mainly iterating over data and processing it. In such scenarios, SIMD can make a big difference. SSE instructions are commonly used to process sets of 4 floats packed into 128-bit vector registers, in parallel. One can see how fast and convenient this can be for 3D maths, provided the data is well organized in memory.

SSE is mostly geared toward vertical data representation (SoA) and processing, but generally speaking, the choice of Structure of Arrays (SoA) versus Array of Structures (AoS) for best performances depends on access patterns. 

  • AoS is probably the most natural one and easy to write. Fits OOP paradigms.
  • AoS has better data locality if all the members are accessed together.
  • SoA exposes more vectorization opportunities (vertical processing).
  • SoA often uses less memory because padding is only between arrays.

 

// Array Of Structures
struct Sphere
{
  float x;
  float y;
  float z;
  double r;
};
Sphere* AoS;

Memory layout (struct aligned on 8 bytes):
------------------------------------------------------------------
| x | y | z | r | pad | x | y | z | r | pad | x | y | z | r | pad
------------------------------------------------------------------

// Structure Of Arrays
struct SoA
{
  float* x;
  float* y;
  float* z;
  double* r;
  size_t size;
};

Memory layout:
------------------------------------------------------------------
| x | x | x ..| pad | y | y | y ..| pad | z | z | z ..| pad | r..
------------------------------------------------------------------

One final note on AVX, which is a natural extension of SSE, and brings the vector register size up to 256 bits, meaning up to 8 floats can be packed and processed in parallel. While Intel processors support 256-bit registers natively since their introduction, AMD processors can be problematic. Early AVX AMD CPUs, such as Bulldozer and Jaguar, decompose 256-bit operations into two 128-bit operations, driving the latency up compared to SSE.

In conclusion, it’s tricky to target AVX exclusively (maybe for internal tools if your workstations are Intel based), and AMD processors don’t support it natively for the most part. On the other hand, SSE2 can be taken for granted on any x64 CPU (it’s part of the specs), so there is no real excuse not to use it.

 


Out-of-Order execution

In an Out-of-Order (OoO) CPU pipeline, when the execution of an instruction is delayed because the necessary input data is not ready, the CPU tries to find later instructions to execute first, if their input data is ready.

An instruction cycle (or fetch-decode-execute cycle) is the process by which a CPU retrieves an instruction from memory, determines what to do with it, and executes it. An Out-of-Order CPU pipeline instruction cycle can be summed up as follows:

  • Fetching/Decoding: the instruction is fetched from the L1I (instruction cache). It is then decoded into smaller operations called micro-operations, or µops.
  • Renaming: the CPU register set can cause execution stalls due to register and data dependencies. To help solve this problem and eliminate false dependencies, the CPU provides a set of unnamed internal registers, which are used for actual computations. Register renaming is the process of converting references to the architectural registers (logical) into unnamed register references (physical).
  • Reorder Buffer: it contains µops that are waiting to be executed, stored in order, as well as those that have already been executed, but not yet retired.
  • Schedulingµops stored in the reorder buffer can be dispatched to parallel execution units in any order, taking into account data dependencies and availability. The result of a µop is written back to the reorder buffer along with the µop, until it is retired.
  • Retirement: the retirement unit continuously checks the status of µops in the reorder buffer, writes the results of executed µops back into architectural registers  (user visible), and removes the µops from the reorder buffer.

 

Jaguar_ AMD Jaguar Processor architecture

Going back to the AMD Jaguar architecture [11][12], we can find all the blocks mentioned earlier. For the integer pipeline:

  • “Decode and Microcode ROMs”
    • = Fetch/Decode Unit
  • “Int Rename” and “Int PRF” (Physical Register File)
    • = Renaming Unit
    • Retire Control Unit (RCU), not visible here, handles register renaming and µops retirement.
  • Schedulers
    • Int Scheduler (ALUs)
      • can issue 1 µop per pipe (2 ALU execution units I0 and I1), out-of-order
    • AGU Scheduler (Load/Store)
      • can issue 1 µop per pipe (2 AGU execution units LAGU and SAGU), out-of-order

 

µops examples

Instruction                   µops
add reg, reg                  1: add
add reg, [mem]                2: load, add
addpd xmm, xmm                1: addpd
addpd xmm, [mem]              2: load, addpd

Looking at the AMD Jaguar section of Agner’s excellent Instruction tables [10], we can deduce what the execution pipeline looks like for the following code:

Code example 
mov eax, [mem1]  ; 1 - load
imul eax, 5      ; 2 - mul
add eax, [mem2]  ; 3 - load, add
mov [mem3], eax  ; 4 - store

Execution pipe (Jaguar)
 I0I1LAGU  |  SAGU   |  FP0  |  FP1   
      |       | 1-load |         |       |                  
2-mul |       | 3-load |         |       |
      | 3-add |        |         |       |
      |       |        | 4-store |       |

In this example, we can see that breaking instructions into µops gives the CPU opportunities to take advantage of the parallel execution units, partially or totally ‘hiding’ the latency of an instruction (3-load and 2-mul are executed in parallel, on 2 different execution units).

It is not always possible though. The dependency chain between 2-mul, 3-add and 4-store prevents the CPU from reordering those µops (4-store needs the result of 3-add, and 3-add needs the result of 2-mul). Avoiding long dependency chains is key to using parallel execution units efficiently.

 

 


Visual Studio Options

I will use msvc++ 14.0 (VS2015) and Clang to illustrate compiler generated assembly. I strongly encourage programmers to do the same and getting used to compare different compilers. By doing so, they will get a better idea of how all the pieces connect together, and make their own opinion on a compiler code quality.

A few good to knows:

  • Show Symbol Names option can show the local variable names and function names in disassembly views instead of instruction addresses or stack relative addresses.

ShowSymbolNamesShow Symbol Names

  • Make the assembly easier to read by removing some checks:  
    • Project settings > C/C++ > Code Generation > Basic Runtime Checks to Default. 
  • Output a .asm file:
    • Project settings > C/C++ > Output Files > Assembler Output to Assembly With Source Code.
  • Frame-Pointer omission: tells the compiler not to use ebp for stack management 
    • /Oy (x86 only, in Clang: -fomit-frame-pointer, works in x64)

 


Basic disassembly examples 

This section goes through very simple C++ code snippets and their corresponding disassembly. The whole assembly code has been reorganized and fully commented to help beginners find their way, but I still encourage programmers to check [17] if they have doubts on what an instruction does or doesn’t.

Function prologues and epilogues have been removed for simplicity and won’t be discussed here. Part 3 of this series will come back to this and explain the details of stack management and calling conventions.

Note: local variables are declared on the stack. For example mov dword ptr [rbp + 4], 0Ah ; int b = 10 means the local variable ‘b’ is placed on the stack (pointed by rbp) at the offset 4, and initialized to 0Ah, or 10 in decimal.


Simple precision floating-point arithmetic 

Floating-point arithmetic operations can be performed either using x87 FPU (80 bits or precision, scalar), or SSE (32 bits or 64-bits of precision, vectorized). On x64, the SSE2 instruction set is always supported, and the default choice for floating point arithmetic [7].

floatingPointArithmeticSimple floating-point arithmetic, using SSE. msvc++

Initializations

  • movss xmm0, dword ptr [adr]             ; loads a floating-point located at adr into xmm0
  • movss dword ptr [rbp], xmm0            ; and stores it on the stack (float x)
  • …                                                                      ; same thing for y and z

Computes x*x

  • movss xmm0, dword ptr [rbp]            ; loads scalar x into xmm0
  • mulss xmm0, dword ptr [rbp]             ; multiplies xmm0 (=x) by x

Computes y*y and adds it to x*x

  • movss xmm1, dword ptr [rbp+4]        ; loads scalar y into xmm1
  • mulss xmm1, dword ptr [rbp+4]         ; multiplies xmm1 (=y) by y
  • addss xmm0, xmm1                                ; adds xmm1 (y*y) to xmm0 (x*x)

Computes z*z and adds it to x*x + y*y

  • movss xmm1, dword ptr [rbp+8]        ; loads scalar z into xmm1
  • mulss xmm1, dword ptr [rbp+8]         ; multiplies xmm1 (=z) by z
  • addss xmm0, xmm1                               ; adds xmm1 (z*z) to xmm0 (x*x + y*y)

Stores the final result

  • movss dword ptr [rbp+0Ch], xmm0    ; stores xmm0 into result
  • xor eax, eax                                                  ; eax = 0. eax contains the main()’s return value

In this example, xmm registers are used to hold a single floating-point value. With SSE, it is possible with work on a single or multiple values, of different data types. For example lets consider the SSE add instruction:

  • addss xmm0, xmm1   ; each reg as a 1 scalar single precision floating-point value
  • addps xmm0, xmm1  ; each reg as 4 packed single precision floating-point values
  • addsd xmm0, xmm1  ; each reg as 1 scalar double precision floating-point value
  • addpd xmm0, xmm1 ; each reg as 2 packed double precision floating-point values
  • paddd xmm0, xmm1 ; each reg as 4 packed double word (32-bit int) values

 


Branching

BranchExample of branch. msvc++

Initializations

  • mov dword ptr [rbp], 5                                  ; stores 5 on the stack (int a)
  • mov dword ptr [rbp+4], 0Ah                       ; stores 10 on the stack (int b)
  • mov dword ptr [rbp+8], 0                             ; stores 0 on the stack (int result)

Condition

  • mov eax, dword ptr [rbp+4]                        ; loads b into eax
  • cmp dword ptr [rbp], eax                             ; compares a to eax (b)
  • jge @ECF81536                                                 ; jumps if a greater or equal to b

‘then’ result = a

  • mov eax, dword ptr [rbp]                             ; loads a into eax
  • mov dword ptr [rbp+8], eax                        ; stores eax on the stack (result)
  • jmp @ECF8153C                                               ; jumps to ECF8153C

‘else’ result = b

  • (ECF81536) mov eax, dword ptr [rbp+4]   ; loads b into eax
  • mov dword ptr [rbp+8], eax                                ; stores eax on the stack (result)
  • (ECF8153C) xor eax, eax                                  ; eax = 0. eax contains the main()’s return value

The cmp instruction compare the first source operand to the second, and sets the status flags in the RFLAGS register according to the result [9]. The (R)FLAGS register is the status register of x86 CPUs that contains the current state of the processor. This cmp instruction is typically used in conjunction with a conditional jump (ex. jge), and the condition codes used by conditional jumps are based on the result of a cmp instruction (RFLAGS condition codes).

 


Integer arithmetic and ‘for’ loop

In assembler, loops are basically represented as a series of conditional jumps (=if .. goto).

forLoopInteger arithmetic and ‘for’ loop. msvc++

Initializations

  • mov dword ptr [rbp], 0                                 ; stores 0 on the stack (int sum)
  • mov dword ptr [k], 0Ah                                ; stores 10 on the stack (int k)
  • mov dword ptr [rbp+8], 0                            ; stores 0 on the stack (int i) for iterating on the loop
  • jmp main+30h                                                 ; jumps to main+30h

Portion of code responsible for incrementing i

  • (main+28h) mov eax, dword ptr [rbp+8]  ; loads i into eax
  • inc eax                                                                   ; increments it
  • mov dword ptr [rbp+8], eax                           ; and stores it back into the stack

Portion of code responsible for testing the exit condition (i >= k)

  • (main+30h) mov eax, dword ptr [k]           ; loads k from the stack into eax
  • cmp dword ptr [rbp+8], eax                           ; compares i to eax (= k) 
  • jge main+47h                                                      ; Jumps (ends the loop) if i Greater or Equal than k

The ‘actual work’: sum+=i

  • mov eax, dword ptr [rbp+8]                       ; loads i into eax
  • mov ecx, dword ptr [rbp]                            ; loads sum into ecx
  • add ecx, eax                                                     ; adds eax to ecx (ecx = sum + i)
  • mov eax, ecx                                                    ; moves ecx into eax
  • mov dword ptr [rbp], eax                            ; stores eax (sum) back into the stack
  • jmp main+28h                                                 ; jumps and process the next loop iteration
  • (main+47h) xor eax, eax                             ; eax = 0. eax contains the main()’s return value

 


SSE Intrinsics

Here is a typical example of vertical processing, where SSE allows the programmer to perform 4 times the same operation in parallel (in this case, a dot product). We can see how intrinsics easily map to their assembly equivalent:

  • _mm_mul_ps maps to mulps
  • _mm_load_ps maps to movaps
  • _mm_add_ps maps to addps
  • _mm_store_ps maps to movaps

image
SSE Intrinsics, msvc++ in Release

Initializations (xmmword is 128-bit wide, and equals to 4 dwords)

  • (main+340h) movaps xmm1, xmmword ptr [rdx+rax]   ; loads a 128-bit xmmword (4 floats) at xs+i into xmm1
  • movaps xmm3, xmmword ptr [rax]                    ; loads 4 floats at ys+i into xmm3
  • movaps xmm0, xmmword ptr [r8+rax]            ; loads 4 floats at zs+i into xmm0
  • movaps xmm2, xmmword ptr [r9+rax]            ; loads 4 floats at ws+i into xmm2

Computes dot(v[i], A) = xi * Ax + yi * Ay + zi * Az + wi * Aw, 4 vertices at a time

  • mulps xmm1, xmm4                     ; xmm1 *= xmm4    xn.Ax  , n [0..3]
  • mulps xmm3, xmm5                      ; xmm3 *= xmm5    yn.Ay  , n [0..3]
  • mulps xmm0, xmm6                     ; xmm0 *= xmm6    zn.Az  , n [0..3]
  • mulps xmm2, xmm7                     ; xmm2 *= xmm7    wn.Aw, n [0..3]
  • addps xmm3, xmm1                       ; xmm3 += xmm1   xn.Ax + yn.Ay
  • addps xmm2, xmm0                      ; xmm2 += xmm0   zn.Az + wn.Aw
  • addps xmm2, xmm3                      ; xmm2 += xmm3   xn.Ax + yn.Ay + zn.Az + wn.Aw

Stores the results at the memory address (results + offset), and loop

  • movaps xmmword ptr [r10 + rax], xmm2 ; stores a 128-bit xmmword (4 floats) to the address pointed by r10+rax
  • add rax, 10h                                      ; adds 16 to rax (current offset = size of 4 floats)
  • sub r11,1                                            ; r11– , remaining loop iterations
  • jne main+34h                                    ; jumps and process the next loop iteration

Porting this code to AVX (256-bit, or 8 single-precision floats) is very straightforward. Here is what it looks like:

_m256 Ax = _mm256_broadcast_ss(A); 
...
for (int i = 0; i < vertexCount; i+=8) // 8 floats (256-bit)
{
   __m256 x4 = _mm256_load_ps(xs + i);
   ..
   __m256 dx = _mm256_mul_ps(Ax, x4);
   ..
   __m256 a0   = _mm256_add_ps(dx, dy);
   ..
   _mm256_store_ps(results + i, dots);
}

 

 


Switch case

SwitchCaseSwitch case. msvc++

Initializations

  • mov dword ptr [rbp], 0                       ; stores 0 on the stack (int val)
  • mov eax, dword ptr [argc]                  ; loads argc into eax
  • mov dword ptr [rbp+44h], eax         ; stores it on the stack

Conditions

  • cmp dword ptr [rbp+44h], 0              ; compares argc to 0
  • je main+38h                                            ; if argc == 0, jumps to main+38h (case 0)
  • cmp dword ptr [rbp+44h], 1              ; compares argc to 1
  • je main+41h                                            ; if argc == 1, jumps to main+41h (case 1)
  • cmp dword ptr [rbp+44h], 2              ; compares argc to 0
  • je main+4Ah                                           ; if argc == 2, jumps to main+4Ah (case 2)
  • cmp dword ptr [rbp+44h], 3              ; compares argc to 3
  • je main+53h                                            ; if argc == 3, jumps to main+53h (case 3)
  • jmp main+5Ch                                       ; jumps to main+5Ch (default)

Case 0

  • (main+38h) mov dword ptr [rbp], 1 ; stores 1 on the stack (val)
  • jmp main+63h     ; jumps to main+63h, and exits the switch case

Case 1

  • (main+41h) mov dword ptr [rbp], 3  ; stores 3 on the stack (val)
  • jmp main+63h     ; jumps to main+63h, and exits the switch case

  • (main+63h) xor eax, eax           ; eax = 0. eax contains the main()’s return value

This assembler generated in this case corresponds to a series of branches. The result would be very similar if replacing the switch case by a series of if-else in the c++ code. In some scenarios, and depending on the compiler, branches might be optimized into a lookup table of jump addresses.

 

 

Special Thanks

Mickael Gilabert (@mickaelgilabert) and Gabriel Lassonde for their inputs 🙂

 

References

Advertisements
This entry was posted in CPU. Bookmark the permalink.

5 Responses to Maximizing code performance by thinking data first – Part 2

  1. Noseshine says:

    I also recommend “What Every Programmer Should Know About Memory”(https://www.akkadia.org/drepper/cpumemory.pdf)

    Like

  2. Filip Jeřábek says:

    Thank you, very well written, easy to understand, amazing job!

    Like

  3. Sarfaraz says:

    Great article. Loved the style, the detailing and the quality!

    However, I think there is one typo in the example under the section [Integer arithmetic and ‘for’ loop]. The instruction corresponding to “int k = 10;” should be “mov dword ptr[rbp+4], 0Ah”, instead of “mov dword ptr[k], 0Ah”. Hope you make a note below (or above) the snapshot, as you can no longer change the image, and changing it is too much of work probably.

    Like

    • Oh actually it’s the real disassembly output I show there. What happened is I let Visual Studio “Show Symbol Names” option ON by mistake there, and VS2015 replaced rbp+4 (the stack address) by k (the variable name). Supposed to make everything more readable 🙂 Thanks for your comment!

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s