As Rendering Programmers, we tend to live in a world where low level considerations are mandatory in order to produce a 30ms long GPU frame. Techniques and new rendering passes are designed from the ground up considering bandwidth (geometry attributes, texture cache, export, …), GPR pressure, texture cache, latency hiding, ROP, to name a few.
Back in the days, it used to be a quite the thing as well in the CPU world, and it’s actually significant that we are now moving old CPU tricks to recent GPUs in order to accelerate ALU ops (Low Level Optimizations for AMD GCN, Quake’s fast inverse square root, …)
Quake’s fast inverse square root
Recently, especially since the move to 64-bit, I tend to see an increasingly quantity of unoptimized code being produced, as if all the knowledge gathered until then was suddenly buried.
Old tricks such as fast inverse square root might prove counter productive on today’s processors, yes, but programmers shouldn’t forget about low level considerations and hope for the compiler to solve all the problems. It won’t.
This post is not an exhaustive dive in the hardcore technical details of the hardware. It only serves as an introduction, a reminder, to basic principles of writing efficient code for the CPU, and “show that low-level thinking is still relevant today“, even when it comes to CPUs I might add.
This post is the first part of a series of 2 or 3 posts that will introduce programmers to memory caching, vector programming, reading and understanding assembly code and writing compiler friendly code.
Why bother ?
Mind the gap
In the 80s, memory bus frequency was similar to frequency of the CPU, with almost zero latency. Then improvements in CPU speeds followed Moore’s law increasing their performances logarithmically. On the other hand, the performances of RAM chips didn’t increase proportionally, and memory quickly became a bottleneck. This is not due to the fact that faster memory could not be built. It is possible but it is not economical.
Processor-Memory speed evolution
In order to reduce this bottleneck, CPU designers added a very small amount of this very expensive memory between the CPU and main memory: the cache memory.
The idea is that for a short amount of time, there is a good chance the same code or data gets reused.
- spatial locality: loops in the code so that the same code gets executed over and over again
- temporal locality: even if the memory used over short periods of time is not close together, there is a high chance that the same data will be reused before long
CPU cache is a sophisticated acceleration technique, but cannot work optimally without some help from the programmer. Unfortunately, memory cost and CPU cache structure are not understood by most programmers.
Data oriented design
In our case, we are mostly interested in game engines. Game engines have to handle an increasingly large amount of data, transform it, and ultimately render it on screen, in real-time. In this context, and in order to solve problems efficiently, the programmer has to understand the data he is processing, and the hardware he is targeting. Thus the necessity to adopt a data oriented design (DoD).
Can’t the compiler do it for me ?
Simple addition. Left: C++ code. Right: generated assembly
Let’s consider the trivial example above, on a AMD Jaguar CPU  (close to what’s in consoles):
- a load operation (around 200 cycles, if not cached)
- the actual work: inc eax (1 cycle)
- a store operation (~3 cycles, same cache line)
We can see in such a simple example most of the CPU time is spent waiting for data, and this doesn’t get better with more complex routines, unless the programmer is aware of the underlying architecture he is targeting.
In short, compilers:
- don’t have the big picture, very hard to predict how data will be organized and accessed
- can be quite good at optimizing arithmetic operations systematically, but it’s often the tip of the iceberg
The compiler space is actually quite small when it comes to memory access optimization. Only the programmer knows the context, and the piece of software he’s trying to write. Thus, it’s critical for him to understand the data flow, and adopt a data oriented approach in order to get the most out of modern CPUs.
The ugly truth: OOP vs DoD
Memory access pattern impact of performances (Mike Acton GDC15)
Object Oriented Programming (OOP) is the dominant paradigm being taught in schools these days. It encourages the programmer to think in terms of real-world objects and their relationship in order to solve problems.
A class usually encapsulates code and data, so that an object will contain all its data. By encouraging Array of Structures (AoS) layouts, and Arrays of *pointers to* Structures/Objects, OOP violates the spatial locality principle on which cache memory relies to speed up RAM access. Remember the performance gap between CPUs and memory ?
With modern hardware, excessive encapsulation is bad.
The main goal of this post, by focusing on Data Oriented Design (DoD), is to shift the focus of developing software from worrying about code to instead understanding data transformations, and to respond to a programming culture and status quo that has been engendered by OOP advocates.
I will end this section by quoting Mike Acton and the 3 big lies 
Software is a platform
- You need to understand the hardware you work on
Code designed around model of the world
- Code needs to be designed around the model of the data
Code is more important than data
- Memory being the bottleneck, data is clearly the most important thing
Part 2 will cover basics of x86 hardware, stay tuned!
-  CppCon 2014: Mike Acton “Data-Oriented Design and C++”
-  Agner.org’s Lists of instruction latencies, throughputs and micro-operation breakdowns
-  AMD’s Jaguar Microarchitecture: Memory Hierarchy
-  AMD Athlon 5350 APU and AM1 Platform Review – Performance – System Memory