Why your game is slow - Part 2

EverCursed · May 16, 2019

Last time we discussed some basics of caches. Today let's move a little further with it.

Also prepare for a wall of text. Sorry.

L1 & L2 Cache

In the previous part, I showed a table of memory access latencies. Some of you may have wondered, what's this L1 L2 cache stuff? Isn't cache just cache?

Well it is time for us to fill out our understanding of CPU architecture.

The first thing to understand is how large the fastest cache is. If I look on my computer right now, it seems to be... 32kbytes... Pretty tiny, isn't it. The array we dealt with before was 4mbytes. And that's not the only problem. Another thing that may be happening, is that you may be sharing it with some other programs like... the operating system.

Yes, even the operating system may be getting in the way by using your cache space. For example, one thing the operating system does is schedule which processes use the CPU and how much time they get to do their work. And to do this, it will likely reference some memory, getting that useless data into our precious cache space. So what do we do?

The obvious solution is to stuff more cache into our CPU. 32kbytes is too small. So here comes the L2 cache. It's bigger than the L1 cache, but also slower unfortunately. But it's still much faster than main memory! We can work with this. Let's see how large it is... 256kbytes. Our array still can't fit in it.

It is now important to understand how memory operations with caches work. Reading is pretty simple, you read a memory address, and the data gets put into the cache. But which one? Does it go into the faster cache since we are using that data right now, or does it go into the slower one and wait to see if we reference that memory again to promote its importance in the cache hierarchy?

Well I'm happy to tell you that it gets put into both. Your data from main memory now exists in L2 and L1 caches, as well as main memory, taking up triple the space. But it does have a positive outcome.

Let's say we reference a memory location and pay the price of a full memory reference. We then do some other work. During this time, the L1 cache has likely been used and it decided to discard the data we used some time ago. This is exactly when we decide to use that data again, but notice that it's no longer in the L1 cache. Do we need to pay the price of memory reference again?

But wait! We have the L2 cache! Since its larger, it is likely that it did not need to flush out our data yet, so we can ask it... Aha! There it is, our data was in the L2 cache. So we need to just pay the price of a L2 cache reference, only 7ns. Compared to the 100ns we need to waste on a main memory reference, that's much better.

L3 Cache?

Depending on the architecture of your computer, your CPU may also have an L3 cache. It follows a similar pattern: the cache is larger and slower. However there is a difference between it and the other caches, and that is what we will talk about next.

The L3 cache on my computer is 6mbytes.

L1 Cache... x2

As it turns out, those 32kbytes of L1 cache are still too small. And part of the problem is that we don't actually have 32kbytes of it. Not only can other applications use that space, we are also wasting space on one more thing: to store the instructions our CPU will be executing. We so far have only talked about using memory as the data for our program, however we cannot ignore that the program instructions are also data, and must be stored in memory when run. It is even more the case that if we had to reference main memory to fetch our instructions, we would run into the same problem as before: our program would be crawling at a disabled snail's pace.

So naturally our solution to this it to stick another L1 cache in there, and start segregating the two caches: one will be the instruction cache, or I-cache, and the other will be the data cache, or D-cache. This has a hidden benefit that when our D-cache is working overtime, your application doesn't get stalled trying to fetch the next instruction. This is actually even more important than it seems, and I may talk about this some other time. It has to do with how the CPU works.

So to sum up, this is what our caches look like:

32kbytes I-cache (L1 cache)
32kbytes D-cache (L1 cache)
256kbytes L2 cache (stores both instructions and data)
? 6mbytes L3 cache (stores both instructions and data)

So now that we have a slightly more complex, but much more efficient hierarchy of caches, it's time to bring in the last important piece of modern computing: multi-core CPUs and multi-threading.

Multiple Core CPUs

I hope you have noticed that CPUs for the past decade have stopped increasing their frequency. It still increases, but by negligible amounts. The way that CPUs become faster now is by packing more cores.

You don't have permission to view the spoiler content. Log in or register now.

For the rest of this post I will assume having a 4 core, 8 thread CPU.

So now we have a CPU that can do 4 simultaneous memory accesses. Suddenly our effective cache size becomes a quarter of what it was. What's the solution? You guessed it! Stuff more caches in there!

More specifically, the L1 and L2 caches will now be bundled together with a core. So each core will have an I-cache, a D-cache, and an L2 cache.

Our current hierarchy for our quad core machine looks like this:

4x 32kbytes I-cache (L1 cache)
4x 32kbytes D-cache (L1 cache)
4x 256kbytes L2 cache (stores both instructions and data)
1x 6mbytes L3 cache (stores both instructions and data and is shared between 4 cores)
Main memory (painfully slow and very large, but we don't actually care)

I will remind you that the previous structure applies, the data in an L1 cache is also in the L2 cache of the same core, the L3 cache, and main memory.

I'm happy to say that by this point we have pretty much finished catching up to modern cache architecture. There are a few other caches that I haven't mentioned, like the TLB, but they are not too important for us. If you optimize your code for I-cache and D-cache usage, you will also by extension optimize the TLB usage.

Well that was boring... What now?

And now come the problems!

Since we now have 4 independent cores accessing and modifying memory, how do you ensure that the memory is consistent between all cores? If one core writes to a loaded cache line, but that cache line is also in another core's D-cache, what happens?

One thing we absolutely cannot do is update all caches (and memory) every time a write is performed. We would be wasting time getting to main memory every time we do a write. This makes caches useless.

So what we do is delay. We delay as long as we can and not update main memory. We only update when another core needs to access that cache line, or if the cache is discarding that cache line.

But our cache structure has made updating very difficult. We now have to not only update main memory, but also the three other L2 caches, 3 other D-Caches (I'm ignoring things like

You must be registered to see the links

, which would force us to update the I-Caches too), and the L3 cache.

Basically, our new data has to propagate down the cache hierarchy (Remember, we have a copy of our L1 caches in L2, L3, and main memory) updating all caches below it, and then propagate back up the hierarchy to update all the remaining in the CPU. We don't need to update memory yet, as the L3 cache is shared between all cores.

If you think this might be slow, you will be correct. And it causes the problem described below.

False Sharing

Let's finally look at an example of this in action. This example is partially stolen from wikipedia.

First we need to establish our base case. This is the most straight forward way to do what we want, against what we will be testing our improvements.

C:

#include <pthread.h>

// for those who don't know C:
//     this is the same as a class with all fields being public
struct data {
    int a;
    int b;
    int c;
    int d;
};

static struct data d;

static void *inc_a()
{
    for (int i = 0; i < 10000000; ++i)
        d.a++;
}

static void *inc_b()
{
    for (int i = 0; i < 10000000; ++i)
        d.b++;
}

int main()
{
    // call each function sequentially. each increments the specific variable 10,000,000 times
    inc_a(NULL); inc_b(NULL); 
 
    return 0;
}

This is pretty simple. We simply call 2 functions in order, which loop through ten million increments on one variable. How long does this take to run? Well on my weak little laptop here, this is what perf shows:

Rich (BB code):

Base case:

Performance counter stats for './false_sharing' (100 runs):

             3,797      cache-misses              #   22.942 % of all cache refs      ( +-  4.17% )  (74.08%)
            16,552      cache-references          #    0.300 M/sec                    ( +-  2.21% )  (74.67%)
       139,497,948      cycles                    #    2.529 GHz                      ( +-  0.80% )  (51.28%)
       110,339,971      instructions              #    0.79  insn per cycle           ( +-  0.79% )  (75.50%)
         55.164324      task-clock (msec)         #    0.840 CPUs utilized            ( +-  0.94% )

       0.065641932 seconds time elapsed                                         ( +-  0.81% )

Now let's try to make this multithreaded. The most simple way to do this, is to just run each function on its own thread. The code now looks like this:

C:

#include <pthread.h>

// for those who don't know C:
//     this is the same as a class with all fields being public
struct data {
    int a;
    int b;
    int c;
    int d;
};

static struct data d;

static void *inc_a()
{
    for (int i = 0; i < 10000000; ++i)
        d.a++;
}

static void *inc_b()
{
    for (int i = 0; i < 10000000; ++i)
        d.b++;
}

int main()
{
    pthread_t thread1, thread2;
 
    // create all threads so they run concurrently
    pthread_create(&thread1, NULL, inc_a, NULL);
    pthread_create(&thread2, NULL, inc_b, NULL);
 
    // wait for all threads to finish.
    pthread_join(thread1, NULL);
    pthread_join(thread2, NULL);
 
    return 0;
}

Okay, seems reasonable. What does perf show?

Code:

The bad case:

Performance counter stats for './false_sharing_bad' (100 runs):

         3,738,423      cache-misses              #   29.720 % of all cache refs      ( +-  0.27% )  (74.83%)
        12,578,922      cache-references          #   23.228 M/sec                    ( +-  0.43% )  (75.03%)
     1,380,577,494      cycles                    #    2.549 GHz                      ( +-  0.34% )  (50.15%)
       118,122,393      instructions              #    0.09  insn per cycle           ( +-  0.17% )  (75.07%)
        541.531791      task-clock (msec)         #    1.919 CPUs utilized            ( +-  0.37% )

       [b]0.282131761 seconds time elapsed                                          ( +-  0.35% )[/b]

...Oh. It's only a bit more than 4 times slower. That's exactly what we want from a multi-threaded application.

Remember that we are not updating the same variable. This is why it's usually very difficult to see the problem.

The problem, of course, comes from the fact that we are writing to a single cache line. Our data struct is composed of 4 ints, which is a total 16 bytes. This guarantees our entire struct to fit in one cache line. Once a thread does a write to that cache line, the cache line is now invalid for all other cores. The CPU is forced to bounce between the caches to restore the consistency of the caches before another thread does another write.

How to fix this? You could make a thread just return the accumulated value, and that will work. But that will also change the behavior of our program. If you want to keep everything the same, all you need to do is inflate those ints to be 64bytes. So all you need to do is change the struct to this:

C:

struct data {
    int a;
    int _padding1[15];
    int b;
    int _padding2[15];
    int c;
    int _padding3[15];
    int d;
    int _padding4[15];
};

Now each integer has an additional 15 integers worth of space that will just not be used. Each int is 4 bytes, so for one int and 15 ints of padding we have 16 ints or 64bytes worth of data for each integer. Or one integer and 15 padding ints now take up an entire cache line. Let's see how it runs now.

Rich (BB code):

The good case, integers padded:

Performance counter stats for './false_sharing_good' (100 runs):

             7,069      cache-misses              #   26.806 % of all cache refs      ( +-  3.57% )  (70.56%)
            26,370      cache-references          #    0.530 M/sec                    ( +-  1.97% )  (76.15%)
       117,618,574      cycles                    #    2.362 GHz                      ( +-  0.78% )  (53.42%)
        93,215,542      instructions              #    0.79  insn per cycle           ( +-  0.80% )  (75.16%)
         49.793880      task-clock (msec)         #    1.309 CPUs utilized            ( +-  1.07% )

       0.038026063 seconds time elapsed                                          ( +-  1.03% )

There we go! That looks better. Almost two times faster than our base case and 7-8 times faster than our bad multi-threaded version. Very nice.

It is also important to point out that no thread synchronization primitives will help us. Most of them will only make the situation worse by making us wait, before we have to wait again for the caches to be updated.

Edit: I just realized that there is another way to solve this. Instead of incrementing the struct directly, you can just use a local variable in each function to store the temporary value, increment that, and at the end write it into the struct. That will also avoid the problem.

Part 2 End

If you would like to try this yourselves, the files are attached. This was done on Linux, but the C file should compile on Windows. You will need to figure out how to profile it yourselves though.

I ran this on a pretty weak laptop, and it only supports one hyperthread per core. So for me it's almost guaranteed that each thread will run on a separate core. Many CPUs have two hyperthreads per core and it may put both of your threads on the same core. You can make the example run 4 threads instead, and I've written the code to do that. You will just need to uncomment the relevant things in the C file.

View attachment false_sharing.zip

That's about all I wanted to talk about this time. There are still a couple more topics I will cover, but this part has finished up the more important aspects.

Next time I will probably talk about some "nice" language features that have unintended performance consequences. And after that will be another eye opening post, somewhat less relevant to games, but quite interesting anyway. That one might be a bit more dense, and I still need to do a bit more research, so I don't know how soon that one will be.

That's all for now! Code softly and keep your caches warm. I'll see you all next time!

nawbra · May 24, 2019

I appreciate your dedication, but this may not be your ideal audience. How many 'adult' games are going to be written in C directly? Beyond that how many of them are going to be resource intensive enough to require the optimizations you are showcasing? E.g. add bloat to your struct to memory align based on cache sizes. Also what is the impact to processing an array of your structs now that you can fit 1/15th the number in cache at a time? Is forced memory aligning even possible in python? java? c#? These types of examples just seem too simplistic to have meaningful impact on a game selected at random.

Personally I think the time would be better spent investing in teaching people to optimize for the frameworks/engines these games are largely being built in - RPGM, RenPy, Unity, etc.

EverCursed · May 25, 2019

nawbra said:
How many 'adult' games are going to be written in C directly?

Probably very few, if any at all. I don't really care about that. Writing in a different language does not excuse you from the reality of the hardware that will be doing your work. It will apply no matter which language you use.

nawbra said:
Beyond that how many of them are going to be resource intensive enough to require the optimizations you are showcasing?

Probably very few, if any at all! But if someone does want to make a game like that, they are welcome to read this.

nawbra said:
Also what is the impact to processing an array of your structs now that you can fit 1/15th the number in cache at a time?

You are not really understanding what I'm trying to say. I don't say that there's only one way to make your code fast, and it's the one I described. My example in this part showed a specific problem: accessing a struct from multiple threads. If my job is to write a program that will process many of them, I will come up with a new solution taking into consideration what I need to do, and what the hardware will like the most.

As for the impact of processing an array of these structs, you can profile to find out. That is really the most important thing. If you haven't profiled, you don't know if you have a problem or not. "Common sense" is usually wrong.

nawbra said:
Is forced memory aligning even possible in python? java? c#?

Yes, but it's not as easy. In managed languages alignment is usually done by the language or w/e you want to call it. However it's also not okay to assume that when you create an object, its fields will be scattered around the memory like alphabet soup (here's an

You must be registered to see the links

I found on Google). If you want to know how this applies to a specific language, you should look up how memory allocation is done in that language and make some conclusions.

nawbra said:
These types of examples just seem too simplistic to have meaningful impact on a game selected at random.

Well there is no point showing a complete application to illustrate a point. Trust me, these issues happen in real applications too. The problem with these issues, is that if you don't know about them, they are almost impossible to notice. A profiler won't show you the offending piece of code in these cases.

As for whether this will help a game selected at random, I really don't expect that to happen (hence I didn't name the offending game). But just in case it does affect someone, I'm writing all of this out.

nawbra said:
Personally I think the time would be better spent investing in teaching people to optimize for the frameworks/engines these games are largely being built in - RPGM, RenPy, Unity, etc.

I don't have enough experience with any of them to make any interesting posts, and I'm sure that there are already plenty of posts like that around. And, more specifically, I don't care about any specific engine or framework. New engines are made every day, and just as many die every day. Learning about one engine in depth may help you in that engine, but it's impossible to know whether that knowledge will transfer to another engine. Knowledge of hardware is far more likely to transfer.

And if you hate all the engines around, you can just use that hardware knowledge to write your own engine! As far as I'm concerned, I will think of you in a much better light if you choose this option. Even if you are completely mental.

FroKe88 · Aug 1, 2019

It runs very fast if you let the compiler optimize the code

gcc-9.1 -O3:

Code:

inc_a:
        add     DWORD PTR d[rip], 10000000
        ret
inc_b:
        add     DWORD PTR d[rip+4], 10000000
        ret
main:
        xor     edi, edi
        xor     eax, eax
        call    inc_a
        xor     edi, edi
        xor     eax, eax
        call    inc_b
        xor     eax, eax
        ret

Don't take me serios

EverCursed · Aug 3, 2019

FroKe88 said:
It runs very fast if you let the compiler optimize the code

I mean, to be honest, that assembly makes me want to cry a bit. It figured out to hoist the additions out of the loop but didn't bother inlining the functions?.. What???

Edit: Okay, so basically, it seems like it doesn't inline because the data struct is global, and so it can't predict access to it. If you just pass a pointer to the functions, then it properly inlines it into this fantastic program:

Code:

main:
        add     DWORD PTR d[rip], 10000000
        xor     eax, eax
        add     DWORD PTR d[rip+4], 10000000
        ret

lobotomist · Aug 3, 2019

I realize this kinda gets out of the scope of this explanation, but maybe you can explain something to me. In this video:

You must be registered to see the links

he mentions caching a few times. I realize this is a different type of caching but maybe you would be so nice to explain how this works. Im specially interested in the whole base state caching for all the inheriting states.

Also on another note you recommend DOD. Im really interested in unity's new DOTS and I would like at some point to use it to make a cache friendly version of these:

You must be registered to see the links

and

You must be registered to see the links

you know for responsive boobie physics, now i realize that this uses a lot of vectors to properly operate, my question would be: What would be your reccomendation to make this as performant as posible??

lancelotdulak · Nov 1, 2019

Heh its great you posted this but you need to note it is for c programming in the title. And.. O
I dont think youve thought about this but c/c++ programmers already likely know hardware.

Why your game is slow - Part 2

EverCursed

Newbie

nawbra

New Member

EverCursed

Newbie

FroKe88

Member

EverCursed

Newbie

lobotomist

Active Member

lancelotdulak

Active Member