- May 2, 2019
- 32
- 24
Last time we discussed some basics of caches. Today let's move a little further with it.
Also prepare for a wall of text. Sorry.
L1 & L2 Cache
In the previous part, I showed a table of memory access latencies. Some of you may have wondered, what's this L1 L2 cache stuff? Isn't cache just cache?
Well it is time for us to fill out our understanding of CPU architecture.
The first thing to understand is how large the fastest cache is. If I look on my computer right now, it seems to be... 32kbytes... Pretty tiny, isn't it. The array we dealt with before was 4mbytes. And that's not the only problem. Another thing that may be happening, is that you may be sharing it with some other programs like... the operating system.
Yes, even the operating system may be getting in the way by using your cache space. For example, one thing the operating system does is schedule which processes use the CPU and how much time they get to do their work. And to do this, it will likely reference some memory, getting that useless data into our precious cache space. So what do we do?
The obvious solution is to stuff more cache into our CPU. 32kbytes is too small. So here comes the L2 cache. It's bigger than the L1 cache, but also slower unfortunately. But it's still much faster than main memory! We can work with this. Let's see how large it is... 256kbytes. Our array still can't fit in it.
It is now important to understand how memory operations with caches work. Reading is pretty simple, you read a memory address, and the data gets put into the cache. But which one? Does it go into the faster cache since we are using that data right now, or does it go into the slower one and wait to see if we reference that memory again to promote its importance in the cache hierarchy?
Well I'm happy to tell you that it gets put into both. Your data from main memory now exists in L2 and L1 caches, as well as main memory, taking up triple the space. But it does have a positive outcome.
Let's say we reference a memory location and pay the price of a full memory reference. We then do some other work. During this time, the L1 cache has likely been used and it decided to discard the data we used some time ago. This is exactly when we decide to use that data again, but notice that it's no longer in the L1 cache. Do we need to pay the price of memory reference again?
But wait! We have the L2 cache! Since its larger, it is likely that it did not need to flush out our data yet, so we can ask it... Aha! There it is, our data was in the L2 cache. So we need to just pay the price of a L2 cache reference, only 7ns. Compared to the 100ns we need to waste on a main memory reference, that's much better.
L3 Cache?
Depending on the architecture of your computer, your CPU may also have an L3 cache. It follows a similar pattern: the cache is larger and slower. However there is a difference between it and the other caches, and that is what we will talk about next.
The L3 cache on my computer is 6mbytes.
L1 Cache... x2
As it turns out, those 32kbytes of L1 cache are still too small. And part of the problem is that we don't actually have 32kbytes of it. Not only can other applications use that space, we are also wasting space on one more thing: to store the instructions our CPU will be executing. We so far have only talked about using memory as the data for our program, however we cannot ignore that the program instructions are also data, and must be stored in memory when run. It is even more the case that if we had to reference main memory to fetch our instructions, we would run into the same problem as before: our program would be crawling at a disabled snail's pace.
So naturally our solution to this it to stick another L1 cache in there, and start segregating the two caches: one will be the instruction cache, or I-cache, and the other will be the data cache, or D-cache. This has a hidden benefit that when our D-cache is working overtime, your application doesn't get stalled trying to fetch the next instruction. This is actually even more important than it seems, and I may talk about this some other time. It has to do with how the CPU works.
So to sum up, this is what our caches look like:
Multiple Core CPUs
I hope you have noticed that CPUs for the past decade have stopped increasing their frequency. It still increases, but by negligible amounts. The way that CPUs become faster now is by packing more cores.
For the rest of this post I will assume having a 4 core, 8 thread CPU.
So now we have a CPU that can do 4 simultaneous memory accesses. Suddenly our effective cache size becomes a quarter of what it was. What's the solution? You guessed it! Stuff more caches in there!
More specifically, the L1 and L2 caches will now be bundled together with a core. So each core will have an I-cache, a D-cache, and an L2 cache.
Our current hierarchy for our quad core machine looks like this:
I'm happy to say that by this point we have pretty much finished catching up to modern cache architecture. There are a few other caches that I haven't mentioned, like the TLB, but they are not too important for us. If you optimize your code for I-cache and D-cache usage, you will also by extension optimize the TLB usage.
Well that was boring... What now?
And now come the problems!
Since we now have 4 independent cores accessing and modifying memory, how do you ensure that the memory is consistent between all cores? If one core writes to a loaded cache line, but that cache line is also in another core's D-cache, what happens?
One thing we absolutely cannot do is update all caches (and memory) every time a write is performed. We would be wasting time getting to main memory every time we do a write. This makes caches useless.
So what we do is delay. We delay as long as we can and not update main memory. We only update when another core needs to access that cache line, or if the cache is discarding that cache line.
But our cache structure has made updating very difficult. We now have to not only update main memory, but also the three other L2 caches, 3 other D-Caches (I'm ignoring things like
Basically, our new data has to propagate down the cache hierarchy (Remember, we have a copy of our L1 caches in L2, L3, and main memory) updating all caches below it, and then propagate back up the hierarchy to update all the remaining in the CPU. We don't need to update memory yet, as the L3 cache is shared between all cores.
If you think this might be slow, you will be correct. And it causes the problem described below.
False Sharing
Let's finally look at an example of this in action. This example is partially stolen from wikipedia.
First we need to establish our base case. This is the most straight forward way to do what we want, against what we will be testing our improvements.
This is pretty simple. We simply call 2 functions in order, which loop through ten million increments on one variable. How long does this take to run? Well on my weak little laptop here, this is what perf shows:
Now let's try to make this multithreaded. The most simple way to do this, is to just run each function on its own thread. The code now looks like this:
Okay, seems reasonable. What does perf show?
...Oh. It's only a bit more than 4 times slower. That's exactly what we want from a multi-threaded application.
Remember that we are not updating the same variable. This is why it's usually very difficult to see the problem.
The problem, of course, comes from the fact that we are writing to a single cache line. Our data struct is composed of 4 ints, which is a total 16 bytes. This guarantees our entire struct to fit in one cache line. Once a thread does a write to that cache line, the cache line is now invalid for all other cores. The CPU is forced to bounce between the caches to restore the consistency of the caches before another thread does another write.
How to fix this? You could make a thread just return the accumulated value, and that will work. But that will also change the behavior of our program. If you want to keep everything the same, all you need to do is inflate those ints to be 64bytes. So all you need to do is change the struct to this:
Now each integer has an additional 15 integers worth of space that will just not be used. Each int is 4 bytes, so for one int and 15 ints of padding we have 16 ints or 64bytes worth of data for each integer. Or one integer and 15 padding ints now take up an entire cache line. Let's see how it runs now.
There we go! That looks better. Almost two times faster than our base case and 7-8 times faster than our bad multi-threaded version. Very nice.
It is also important to point out that no thread synchronization primitives will help us. Most of them will only make the situation worse by making us wait, before we have to wait again for the caches to be updated.
Edit: I just realized that there is another way to solve this. Instead of incrementing the struct directly, you can just use a local variable in each function to store the temporary value, increment that, and at the end write it into the struct. That will also avoid the problem.
Part 2 End
If you would like to try this yourselves, the files are attached. This was done on Linux, but the C file should compile on Windows. You will need to figure out how to profile it yourselves though.
I ran this on a pretty weak laptop, and it only supports one hyperthread per core. So for me it's almost guaranteed that each thread will run on a separate core. Many CPUs have two hyperthreads per core and it may put both of your threads on the same core. You can make the example run 4 threads instead, and I've written the code to do that. You will just need to uncomment the relevant things in the C file.
View attachment false_sharing.zip
That's about all I wanted to talk about this time. There are still a couple more topics I will cover, but this part has finished up the more important aspects.
Next time I will probably talk about some "nice" language features that have unintended performance consequences. And after that will be another eye opening post, somewhat less relevant to games, but quite interesting anyway. That one might be a bit more dense, and I still need to do a bit more research, so I don't know how soon that one will be.
That's all for now! Code softly and keep your caches warm. I'll see you all next time!
Also prepare for a wall of text. Sorry.
L1 & L2 Cache
In the previous part, I showed a table of memory access latencies. Some of you may have wondered, what's this L1 L2 cache stuff? Isn't cache just cache?
Well it is time for us to fill out our understanding of CPU architecture.
The first thing to understand is how large the fastest cache is. If I look on my computer right now, it seems to be... 32kbytes... Pretty tiny, isn't it. The array we dealt with before was 4mbytes. And that's not the only problem. Another thing that may be happening, is that you may be sharing it with some other programs like... the operating system.
Yes, even the operating system may be getting in the way by using your cache space. For example, one thing the operating system does is schedule which processes use the CPU and how much time they get to do their work. And to do this, it will likely reference some memory, getting that useless data into our precious cache space. So what do we do?
The obvious solution is to stuff more cache into our CPU. 32kbytes is too small. So here comes the L2 cache. It's bigger than the L1 cache, but also slower unfortunately. But it's still much faster than main memory! We can work with this. Let's see how large it is... 256kbytes. Our array still can't fit in it.
It is now important to understand how memory operations with caches work. Reading is pretty simple, you read a memory address, and the data gets put into the cache. But which one? Does it go into the faster cache since we are using that data right now, or does it go into the slower one and wait to see if we reference that memory again to promote its importance in the cache hierarchy?
Well I'm happy to tell you that it gets put into both. Your data from main memory now exists in L2 and L1 caches, as well as main memory, taking up triple the space. But it does have a positive outcome.
Let's say we reference a memory location and pay the price of a full memory reference. We then do some other work. During this time, the L1 cache has likely been used and it decided to discard the data we used some time ago. This is exactly when we decide to use that data again, but notice that it's no longer in the L1 cache. Do we need to pay the price of memory reference again?
But wait! We have the L2 cache! Since its larger, it is likely that it did not need to flush out our data yet, so we can ask it... Aha! There it is, our data was in the L2 cache. So we need to just pay the price of a L2 cache reference, only 7ns. Compared to the 100ns we need to waste on a main memory reference, that's much better.
L3 Cache?
Depending on the architecture of your computer, your CPU may also have an L3 cache. It follows a similar pattern: the cache is larger and slower. However there is a difference between it and the other caches, and that is what we will talk about next.
The L3 cache on my computer is 6mbytes.
L1 Cache... x2
As it turns out, those 32kbytes of L1 cache are still too small. And part of the problem is that we don't actually have 32kbytes of it. Not only can other applications use that space, we are also wasting space on one more thing: to store the instructions our CPU will be executing. We so far have only talked about using memory as the data for our program, however we cannot ignore that the program instructions are also data, and must be stored in memory when run. It is even more the case that if we had to reference main memory to fetch our instructions, we would run into the same problem as before: our program would be crawling at a disabled snail's pace.
So naturally our solution to this it to stick another L1 cache in there, and start segregating the two caches: one will be the instruction cache, or I-cache, and the other will be the data cache, or D-cache. This has a hidden benefit that when our D-cache is working overtime, your application doesn't get stalled trying to fetch the next instruction. This is actually even more important than it seems, and I may talk about this some other time. It has to do with how the CPU works.
So to sum up, this is what our caches look like:
- 32kbytes I-cache (L1 cache)
- 32kbytes D-cache (L1 cache)
- 256kbytes L2 cache (stores both instructions and data)
- ? 6mbytes L3 cache (stores both instructions and data)
Multiple Core CPUs
I hope you have noticed that CPUs for the past decade have stopped increasing their frequency. It still increases, but by negligible amounts. The way that CPUs become faster now is by packing more cores.
You don't have permission to view the spoiler content.
Log in or register now.
For the rest of this post I will assume having a 4 core, 8 thread CPU.
So now we have a CPU that can do 4 simultaneous memory accesses. Suddenly our effective cache size becomes a quarter of what it was. What's the solution? You guessed it! Stuff more caches in there!
More specifically, the L1 and L2 caches will now be bundled together with a core. So each core will have an I-cache, a D-cache, and an L2 cache.
Our current hierarchy for our quad core machine looks like this:
- 4x 32kbytes I-cache (L1 cache)
- 4x 32kbytes D-cache (L1 cache)
- 4x 256kbytes L2 cache (stores both instructions and data)
- 1x 6mbytes L3 cache (stores both instructions and data and is shared between 4 cores)
- Main memory (painfully slow and very large, but we don't actually care)
I'm happy to say that by this point we have pretty much finished catching up to modern cache architecture. There are a few other caches that I haven't mentioned, like the TLB, but they are not too important for us. If you optimize your code for I-cache and D-cache usage, you will also by extension optimize the TLB usage.
Well that was boring... What now?
And now come the problems!
Since we now have 4 independent cores accessing and modifying memory, how do you ensure that the memory is consistent between all cores? If one core writes to a loaded cache line, but that cache line is also in another core's D-cache, what happens?
One thing we absolutely cannot do is update all caches (and memory) every time a write is performed. We would be wasting time getting to main memory every time we do a write. This makes caches useless.
So what we do is delay. We delay as long as we can and not update main memory. We only update when another core needs to access that cache line, or if the cache is discarding that cache line.
But our cache structure has made updating very difficult. We now have to not only update main memory, but also the three other L2 caches, 3 other D-Caches (I'm ignoring things like
You must be registered to see the links
, which would force us to update the I-Caches too), and the L3 cache.Basically, our new data has to propagate down the cache hierarchy (Remember, we have a copy of our L1 caches in L2, L3, and main memory) updating all caches below it, and then propagate back up the hierarchy to update all the remaining in the CPU. We don't need to update memory yet, as the L3 cache is shared between all cores.
If you think this might be slow, you will be correct. And it causes the problem described below.
False Sharing
Let's finally look at an example of this in action. This example is partially stolen from wikipedia.
First we need to establish our base case. This is the most straight forward way to do what we want, against what we will be testing our improvements.
C:
#include <pthread.h>
// for those who don't know C:
// this is the same as a class with all fields being public
struct data {
int a;
int b;
int c;
int d;
};
static struct data d;
static void *inc_a()
{
for (int i = 0; i < 10000000; ++i)
d.a++;
}
static void *inc_b()
{
for (int i = 0; i < 10000000; ++i)
d.b++;
}
int main()
{
// call each function sequentially. each increments the specific variable 10,000,000 times
inc_a(NULL); inc_b(NULL);
return 0;
}
Rich (BB code):
Base case:
Performance counter stats for './false_sharing' (100 runs):
3,797 cache-misses # 22.942 % of all cache refs ( +- 4.17% ) (74.08%)
16,552 cache-references # 0.300 M/sec ( +- 2.21% ) (74.67%)
139,497,948 cycles # 2.529 GHz ( +- 0.80% ) (51.28%)
110,339,971 instructions # 0.79 insn per cycle ( +- 0.79% ) (75.50%)
55.164324 task-clock (msec) # 0.840 CPUs utilized ( +- 0.94% )
0.065641932 seconds time elapsed ( +- 0.81% )
C:
#include <pthread.h>
// for those who don't know C:
// this is the same as a class with all fields being public
struct data {
int a;
int b;
int c;
int d;
};
static struct data d;
static void *inc_a()
{
for (int i = 0; i < 10000000; ++i)
d.a++;
}
static void *inc_b()
{
for (int i = 0; i < 10000000; ++i)
d.b++;
}
int main()
{
pthread_t thread1, thread2;
// create all threads so they run concurrently
pthread_create(&thread1, NULL, inc_a, NULL);
pthread_create(&thread2, NULL, inc_b, NULL);
// wait for all threads to finish.
pthread_join(thread1, NULL);
pthread_join(thread2, NULL);
return 0;
}
Code:
The bad case:
Performance counter stats for './false_sharing_bad' (100 runs):
3,738,423 cache-misses # 29.720 % of all cache refs ( +- 0.27% ) (74.83%)
12,578,922 cache-references # 23.228 M/sec ( +- 0.43% ) (75.03%)
1,380,577,494 cycles # 2.549 GHz ( +- 0.34% ) (50.15%)
118,122,393 instructions # 0.09 insn per cycle ( +- 0.17% ) (75.07%)
541.531791 task-clock (msec) # 1.919 CPUs utilized ( +- 0.37% )
[b]0.282131761 seconds time elapsed ( +- 0.35% )[/b]
Remember that we are not updating the same variable. This is why it's usually very difficult to see the problem.
The problem, of course, comes from the fact that we are writing to a single cache line. Our data struct is composed of 4 ints, which is a total 16 bytes. This guarantees our entire struct to fit in one cache line. Once a thread does a write to that cache line, the cache line is now invalid for all other cores. The CPU is forced to bounce between the caches to restore the consistency of the caches before another thread does another write.
How to fix this? You could make a thread just return the accumulated value, and that will work. But that will also change the behavior of our program. If you want to keep everything the same, all you need to do is inflate those ints to be 64bytes. So all you need to do is change the struct to this:
C:
struct data {
int a;
int _padding1[15];
int b;
int _padding2[15];
int c;
int _padding3[15];
int d;
int _padding4[15];
};
Rich (BB code):
The good case, integers padded:
Performance counter stats for './false_sharing_good' (100 runs):
7,069 cache-misses # 26.806 % of all cache refs ( +- 3.57% ) (70.56%)
26,370 cache-references # 0.530 M/sec ( +- 1.97% ) (76.15%)
117,618,574 cycles # 2.362 GHz ( +- 0.78% ) (53.42%)
93,215,542 instructions # 0.79 insn per cycle ( +- 0.80% ) (75.16%)
49.793880 task-clock (msec) # 1.309 CPUs utilized ( +- 1.07% )
0.038026063 seconds time elapsed ( +- 1.03% )
It is also important to point out that no thread synchronization primitives will help us. Most of them will only make the situation worse by making us wait, before we have to wait again for the caches to be updated.
Edit: I just realized that there is another way to solve this. Instead of incrementing the struct directly, you can just use a local variable in each function to store the temporary value, increment that, and at the end write it into the struct. That will also avoid the problem.
Part 2 End
If you would like to try this yourselves, the files are attached. This was done on Linux, but the C file should compile on Windows. You will need to figure out how to profile it yourselves though.
I ran this on a pretty weak laptop, and it only supports one hyperthread per core. So for me it's almost guaranteed that each thread will run on a separate core. Many CPUs have two hyperthreads per core and it may put both of your threads on the same core. You can make the example run 4 threads instead, and I've written the code to do that. You will just need to uncomment the relevant things in the C file.
View attachment false_sharing.zip
That's about all I wanted to talk about this time. There are still a couple more topics I will cover, but this part has finished up the more important aspects.
Next time I will probably talk about some "nice" language features that have unintended performance consequences. And after that will be another eye opening post, somewhat less relevant to games, but quite interesting anyway. That one might be a bit more dense, and I still need to do a bit more research, so I don't know how soon that one will be.
That's all for now! Code softly and keep your caches warm. I'll see you all next time!
Last edited: