- May 2, 2019
- 32
- 24
Part 2 is now up: https://f95zone.to/threads/why-your-game-is-slow-part-2.30341/
This will be a fairly long technical post that requires some programming knowledge. For most of my examples I will be using C, and I may add applications of my points in some other languages (Python, Java, etc). However, no matter what language the examples are in, the theory applies to absolutely any language without exception. I use C because it gives more control over how your program behaves compared to other languages like Python.
The reason for me writing this is as follows. I recently played a game from this site on my phone (I won't point out which). Any time video ran, it played at around 0.5-2fps. So I thought I could write up a little guide to some possible performance problems in games. If you don't care about that, it is still nice to know no matter what you do.
Introduction
In my last semester of university, I had an assignment to write a web crawler that also brute forces passwords. This was also a group project. I started the project, outlined the structure of the program, and shared it with my teammates.
The next thing that happened was that my teammates revised my code. They made two major changes: 1. They made the code object-oriented, and 2. They changed the link storage system from an array to a linked list.
At this point I realized that students are not actually taught computer architecture in depth. Therefore, I thought of doing a small series about programming suited to those who have already done some.
Computer Architecture
Before anything else, it is important to understand how computers work. I expect that pretty much everyone understands the basic structure of the computer. It can be summarized like so:
The current computer model looks something like this:
Caches
Let's start with an example:
Those who have done structured programming will likely see the issue with this code.
If we visualize how the array is laid out in memory, it looks something like this:
Our loop first loops through the outer dimension and then through the inner dimension. It starts at
So why is this slow? Wasn't the point of an array that you could randomly access different locations?
Well yes, but there are two major problems.
The first problem is that accessing memory is far slower than you might think. As the speed of the CPU increased, memory has failed to catch up. See at the end for the simple reason for why this is. For now all we care about is whether it is slow enough to matter. Here is a table for very rough latencies of different memory access operations on a modern computer.
To put this in perspective, think of your average AAA game. Runs at 60fps. This means it has 16.666ms to do physics, rendering, loading assets, audio queuing, input processing. You cannot waste 100ns to read a single value.
The second problem is that memory is retrieved in chunks of a specific size. The exact size depends on the architecture, but most common one is 64 bytes as of now. You do not have a choice in this. If you reference a single byte in memory, you will fetch 64 bytes. What will also happen is that the entire chunk (also called cache line) will be placed in the CPU cache. This is done so that you don't need to waste time retrieving 64 bytes again when you reference that same byte or a byte close to it. That being said, it isn't actually that simple. More on this later.
Going back to the example now. Let's look how it performs, versus how it would perform if we flipped the order of our array accesses to
Now there is one more piece of information that we should know. That is the fact that the hardware is actually quite a bit smarter than we give it credit. It will detect memory access patterns and will speculatively prefetch cache lines before you ask for them. The important thing to note here is that memory accesses need to be regular: every 1 bytes, every 2 bytes, 4 bytes, 8 bytes, etc. If it is regular, the hardware will detect it and will start fetching cache lines you might need soon while the CPU is doing other operations.
Because of this our new array access pattern will likely not hit a cache miss after the first one. It will keep fetching more cache lines while we are going through each 64 bytes of data at a time. Our old version however can't benefit from this. That's because the memory accesses are so far apart, that even though our memory access pattern is regular, we don't do enough work to take up the 100ns needed for the next cache line to be fetched. Because of this we will hit a cache miss every time. To make things even worse, when we finish looping through one dimension once, and increment the other dimension, we basically return to the start of the array. While we have already fetched that cache line, it is possible that the cache line has already been flushed and will need to be fetched again.
How Does this Help?
Let's see how this applies to a few popular languages.
However, I am not in any way an expert in the following languages. Most of this is going to be from guesses, things I heard, random articles, etc. Please let me know if I make a mistake.
Python
One thing I will recommend is to avoid using standard python data structures. For example, lists. It may seem like they are similar to arrays. They can be accessed in order and are internally dynamic arrays. Sounds good so far.
What's important though is that they are dynamic arrays of pointers. Which means you will get the pointer likely without a cache miss, but will then have to dereference it leading to a... cache miss. The problem is that we don't know where it points to and if those memory locations are regular. In other words, there will likely be a problem with prefetching.
As far as I'm aware, NumPy arrays seem to deal with these problems.
C++
Not much to say here. The language is absolutely disfigured by the amount of unfortunate decisions the committee has made. A lot of useful structures are actually linked lists internally. See the end for a rant about why this is literally worse than AIDS (though I think you can guess by now).
Java
I thought about adding explanations about the sorts of issues there are with Java, but I realized that I have not talked about everything I need to explain these. Alas our journey into cache understanding has only just begun.
Why do we care?
For simple software, we don't. It's as simple as that. It is unlikely that your program will be complex enough for this to matter.
I can only say this. Have you ever thought about why your phone battery can't last more than a day? The problem is that we don't know how to make the CPU use less energy. We only know of one way to do that, and it's to turn it off. If you ever looked at how your CPU performs, you will see that the frequency of it varies depending on what it's doing. This is because when there isn't much to do, the CPU will just stop working, and continue some time later. However, when the CPU is waiting for memory, it does not sleep.
Addendum
Limitation of Physics
This is an interesting one. If you look at a motherboard and look at the distance between the RAM sticks and the CPU, I would assume the distance would be around 10cm / 4in. I recommend you go on Google and type this in:
We don't know how to make faster memory. That's why we started sticking memory directly into the CPU, which is called cache.
Linked Lists
This is the number one cause of all performance problems. If you use a linked list, please remove it. While asymptotic complexity of traversing a linked list is the same as traversing an array, it does not consider that you will get a cache miss every time you move to the next node. Asymptotically it's the same, practically its 200 times slower.
It is the same as the worse version of our array traversal, except that it also has to allocate memory when it inserts. It allocates it on the heap, likely causing internal, and potentially external fragmentation.
The only time linked lists are acceptable is when your program inserts into the list many times, and then reads the entire list once, or somewhere around that order of magnitude. That's what it's designed for. Insertion is easy and compared to other data structures, and also fairly quick. Reading is painfully slow.
What else is there?
To be honest quite a lot. This was an insultingly small overview of caches. There is still a lot to cover, and that's just about caches. Here's a small list.
Part 1 End
For this part I would recommend you to look through the documentation for the language you use. See how various data structures are implemented. See if the implementation fits your needs.
This will be a fairly long technical post that requires some programming knowledge. For most of my examples I will be using C, and I may add applications of my points in some other languages (Python, Java, etc). However, no matter what language the examples are in, the theory applies to absolutely any language without exception. I use C because it gives more control over how your program behaves compared to other languages like Python.
The reason for me writing this is as follows. I recently played a game from this site on my phone (I won't point out which). Any time video ran, it played at around 0.5-2fps. So I thought I could write up a little guide to some possible performance problems in games. If you don't care about that, it is still nice to know no matter what you do.
Introduction
In my last semester of university, I had an assignment to write a web crawler that also brute forces passwords. This was also a group project. I started the project, outlined the structure of the program, and shared it with my teammates.
The next thing that happened was that my teammates revised my code. They made two major changes: 1. They made the code object-oriented, and 2. They changed the link storage system from an array to a linked list.
At this point I realized that students are not actually taught computer architecture in depth. Therefore, I thought of doing a small series about programming suited to those who have already done some.
Computer Architecture
Before anything else, it is important to understand how computers work. I expect that pretty much everyone understands the basic structure of the computer. It can be summarized like so:
- The CPU is the computer's brain.
- RAM stores data and is very quick to access.
- Hard drives (HDDs) or solid state drives (SSDs) are for permanent storage. They are slower than RAM but hold data while the computer is turned off.
The current computer model looks something like this:
- The CPU is the computer's brain.
- RAM stores data and is very slow to access.
- Hard drives (HDDs) or solid state drives (SSDs) are for permanent storage. They are lethally slow but hold data while the computer is turned off.
- CPUs have hierarchies of caches that store recently used data. They are faster than RAM and much smaller than it.
Caches
Let's start with an example:
C:
int array[1024][1024];
for(int outer = 0; outer < 1024; outer++)
{
for(int inner = 0; inner < 1024; inner++)
{
array[inner][outer] = 0;
}
}
If we visualize how the array is laid out in memory, it looks something like this:
Code:
address index
----------------------------
0x00000000 array[0][0]
0x00000004 array[0][1]
0x00000008 array[0][2]
0x0000000C array[0][3]
0x00000010 array[0][4]
0x00000014 array[0][5]
0x00000018 array[0][6]
0x0000001C array[0][7]
0x00000020 array[0][8]
.... ....
0x00000FF8 array[0][1022]
0x00000FFC array[0][1023]
0x00001000 array[1][0]
0x00001004 array[1][1]
0x00001008 array[1][1]
.... ....
0x003FFFF0 array[1023][1020]
0x003FFFF4 array[1023][1021]
0x003FFFF8 array[1023][1022]
0x003FFFFC array[1023][1023]
array[0][0]
, next goes to array[1][0]
, array[2][0]
... Or, to put it another way, it visits addresses 0x00000000
, 0x00001000
, 0x00002000
, 0x00003000
...So why is this slow? Wasn't the point of an array that you could randomly access different locations?
Well yes, but there are two major problems.
The first problem is that accessing memory is far slower than you might think. As the speed of the CPU increased, memory has failed to catch up. See at the end for the simple reason for why this is. For now all we care about is whether it is slow enough to matter. Here is a table for very rough latencies of different memory access operations on a modern computer.
Operation | Latency |
---|---|
L1 cache reference | 0.5 ns |
L2 cache reference | 7 ns |
Main memory reference | 100 ns |
Partially stolen from
You must be registered to see the links
.To put this in perspective, think of your average AAA game. Runs at 60fps. This means it has 16.666ms to do physics, rendering, loading assets, audio queuing, input processing. You cannot waste 100ns to read a single value.
The second problem is that memory is retrieved in chunks of a specific size. The exact size depends on the architecture, but most common one is 64 bytes as of now. You do not have a choice in this. If you reference a single byte in memory, you will fetch 64 bytes. What will also happen is that the entire chunk (also called cache line) will be placed in the CPU cache. This is done so that you don't need to waste time retrieving 64 bytes again when you reference that same byte or a byte close to it. That being said, it isn't actually that simple. More on this later.
Going back to the example now. Let's look how it performs, versus how it would perform if we flipped the order of our array accesses to
array[outer][inner]
.array[inner][outer] | array[outer][inner] |
---|---|
Code:
|
Code:
|
Now there is one more piece of information that we should know. That is the fact that the hardware is actually quite a bit smarter than we give it credit. It will detect memory access patterns and will speculatively prefetch cache lines before you ask for them. The important thing to note here is that memory accesses need to be regular: every 1 bytes, every 2 bytes, 4 bytes, 8 bytes, etc. If it is regular, the hardware will detect it and will start fetching cache lines you might need soon while the CPU is doing other operations.
Because of this our new array access pattern will likely not hit a cache miss after the first one. It will keep fetching more cache lines while we are going through each 64 bytes of data at a time. Our old version however can't benefit from this. That's because the memory accesses are so far apart, that even though our memory access pattern is regular, we don't do enough work to take up the 100ns needed for the next cache line to be fetched. Because of this we will hit a cache miss every time. To make things even worse, when we finish looping through one dimension once, and increment the other dimension, we basically return to the start of the array. While we have already fetched that cache line, it is possible that the cache line has already been flushed and will need to be fetched again.
How Does this Help?
Let's see how this applies to a few popular languages.
However, I am not in any way an expert in the following languages. Most of this is going to be from guesses, things I heard, random articles, etc. Please let me know if I make a mistake.
Python
One thing I will recommend is to avoid using standard python data structures. For example, lists. It may seem like they are similar to arrays. They can be accessed in order and are internally dynamic arrays. Sounds good so far.
What's important though is that they are dynamic arrays of pointers. Which means you will get the pointer likely without a cache miss, but will then have to dereference it leading to a... cache miss. The problem is that we don't know where it points to and if those memory locations are regular. In other words, there will likely be a problem with prefetching.
As far as I'm aware, NumPy arrays seem to deal with these problems.
C++
Not much to say here. The language is absolutely disfigured by the amount of unfortunate decisions the committee has made. A lot of useful structures are actually linked lists internally. See the end for a rant about why this is literally worse than AIDS (though I think you can guess by now).
Java
I thought about adding explanations about the sorts of issues there are with Java, but I realized that I have not talked about everything I need to explain these. Alas our journey into cache understanding has only just begun.
Why do we care?
For simple software, we don't. It's as simple as that. It is unlikely that your program will be complex enough for this to matter.
I can only say this. Have you ever thought about why your phone battery can't last more than a day? The problem is that we don't know how to make the CPU use less energy. We only know of one way to do that, and it's to turn it off. If you ever looked at how your CPU performs, you will see that the frequency of it varies depending on what it's doing. This is because when there isn't much to do, the CPU will just stop working, and continue some time later. However, when the CPU is waiting for memory, it does not sleep.
Addendum
Limitation of Physics
This is an interesting one. If you look at a motherboard and look at the distance between the RAM sticks and the CPU, I would assume the distance would be around 10cm / 4in. I recommend you go on Google and type this in:
speed of light / 10cm
. Look at the result. If this is smaller than your CPU frequency, congratulations, it takes more time for a signal to reach the memory from the CPU than it takes for a CPU cycle to complete. And this is only one way. It also doesn't include electrons dancing through the RAM stick gathering the cache line. And it also doesn't include the fact that electricity moves slower than the speed of light.We don't know how to make faster memory. That's why we started sticking memory directly into the CPU, which is called cache.
Linked Lists
This is the number one cause of all performance problems. If you use a linked list, please remove it. While asymptotic complexity of traversing a linked list is the same as traversing an array, it does not consider that you will get a cache miss every time you move to the next node. Asymptotically it's the same, practically its 200 times slower.
It is the same as the worse version of our array traversal, except that it also has to allocate memory when it inserts. It allocates it on the heap, likely causing internal, and potentially external fragmentation.
The only time linked lists are acceptable is when your program inserts into the list many times, and then reads the entire list once, or somewhere around that order of magnitude. That's what it's designed for. Insertion is easy and compared to other data structures, and also fairly quick. Reading is painfully slow.
What else is there?
To be honest quite a lot. This was an insultingly small overview of caches. There is still a lot to cover, and that's just about caches. Here's a small list.
- Instruction cache vs data cache
- Cache hierarchy (L1, L2, L3 caches, how many of each)
- Multi-threading and caches
- Object oriented design and caches
- Data structure alignment
- What can the CPU do in one cycle?
Part 1 End
For this part I would recommend you to look through the documentation for the language you use. See how various data structures are implemented. See if the implementation fits your needs.
Last edited: