I’ve had a couple discussions lately discussing the difference in speed between accessing data from memory and disk. In most of these conversations people were surprised to know that there is something faster than Main Memory that the programmer can take advantage of .
First, let’s consider the memory hierarchy. If you do a google image search for memory hierarchy, you’ll see a few different (and very insightful) interpretations. In it’s simplest, it looks like this.

A version of the memory hierarchy. L0 = Level 0, L1 = Level 2, and so on. It should be noted that some high-end systems have a third SRAM Cache, which becomes L3. This causes L3 (Main Memory) to become L4 and L4 to become L5. Also, please also keep in mine that the higher you move up the memory hierarchy, the more monetarily expensive things become (5 GB of RAM is more expensive than 5 GB of hard disk space).
Main memory is DRAM, while Level 2 (L2) Cache is SRAM. Now, I hear you asking, “What about my SDRAM? Is that the best of both worlds?” No, it’s still DRAM. The S means something different.
- DRAM: Dynamic Random Access Memory
- SRAM: Static Random Access Memory
- SDRAM: Synchronous Dynamic Random Access Memory
- DDR SDRAM: Double Data Rate & one hell of an acronym.
If you want to access something on your hard disk, it takes roughly 100,000 times longer than DRAM, and 1,000,000 times longer than SRAM, Making SRAM ~ 10x faster than DRAM.
There are actually two, and in some machines even three, levels of SRAM Cache. L1 actually lives on-board the processor, while L2 (and the less common L3) live off-board. Although they’re typically all SRAM, in this post references to “SRAM” specifically mean L2 Cache.
Now I hear you say “Awesome! I’m building this killer machine, and it’s got all the Main Memory it can handle, but I’ve still got some money to burn. Can I increase it’s SRAM too?” No. SRAM (L2 Cache) lives in a piece of additional hardware on your motherboard, you can’t really do anything to it.
So what’s it all about then? In general, the majority of your accesses to memory will be very close, even adjacent, to your previous access. It won’t always be the case, but it probably will. Caches take advantage of this probability. If you access memory location [x], there is a high probability that you’re next memory access will be to memory location [x+1] or something similar. If you accessed an int in memory, which has a typical size of 4 bytes, then your next memory access will probably be [x+4]. If L2 Cache is nearly 10x faster, as we know what you’re probably access next, we can speed things up by automatically moving what you’ll probably move into L2 Cache. When your program makes it’s next memory request, we check to see if it’s already in L2. If it is (cache hit) great! If it’s not, (cache miss), too bad – we goto main memory and swap that and it’s adjacent bytes, into L2 and proceed. A typical Intel Pentium chip has 128 KB – 2 MB of L2 Cache, and typically swaps data in and out of cache from main memory in 32 byte chunks.
So how can a programmer take advantage of this? If you’re going to be accessing chunks of memory, try to access it as you know it appears spatially in memory.
Let us consider the loop below, and assume for sake of example,
- Length = 8
- Any given array[i], an integer, is 4 bytes
- L2 Cache swaps data in 16 byte chunks (as opposed to the actual 32, noted above. That’s 4 integers).
- At the start of the loop, the L2 Cache was empty (cold cache).
[cc lang="cpp"]
int sumArray(int array[Length])
{
int sum = 0;
for (int i = 0; i < Length; i++)
sum += array[i];
return sum;
}
[/cc]
The below table represents cache hits and misses. White are cache hits, and red are cache misses.
| array[i] | i=1 | i=2 | i=3 | i=4 | i=5 | i=6 | i=7 | i=8 |
| Access Order | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
This is the best that we can do. Now let us consider a 2 dimensional array, given the assumtions
- Rows = 8
- Columns = 4
- C stores arrays/matrices in row-major order.
- The rest is the same as the example above.
[cc lang="c++"]
int sumMatrix(int matrix[Columns][Rows])
{
int sum = 0;
for (int col = 0; col < Columns; col++)
for (int row = 0; row < Rows; row++)
sum += matrix[col][row];
return sum;
}
[/cc]
Let’s look at those cache hits and misses. The values in the table body are the order in which matrix[col][row] was accessed.
| matrix[col][row] | row=0 |
row=1 |
row=2 |
row=3 | row=4 | row=5 | row=6 | row=7 |
| col = 0 | 1 | 2 |
3 | 4 | 5 | 6 | 7 | 8 |
| col = 1 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 |
| col = 2 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 |
| col = 3 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 |
Now let us consider the case where we do nothing except swap the order in which we access the matrix elements, and inspect the cache hit/miss table.
[cc lang="c++"]
int sumMatrix(int matrix[Columns][Rows])
{
int sum = 0;
for (int row = 0; row < Rows; row++)
for (int col = 0; col < Columns; row++)
sum += matrix[col][row];
return sum;
}
[/cc]
| matrix[col][row] | row=0 | row=1 | row=2 | row=3 | row=4 | row=5 | row=6 | row=7 |
| col=0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
| col=1 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 |
| col=2 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 |
| col=3 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 |
Every access is a cache miss! Meaning that we have to goto main memory for every access of the matrix. This will be approximately 9 times slower than if we hadn’t swapped the row/column order. This is because data is not actually stored as a two dimensional matrix in main memory – everything is a linear array, with row and column offsets being calculated. Because some data resides right next to other data in main memory, it will also appear this way when 32 bytes are moved into the cache. If you access them in the same order in which they’re stored in main memory, you can take advantage of the cache. If you don’t, you end up forcing cache two swap stuff in and out from main memory.
John Carmack was once quoted as saying “Low level programming is good for the programmer’s soul”.
No related posts.
Related posts brought to you by Yet Another Related Posts Plugin.

