To compile: cc -g loop-order.c
.
To generate cachegrind.out.<pid>
:
$ valgrind --tool=cachegrind --cache-sim=yes ./a.out
=21339== Cachegrind, a cache and branch-prediction profiler
==21339== Copyright (C) 2002-2017, and GNU GPL'd, by Nicholas Nethercote et al.
==21339== Using Valgrind-3.21.0 and LibVEX; rerun with -h for copyright info
==21339== Command: ./a.out
==21339==
--21339-- warning: L3 cache found, using its data for the LL simulation.
==21339==
==21339== I refs: 480,189,794
==21339== I1 misses: 1,066
==21339== LLi misses: 1,058
==21339== I1 miss rate: 0.00%
==21339== LLi miss rate: 0.00%
==21339==
==21339== D refs: 224,073,258 (192,054,705 rd + 32,018,553 wr)
==21339== D1 misses: 17,001,492 ( 1,167 rd + 17,000,325 wr)
==21339== LLd misses: 2,001,358 ( 1,041 rd + 2,000,317 wr)
==21339== D1 miss rate: 7.6% ( 0.0% + 53.1% )
==21339== LLd miss rate: 0.9% ( 0.0% + 6.2% )
==21339==
==21339== LL refs: 17,002,558 ( 2,233 rd + 17,000,325 wr)
==21339== LL misses: 2,002,416 ( 2,099 rd + 2,000,317 wr)
==21339== LL miss rate: 0.3% ( 0.0% + 6.2% )
Then you can analyze it with:
$ cg_annotate cachegrind.out.<pid>
--------------------------------------------------------------------------------
-- Metadata
--------------------------------------------------------------------------------
Invocation: /usr/bin/cg_annotate cachegrind.out.21339
I1 cache: 32768 B, 64 B, 8-way associative
D1 cache: 49152 B, 64 B, 12-way associative
LL cache: 25165824 B, 64 B, 12-way associative
Command: ./a.out
Events recorded: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Events shown: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Event sort order: Ir I1mr ILmr Dr D1mr DLmr Dw D1mw DLmw
Threshold: 0.1%
Annotation: on
--------------------------------------------------------------------------------
-- Summary
--------------------------------------------------------------------------------
Ir__________________ I1mr__________ ILmr__________ Dr__________________ D1mr__________ DLmr__________ Dw_________________ D1mw_______________ DLmw______________
480,189,794 (100.0%) 1,066 (100.0%) 1,058 (100.0%) 192,054,705 (100.0%) 1,167 (100.0%) 1,041 (100.0%) 32,018,553 (100.0%) 17,000,325 (100.0%) 2,000,317 (100.0%) PROGRAM TOTALS
--------------------------------------------------------------------------------
-- File:function summary
--------------------------------------------------------------------------------
Ir__________________________ I1mr__________ ILmr__________ Dr__________________________ D1mr__________ DLmr__________ Dw_________________________ D1mw_______________________ DLmw______________________ file:function
< 480,056,029 (100.0%, 100.0%) 3 (0.3%, 0.3%) 3 (0.3%, 0.3%) 192,024,008 (100.0%, 100.0%) 2 (0.2%, 0.2%) 2 (0.2%, 0.2%) 32,008,007 (100.0%, 100.0%) 17,000,000 (100.0%, 100.0%) 2,000,000 (100.0%, 100.0%) /home/rei/csi/intro-systems/memory-hierarchy/loop-order.c:
240,028,010 (50.0%) 1 (0.1%) 1 (0.1%) 96,012,003 (50.0%) 1 (0.1%) 1 (0.1%) 16,004,002 (50.0%) 16,000,000 (94.1%) 1,000,000 (50.0%) option_two
240,028,010 (50.0%) 1 (0.1%) 1 (0.1%) 96,012,003 (50.0%) 1 (0.1%) 1 (0.1%) 16,004,002 (50.0%) 1,000,000 (5.9%) 1,000,000 (50.0%) option_one
--------------------------------------------------------------------------------
-- Function:file summary
--------------------------------------------------------------------------------
Ir_________________________ I1mr__________ ILmr__________ Dr________________________ D1mr__________ DLmr__________ Dw________________________ D1mw______________________ DLmw_____________________ function:file
> 240,028,010 (50.0%, 50.0%) 1 (0.1%, 0.1%) 1 (0.1%, 0.1%) 96,012,003 (50.0%, 50.0%) 1 (0.1%, 0.1%) 1 (0.1%, 0.1%) 16,004,002 (50.0%, 50.0%) 16,000,000 (94.1%, 94.1%) 1,000,000 (50.0%, 50.0%) option_two:/home/rei/csi/intro-systems/memory-hierarchy/loop-order.c
> 240,028,010 (50.0%, 100.0%) 1 (0.1%, 0.2%) 1 (0.1%, 0.2%) 96,012,003 (50.0%, 100.0%) 1 (0.1%, 0.2%) 1 (0.1%, 0.2%) 16,004,002 (50.0%, 100.0%) 1,000,000 (5.9%, 100.0%) 1,000,000 (50.0%, 100.0%) option_one:/home/rei/csi/intro-systems/memory-hierarchy/loop-order.c
--------------------------------------------------------------------------------
-- Annotated source file: /home/rei/csi/intro-systems/memory-hierarchy/loop-order.c
--------------------------------------------------------------------------------
Ir_________________ I1mr____ ILmr____ Dr________________ D1mr____ DLmr____ Dw________________ D1mw______________ DLmw_____________
-- line 2 ----------------------------------------
. . . . . . . . .
. . . . . . . . . Two different ways to loop over an array of arrays.
. . . . . . . . .
. . . . . . . . . Spotted at:
. . . . . . . . . http://stackoverflow.com/questions/9936132/why-does-the-order-of-the-loops-affect-performance-when-iterating-over-a-2d-arra
. . . . . . . . .
. . . . . . . . . */
. . . . . . . . .
2 (0.0%) 0 0 0 0 0 1 (0.0%) 0 0 void option_one() {
. . . . . . . . . int i, j;
. . . . . . . . . static int x[4000][4000];
12,004 (0.0%) 1 (0.1%) 1 (0.1%) 8,001 (0.0%) 0 0 1 (0.0%) 0 0 for (i = 0; i < 4000; i++) {
48,016,000 (10.0%) 0 0 32,004,000 (16.7%) 0 0 4,000 (0.0%) 0 0 for (j = 0; j < 4000; j++) {
192,000,000 (40.0%) 0 0 64,000,000 (33.3%) 0 0 16,000,000 (50.0%) 1,000,000 (5.9%) 1,000,000 (50.0%) x[i][j] = i + j;
. . . . . . . . . }
. . . . . . . . . }
4 (0.0%) 0 0 2 (0.0%) 1 (0.1%) 1 (0.1%) 0 0 0 }
. . . . . . . . .
2 (0.0%) 1 (0.1%) 1 (0.1%) 0 0 0 1 (0.0%) 0 0 void option_two() {
. . . . . . . . . int i, j;
. . . . . . . . . static int x[4000][4000];
12,004 (0.0%) 0 0 8,001 (0.0%) 0 0 1 (0.0%) 0 0 for (i = 0; i < 4000; i++) {
48,016,000 (10.0%) 0 0 32,004,000 (16.7%) 0 0 4,000 (0.0%) 0 0 for (j = 0; j < 4000; j++) {
192,000,000 (40.0%) 0 0 64,000,000 (33.3%) 0 0 16,000,000 (50.0%) 16,000,000 (94.1%) 1,000,000 (50.0%) x[j][i] = i + j;
. . . . . . . . . }
. . . . . . . . . }
4 (0.0%) 0 0 2 (0.0%) 1 (0.1%) 1 (0.1%) 0 0 0 }
. . . . . . . . .
2 (0.0%) 1 (0.1%) 1 (0.1%) 0 0 0 1 (0.0%) 0 0 int main() {
2 (0.0%) 0 0 0 0 0 1 (0.0%) 0 0 option_one();
2 (0.0%) 0 0 0 0 0 1 (0.0%) 0 0 option_two();
1 (0.0%) 0 0 0 0 0 0 0 0 return 0;
2 (0.0%) 0 0 2 (0.0%) 0 0 0 0 0 }
--------------------------------------------------------------------------------
-- Annotation summary
--------------------------------------------------------------------------------
Ir__________________ I1mr_________ ILmr_________ Dr__________________ D1mr_________ DLmr_________ Dw_________________ D1mw_______________ DLmw______________
480,056,029 (100.0%) 3 (0.3%) 3 (0.3%) 192,024,008 (100.0%) 2 (0.2%) 2 (0.2%) 32,008,007 (100.0%) 17,000,000 (100.0%) 2,000,000 (100.0%) annotated: files known & above threshold & readable, line numbers known
0 0 0 0 0 0 0 0 0 annotated: files known & above threshold & readable, line numbers unknown
0 0 0 0 0 0 0 0 0 unannotated: files known & above threshold & two or more non-identical
0 0 0 0 0 0 0 0 0 unannotated: files known & above threshold & unreadable
133,654 (0.0%) 1,047 (98.2%) 1,040 (98.3%) 30,670 (0.0%) 1,161 (99.5%) 1,035 (99.4%) 10,533 (0.0%) 324 (0.0%) 316 (0.0%) unannotated: files known & below threshold
111 (0.0%) 16 (1.5%) 15 (1.4%) 27 (0.0%) 4 (0.3%) 4 (0.4%) 13 (0.0%) 1 (0.0%) 1 (0.0%) unannotated: files unknown
Command to benchmark: cc -Wall matrix-multiply.c benchmark.c && ./a.out 1000
.
Swapping two internal loops (j
and k
):
Naive: 3.575s
Fast: 2.300s
1.55x speedup