-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME
428 lines (359 loc) · 16.8 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
CSX library v0.2
1. Introduction
===============
This package is a proof-of-concept release of the Compressed Sparse
eXtended format for sparse matrices. This format seeks to minimize the
memory footprint of the column index array of the typical Compressed
Sparse Row (CSR) format by exploiting dense substructures inside the
sparse matrix. Instead of storing a single index for every nonzero
element of the sparse matrix, CSX stores a short description for each
substructure found in the matrix (and selected for encoding). This
technique can save significant amount of main memory storage and
minimize the bandwidth requirements of the Sparse Matrix-Vector
Multiplication (SpMV) kernel. Finally, the CSX format employs runtime
code generation (using the LLVM compiler infrastructure) for emitting
optimized SpMV routines for each encoded pattern.
More information about the CSX format can be found in
K. Kourtis, V. Karakasis, G. Goumas, and N. Koziris, "CSX: An extended
compression format for SpMV on shared memory systems," 16th ACM
SIGPLAN Annual Symposium on Principles and Practice of Parallel
Programming (PPoPP'11) San Antonio, TX, USA, February 12-16, 2011.
V. Karakasis, G. Goumas, K. Nikas, N. Koziris, J. Ruokolainen, and
P. Råback, "Using State-of-the-Art Sparse Matrix Optimizations for
Accelerating the Performance of Multiphysics Simulations" PARA 2012:
Workshop on State-of-the-Art in Scientific and Parallel Computing
Helsinki, Finland, June 10-13, 2012.
2. Summary of changes with the previous version
===============================================
The CSX format and library has matured considerably since its first
release in February, 2011. A summary of the changes follows:
* Complete re-implementation of the runtime code generation module
using Clang and LLVM
* Efficient support of NUMA architectures
* Support for symmetric matrices
* Proof-of-concept non-preconditioned CG implementation using CSX
* Revised preprocessing (better sampling, multi-threading, careful
memory management)
* Code and Makefile restructuring
* Support for multi-threaded compilation
* Efficient implementations, incl. NUMA-aware versions, of other
sparse matrix storage formats, such as CSR, BCSR, VBL, SSS
* Self-contained library interface (libcsx.so) for use with sparse
linear algebra s/w, specifically tuned for the Elmer multiphysics
simulation s/w.
* Several bug fixes
3. Prerequisites and external dependencies
==========================================
* A fairly recent Linux OS
* LLVM >= 2.9 and Clang
* Boost Regex Library >= 1.48
* numactl library >= 2.0.7
* Cairo library (optionally)
4. Compiling the package
========================
This package is written in C and C++. We have compiled it successfully
with several GCC up to 4.6. Running `make' at the top-level source
directory will be enough to compile the whole package for a typical
installation of LLVM and Boost. If you have installed LLVM 2.9 in a
non-standard location that is not in your path, you can instruct the
compilation process to use your preferred LLVM 2.9 installation by
typing
$ make llvmconf=/path/to/llvm-2.9/bin/llvm-config
If, for any reason, you might want to alter the predefined
preprocessor, compiler flags and/or linker flags, you can use the
standard CPPFLAGS, CFLAGS, CXXFLAGS and LDFLAGS. You need not worry
for the dependencies within the CSX library, since we append the
required flags to the user supplied ones.
Finally, you can speed up the compilation by using multiple tasks with
the `-j' flag of make:
$ make -j8
5. Running CSX
==============
After you have successfully compiled CSX, the final executable, named
`spmv', will be placed inside the `csx' directory. The spmv executable
may be invoked as follows:
[ENV=<value>] ... ./spmv [-b | -s] <matrix_file> [...]
The execution of CSX is controlled by a set of environment variables
and takes as input one or more files describing sparse matrices. For
each matrix, `spmv' will perform an analysis of the patterns found and
print the results, the preprocessing time and the wall-clock time (and
performance in Mflop/s) for the execution of 128 consecutive SpMV
operations.
5.1 Input matrix format
-----------------------
The `spmv' executable requires the input matrices to be in a variation
of the Matrix Market Exchange format. Specifically, the first line of
the file must contain the number of rows, columns and non-zeros,
respectively, separated by whitespace. The rest of the lines contain
the nonzero elements of the matrix (one per line) in the form of
`<row> <column> <value>', sorted lexicographically (i.e.,
row-major). To facilitate the conversion to the desired format, the
utility script `scripts/sort-mtx.sh' is supplied.
See also http://math.nist.gov/MatrixMarket/formats.html
5.2 Environment
---------------
The execution of `spmv' is solely controlled by environment variables,
which are described in detail below.
** Variables controlling multithreaded execution
MT_CONF Set the CPU affinity for running `spmv'. This is the
only way to setup a multithreaded execution for
`spmv'. You should supply the CPU numbers (see
/sys/devices/) of the desired CPUs as a
comma-separated list. If MT_CONF is not specified, it
is set to `0', thus assuming single-threaded
execution.
** Variables controlling the construction of CSX
XFORM_CONF Set the pattern types to search for in the
matrix. This is a comma-separated list of the pattern
type IDs as specified in the enumeration
`SpmIterOrder' in `spm.h'. The mapping between type
IDs and types is the following:
0 -> NONE
1 -> HORIZONTAL
2 -> VERTICAL
3 -> DIAGONAL
4 -> REV_DIAGONAL
5 -> BLOCK_TYPE_START (not used)
6 -> BLOCK_R1 (not used)
7 -> BLOCK_R2
8 -> BLOCK_R3
9 -> BLOCK_R4
10 -> BLOCK_R5
11 -> BLOCK_R6
12 -> BLOCK_R7
13 -> BLOCK_R8
14 -> BLOCK_COL_START (not used)
15 -> BLOCK_C1 (not used)
16 -> BLOCK_C2
17 -> BLOCK_C3
18 -> BLOCK_C4
19 -> BLOCK_C5
20 -> BLOCK_C6
21 -> BLOCK_C7
22 -> BLOCK_C8
23 -> BLOCK_TYPE_END (not used)
24 -> XFORM_MAX
The default value of XFORM_CONF is `0', i.e.,
NONE. You can optionally enable the BLOCK_R1 and
BLOCK_C1 types, i.e., one-dimensional fixed size
blocks, programmatically through the DRLE_Manager
interface.
ENCODE_DELTAS Set specific deltas to encode for one-dimensional
patterns or block dimensions for two-dimensional
patterns. ENCODE_DELTAS is a comma-separated list of
"delta descriptors". A delta descriptor is a tuple in
the form of `{d1,d2,...,dn}', where `di' is a specific
delta to be encoded. For two-dimensional patterns
types, the `di' signifies the second dimension of the
block pattern. The ENCODE_DELTAS variable must be
specified along with XFORM_CONF, in which case there
is a one-to-one mapping between the pattern types and
delta descriptors. For example, using
XFORM_CONF=16,1,3 and ENCODE_DELTAS={4,7},{1},{11},
first `BLOCK_C2' patterns with a second dimension of 4
and 7, respectively (i.e., block patterns 4x2 and 7x2)
will be encoded into the matrix. Second, `HORIZONTAL'
patterns with delta 1 will be encoded and, finally,
`DIAGONAL' patterns with delta 11. The number of
pattern types specified in XFORM_CONF must equal the
number of delta descriptors specified in
ENCODE_DELTAS.
** Variables controlling preprocessing
The preprocessing phase of CSX can be optimized with the use of
statistical sampling. Preprocessing is done in multiple threads using
the thread configuration specified by the MT_CONF variable. The
following variables control the sampling process.
SAMPLES Specify the number of sampling windows to consider for
searching per thread. This variable actually enables
the optimized preprocessing. It can be used in
conjunction with one of the following variables, which
specify the sampling window size, either explicitly or
implicitly. If the specified number of samples exceeds
the maximum possible number of samples (depending on
the sampling policy), it will be set to this maximum
number. In this case, no statistical sampling takes
place, but the preprocessing is performed simply with
the use of preprocessing windows.
WINDOW_SIZE Set the preprocessing window size explicitly. The
window size is set in terms of non-zero elements by
default. The size of the window will be "rounded" up
to the next full row. You can sample the matrix
row-wise, in which case WINDOW_SIZE is the number or
rows of each window. However, to enable this policy,
you should alter the sampling policy during the
creation of the DRLE_Manager in `spmv.cc'. This kind
of sampling is not recommended, since it cannot
provide a good coverage for irregular matrices.
SAMPLING_PORTION Specify the portion of the matrix to sample as a real
number in [0,1]. If this variable is specified, the
window size in non-zero elements will be automatically
computed to meet the following requirement:
SAMPLING_PORTION*non_zeros = SAMPLES*computed_win_size
5.3 Other options controlling execution
---------------------------------------
You can specify the two following options when running CSX:
-b Disable the split-block optimization. The split-blocks
optimization leads to better compression, since it improves
the statistics (number of nonzero elements covered by a
pattern) of smaller blocks making them eligible for
encoding.
-s Use the version of CSX for symmetric matrices. In this
variation, the elements of the main diagonal are stored
separately and the lower triangular matrix is stored in the
CSX format.
5.4 Output
----------
The typical output of running the CSX is the following (lines starting
with `#' are not part of the output, but serve as explanatory comments):
=== BEGIN BENCHMARK ===
##
## CSX set-up information
##
MT_CONF=0
Encoding type: HORIZONTAL, VERTICAL, DIAGONAL, REV_DIAGONAL, __BLOCK_TYPE_START__, BLOCK_R1, BLOCK_R2, BLOCK_R3, BLOCK_R4, BLOCK_R5, BLOCK_R6, BLOCK_R7, BLOCK_R8, __BLOCK_COL_START__, BLOCK_C1, BLOCK_C2, BLOCK_C3, BLOCK_C4, BLOCK_C5, BLOCK_C6, BLOCK_C7, BLOCK_C8
Window size: Not set
Number of samples: Not set
Sampling portion: Not set
##
## Per-thread preprocessing results
##
## s: heuristic score
## <int>-> : the delta value or the second dimension of the block for
## blocked substructures
## np: number of patterns encoded by every substructure instantiation
## nnz (<percent>): number of non-zeros encoded by every substructure
## instantiation followed by the coverage percentage.
## This percentage is an estimation if sampling is
## used.
==> Thread: #0
HORIZONTAL s:12 1-> np:4 nnz: 16 (42.1053%)
VERTICAL s:15 1-> np:5 nnz: 20 (52.6316%)
DIAGONAL s:8 1-> np:1 nnz: 6 (15.7895%) 2-> np:1 nnz: 4 (10.5263%)
REV_DIAGONAL s:3 1-> np:1 nnz: 4 (10.5263%)
BLOCK_R2 s:16 2-> np:3 nnz: 12 (31.5789%) 4-> np:1 nnz: 8 (21.0526%)
BLOCK_R3 s:10 2-> np:2 nnz: 12 (31.5789%)
BLOCK_C2 s:19 3-> np:1 nnz: 6 (15.7895%) 4-> np:2 nnz: 16 (42.1053%)
BLOCK_C3 s:8 3-> np:1 nnz: 9 (23.6842%)
##
## Substructure type selected for encoding and the detection process
## starts over
##
Encode to BLOCK_C2
HORIZONTAL s:3 1-> np:1 nnz: 4 (10.5263%)
VERTICAL s:3 1-> np:1 nnz: 4 (10.5263%)
DIAGONAL s:3 2-> np:1 nnz: 4 (10.5263%)
REV_DIAGONAL s:3 1-> np:1 nnz: 4 (10.5263%)
Encode to HORIZONTAL
VERTICAL s:3 1-> np:1 nnz: 4 (10.5263%)
DIAGONAL s:3 2-> np:1 nnz: 4 (10.5263%)
REV_DIAGONAL s:3 1-> np:1 nnz: 4 (10.5263%)
Encode to VERTICAL
DIAGONAL s:3 2-> np:1 nnz: 4 (10.5263%)
REV_DIAGONAL s:3 1-> np:1 nnz: 4 (10.5263%)
Encode to DIAGONAL
REV_DIAGONAL s:3 1-> np:1 nnz: 4 (10.5263%)
Encode to REV_DIAGONAL
##
## The substructure types were selected for encoded in the selection
## order
##
Encoding sequence: BLOCK_C2, HORIZONTAL, VERTICAL, DIAGONAL, REV_DIAGONAL
##
## The exact instantiations that were encoded
##
## delta:<int> : the delta value of the instantiation
## dim:<int>x<int> : the block dimensions of the encoded blocks
## npatterns:<int> : the number of patterns encoded by the instantiation
## nnz:<int> : the number of non-zeros encoded by the instantiation
##
type:HORIZONTAL delta:1 npatterns:1 nnz:4
type:VERTICAL delta:1 npatterns:1 nnz:4
type:DIAGONAL delta:2 npatterns:1 nnz:4
type:REV_DIAGONAL delta:1 npatterns:1 nnz:4
type:BLOCK_C2 dim:3x2 npatterns:1 nnz:6
type:BLOCK_C2 dim:4x2 npatterns:2 nnz:16
Checking ... Check Passed
##
## Performance info
##
## m:<method> : method used (csx or csx-sym)
## f:<file> : the matrix file (basename)
## s:<size> : size in bytes of the CSX format
## pt:<time> : preprocessing time in seconds
## t:<time> : execution time of 128 SpMV operations in seconds
## r:<perf> : performance in Mflop/s
##
m:csx f:demopatt.mtx.sorted s:327 pt:0.063501 t:0.000086 r:113.363468
=== END BENCHMARK ===
6. Running CG
=============
After you have successfully compiled CG, the final executable, named
`cg', will be placed inside the `cg' directory. The cg executable may
be invoked as follows:
[ENV=<value>] ... ./cg [-s | -x | -l <n> | -L <n>] <matrix_file>
The execution of CG is controlled by a set of environment variables
and command line parameters and takes as input a file describing the
sparse matrix. The solution vector of equation Ax = b (s) is
randomized with double precision elements from -1000 to 1000, the
vector b is calculated from the multiplication b = As and the
estimated solution vector (x) is initialized with elements equal to
0.01. The result shows the square distance between the estimated
vector (x) and the real solution vector values (s), the time breakdown
into spmv multiplication phase, spmv reduction time and rest vector
operations, and the total cg time.
6.1 Input matrix format
-----------------------
See 5.1.
6.2 Environment
---------------
See 5.2.
6.3 Other options controlling execution
---------------------------------------
You can control the execution of the CG method with the following four
command line options:
-x Activate the CSX format. By default (if the options is not
set), CSR will be used.
-s Replace with the respective symmetric formats. Instead of CSR,
SSS will be deployed and instead of CSX, CSX-Sym would be
used.
-l <n> Specify the maximum number of iterations of the CG method
(default is 1024).
-L <n> Specify how many times to run the specified benchmark (default
is once).
6.4 Output
----------
The typical output of running CG is the following (lines starting with `#' are
not part of the output, but serve as explanatory comments):
##
## The CPU configuration.
##
MT_CONF: 0,1,2,3
##
## The specified format output (if CSX or CSX-Sym is activated the details about
## stats, encoding etc. is shown.
##
...
##
## Performance info
##
## m:<matrix> : the matrix file (basename)
## l:<loops> : number of iterations of CG
## rd:<distance> : square distance from actual solution
## st:<time> : SpMxV multiplication operations time
## rt:<time> : SpMxV reduction operations time
## ct:<time> : CG total time
##
## * the rest vector operations time of CG can be found by subtracting 'st' and
## 'rt' from 'ct' time.
##
m:26.nd12k.mtx.sorted l:1024 rd:0.004506 st:10.816884 rt:0.145372 ct:11.367220
7. Licence
==========
This package is distributed under the BSD Licence. See `LICENCE.txt' for more
information.
8. Authors
==========
This package is maintained by
Vasileios Karakasis <[email protected]>
Theodoros Gkountouvas <[email protected]>
Kornilios Kourtis <[email protected]>