-
Notifications
You must be signed in to change notification settings - Fork 0
Compressed Sparse eXtended (CSX) sparse matrix format. This proof-of-concept version is now deprecated. It has been replaced from the SparseX sparse kernel optimization library [https://github.com/cslab-ntua/sparsex]
License
cslab-ntua/csx
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
CSX library v0.2 1. Introduction =============== This package is a proof-of-concept release of the Compressed Sparse eXtended format for sparse matrices. This format seeks to minimize the memory footprint of the column index array of the typical Compressed Sparse Row (CSR) format by exploiting dense substructures inside the sparse matrix. Instead of storing a single index for every nonzero element of the sparse matrix, CSX stores a short description for each substructure found in the matrix (and selected for encoding). This technique can save significant amount of main memory storage and minimize the bandwidth requirements of the Sparse Matrix-Vector Multiplication (SpMV) kernel. Finally, the CSX format employs runtime code generation (using the LLVM compiler infrastructure) for emitting optimized SpMV routines for each encoded pattern. More information about the CSX format can be found in K. Kourtis, V. Karakasis, G. Goumas, and N. Koziris, "CSX: An extended compression format for SpMV on shared memory systems," 16th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming (PPoPP'11) San Antonio, TX, USA, February 12-16, 2011. V. Karakasis, G. Goumas, K. Nikas, N. Koziris, J. Ruokolainen, and P. Råback, "Using State-of-the-Art Sparse Matrix Optimizations for Accelerating the Performance of Multiphysics Simulations" PARA 2012: Workshop on State-of-the-Art in Scientific and Parallel Computing Helsinki, Finland, June 10-13, 2012. 2. Summary of changes with the previous version =============================================== The CSX format and library has matured considerably since its first release in February, 2011. A summary of the changes follows: * Complete re-implementation of the runtime code generation module using Clang and LLVM * Efficient support of NUMA architectures * Support for symmetric matrices * Proof-of-concept non-preconditioned CG implementation using CSX * Revised preprocessing (better sampling, multi-threading, careful memory management) * Code and Makefile restructuring * Support for multi-threaded compilation * Efficient implementations, incl. NUMA-aware versions, of other sparse matrix storage formats, such as CSR, BCSR, VBL, SSS * Self-contained library interface (libcsx.so) for use with sparse linear algebra s/w, specifically tuned for the Elmer multiphysics simulation s/w. * Several bug fixes 3. Prerequisites and external dependencies ========================================== * A fairly recent Linux OS * LLVM >= 2.9 and Clang * Boost Regex Library >= 1.48 * numactl library >= 2.0.7 * Cairo library (optionally) 4. Compiling the package ======================== This package is written in C and C++. We have compiled it successfully with several GCC up to 4.6. Running `make' at the top-level source directory will be enough to compile the whole package for a typical installation of LLVM and Boost. If you have installed LLVM 2.9 in a non-standard location that is not in your path, you can instruct the compilation process to use your preferred LLVM 2.9 installation by typing $ make llvmconf=/path/to/llvm-2.9/bin/llvm-config If, for any reason, you might want to alter the predefined preprocessor, compiler flags and/or linker flags, you can use the standard CPPFLAGS, CFLAGS, CXXFLAGS and LDFLAGS. You need not worry for the dependencies within the CSX library, since we append the required flags to the user supplied ones. Finally, you can speed up the compilation by using multiple tasks with the `-j' flag of make: $ make -j8 5. Running CSX ============== After you have successfully compiled CSX, the final executable, named `spmv', will be placed inside the `csx' directory. The spmv executable may be invoked as follows: [ENV=<value>] ... ./spmv [-b | -s] <matrix_file> [...] The execution of CSX is controlled by a set of environment variables and takes as input one or more files describing sparse matrices. For each matrix, `spmv' will perform an analysis of the patterns found and print the results, the preprocessing time and the wall-clock time (and performance in Mflop/s) for the execution of 128 consecutive SpMV operations. 5.1 Input matrix format ----------------------- The `spmv' executable requires the input matrices to be in a variation of the Matrix Market Exchange format. Specifically, the first line of the file must contain the number of rows, columns and non-zeros, respectively, separated by whitespace. The rest of the lines contain the nonzero elements of the matrix (one per line) in the form of `<row> <column> <value>', sorted lexicographically (i.e., row-major). To facilitate the conversion to the desired format, the utility script `scripts/sort-mtx.sh' is supplied. See also http://math.nist.gov/MatrixMarket/formats.html 5.2 Environment --------------- The execution of `spmv' is solely controlled by environment variables, which are described in detail below. ** Variables controlling multithreaded execution MT_CONF Set the CPU affinity for running `spmv'. This is the only way to setup a multithreaded execution for `spmv'. You should supply the CPU numbers (see /sys/devices/) of the desired CPUs as a comma-separated list. If MT_CONF is not specified, it is set to `0', thus assuming single-threaded execution. ** Variables controlling the construction of CSX XFORM_CONF Set the pattern types to search for in the matrix. This is a comma-separated list of the pattern type IDs as specified in the enumeration `SpmIterOrder' in `spm.h'. The mapping between type IDs and types is the following: 0 -> NONE 1 -> HORIZONTAL 2 -> VERTICAL 3 -> DIAGONAL 4 -> REV_DIAGONAL 5 -> BLOCK_TYPE_START (not used) 6 -> BLOCK_R1 (not used) 7 -> BLOCK_R2 8 -> BLOCK_R3 9 -> BLOCK_R4 10 -> BLOCK_R5 11 -> BLOCK_R6 12 -> BLOCK_R7 13 -> BLOCK_R8 14 -> BLOCK_COL_START (not used) 15 -> BLOCK_C1 (not used) 16 -> BLOCK_C2 17 -> BLOCK_C3 18 -> BLOCK_C4 19 -> BLOCK_C5 20 -> BLOCK_C6 21 -> BLOCK_C7 22 -> BLOCK_C8 23 -> BLOCK_TYPE_END (not used) 24 -> XFORM_MAX The default value of XFORM_CONF is `0', i.e., NONE. You can optionally enable the BLOCK_R1 and BLOCK_C1 types, i.e., one-dimensional fixed size blocks, programmatically through the DRLE_Manager interface. ENCODE_DELTAS Set specific deltas to encode for one-dimensional patterns or block dimensions for two-dimensional patterns. ENCODE_DELTAS is a comma-separated list of "delta descriptors". A delta descriptor is a tuple in the form of `{d1,d2,...,dn}', where `di' is a specific delta to be encoded. For two-dimensional patterns types, the `di' signifies the second dimension of the block pattern. The ENCODE_DELTAS variable must be specified along with XFORM_CONF, in which case there is a one-to-one mapping between the pattern types and delta descriptors. For example, using XFORM_CONF=16,1,3 and ENCODE_DELTAS={4,7},{1},{11}, first `BLOCK_C2' patterns with a second dimension of 4 and 7, respectively (i.e., block patterns 4x2 and 7x2) will be encoded into the matrix. Second, `HORIZONTAL' patterns with delta 1 will be encoded and, finally, `DIAGONAL' patterns with delta 11. The number of pattern types specified in XFORM_CONF must equal the number of delta descriptors specified in ENCODE_DELTAS. ** Variables controlling preprocessing The preprocessing phase of CSX can be optimized with the use of statistical sampling. Preprocessing is done in multiple threads using the thread configuration specified by the MT_CONF variable. The following variables control the sampling process. SAMPLES Specify the number of sampling windows to consider for searching per thread. This variable actually enables the optimized preprocessing. It can be used in conjunction with one of the following variables, which specify the sampling window size, either explicitly or implicitly. If the specified number of samples exceeds the maximum possible number of samples (depending on the sampling policy), it will be set to this maximum number. In this case, no statistical sampling takes place, but the preprocessing is performed simply with the use of preprocessing windows. WINDOW_SIZE Set the preprocessing window size explicitly. The window size is set in terms of non-zero elements by default. The size of the window will be "rounded" up to the next full row. You can sample the matrix row-wise, in which case WINDOW_SIZE is the number or rows of each window. However, to enable this policy, you should alter the sampling policy during the creation of the DRLE_Manager in `spmv.cc'. This kind of sampling is not recommended, since it cannot provide a good coverage for irregular matrices. SAMPLING_PORTION Specify the portion of the matrix to sample as a real number in [0,1]. If this variable is specified, the window size in non-zero elements will be automatically computed to meet the following requirement: SAMPLING_PORTION*non_zeros = SAMPLES*computed_win_size 5.3 Other options controlling execution --------------------------------------- You can specify the two following options when running CSX: -b Disable the split-block optimization. The split-blocks optimization leads to better compression, since it improves the statistics (number of nonzero elements covered by a pattern) of smaller blocks making them eligible for encoding. -s Use the version of CSX for symmetric matrices. In this variation, the elements of the main diagonal are stored separately and the lower triangular matrix is stored in the CSX format. 5.4 Output ---------- The typical output of running the CSX is the following (lines starting with `#' are not part of the output, but serve as explanatory comments): === BEGIN BENCHMARK === ## ## CSX set-up information ## MT_CONF=0 Encoding type: HORIZONTAL, VERTICAL, DIAGONAL, REV_DIAGONAL, __BLOCK_TYPE_START__, BLOCK_R1, BLOCK_R2, BLOCK_R3, BLOCK_R4, BLOCK_R5, BLOCK_R6, BLOCK_R7, BLOCK_R8, __BLOCK_COL_START__, BLOCK_C1, BLOCK_C2, BLOCK_C3, BLOCK_C4, BLOCK_C5, BLOCK_C6, BLOCK_C7, BLOCK_C8 Window size: Not set Number of samples: Not set Sampling portion: Not set ## ## Per-thread preprocessing results ## ## s: heuristic score ## <int>-> : the delta value or the second dimension of the block for ## blocked substructures ## np: number of patterns encoded by every substructure instantiation ## nnz (<percent>): number of non-zeros encoded by every substructure ## instantiation followed by the coverage percentage. ## This percentage is an estimation if sampling is ## used. ==> Thread: #0 HORIZONTAL s:12 1-> np:4 nnz: 16 (42.1053%) VERTICAL s:15 1-> np:5 nnz: 20 (52.6316%) DIAGONAL s:8 1-> np:1 nnz: 6 (15.7895%) 2-> np:1 nnz: 4 (10.5263%) REV_DIAGONAL s:3 1-> np:1 nnz: 4 (10.5263%) BLOCK_R2 s:16 2-> np:3 nnz: 12 (31.5789%) 4-> np:1 nnz: 8 (21.0526%) BLOCK_R3 s:10 2-> np:2 nnz: 12 (31.5789%) BLOCK_C2 s:19 3-> np:1 nnz: 6 (15.7895%) 4-> np:2 nnz: 16 (42.1053%) BLOCK_C3 s:8 3-> np:1 nnz: 9 (23.6842%) ## ## Substructure type selected for encoding and the detection process ## starts over ## Encode to BLOCK_C2 HORIZONTAL s:3 1-> np:1 nnz: 4 (10.5263%) VERTICAL s:3 1-> np:1 nnz: 4 (10.5263%) DIAGONAL s:3 2-> np:1 nnz: 4 (10.5263%) REV_DIAGONAL s:3 1-> np:1 nnz: 4 (10.5263%) Encode to HORIZONTAL VERTICAL s:3 1-> np:1 nnz: 4 (10.5263%) DIAGONAL s:3 2-> np:1 nnz: 4 (10.5263%) REV_DIAGONAL s:3 1-> np:1 nnz: 4 (10.5263%) Encode to VERTICAL DIAGONAL s:3 2-> np:1 nnz: 4 (10.5263%) REV_DIAGONAL s:3 1-> np:1 nnz: 4 (10.5263%) Encode to DIAGONAL REV_DIAGONAL s:3 1-> np:1 nnz: 4 (10.5263%) Encode to REV_DIAGONAL ## ## The substructure types were selected for encoded in the selection ## order ## Encoding sequence: BLOCK_C2, HORIZONTAL, VERTICAL, DIAGONAL, REV_DIAGONAL ## ## The exact instantiations that were encoded ## ## delta:<int> : the delta value of the instantiation ## dim:<int>x<int> : the block dimensions of the encoded blocks ## npatterns:<int> : the number of patterns encoded by the instantiation ## nnz:<int> : the number of non-zeros encoded by the instantiation ## type:HORIZONTAL delta:1 npatterns:1 nnz:4 type:VERTICAL delta:1 npatterns:1 nnz:4 type:DIAGONAL delta:2 npatterns:1 nnz:4 type:REV_DIAGONAL delta:1 npatterns:1 nnz:4 type:BLOCK_C2 dim:3x2 npatterns:1 nnz:6 type:BLOCK_C2 dim:4x2 npatterns:2 nnz:16 Checking ... Check Passed ## ## Performance info ## ## m:<method> : method used (csx or csx-sym) ## f:<file> : the matrix file (basename) ## s:<size> : size in bytes of the CSX format ## pt:<time> : preprocessing time in seconds ## t:<time> : execution time of 128 SpMV operations in seconds ## r:<perf> : performance in Mflop/s ## m:csx f:demopatt.mtx.sorted s:327 pt:0.063501 t:0.000086 r:113.363468 === END BENCHMARK === 6. Running CG ============= After you have successfully compiled CG, the final executable, named `cg', will be placed inside the `cg' directory. The cg executable may be invoked as follows: [ENV=<value>] ... ./cg [-s | -x | -l <n> | -L <n>] <matrix_file> The execution of CG is controlled by a set of environment variables and command line parameters and takes as input a file describing the sparse matrix. The solution vector of equation Ax = b (s) is randomized with double precision elements from -1000 to 1000, the vector b is calculated from the multiplication b = As and the estimated solution vector (x) is initialized with elements equal to 0.01. The result shows the square distance between the estimated vector (x) and the real solution vector values (s), the time breakdown into spmv multiplication phase, spmv reduction time and rest vector operations, and the total cg time. 6.1 Input matrix format ----------------------- See 5.1. 6.2 Environment --------------- See 5.2. 6.3 Other options controlling execution --------------------------------------- You can control the execution of the CG method with the following four command line options: -x Activate the CSX format. By default (if the options is not set), CSR will be used. -s Replace with the respective symmetric formats. Instead of CSR, SSS will be deployed and instead of CSX, CSX-Sym would be used. -l <n> Specify the maximum number of iterations of the CG method (default is 1024). -L <n> Specify how many times to run the specified benchmark (default is once). 6.4 Output ---------- The typical output of running CG is the following (lines starting with `#' are not part of the output, but serve as explanatory comments): ## ## The CPU configuration. ## MT_CONF: 0,1,2,3 ## ## The specified format output (if CSX or CSX-Sym is activated the details about ## stats, encoding etc. is shown. ## ... ## ## Performance info ## ## m:<matrix> : the matrix file (basename) ## l:<loops> : number of iterations of CG ## rd:<distance> : square distance from actual solution ## st:<time> : SpMxV multiplication operations time ## rt:<time> : SpMxV reduction operations time ## ct:<time> : CG total time ## ## * the rest vector operations time of CG can be found by subtracting 'st' and ## 'rt' from 'ct' time. ## m:26.nd12k.mtx.sorted l:1024 rd:0.004506 st:10.816884 rt:0.145372 ct:11.367220 7. Licence ========== This package is distributed under the BSD Licence. See `LICENCE.txt' for more information. 8. Authors ========== This package is maintained by Vasileios Karakasis <[email protected]> Theodoros Gkountouvas <[email protected]> Kornilios Kourtis <[email protected]>
About
Compressed Sparse eXtended (CSX) sparse matrix format. This proof-of-concept version is now deprecated. It has been replaced from the SparseX sparse kernel optimization library [https://github.com/cslab-ntua/sparsex]
Resources
License
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published