\`x^2+y_1+z_12^34\`
Advanced Search
Article Contents
Article Contents

Experiments with sparse Cholesky using a sequential task-flow implementation

This work is supported by the NLAFET Project funded by the European Unions Horizon 2020 Research and Innovation Programme under Grant Agreement 671633

Abstract / Introduction Full Text(HTML) Figure(20) / Table(3) Related Papers Cited by
  • We describe the development of a prototype code for the solution of large sparse symmetric positive definite systems that is efficient on parallel architectures. We implement a DAG-based Cholesky factorization that offers good performance and scalability on multicore architectures. Our approach uses a runtime system to execute the DAG. The runtime system plays the role of a software layer between the application and the architecture and handles the management of task dependencies as well as the task scheduling. In this model, the application is expressed using a high-level API, independent of the hardware details, thus enabling portability across different architectures. Although widely used in dense linear algebra, this approach is nevertheless challenging for sparse algorithms because of the irregularity and variable granularity of the DAGs arising in these systems. We assess the ability of two different Sequential Task Flow implementations to address this challenge: one implemented with the OpenMP standard, and the other with the modern runtime system StarPU. We compare these implementations to the state-of-the-art solver HSL_MA87 and demonstrate comparable performance on a multicore architecture.

    Mathematics Subject Classification: 65F05, 65F50, 65Y05.

    Citation:

    \begin{equation} \\ \end{equation}
  • 加载中
  • Figure 2.1.  Simple assembly tree with three supernodes partitioned into square blocks of order $ \mathtt{nb} $

    Figure 2.2.  DAG corresponding to the factorization of the tree in Figure 2.1

    Figure 2.3.  Pseudo-code for the sequential version of the taskbased sparse Cholesky factorization

    Figure 3.1.  Simple example of a sequential code

    Figure 3.2.  STF code corresponding to Figure 3.1 example

    Figure 3.3.  DAG corresponding to the sequential code presented in Figure 3.1

    Figure 4.1.  Simple example of a parallel version of the sequential code in Figure 3.1 using a STF model with OpenMP

    Figure 4.2.  Simple example of a parallel version of the sequential code in Figure 3.1 using a STF model with StarPU

    Figure 4.3.  Illustration of the dynamic scheduling strategy of tasks in the runtime system

    Figure 5.1.  Pseudo-code for the sparse Cholesky factorization using a STF model presented in Section 3

    Figure 5.2.  Submission routine used for the solve tasks in the StarPU code

    Figure 5.3.  Submission routine used for the solve tasks in the OpenMP code

    Figure 7.1.  Performance results for both OpenMP and StarPU versions of the SpLLT and HSL_MA87 solvers on 28 cores for the test matrices presented in Table 8

    Figure 7.2.  Performance results obtained with SpLLT, OpenMP and StarPU relative to the performance obtained with HSL_MA87

    Figure 7.3.  Efficiency analysis for a subset of the test matrices including the parallel efficiency (top-left), the task efficiency (top-right), the runtime efficiency (bottom-left) and the pipeline efficiency (bottom-right)

    Figure 7.4.  Main submission routine executed by the master thread. This performs both the initialisation and factorization tasks for the nodes in the assembly tree

    Figure 7.5.  Kernel for the $ \mathtt{insert\_factorize\_node} $ task in charge of submitting the factorization tasks for a given node

    Figure 7.6.  Illustration of the tree pruning strategy used by SpLLT

    Figure 7.7.  Performance results for StarPU versions of the SpLLT and HSL_MA87 solvers on 28 cores for the test matrices presented in Table 8. We also show the impact of tree pruning on the performance of SpLLT

    Figure 7.8.  Performance results for OpenMP versions of the SpLLT and HSL_MA87 solvers on 28 cores for the test matrices presented in Table 8. We also show the impact of tree pruning on the performance of SpLLT

    Table 7.1.  Factorization times (seconds) obtained with MA87 and SpLLT with both OpenMP and StarPU versions. The factorizations were run with the block sizes $ \mathtt{nb = (256,384,512,768,1024)} $ on 28 cores and $ \mathtt{nemin = 32} $.

    # Name MA87 spLLT
    MA87 OpenMP (gnu) StarPU
    nb factor.
    (s)
    nb factor.
    (s)
    nb factor.
    (s)
    1 Schmid/thermal2 1024 0.391 1024 1.742 512 2.215
    2 Rothberg/gearbox 256 0.253 384 0.236 512 0.339
    3 DNVS/m_t1 256 0.206 256 0.208 1024 0.298
    4 Boeing/pwtk 768 0.249 768 0.262 768 0.396
    5 Chen/pkustk13 256 0.218 256 0.242 384 0.344
    6 GHS_psdef/crankseg_1 256 0.233 256 0.234 384 0.281
    7 Rothberg/cfd2 256 0.242 256 0.256 384 0.386
    8 DNVS/thread 256 0.236 256 0.223 384 0.222
    9 DNVS/shipsec8 256 0.253 256 0.234 384 0.380
    10 DNVS/shipsec1 256 0.245 256 0.270 384 0.398
    11 GHS_psdef/crankseg_2 256 0.267 256 0.268 384 0.326
    12 DNVS/fcondp2 256 0.294 384 0.347 384 0.479
    13 Schenk_AFE/af _shell3 256 0.437 512 0.550 512 0.906
    14 DNVS/troll 256 0.381 384 0.434 512 0.599
    15 AMD/G3_circuit 256 0.607 768 2.534 768 3.561
    16 GHS_psdef/bmwcra_1 256 0.339 256 0.356 512 0.466
    17 DNVS/halfb 256 0.370 256 0.442 384 0.631
    18 Um/2cubes_sphere 256 0.321 384 0.451 512 0.634
    19 GHS_psdef/ldoor 256 0.620 512 1.058 768 1.726
    20 DNVS/ship_003 256 0.361 384 0.457 512 0.554
    21 DNVS/fullb 256 0.445 256 0.469 384 0.651
    22 GHS_psdef/inline_1 256 0.659 384 0.711 768 0.995
    23 Chen/pkustk14 256 0.557 256 0.576 512 0.768
    24 GHS_psdef/apache2 256 0.710 768 1.448 512 2.053
    25 Koutsovasilis/F1 384 0.776 384 0.829 768 0.981
    26 Oberwolfach/boneS10 256 1.102 256 1.185 768 1.702
    27 ND/nd12k 384 1.452 384 1.479 768 1.611
    28 JGD_Trefethen/Trefethen_20000 768 4.233 512 3.700 768 3.843
    29 ND/nd24k 384 5.350 512 5.570 768 5.485
    30 Janna/Flan_1565 384 7.722 512 7.921 768 8.419
    31 Oberwolfach/bone010 384 7.117 768 7.350 768 7.525
    32 Janna/StocF-1465 768 8.337 768 9.455 768 9.869
    33 GHS_psdef/audikw_1 384 10.419 768 10.630 768 10.680
    34 Janna/Fault_639 384 14.392 768 14.350 768 14.260
    35 Janna/Hook_1498 512 17.019 384 16.860 768 16.960
    36 Janna/Emilia_923 768 23.221 384 22.620 768 23.060
    37 Janna/Geo_1438 768 29.654 768 29.250 1024 29.380
    38 Janna/Serena 768 52.830 384 51.170 768 51.900
     | Show Table
    DownLoad: CSV

    Table 7.2.  DAG information along with the times (seconds) spent by the master thread to submit all the tasks to the runtime system on a subset of our test matrices

    # Name DAG stats SpLLT (StarPU)
    # task time/task avg. (ms) task insert (s) factor. time (s)
    1 Schmid/thermal2 166695 0.071 2.213 2.215
    2 Rothberg/gearbox 19392 0.417 0.326 0.339
    8 DNVS/thread 7481 0.746 0.155 0.222
    13 Schenk_AFE/af _shell3 65195 0.250 0.906 0.906
    15 AMD/G3_circuit 290235 0.154 3.558 3.561
    19 GHS_psdef/ldoor 126786 0.230 1.724 1.726
    24 GHS_psdef/apache2 163804 0.264 2.052 2.053
    32 Janna/Flan_1565 226486 1.007 3.452 8.419
    35 Janna/Hook_1498 639060 0.867 8.441 16.960
    38 Janna/Serena 795367 1.857 10.920 51.900
     | Show Table
    DownLoad: CSV

    Table A.1.  Test matrices and their characteristics without node amalgamation. $n$ is the matrix order, $nz(A)$ represent the number entries in the matrix $A$, $nz(L)$ represent the number of entries the factor $L$ and Flops correspond to the operation count for the matrix factorization

    # Name n
    (103)
    nz(A)
    (106)
    nz(L)
    (106)
    Flops
    (109)
    Application/Description
    1 Schmid/thermal2 1228 4.9 51.6 14.6 Unstructured thermal FEM
    2 Rothberg/gearbox 154 4.6 37.1 20.6 Aircraft flap actuator
    3 DNVS/m_t1 97.6 4.9 34.2 21.9 Tubular joint
    4 Boeing/pwtk 218 5.9 48.6 22.4 Pressurised wind tunnel
    5 Chen/pkustk13 94.9 3.4 30.4 25.9 Machine element
    6 GHS_psdef/crankseg_1 52.8 5.3 33.4 32.3 Linear static analysis
    7 Rothberg/cfd2 123 1.6 38.3 32.7 CFD pressure matrix
    8 DNVS/thread 29.7 2.2 24.1 34.9 Threaded connector
    9 DNVS/shipsec8 115 3.4 35.9 38.1 Ship section
    10 DNVS/shipsec1 141 4.0 39.4 38.1 Ship section
    11 GHS_psdef/crankseg_2 63.8 7.1 43.8 46.7 Linear static analysis
    12 DNVS/fcondp2 202 5.7 52.0 48.2 Oil production platform
    13 Schenk_AFE/af_shell3 505 9.0 93.6 52.2 Sheet metal forming
    14 DNVS/troll 214 6.1 64.2 55.9 Structural analysis
    15 AMD/G3_circuit 1586 4.6 97.8 57.0 Circuit simulation
    16 GHS_psdef/bmwcra_1 149 5.4 69.8 60.8 Automotive crankshaft
    17 DNVS/halfb 225 6.3 65.9 70.4 Half-breadth barge
    18 Um/2cubes_sphere 102 0.9 45.0 74.9 Electromagnetics
    19 GHS_psdef/ldoor 952 23.7 144.6 78.3 Large door
    20 DNVS/ship_003 122 4.1 60.2 81.0 Ship structure
    21 DNVS/fullb 199 6.0 74.5 100.2 Full-breadth barge
    22 GHS_psdef/inline_1 504 18.7 172.9 144.4 Inline skater
    23 Chen/pkustk14 152 7.5 106.8 146.4 Tall building
    24 GHS_psdef/apache2 715 2.8 134.7 174.3 3D structural problem
    25 Koutsovasilis/F1 344 13.6 173.7 218.8 AUDI engine crankshaft
    26 Oberwolfach/boneS10 915 28.2 278.0 281.6 Bone micro-FEM
    27 ND/nd12k 36.0 7.1 116.5 505.0 3D mesh problem
    28 JGD_Trefethen/Trefethen_20000 20.0 0.3 90.7 652.6 Integer matrix
    29 ND/nd24k 72.0 14.4 321.6 2054.4 3D mesh problem
    30 Janna/Flan_1565 1565 59.5 1477.9 3859.8 3D mechanical problem
    31 Oberwolfach/bone010 987 36.3 1076.4 3876.2 Bone micro-FEM
    32 Janna/StocF-1465 1465 11.2 1126.1 4386.6 Underground aquifer
    33 GHS_psdef/audikw_1 944 39.3 1242.3 5804.1 Automotive crankshaft
    34 Janna/Fault_639 639 14.6 1144.7 8283.9 Gas reservoir
    35 Janna/Hook_1498 1498 31.2 1532.9 8891.3 Steel hook
    36 Janna/Emilia_923 923 21.0 1729.9 13661.1 Gas reservoir
    37 Janna/Geo_1438 1438 32.3 2467.4 18058.1 Underground deformation
    38 Janna/Serena 1391 33.0 2761.7 30048.9 Gas reservoir
     | Show Table
    DownLoad: CSV
  •   E. Agullo, A. Buttari, A. Guermouche and F. Lopez, Implementing multifrontal sparse solvers for multicore architectures with sequential task flow runtime systems, ACM Trans. Math. Softw., 43 (2016), 13: 1-13: 22, doi: 10.1145/2898348.
      E. Agullo, J. Demmel, J. Dongarra, B. Hadri, J. Kurzak, J. Langou, H. Ltaief, P. Luszczek and S. Tomov, Numerical linear algebra on emerging architectures: The PLASMA and MAGMA projects, Journal of Physics: Conference Series, 180 (2009), 012037, http://stacks.iop.org/1742-6596/180/i=1/a=012037. doi: 10.1088/1742-6596/180/1/012037.
      P. R. Amestoy, T. A. Davis and I. S. Duff, An approximate minimum degree ordering algorithm, SIAM J. Matrix Anal. Appl., 17 (1996), 886-905, doi: 10.1137/S0895479894278952.
      P. R. Amestoy, T. A. Davis and I. S. Duff, Algorithm 837: Amd, an approximate minimum degree ordering algorithm, ACM Trans. Math. Softw., 30 (2004), 381-388, doi: 10.1145/1024074.1024081.
      C. Augonnet, S. Thibault, R. Namyst and P. -A. Wacrenier, Starpu: a unified platform for task scheduling on heterogeneous multicore architectures, Concurrency and Computation: Practice and Experience, 23 (2011), 187-198, doi: 10.1007/978-3-642-03869-3_80.
      G. Bosilca, A. Bouteiller, A. Danalis, M. Faverge, A. Haidar, T. Hérault, J. Kurzak, J. Langou, P. Lemarinier, H. Ltaief, P. Luszczek, A. Yarkhan and J. J. Dongarra, Distibuted dense numerical linear algebra algorithms on massively parallel architectures: DPLASMA, in Proceedings of the 25th IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW'11), PDSEC 2011, Anchorage, United States, (2011), 1432-1441, https://hal.inria.fr/hal-00809680.
      G. Bosilca, A. Bouteiller, A. Danalis, M. Faverge, T. Hérault and J. J. Dongarra, Parsec: Exploiting heterogeneity to enhance scalability, Computing in Science and Engineering, 15 (2013), 36-45, doi: 10.1109/MCSE.2013.98.
      A. Buttari, Fine-grained multithreading for the multifrontal QR factorization of sparse matrices, SIAM Journal on Scientific Computing, 35 (2013), C323-C345, doi: 10.1137/110846427.
      M. Cosnard and M. Loi, Automatic task graph generation techniques, in System Sciences, 1995. Proceedings of the Twenty-Eighth Hawaii International Conference on, 2 (1995), 113-122. doi: 10.1109/HICSS.1995.375471.
      T. A. Davis  and  Y. Hu , The university of Florida sparse matrix collection, ACM Trans. Math. Softw., 38 (2011) , 1:1-1:25.  doi: 10.1145/2049662.2049663.
      G. A. Geist  and  E. Ng , Task scheduling for parallel sparse cholesky factorization, Int. J. Parallel Program., 18 (1990) , 291-314.  doi: 10.1007/BF01407861.
      A. George  and  J. W. H. Liu , An automatic nested dissection algorithm for irregular finite element problems, SINUM, 15 (1978) , 1053-1069.  doi: 10.1137/0715069.
      P. Hénon , P. Ramet  and  J. Roman , PaStiX: A high-performance parallel direct solver For sparse symmetric definite systems, Parallel Computing, 28 (2002) , 301-321.  doi: 10.1016/S0167-8191(01)00141-7.
      J. D. Hogg , J. K. Reid  and  J. A. Scott , Design of a multicore sparse cholesky factorization using dags, SIAM Journal on Scientific Computing, 32 (2010) , 3627-3649.  doi: 10.1137/090757216.
      J. D. Hogg and J. A. Scott, A modern analyse phase for sparse tree-based direct methods, Technical Report RAL-TR-2010-031, STFC Rutherford Appleton Lab., 2010, https://epubs.stfc.ac.uk/work/54246.
      F. D. Igual , E. Chan , E. S. Quintana-Ortí , G. Quintana-Ortí , R. A. van de Geijn  and  F. G. V. Zee , The flame approach: From dense linear algebra algorithms to high-performance multi-accelerator implementations, J. Parallel Distrib. Comput., 72 (2012) , 1134-1143.  doi: 10.1016/j.jpdc.2011.10.014.
      G. Karypis and V. Kumar, A fast and high quality multilevel scheme for partitioning irregular graphs, SIAM J. Sci. Comput., 20 (1998), 359-392, doi: 10.1137/S1064827595287997.
      J. W. H. Liu, Modification of the minimum-degree algorithm by multiple elimination, ACM Trans. Math. Softw., 11 (1985), 141-153, doi: 10.1145/214392.214398.
      F. Lopez, Task-based Multifrontal QR Solver for Heterogeneous Architectures, Thèse de doctorat, Université Paul Sabatier, Toulouse, France, 2015.
      W. F. Tinney  and  J. W. Walker , Direct solutions of sparse network equations by optimally ordered triangular factorization, Proceedings of the IEEE, 55 (1967) , 1801-1809.  doi: 10.1109/PROC.1967.6011.
  • 加载中

Figures(20)

Tables(3)

SHARE

Article Metrics

HTML views(1795) PDF downloads(345) Cited by(0)

Access History

Other Articles By Authors

Catalog

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return