HPC (High Performance Computing) bookmarks

Learning and practice of high performance computing
CFD
- Algebraic Flux Correction I. Scalar Conservation Laws
- Algebraic Flux Correction II. Compressible Euler Equations
- CFD Notes by Hiroaki Nishikawa
  - CFD codes in f90
- Tim Warburton's github repositories
  - Nodal Discontinuous Galerkin
- Hybrid and Easy Discontinuous Galerkin Environment
- Element-based-Galerkin-Methods
- Extreme-scale Discontinuous Galerkin Environment (EDGE)
- The Development, Verification, and Validation of a Discontinuous Galerkin Method for the Navier-Stokes Equations
- What are possible methods to solve compressible Euler equations
- I do like CFD, VOL.1, Second Edition
- Free CFD Codes
- CFD Julia: A Learning Module Structuring an Introductory Course on Computational Fluid Dynamics
  - CFD_Julia
- Lattice Boltzmann codes
  - A lattice Boltzmann code for complex fluids
- Hydrodynamics in OpenCL
  - Roe, HLL, HLLC, Burgers Scheme; 1D, 2D, 3D; Euler Equation, (~Maxwell), (~MHD), ADM Solver in OpenCL
  - Differential Geometry Tensor Library
- Implementing the discontinuous Galerkin method in CUDA
  - Marty Fuhry's Homepage
  - The Development, Verification, and Validation of a Discontinuous Galerkin Method for the Navier-Stokes Equations
- SU2
  - github repo for SU2
- HiFiLES: High Fidelity Large Eddy Simulation
  - github repo for HiFiLES
- PyWENO + PyPFASST
- Clawpack Repositories
- Riemann Problems and Jupyter Solutions
- Shenfun is a high performance computing platform for solving partial differential equations (PDEs) by the spectral Galerkin method
- Multilayered Abstractions for Partial Differential Equations by Graham Robert Markall: good review on Nektar++ etc.
  - Making Faster FEM Solvers, Faster MPhil Transfer Report By Graham Markall
  - Graham Markall
- Nektar++
  - Nektar++: An efficient h to p finite element framework
  - Nektar++: a high-order finite element framework
- Simple (and not-so-simple) CFD solvers written in Fortran with Python plotting routines
- MAESTROeX solves the equations of low Mach number hydrodynamics for stratified atmospheres/full spherical stars with a general equation of state, and nuclear reaction networks in an adaptive-grid finite-volume framework. It includes reactions and thermal diffusion and can be used on anything from a single core to 100,000s of processor cores with MPI + OpenMP or 1,000s of GPUs
  - Model stars and atomspheres with MAESTROeX
- Is there a good tutorial or textbook-like source on implementing ENO/WENO with limiters in one (and more than one) dimension?
  - PyWENO
- blitzdg is an open-source library offering discontinuous Galerkin (dg) solvers for common partial differential equations systems using blitz++ for array and tensor manipulations in a C++ environment or NumPy as a Python 3 library
  - Derek Steinmoeller's blog
- DGSWEM V2
- adaptive multiresolution DG
- Exasim - Generating Discontinuous Galerkin Codes For Extreme Scalable Simulations
- Chebyshev Pseudo-Spectral Method (PSM)
  - Chebyshev Polynomials J.C. Mason D.C. Handscomb
  - A brief introduction to pseudo-spectral methods: application to diffusion problems
  - Spectral methods in python
  - An Introduction to Domain Decomposition Methods:algorithms, theory and parallel implementation
  - Chebyshev-Legendre Spectral Domain Decomposition Method for Two-Dimensional Vorticity Equations
  - Domain Decomposition Methods for Mortar Finite Elements
  - An efficient domain-decomposition pseudo-spectral method for solving elliptic differential equations
  - A Pseudospectral Multi-Domain Method for the Incompressible Navier-Stokes Equations
  - Deep Domain Decomposition Method: Elliptic Problems
  - How to Design an Efficient Pseudospectral Code
    - code for How to Design an Efficient Pseudospectral Code
  - Dedalus is a framework for solving a broad range of partial differential equations using spectral methods, including initial-value, boundary-value, and generalized eigenvalue problems
    - Dedalus is a flexible framework for solving partial differential equations using spectral methods
  - multiple-interval pseudospectral methods to solve optimal control problems
  - pizza is a high-performance numerical code for quasi-geostrophic and non-rotating convection in a 2-D annulus geometry
  - FDBB (Fluid Dynamics Building Blocks) is a C++ expression template library for fluid dynamics
    - FDBB - Fluid Dynamics Building Blocks
- Finite Element Methods (FEM) and Spectral Element Methods (SEM)
  - deal.II — an open source finite element library
    - Amandus: Simulations based on multilevel Schwarz methods Documentation
  - Feel++ finite element embedded library in C++
    - Feel++: Finite Element Embedded Library in C++
  - Veamy: an extensible object-oriented C++ library for the virtual element method
    - Veamy: an extensible object-oriented C++ library for the virtual element method
  - Two dimensional high-order spectral element method fluid dynamics solver
    - Two dimensional high-order spectral element method fluid dynamics solver
  - github: ITHACA-SEM - In real Time Highly Advanced Computational Applications for Spectral Element Methods
    - THACA-SEM - In real Time Highly Advanced Computational Applications for Spectral Element Methods
  - AxiSEM is a parallel spectral-element method to solve 3D wave propagation in a sphere with axisymmetric or spherically symmetric visco-elastic, acoustic, anisotropic structures
  - HDGlab: An open-source implementation of the hybridisable discontinuous Galerkin method in MATLAB
    - HDGlab - A Matlab implementation of the hybridisable discontinuous Galerkin (HDG) method
  - Euler Equations for Ideal Gases
    - Split form nodal discontinuous Galerkin schemes with summation-by-parts property for the compressible Euler equations
- Siemens
  - Embedded Multicore Building Blocks (EMB²)
- Maxeler
  - Maxeler Technologies - Maximum Performance Computing
- facilities to experiment with Discontinuous Petrov Galerkin (DPG) methods
  - Research papers of Jay Gopalakrishnan
- Free CFD codes
  - Code_Saturne
  - Large-Scale CFD Parallel Computing Dealing with Massive Mesh016c0bd28b2435d468ce3cd1771426de9f264af6
- Open source tools in technical photorealistic large-scale visualisation
- An Open Source CFD-DEM Perspective
- 3D, block structured, explicit/implicit, Navier-Stokes solver
  - An evaluation of the Eigen linear algebra library for use in the aither CFD solver
    - A look at the performance of expression templates in C++: Eigen vs Blaze vs Fastor vs Armadillo vs XTensor
- CFD + GPU
  - Recent progress and challenges in exploiting graphics processors in computational fluid dynamics: slightly outdated but interesting
    - Laplace solver running on GPU using CUDA, with CPU version for comparison, slightly outdated
  - PyFR
- Camellia
  - Camellia Discontinuous Petrov-Galerkin github repository
Co-design at Lawrence Livermore National Lab
- Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics (LULESH)
- DoE Exascale Co-Design Center for Materials in Extreme Environments : Extreme Materials at Extreme Scale
  - Programming Models - Languages and tools for developing multi-scale applicatins.
  - Terra is a new low-level system programming language that is designed to interoperate seamlessly with the Lua programming language
List of quantum chemistry and solid-state physics software
- CP2K
  - Mirror of official svn repository at sourceforge. Synced every 5 minutes.
  - Accelerated Sparse Matrix Multiplication for Quantum Chemistry with CP2K on Hyprid Supercomputers
Evaluation of C, Go, and Rust in the HPC environment
Modern Fortran
- NNSA, national labs team with Nvidia to develop open-source Fortran compiler technology
- Flang is a ground-up implementation of a Fortran front end written in modern C++. It started off as the f18 project
  - F18 is a front-end for Fortran intended to replace the existing front-end in the Flang compiler
    tl;dr 301 moved The code from this repository can now be found at flang
  - Flang and F18
  - Installing LLVM Flang Fortran compiler
    tl;dr
```
git clone https://github.com/llvm/llvm-project
mkdir -p llvm-project/build
cd llvm-project/build
cmake ../llvm -DLLVM_ENABLE_PROJECTS=flang
```
```
  + [Unknown CMake command “tablegen”](https://stackoverflow.com/questions/59691069/unknown-cmake-command-tablegen)
```
libCEED: the CEED Library: Code for Efficient Extensible Discretization
- CEED Library: Code for Efficient Extensible Discretization
  - MFEM is a free, lightweight, scalable C++ library for finite element methods
Modern trends in programming of GPUs DAQFEET 2021
- Toward Performance-Portable PETSc for GPU-based Exascale Systems
AMG
- AMG intro
- AMGX
- amgcl
```
diff --git a/CMakeLists.txt b/CMakeLists.txt
index 6ca3264..b63e326 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -161,9 +161,9 @@ if(CMAKE_CXX_COMPILER_ID MATCHES "GNU" OR CMAKE_CXX_COMPILER_ID MATCHES "MSVC")
 
         if (CMAKE_CXX_COMPILER_ID MATCHES "GNU")
            list(APPEND CUDA_NVCC_FLAGS
-                ${CUDA_ARCH_FLAGS} -std=c++11 -Wno-deprecated-gpu-targets)
+                ${CUDA_ARCH_FLAGS} -std=c++17 -Wno-deprecated-gpu-targets)
 
-            list(APPEND CUDA_NVCC_FLAGS -Xcompiler -std=c++11 -Xcompiler -fPIC -Xcompiler -Wno-vla)
+            list(APPEND CUDA_NVCC_FLAGS -Xcompiler -std=c++17 -Xcompiler -fPIC -Xcompiler -Wno-vla)
         endif()
 
         add_library(cuda_target INTERFACE)
```
- SPARSH-AMG
  - SPARSH-AMG: A LIBRARY FOR HYBRID CPU-GPU ALGEBRAIC MULTIGRID AND PRECONDITIONED ITERATIVE METHODS
- Ginkgo is a high-performance linear algebra library for manycore systems, with a focus on sparse solution of linear systems. It is implemented using modern C++ (you will need at least C++14 compliant compiler to build it), with GPU kernels implemented in CUDA and HIP. HAS support for AMG
  - GPGPU acceleration - a case study of algebraic multigrid preconditioned GMRES
- BootCMatchG
- multigrid solver for solving elliptic PDEs using finite differences on a rectangular grid
- Multigrid HowTo (Part I): A simple Multigrid solver in C++ in less than 200 lines of code
- Multigrid HowTo (Part II): An Open Source Algebraic Multigrid Solver in C++
  - Multigrid solver prototype (GMG) and simple Lid Cavity solver
- ExaStencils: Advanced Multigrid Solver Generation
  - EvoStencils - Constructing efficient multigrid solvers through evolutionary computation
Sparse Linear System Solvers on GPUs
- SPARSE LINEAR SYSTEM SOLVERS ON GPUS: PARALLEL PRECONDITIONING, WORKLOAD BALANCING, AND COMMUNICATION REDUCTION
- High performance sparse multifrontal solvers on modern GPUs
  - STRUMPACK -- STRUctured Matrix PACKage, Copyright (c) 2014-2021
Как SpaceX использует GPU для обсчёта ракетных двигателей
- Rockets Shake And Rattle, So SpaceX Rolls Homegrown CFD
Modern C++ Parallel Task Programming
- docs for Modern C++ Parallel Task Programming
Freud, a tool to create Performance Annotations for C/C++ programs
Eyal Rozenberg, Ph.D.
- Eyal Rozenberg
  - Thin C++-flavored wrappers for the CUDA APIs: Runtime, Driver, NVRTC and NVTX
  - GPU Kernel Runner

RAFT: Reusable Accelerated Functions and Tools

cd cpp && mkdir -p build && cd build && cmake .. -DCMAKE_BUILD_TYPE=Release  -DOPENSSL_INCLUDE_DIR=/usr/include/openssl  -DOPENSSL_CRYPTO_LIBRARY=/usr/lib/libcrypto.so -DOPENSSL_SSL_LIBRARY=/usr/lib/libssl.so

cuSpatial - GPU-Accelerated Spatial and Trajectory Data Management and Analytics Library

CUDA rehab & NVidia docs
- Documentation of NVIDIA chip/hardware interfaces
- CS344 : CUDA Programming in C
- UD281 : High Performance Computing
  - Parallel Computer Architecture and Programming (CMU 15-418/618)
  - Parallel Computer Architecture and Programming (CMU 15-418/618)
- Course on CUDA Programming on NVIDIA GPUs, July 22-26, 2019
```
nvcc -arch=sm_35 ...
```
- Open-Arch-Group
  - Matrix Multiplication (MMul) Benchmarks
  - Performance engineer that's always happy to answer questions!
    - GPGPU Programming with CUDA
      - From Scratch: Histograms in CUDA using Atomics
    - Parallel Programming in Modern C++
      - This program shows off the basics of stop tokens in C++20
- Matrix multiplication in cuSparse (cusparseDcsrgemm) outputs wrong results
- how to cast thrust::device_vector to raw pointer
- Параллельные вычисления с использованием стандартов MPI, OpenMP, OpenACC
Memory Model
- C++11 introduced a standardized memory model. What does it mean? And how is it going to affect C++ programming?
- A Primer on Memory Consistency and Cache Coherence
LPC2018 - Open Source GPU compute stack - Not dancing the CUDA dance
OpenCL
- OpenCL 3.0 Specification Released With New Khronos Open-Source OpenCL SDK
- The State of OpenCL for Scientific Computing in 2018
- OpenCL: History & Future
- Tuned OpenCL BLAS
  - CLBlast:ATunedBLASLibrary forFasterDeepLearning
- OpenCL vloadn
- Could not find a package configuration file provided by "OpenCLHeaders"
- Using OpenCL on Adreno & Mali GPUs is slower than CPU
- Zero copy buffer allocation on arm mali midgard gpus?
- SYCL - C++ Single-source Heterogeneous Programming for OpenCL
OneAPI
- Run simple DPC++ application
- oneAPI Direct Programming
- Port a CUDA App to oneAPI and DPC++ in 5 Minutes
- How to run dpc++ code on Intel HD Graphic atop Nvidia GPU
Kompute
- The general purpose GPU compute framework for cross vendor graphics cards (AMD, Qualcomm, NVIDIA & friends)
  - Kompute github repo
HCC is an Open Source, Optimizing C++ Compiler for Heterogeneous Compute currently for the ROCm GPU Computing Platform
- Why did AMD open source ROCm’s OpenCL driver-stack?
- wiki for HCC
- github HCC repository
- Portable Computing Language
- A collection of Arch Linux PKGBUILDS for the ROCm platform tl;dr
```
yay -S rocm-opencl-runtime
```
- aur package rocm-opencl-runtime
- Arch GPGPU
  - Arch ROCm
    - ROCm for Arch Linux
- rocm OpenCL Programming Guide
```
dkms install --no-depmod -m amdgpu-4.0 -v 23 -k 5.11.16-arch1-1
Error! Bad return status for module build on kernel: 5.11.16-arch1-1 (x86_64)
Consult /var/lib/dkms/amdgpu-4.0/23/build/make.log for more information.
==> Warning, `dkms install --no-depmod -m amdgpu-4.0 -v 23 -k 5.11.16-arch1-1' returned 10

pacman -Qo /usr/src/amdgpu-4.0-23
/usr/src/amdgpu-4.0-23/ принадлежит rock-dkms-bin 4.0-3
/usr/src/amdgpu-4.0-23/ принадлежит rock-dkms-firmware-bin 4.0-3
```
- Radeon Instinct like : Radeon VII
- hipSYCL - a SYCL implementation for CPUs and GPUs
  - hipSYCL performance
- OpenCL => Vulkan
  - a prototype implementation of OpenCL 1.2 on top of Vulkan using clspv as the compiler
  - clspv is a prototype compiler for a subset of OpenCL C to Vulkan compute shaders
How To Set The CPU Affinity Of A Running Process In Linux
OpenMP
- Ждали, ждали и дождались! OpenMP 4.0
- Parallelization of a prefix sum (Openmp)
OpenACC
- IPMACC is a framework for translating/executing OpenACC for C API to/over CUDA or OpenCL runtime
- IPMACC – An Open Source OpenACC to CUDA/OpenCL Translator
MATOG - GPU Access Auto Tuning
- MATOG Auto-Tuning on GPUs is a tool to automatically optimize performance of NVIDIA CUDA code
- MATOG preprint
- MATOG: CUDA Array Access Auto-Tuner
OCCA (Open Concurrent Compute Abstraction)
- github repository for OCCA
LCSE - Linked Cluster Series Expansions - a framework for high-temperature series expansions
VLI is a llibrary for high but fixed (128 to 512-bit) arithmetic and symbolic polinomials computations
- Series Expansion Methods for Quantum Lattice Models
Apache Arrow
- Apache Arrow is a development platform for in-memory analytics. It contains a set of technologies that enable big data systems to process and move data fast.

Sandia

Trilinos is a collection of open-source software libraries, called packages, intended to be used as building blocks for the development of scientific applications.

github repo fo Trilinos
tl;dr

$ yay -s trilinos
3 aur/trilinos 12.14.1-2 (+0 0.00%) 
    algorithms for the solution of large-scale scientific problems
2 aur/mingw-w64-trilinos 12.12.1-1 (+0 0.00%) 
    Framework for the solution of large-scale, complex multi-physics engineering and scientific problems (mingw-w64)
1 aur/trilinos-git 12.12.0.gd3b096f4f1-1 (+1 0.00%) (Out-of-date 2019-06-21) 
    An effort to develop algorithms and enabling technologies within an object-oriented software framework for the solution of large-scale, complex multi-physics engineering and scientific problems.

Add option to turn off the install of gtest header and lib even if Gtest package is enabled

ARM
- The ARM Computer Vision and Machine Learning library
- HPCG for Arm
  - Parallelizing HPCG's main kernels
- ARM Neon
CPU, GPU & DRAM Architecture Simulators
- GPGPU-Sim
- Integrated gem5 + GPGPU-Sim Simulator
  - Getting gem5
- SimpleScalar LLC
  - SimpleScalar LLC Intro
    - Todd Austin : the author
- DRAMSim2
  - github repos for DRAMSim2 etc. from University of Maryland
    - Write-back vs Write-Through
    - Study of Different Cache Line Replacement Algorithms in Embedded Systems
- Chisel: Constructing Hardware in a Scala Embedded Language
  - UC Berkeley Architecture Research

CUDA and friends related surveys, papers

A Survey of CPU-GPU Heterogeneous Computing Techniques
Гибридная реализация алгоритма MST с использованием CPU и GPU
Понимание конфликтов банков разделяемой (shared) памяти в NVIDIA CUDA
Vulkan: The next Khronos graphics API… that is not OpenGL
AMD supported project: HIP : Convert CUDA to Portable C++ Code
- Examples for HIP

DSLs targeting GPU

CARP: Correct and Efficient Accelerator Programming
- CARP dessimination
  - A taste of CARP: benchmark analysis, language design and kernel verification
- PENCIL: a C99-based intermediate language for compute & optimization
- see also PPCG (below)
Framework for performance-portable parallel computations on unstructured meshes
Copperhead Data Parallel Python
- github CU copperhead
Delite
Scalan
- Scalan Community Edition
Generating Performance Portable Code using Rewrite Rules: From High-level Functional Expressions to High-Performance OpenCL Code
Performance Comparison of GPU, DSP and FPGA implementations of image processing and computer vision algorithms in embedded systems, Fykse, Egil
ROSE compiler + Mint for C-to-CUDA code generation
- ROSE compiler github
- MINT
  - ROSE project MINT
  - MINT google project
    - Mint: Realizing CUDA performance in 3D Stencil Methods with Annotated C: claims 78% of handwritten CUDA performance
    - MINT PhD thesis
Nested Data Parallelism, Haskell, and friends
CUDA kernels generation using C++ expression templates technique
- CU++ -- an interesting approach
- VexCL is a C++ vector expression template library for OpenCL/CUDA
  - VexCL is a C++ vector expression template library for OpenCL/CUDA
  - Generating OpenCL/CUDA source code from C++ expressions in VexCL
AnyDSL - A Framework for Rapid Development of Domain-Specific Libraries; thorin (The Higher-ORder INtermediate representation) / impala (An imperative and functional programming language)

parallelforall

An Efficient Matrix Transpose in CUDA C/C++
BIDMach: Machine Learning at the Limit with GPUs
High-Performance Geometric Multi-Grid with GPU Acceleration
Inside Pascal: NVIDIA’s Newest Computing Platform
GPU Programming in Functional Languages
HIP : Convert CUDA to Portable C++ Code

Pencil computations

Ускоряем трафаретные вычисления: сборка и запуск YASK на процессорах Intel
flexible package manager that supports multiple versions, configurations, platforms, and compilers. https://spack.io
Tutorial: Spack 101
NASA: High Performance Fast Computing Challenge
Why Rust fails hard at scientific computing
- Why Rust fails hard at scientific computing
  - technicalities: interactive scientific computing #2 of 2, goldilocks languages

Nim links

Laser - Primitives for high performance computing
NimTorch
A matrix library https://unicredit.github.io/neo/
A fast, ergonomic and portable tensor library with a deep learning focus
high performance tensor library in Nim
- Arraymancer - A n-dimensional tensor (ndarray) library
A curated list of awesome Nim frameworks, libraries and software
- Find the nim package
Meta Nim Are we scientists yet?
Quantum EXpressions lattice field theory framework
- QEX: a framework for lattice field theories
tl;dr

nimble refresh
nimble install neo
nimble install Arraymancer

Why is nim and nimble in official repo so outdated?
parallel-computing resources list
Portable Hardware Locality (hwloc)
Overview of the Efficient Programming Languages (v.3) 2018.4
Intel Level Zero
- oneAPI Level Zero
Code Generation for High Performance PDE Solvers on Modern Architectures
GPU roof model
YouTube videos on GPU embedded profiling/optimization
AMD Radeon and NVIDIA GeForce FP32/FP64 GFLOPS Table
RICOS Co. Ltd. Research Institute for Computational Science Co.Ltd.
Load-link/store-conditional

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hpc.md

hpc.md

HPC (High Performance Computing) bookmarks

CUDA and friends related surveys, papers

DSLs targeting GPU

parallelforall

Pencil computations

Nim links

Files

hpc.md

Latest commit

History

hpc.md

File metadata and controls

HPC (High Performance Computing) bookmarks

CUDA and friends related surveys, papers

DSLs targeting GPU

parallelforall

Pencil computations

Nim links