2022-03-30 17:26:29 UTC
C++ 17 & 20

有关新特性永远是恒久不变的话题。从 C++11 的左右值到现在已经有很久时间了。在epyc 上一般的编译器性能排序是 AOCC>Intel>GCC>>LLVM. 但是MKL相对于amd的优化库还有一定优势,所以有时候Intel开最基本的x86优化也能和AOCC差不多。C++ 17对并行化做了很多优化,比如pstl、for_each(threading)、pmr、intel 的 sycl、nvidia的thrust/cub。可以很方便的修改namespace来获得无痛性能提升。C++ 20 最重要的特性是 ranges、filesystem等。不过LLVM 对新版本的支持一直都是最慢的。之前 icc 可以兼容一些古老的语法比如 VLA,但到了新版本以后取消了,这种不稳定的特性变更导致大厂不怎么把其当成标准。

这是笔者在其编译原理作业中出的一道题,需要给出 semantic rule 来拒绝 ??? 那行。但貌似只能在icc中通过标准。

template<class T>class array{ 
  int s;
  T* elements; 
  array(int n); // allocate "n" elements and let "elements" refer to them 
  array(T* p, int n); // make this array refer to p[0..n-1]
  operator T*(){return elements;}
  int size()const{return s;} 
  // the usual container operations, such as = and [], much like vector 

void h(array<double>a); //C++
void g(int m,double vla[m]); //C99
void f(int m,double vla1[m],array<double>a1) {
    array<double> a2(vla1,m); // a2 refers to vla1 
    double*p=a1; //p refers to a1's elements

    g(a1.size(),a1); // a bit verbose 
    g(a1); //???


所有编译器可能出现的segfault,很多时候换个 intel 小版本可以通过。


建议若是 CMake 打开 make -n, configure 打开 make VERBOSE=1。如果需要与展开宏编译器,需要 CMake 开 -e 选项得到展开后的表达式。

需要广泛运用好 man、--help。



为了解决不同库或者跨语言之间调用的开销,这块使用的基本是 LLVM 的 libLTO 和 tblgen。这个是自动开启的,原理是把库都弄成 LLVM bitcode 统一链接,其实并行版的 LTO 也不是很难实现,曾是前队长用 rust 写的并行计算的 project,具体可以参考源码。




  • 第一阶段:编译参数中加上:-prof-gen=srcpos -prof-dir=/tmp/profdata。其中-prof-dir是存储性能分析文件的目录。
  • 第二阶段:运行编译好的程序,然后运行profmerge -prof_dir /tmp/profdata生成汇总文件。
  • 第三阶段:重新编译程序,使用参数:-prof-use=nomerge -prof-func-groups -prof-dir=/tmp/profdata




从下文我们知道的 LLVM IR 就已经出现的部分优化,我们知道 icc 实际上在 LLVM IR 之前还拥有 high level 的 ir。根据文档,主要做了

  • Loop Permutation or Interchange
  • Loop Distribution
  • Loop Fusion
  • Loop Unrolling
  • Data Prefetching
  • Scalar Replacement
  • Unroll and Jam
  • Loop Blocking or Tiling
  • Partial-Sum Optimization
  • Predicate Optimization
  • Loop Reversal
  • Profile-Guided Loop Unrolling
  • Loop Peeling
  • Data Transformation: Malloc Combining and Memset Combining, - Memory Layout Change
  • Loop Rerolling
  • Memset and Memcpy Recognition
  • Statement Sinking for Creating Perfect Loopnests
  • Multiversioning: Checks include Dependency of Memory References, - and Trip Counts
  • Loop Collapsing


大家如果写过 UE 这种游戏引擎的程序或者kernel中需要 cache align 的struct,以及最近几年对DB内卷式的优化,会很熟悉这种数据结构。其最重要的思想就是把数据塞在 avx 对齐的 struct 中,所有的操作都是围绕着对struct的加乘运算。详见。


DC++/AOCC 都开始使用 LLVM 作为中间层

ICC 大致做了什么新特性

以最新的 2021.3.0 做静态分析,用 saxpy 做标程。意思是 Single-Precision A·X Plus Y。这是一维 BLAS 中的一个函数,经常被写作 kernel 来各种调参各种调寄存器和内存模型。其 C++ 版本如下

void saxpy(int n, float a, float *  x, float *  y)
  for (int i = 0; i < n; ++i)
      y[i] = a*x[i] + y[i];

LLVM IR 如下,完整路径是,主要hard code 进了各种预优化好的汇编,尤其是mov高地址这种快速取指方式。感觉是把VADDSS __m128 _mm_mask_add_ss (__m128 s, __mmask8 k, __m128 a, __m128 b);这种 Intel C/C++ Compiler Intrinsic Equivalent 当成库函数一起编译到 IR 上了。

saxpy(int, float, float*, float*):
        mov       r9, rsi                                       #2.1
        test      edi, edi                                      #3.23
        jle       ..B1.36       # Prob 50%                      #3.23
        cmp       edi, 6                                        #3.3
        jle       ..B1.30       # Prob 50%                      #3.3
        movsxd    r8, edi                                       #1.6
        mov       rax, rdx                                      #4.16
        sub       rax, r9                                       #4.16
        lea       rcx, QWORD PTR [r8*4]                         #3.3
        cmp       rax, rcx                                      #3.3
        jge       ..B1.5        # Prob 50%                      #3.3
        neg       rax                                           #4.23
        cmp       rax, rcx                                      #3.3
        jl        ..B1.30       # Prob 50%                      #3.3
..B1.5:                         # Preds ..B1.4 ..B1.3
        cmp       edi, 8                                        #3.3
        jl        ..B1.38       # Prob 10%                      #3.3
        mov       r10, rdx                                      #3.3
        and       r10, 15                                       #3.3
        test      r10d, r10d                                    #3.3
        je        ..B1.9        # Prob 50%                      #3.3
        test      r10d, 3                                       #3.3
        jne       ..B1.38       # Prob 10%                      #3.3
        neg       r10d                                          #3.3
        add       r10d, 16                                      #3.3
        shr       r10d, 2                                       #3.3
..B1.9:                         # Preds ..B1.8 ..B1.6
        lea       eax, DWORD PTR [8+r10]                        #3.3
        cmp       edi, eax                                      #3.3
        jl        ..B1.38       # Prob 10%                      #3.3
        mov       esi, edi                                      #3.3
        xor       ecx, ecx                                      #3.3
        sub       esi, r10d                                     #3.3
        and       esi, 7                                        #3.3
        neg       esi                                           #3.3
        add       esi, edi                                      #3.3
        mov       eax, r10d                                     #3.3
        test      r10d, r10d                                    #3.3
        jbe       ..B1.14       # Prob 9%                       #3.3
..B1.12:                        # Preds ..B1.10 ..B1.12
        movss     xmm1, DWORD PTR [r9+rcx*4]                    #4.16
        mulss     xmm1, xmm0                                    #4.16
        addss     xmm1, DWORD PTR [rdx+rcx*4]                   #4.23
        movss     DWORD PTR [rdx+rcx*4], xmm1                   #4.7
        inc       rcx                                           #3.3
        cmp       rcx, rax                                      #3.3
        jb        ..B1.12       # Prob 82%                      #3.3
..B1.14:                        # Preds ..B1.12 ..B1.10
        lea       rcx, QWORD PTR [r9+rax*4]                     #4.16
        test      rcx, 15                                       #3.3
        je        ..B1.18       # Prob 60%                      #3.3
        movaps    xmm1, xmm0                                    #1.6
        shufps    xmm1, xmm1, 0                                 #1.6
        movsxd    rcx, esi                                      #3.3
..B1.16:                        # Preds ..B1.16 ..B1.15
        movups    xmm2, XMMWORD PTR [r9+rax*4]                  #4.16
        movups    xmm3, XMMWORD PTR [16+r9+rax*4]               #4.16
        mulps     xmm2, xmm1                                    #4.16
        mulps     xmm3, xmm1                                    #4.16
        addps     xmm2, XMMWORD PTR [rdx+rax*4]                 #4.23
        addps     xmm3, XMMWORD PTR [16+rdx+rax*4]              #4.23
        movups    XMMWORD PTR [rdx+rax*4], xmm2                 #4.7
        movups    XMMWORD PTR [16+rdx+rax*4], xmm3              #4.7
        add       rax, 8                                        #3.3
        cmp       rax, rcx                                      #3.3
        jb        ..B1.16       # Prob 82%                      #3.3
        jmp       ..B1.21       # Prob 100%                     #3.3
..B1.18:                        # Preds ..B1.14
        movaps    xmm1, xmm0                                    #1.6
        shufps    xmm1, xmm1, 0                                 #1.6
        movsxd    rcx, esi                                      #3.3
..B1.19:                        # Preds ..B1.19 ..B1.18
        movups    xmm2, XMMWORD PTR [r9+rax*4]                  #4.16
        movups    xmm3, XMMWORD PTR [16+r9+rax*4]               #4.16
        mulps     xmm2, xmm1                                    #4.16
        mulps     xmm3, xmm1                                    #4.16
        addps     xmm2, XMMWORD PTR [rdx+rax*4]                 #4.23
        addps     xmm3, XMMWORD PTR [16+rdx+rax*4]              #4.23
        movups    XMMWORD PTR [rdx+rax*4], xmm2                 #4.7
        movups    XMMWORD PTR [16+rdx+rax*4], xmm3              #4.7
        add       rax, 8                                        #3.3
        cmp       rax, rcx                                      #3.3
        jb        ..B1.19       # Prob 82%                      #3.3
..B1.21:                        # Preds ..B1.19 ..B1.16
        lea       eax, DWORD PTR [1+rsi]                        #3.3
        cmp       eax, edi                                      #3.3
        ja        ..B1.36       # Prob 50%                      #3.3
        sub       r8, rcx                                       #3.3
        cmp       r8, 4                                         #3.3
        jl        ..B1.39       # Prob 10%                      #3.3
        mov       eax, r8d                                      #3.3
        xor       r10d, r10d                                    #3.3
        and       eax, -4                                       #3.3
        lea       rdi, QWORD PTR [rdx+rcx*4]                    #4.23
        movsxd    rax, eax                                      #3.3
        lea       rcx, QWORD PTR [r9+rcx*4]                     #4.16
..B1.24:                        # Preds ..B1.24 ..B1.23
        movups    xmm2, XMMWORD PTR [rcx+r10*4]                 #4.16
        mulps     xmm2, xmm1                                    #4.16
        addps     xmm2, XMMWORD PTR [rdi+r10*4]                 #4.23
        movups    XMMWORD PTR [rdi+r10*4], xmm2                 #4.7
        add       r10, 4                                        #3.3
        cmp       r10, rax                                      #3.3
        jb        ..B1.24       # Prob 82%                      #3.3
..B1.26:                        # Preds ..B1.24 ..B1.39
        cmp       rax, r8                                       #3.3
        jae       ..B1.36       # Prob 9%                       #3.3
        movsxd    rsi, esi                                      #4.7
        lea       rcx, QWORD PTR [rdx+rsi*4]                    #4.23
        lea       rdx, QWORD PTR [r9+rsi*4]                     #4.16
..B1.28:                        # Preds ..B1.28 ..B1.27
        movss     xmm1, DWORD PTR [rdx+rax*4]                   #4.16
        mulss     xmm1, xmm0                                    #4.16
        addss     xmm1, DWORD PTR [rcx+rax*4]                   #4.23
        movss     DWORD PTR [rcx+rax*4], xmm1                   #4.7
        inc       rax                                           #3.3
        cmp       rax, r8                                       #3.3
        jb        ..B1.28       # Prob 82%                      #3.3
        jmp       ..B1.36       # Prob 100%                     #3.3
..B1.30:                        # Preds ..B1.4 ..B1.2
        mov       eax, edi                                      #3.3
        mov       esi, 1                                        #3.3
        xor       ecx, ecx                                      #3.3
        shr       eax, 1                                        #3.3
        je        ..B1.34       # Prob 9%                       #3.3
..B1.32:                        # Preds ..B1.30 ..B1.32
        movss     xmm1, DWORD PTR [r9+rcx*8]                    #4.16
        mulss     xmm1, xmm0                                    #4.16
        addss     xmm1, DWORD PTR [rdx+rcx*8]                   #4.23
        movss     DWORD PTR [rdx+rcx*8], xmm1                   #4.7
        movss     xmm2, DWORD PTR [4+r9+rcx*8]                  #4.16
        mulss     xmm2, xmm0                                    #4.16
        addss     xmm2, DWORD PTR [4+rdx+rcx*8]                 #4.23
        movss     DWORD PTR [4+rdx+rcx*8], xmm2                 #4.7
        inc       rcx                                           #3.3
        cmp       rcx, rax                                      #3.3
        jb        ..B1.32       # Prob 63%                      #3.3
        lea       esi, DWORD PTR [1+rcx+rcx]                    #4.7
..B1.34:                        # Preds ..B1.33 ..B1.30
        lea       eax, DWORD PTR [-1+rsi]                       #3.3
        cmp       eax, edi                                      #3.3
        jae       ..B1.36       # Prob 9%                       #3.3
        movsxd    rsi, esi                                      #3.3
        movss     xmm1, DWORD PTR [-4+r9+rsi*4]                 #4.16
        mulss     xmm0, xmm1                                    #4.16
        addss     xmm0, DWORD PTR [-4+rdx+rsi*4]                #4.23
        movss     DWORD PTR [-4+rdx+rsi*4], xmm0                #4.7
..B1.36:                        # Preds ..B1.28 ..B1.21 ..B1.34 ..B1.38 ..B1.1
        ret                                                     #5.1
..B1.38:                        # Preds ..B1.5 ..B1.7 ..B1.9
        xor       esi, esi                                      #3.3
        cmp       edi, 1                                        #3.3
        jb        ..B1.36       # Prob 50%                      #3.3
..B1.39:                        # Preds ..B1.22 ..B1.38
        xor       eax, eax                                      #3.3
        jmp       ..B1.26       # Prob 100%                     #3.3

下面是 AOCC LLVM IR emit 出来的代码,并没有在 IR 上做什么文章,和 clang emit 的基本一样。

; ModuleID = './a.c'
source_filename = "./a.c"
target datalayout = "e-m:e-p270:32:32-p271:32:32-p272:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-unknown-linux-gnu"

; Function Attrs: noinline nounwind optnone uwtable
define dso_local void @saxpy(i32 %n, float %a, float* %x, float* %y) #0 {
  %n.addr = alloca i32, align 4
  %a.addr = alloca float, align 4
  %x.addr = alloca float*, align 8
  %y.addr = alloca float*, align 8
  %i = alloca i32, align 4
  store i32 %n, i32* %n.addr, align 4
  store float %a, float* %a.addr, align 4
  store float* %x, float** %x.addr, align 8
  store float* %y, float** %y.addr, align 8
  store i32 0, i32* %i, align 4
  br label %for.cond

for.cond:                                         ; preds =, %entry
  %0 = load i32, i32* %i, align 4
  %1 = load i32, i32* %n.addr, align 4
  %cmp = icmp slt i32 %0, %1
  br i1 %cmp, label %for.body, label %for.end

for.body:                                         ; preds = %for.cond
  %2 = load float, float* %a.addr, align 4
  %3 = load float*, float** %x.addr, align 8
  %4 = load i32, i32* %i, align 4
  %idxprom = sext i32 %4 to i64
  %arrayidx = getelementptr inbounds float, float* %3, i64 %idxprom
  %5 = load float, float* %arrayidx, align 4
  %mul = fmul float %2, %5
  %6 = load float*, float** %y.addr, align 8
  %7 = load i32, i32* %i, align 4
  %idxprom1 = sext i32 %7 to i64
  %arrayidx2 = getelementptr inbounds float, float* %6, i64 %idxprom1
  %8 = load float, float* %arrayidx2, align 4
  %add = fadd float %mul, %8
  %9 = load float*, float** %y.addr, align 8
  %10 = load i32, i32* %i, align 4
  %idxprom3 = sext i32 %10 to i64
  %arrayidx4 = getelementptr inbounds float, float* %9, i64 %idxprom3
  store float %add, float* %arrayidx4, align 4
  br label                                          ; preds = %for.body
  %11 = load i32, i32* %i, align 4
  %inc = add nsw i32 %11, 1
  store i32 %inc, i32* %i, align 4
  br label %for.cond

for.end:                                          ; preds = %for.cond
  ret void

attributes #0 = { noinline nounwind optnone uwtable "disable-tail-calls"="false" "frame-pointer"="all" "less-precise-fpmad"="false" "min-legal-vector-width"="0" "no-infs-fp-math"="false" "no-jump-tables"="false" "no-nans-fp-math"="false" "no-signed-zeros-fp-math"="false" "no-trapping-math"="true" "stack-protector-buffer-size"="8" "target-cpu"="x86-64" "target-features"="+cx8,+fxsr,+mmx,+sse,+sse2,+x87" "tune-cpu"="generic" "unsafe-fp-math"="false" "use-soft-float"="false" }

!llvm.module.flags = !{!0}
!llvm.ident = !{!1}

!0 = !{i32 1, !"wchar_size", i32 4}
!1 = !{!"AMD clang version 12.0.0 (CLANG: AOCC_3.0.0-Build#78 2020_12_10) (based on LLVM Mirror.Version.12.0.0)"}

和 icc 编译的向量化部分基本一样,但没有概率模型,可惜上面的概率模型的 cost model 是 intel processor 的,所以最终结果icc和aocc不分伯仲。

Another test on NVHPC, actually you can hack the backend CPU part using AOCC with nvc -march=zen2 -Mvect=simd:256 -Mcache_align -fma -S a.c.

; ModuleID = 'a.c'
target datalayout = "e-p:64:64-i64:64-f80:128-n8:16:32:64-S128"
target triple = "x86_64-pc-linux-gnu"

define internal void @pgCplus_compiled.() noinline {
	ret void

define void @saxpy(i32 signext %n.arg, float %a.arg, float* %x.arg, float* %y.arg) #0 !dbg !17 {
	%n.addr = alloca i32, align 4
	%a.addr = alloca float, align 4
	%x.addr = alloca float*, align 8
	%y.addr = alloca float*, align 8
	%.ndi0002.addr = alloca i32, align 4
	%.ndi0003.addr = alloca i32, align 4
	%.vv0000.addr = alloca i8*, align 8
	%.vv0001.addr = alloca i8*, align 8
	%.vv0002.addr = alloca i8*, align 8
	%.r1.0148.addr = alloca <8 x float>, align 4
	%.lcr010001.addr = alloca i32, align 4

	store i32 %n.arg, i32* %n.addr, align 4, !tbaa !29
	store float %a.arg, float* %a.addr, align 4, !tbaa !29
	store float* %x.arg, float** %x.addr, align 8, !tbaa !30
	store float* %y.arg, float** %y.addr, align 8, !tbaa !30
	%0 = load i32, i32* %n.addr, align 4, !tbaa !32, !dbg !23
	%1 = icmp sle i32  %0, 0, !dbg !23
	br i1  %1, label %L.B0005, label %L.B0014, !dbg !23
	%2 = load float*, float** %y.addr, align 8, !tbaa !30, !dbg !23
	%3 = bitcast float*  %2 to i8*, !dbg !23
	%4 = load float*, float** %x.addr, align 8, !tbaa !30, !dbg !23
	%5 = bitcast float*  %4 to i8*, !dbg !23
	%6 = ptrtoint i8*  %5 to i64, !dbg !23
	%7 = sub i64 0,  %6, !dbg !23
	%8 = getelementptr i8, i8*  %3, i64  %7, !dbg !23
	%9 = icmp ule i8*  %8,  null, !dbg !23
	br i1  %9, label %L.B0008, label %L.B0015, !dbg !23
	%10 = bitcast float*  %2 to i8*, !dbg !23
	%11 = bitcast float*  %4 to i8*, !dbg !23
	%12 = ptrtoint i8*  %11 to i64, !dbg !23
	%13 = sub i64 0,  %12, !dbg !23
	%14 = getelementptr i8, i8*  %10, i64  %13, !dbg !23
	%15 = inttoptr i64 32 to i8*, !dbg !23
	%16 = icmp ult i8*  %14,  %15, !dbg !23
	br i1  %16, label %L.B0007, label %L.B0008, !dbg !23
	store i32 0, i32* %.ndi0002.addr, align 4, !tbaa !32, !dbg !23
	%17 = load i32, i32* %n.addr, align 4, !tbaa !32, !dbg !23
	store i32  %17, i32* %.ndi0003.addr, align 4, !tbaa !32, !dbg !23
	%18 = icmp slt i32  %17, 8, !dbg !23
	br i1  %18, label %L.B0011, label %L.B0016, !dbg !23
	store i8* null, i8** %.vv0000.addr, align 8, !tbaa !30, !dbg !23
	%19 = load float*, float** %y.addr, align 8, !tbaa !30, !dbg !23
	%20 = bitcast float*  %19 to i8*, !dbg !23
	store i8*  %20, i8** %.vv0001.addr, align 8, !tbaa !30, !dbg !23
	%21 = load float*, float** %x.addr, align 8, !tbaa !30, !dbg !23
	%22 = bitcast float*  %21 to i8*, !dbg !23
	store i8*  %22, i8** %.vv0002.addr, align 8, !tbaa !30, !dbg !23
	%23 = sub i32  %17, 7, !dbg !23
	store i32  %23, i32* %.ndi0003.addr, align 4, !tbaa !32, !dbg !23
	%24 = load float, float* %a.addr, align 4, !tbaa !34, !dbg !23
	%25 = insertelement <8 x float> undef, float  %24, i32 0, !dbg !23
	%26 = shufflevector <8 x float>  %25, <8 x float> undef, <8 x i32> <i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0, i32 0>, !dbg !23
	store <8 x float>  %26, <8 x float>* %.r1.0148.addr, align 1, !tbaa !29, !dbg !23
	br label %L.B0012
	%27 = load <8 x float>, <8 x float>* %.r1.0148.addr, align 4, !tbaa !29, !dbg !23
	%28 = load i8*, i8** %.vv0002.addr, align 8, !tbaa !30, !dbg !23
	%29 = load i8*, i8** %.vv0000.addr, align 8, !tbaa !30, !dbg !23
	%30 = ptrtoint i8*  %29 to i64, !dbg !23
	%31 = getelementptr i8, i8*  %28, i64  %30, !dbg !23
	%32 = bitcast i8*  %31 to <8 x float>*, !dbg !23
	%33 = load <8 x float>, <8 x float>*  %32, align 4, !tbaa !29, !dbg !23
	%34 = load i8*, i8** %.vv0001.addr, align 8, !tbaa !30, !dbg !23
	%35 = getelementptr i8, i8*  %34, i64  %30, !dbg !23
	%36 = bitcast i8*  %35 to <8 x float>*, !dbg !23
	%37 = load <8 x float>, <8 x float>*  %36, align 4, !tbaa !29, !dbg !23
	%38 = call <8 x float> @llvm.fma.v8f32 (<8 x float>  %27, <8 x float>  %33, <8 x float>  %37), !dbg !23
	store <8 x float>  %38, <8 x float>*  %36, align 1, !tbaa !29, !dbg !23
	%39 = getelementptr i8, i8*  %29, i64 32, !dbg !23
	store i8*  %39, i8** %.vv0000.addr, align 8, !tbaa !30, !dbg !23
	%40 = load i32, i32* %.ndi0003.addr, align 4, !tbaa !32, !dbg !23
	%41 = sub i32  %40, 8, !dbg !23
	store i32  %41, i32* %.ndi0003.addr, align 4, !tbaa !32, !dbg !23
	%42 = icmp sgt i32  %41, 0, !dbg !23
	br i1  %42, label %L.B0012, label %L.B0017, !llvm.loop !24, !dbg !23
	%43 = load i32, i32* %.ndi0003.addr, align 4, !tbaa !32, !dbg !23
	%44 = add i32  %43, 7, !dbg !23
	store i32  %44, i32* %.ndi0003.addr, align 4, !tbaa !32, !dbg !23
	%45 = icmp eq i32  %44, 0, !dbg !23
	br i1  %45, label %L.B0013, label %L.B0018, !dbg !23
	%46 = load i32, i32* %n.addr, align 4, !tbaa !32, !dbg !23
	%47 = and i32  %46, -8, !dbg !23
	store i32  %47, i32* %.ndi0002.addr, align 4, !tbaa !32, !dbg !23
	br label %L.B0011
	%48 = load i32, i32* %.ndi0002.addr, align 4, !tbaa !32, !dbg !23
	%49 = sext i32  %48 to i64, !dbg !23
	%50 = load float*, float** %y.addr, align 8, !tbaa !30, !dbg !23
	%51 = getelementptr float, float*  %50, i64  %49, !dbg !23
	%52 = load float, float*  %51, align 4, !tbaa !29, !dbg !23
	%53 = load float, float* %a.addr, align 4, !tbaa !34, !dbg !23
	%54 = load float*, float** %x.addr, align 8, !tbaa !30, !dbg !23
	%55 = getelementptr float, float*  %54, i64  %49, !dbg !23
	%56 = load float, float*  %55, align 4, !tbaa !29, !dbg !23
	%57 = call float @llvm.fma.f32 (float  %53, float  %56, float  %52), !dbg !23
	store float  %57, float*  %51, align 4, !tbaa !29, !dbg !23
	%58 = add i32  %48, 1, !dbg !23
	store i32  %58, i32* %.ndi0002.addr, align 4, !tbaa !32, !dbg !23
	%59 = load i32, i32* %.ndi0003.addr, align 4, !tbaa !32, !dbg !23
	%60 = sub i32  %59, 1, !dbg !23
	store i32  %60, i32* %.ndi0003.addr, align 4, !tbaa !32, !dbg !23
	%61 = icmp sgt i32  %60, 0, !dbg !23
	br i1  %61, label %L.B0011, label %L.B0013, !llvm.loop !24, !dbg !23
	br label %L.B0009, !dbg !23
	store i32 0, i32* %.ndi0002.addr, align 4, !tbaa !32, !dbg !23
	%62 = load i32, i32* %n.addr, align 4, !tbaa !32, !dbg !23
	store i32  %62, i32* %.lcr010001.addr, align 4, !tbaa !32, !dbg !23
	br label %L.B0010
	%63 = load i32, i32* %.ndi0002.addr, align 4, !tbaa !32, !dbg !23
	%64 = sext i32  %63 to i64, !dbg !23
	%65 = load float*, float** %y.addr, align 8, !tbaa !30, !dbg !23
	%66 = getelementptr float, float*  %65, i64  %64, !dbg !23
	%67 = load float, float*  %66, align 4, !tbaa !29, !dbg !23
	%68 = load float, float* %a.addr, align 4, !tbaa !34, !dbg !23
	%69 = load float*, float** %x.addr, align 8, !tbaa !30, !dbg !23
	%70 = getelementptr float, float*  %69, i64  %64, !dbg !23
	%71 = load float, float*  %70, align 4, !tbaa !29, !dbg !23
	%72 = call float @llvm.fma.f32 (float  %68, float  %71, float  %67), !dbg !23
	store float  %72, float*  %66, align 4, !tbaa !29, !dbg !23
	%73 = add i32  %63, 1, !dbg !23
	store i32  %73, i32* %.ndi0002.addr, align 4, !tbaa !32, !dbg !23
	%74 = load i32, i32* %.lcr010001.addr, align 4, !tbaa !32, !dbg !23
	%75 = icmp slt i32  %73,  %74, !dbg !23
	br i1  %75, label %L.B0010, label %L.B0009, !dbg !23
	br label %L.B0005
	ret void, !dbg !26

declare float @llvm.fma.f32(float, float, float)
declare <8 x float> @llvm.fma.v8f32(<8 x float>, <8 x float>, <8 x float>)
declare i32 @__gxx_personality_v0(...)

; Named metadata
!llvm.module.flags = !{ !1, !2 }
! = !{ !10 }

; Metadata
!1 = !{ i32 2, !"Dwarf Version", i32 4 }
!2 = !{ i32 2, !"Debug Info Version", i32 3 }
!3 = !DIFile(filename: "a.c", directory: "/home/victoryang")
; !4 = !DIFile(tag: DW_TAG_file_type, pair: !3)
!4 = !{ i32 41, !3 }
!5 = !{  }
!6 = !{  }
!7 = !{ !17 }
!8 = !{  }
!9 = !{  }
!10 = distinct !DICompileUnit(file: !3, language: DW_LANG_C_plus_plus, producer: " NVC++ 21.5-0", enums: !5, retainedTypes: !6, globals: !8, emissionKind: FullDebug, imports: !9)
!11 = !DIBasicType(tag: DW_TAG_base_type, name: "int", size: 32, align: 32, encoding: DW_ATE_signed)
!12 = !DIBasicType(tag: DW_TAG_base_type, name: "float", size: 32, align: 32, encoding: DW_ATE_float)
!13 = !DIDerivedType(tag: DW_TAG_pointer_type, size: 64, align: 64, baseType: !12)
!14 = !{ null, !11, !12, !13, !13 }
!15 = !DISubroutineType(types: !14)
!16 = !{  }
!17 = distinct !DISubprogram(file: !3, scope: !10, name: "saxpy", line: 2, type: !15, spFlags: 8, unit: !10, scopeLine: 2)
!18 = !DILocation(line: 2, column: 1, scope: !17)
!19 = !DILexicalBlock(file: !3, scope: !17, line: 2, column: 1)
!20 = !DILocation(line: 2, column: 1, scope: !19)
!21 = !DILexicalBlock(file: !3, scope: !19, line: 2, column: 1)
!22 = !DILocation(line: 2, column: 1, scope: !21)
!23 = !DILocation(line: 3, column: 1, scope: !21)
!24 = !{ !24, !25 }
!25 = !{ !"llvm.loop.vectorize.enable", i1 0 }
!26 = !DILocation(line: 5, column: 1, scope: !19)
!27 = !{ !"PGI C[++] TBAA" }
!28 = !{ !"omnipotent char", !27, i64 0 }
!29 = !{ !28, !28, i64 0 }
!30 = !{ !"<T>*", !28, i64 0 }
!31 = !{ !"int", !28, i64 0 }
!32 = !{ !31, !31, i64 0 }
!33 = !{ !"float", !28, i64 0 }
!34 = !{ !33, !33, i64 0 }


	.section	.debug_line,"",@progbits


对于超算来说应该介绍 Arm FX 64 的。但笔者觉得还是先科普一下 SVE 比较好,说不定下一次 ISC 就有了。

saxpy with neon

// x0 = &x[0], x1 = &y[0], x2 = &a, x3 = &n
   ldrswx3, [x3] // x3=*n
   movx4, #0     // x4=i=0
   ldrd0, [x2]   // d0=*a
   b     .latch
   ldrd1, [x0, x4, lsl #3]// d1=x[i]
   ldrd2, [x1, x4, lsl #3]// d2=y[i]
   fmaddd2, d1, d0, d2.   // d2+=x[i]*a
   strd2, [x1, x4, lsl #3]// y[i]=d2
   addx4, x4, #1          // i+=1
   cmpx4, x3  // i < n  .loop// more to do?

saxpy with sve

// x0 = &x[0], x1 = &y[0], x2 = &a, x3 = &n
   ldrswx3, [x3]        // x3=*n
   movx4, #0            // x4=i=0
   whileltp0.d, x4, x3  // p0=while(i++<n)
   ld1rdz0.d, p0/z, [x2]// p0:z0=bcast(*a)
   ld1dz1.d, p0/z, [x0, x4, lsl #3]// p0:z1=x[i]
   ld1dz2.d, p0/z, [x1, x4, lsl #3]// p0:z2=y[i]
   fmlaz2.d, p0/m, z1.d, z0.d      // p0?z2+=x[i]*a
   st1dz2.d, p0, [x1, x4, lsl #3]  // p0?y[i]=z2
   incdx4                          // i+=(VL/64)
   whileltp0.d, x4, x3             // p0=while(i++<n)
   b.first .loop                   // more to do?

There is no overhead in instruction count for the SVE version when compared to the equivalent scalar code, which allows a compiler to opportunistically vectorize loops withan unknown trip count.

  1. 16个可伸缩预测寄存器(P0-P15):普通的内存和算数操作的控制仅限于P0-P7。但是生成predicate的指令(向量比较)和依赖predicate的指令(逻辑操作)会使用全部寄存器P0-P15。通过分析编译和手动优化,这样的分配方案被验证有效并且减轻了predicate寄存器在其它架构上被观察到的压力

  2. mixed element尺寸控制:每个predicate寄存器允许将最低粒度降低到字节水平,所以每个bit位对应8blt的数据宽度。

  3. predicate条件:在SVE中产生predication的指令是复用NZCV条件码flags,这个NZCV有不同的解释

  4. 隐式顺序:predicate有一个隐式地从最低到最高位元素顺序解释,对应于一个等效的序列顺序。我们引用与此顺序有关的第一个和最后一个predicate elements以及它们的关联条件。

whileltp0 is to predict before the last max alignment which may cause throughput drain. OoO may shot the last element with low occpancy which lead to waste of this shot, alternatively, it could shot lower other (2^n) aligned instruction.

For hazard execution and speculation, you could easily doing gather load with z3 reg fault trap and reload.


The compiler support the openmp by default. The OpenMP standard for specifying threading in programming languages like C and Fortran is implemented in the compiler itself and as such is an integral part of the compiler in question. The OMP and POSIX thread library underneath can vary, but this is normally hidden from the user. OpenMP makes use of POSIX threads so both an OpenMP library and a POSIX thread library is needed. The POSIX thread library is normally supplied with the distribution (typically /usr/lib64/

\[ \begin{array}{|l|c|c|} \hline \text { Compiler } & \text { Flag to select OpenMP } & \text { OpenMP version supported } \\ \hline \text { Intel compilers } & \text {-qopenmp } & \text { From } 17.0 \text { on : } 4.5 \\ \hline \text { GNU compilers } & \text {-fopenmp } & \text { From GCC 6.1 on : 4.5 } \\ \hline \text { PGI compilers } & -\mathrm{mp} & 4.5 \\ \hline \end{array} \]

You definitely need to watch Fanrui's PPT and understand the implementation of OpenMP in Clang.


