Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove virtual methods from ur_mem_handle_t_ #2620

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

RossBrunton
Copy link
Contributor

We want to transition to handle pointers containing the ddi table as the
first element. For this to work, handle object must not have a vtable.

Since ur_mem_handle_t_ is relatively simple, it's easy enough to roll
out our own version of dynamic dispatch.

Copy link

Compute Benchmarks level_zero run (with params: ):
https://github.com/oneapi-src/unified-runtime/actions/runs/12989137257

@github-actions github-actions bot added the level-zero L0 adapter specific issues label Jan 27, 2025
Copy link

Compute Benchmarks level_zero run ():
https://github.com/oneapi-src/unified-runtime/actions/runs/12989137257
Job status: success. Test status: success.

Summary

Total 138 benchmarks in mean.
Geomean 99.919%.
Improved 16 Regressed 15 (threshold 2.00%)

(result is better)

Performance change in benchmark groups

Relative perf in group api (12): 100.144%
Benchmark This PR baseline Relative perf Change -
api_overhead_benchmark_sycl ExecImmediateCopyQueue out of order from Device to Device, size 1024 2.116000 μs 2.175 μs 102.79% 2.79% +
api_overhead_benchmark_sycl ExecImmediateCopyQueue in order from Device to Host, size 1024 1.663000 μs 1.706 μs 102.59% 2.59% +
api_overhead_benchmark_ur SubmitKernel out of order 15.966000 μs 16.073 μs 100.67% 0.67% .
api_overhead_benchmark_sycl SubmitKernel in order 24.579000 μs 24.664 μs 100.35% 0.35% .
api_overhead_benchmark_ur SubmitKernel in order 16.648000 μs 16.703 μs 100.33% 0.33% .
api_overhead_benchmark_ur SubmitKernel in order with measure completion CPU count 124071.000 instr 123991.000000 instr 99.94% -0.06% .
api_overhead_benchmark_ur SubmitKernel in order CPU count 110895.000 instr 110815.000000 instr 99.93% -0.07% .
api_overhead_benchmark_ur SubmitKernel out of order CPU count 105543.000 instr 105463.000000 instr 99.92% -0.08% .
api_overhead_benchmark_l0 SubmitKernel in order 11.852 μs 11.800000 μs 99.56% -0.44% .
api_overhead_benchmark_sycl SubmitKernel out of order 23.425 μs 23.287000 μs 99.41% -0.59% .
api_overhead_benchmark_l0 SubmitKernel out of order 11.785 μs 11.629000 μs 98.68% -1.32% .
api_overhead_benchmark_ur SubmitKernel in order with measure completion 21.982 μs 21.473000 μs 97.68% -2.32% -
Relative perf in group memory (4): 100.303%
Benchmark This PR baseline Relative perf Change -
memory_benchmark_sycl QueueInOrderMemcpy from Device to Device, size 1024 253.243000 μs 256.472 μs 101.28% 1.28% .
memory_benchmark_sycl StreamMemory, placement Device, type Triad, size 10240 3.084000 GB/s 3.074 GB/s 100.33% 0.33% .
memory_benchmark_sycl QueueMemcpy from Device to Device, size 1024 5.930000 μs 5.932 μs 100.03% 0.03% .
memory_benchmark_sycl QueueInOrderMemcpy from Host to Device, size 1024 220.111 μs 219.201000 μs 99.59% -0.41% .
Relative perf in group miscellaneous (1): 106.845%
Benchmark This PR baseline Relative perf Change -
miscellaneous_benchmark_sycl VectorSum 806.080000 bw GB/s 861.253 bw GB/s 106.84% 6.84% +++
Relative perf in group multithread (10): 100.053%
Benchmark This PR baseline Relative perf Change -
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:1 dstUSM:1 16968.308000 μs 17230.283 μs 101.54% 1.54% .
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:0 dstUSM:1 8936.953000 μs 9073.725 μs 101.53% 1.53% .
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:0 dstUSM:1 1197.227000 μs 1210.999 μs 101.15% 1.15% .
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:4, allocSize:1024 srcUSM:0 dstUSM:1 without events 113293.620000 μs 114139.645 μs 100.75% 0.75% .
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:1 dstUSM:1 47170.440000 μs 47306.654 μs 100.29% 0.29% .
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:1 dstUSM:1 6951.458 μs 6943.025000 μs 99.88% -0.12% .
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:1, allocSize:1024 srcUSM:0 dstUSM:1 without events 43161.273 μs 43064.999000 μs 99.78% -0.22% .
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:0 dstUSM:1 7868.920 μs 7821.718000 μs 99.40% -0.60% .
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:1 dstUSM:1 2097.041 μs 2083.870000 μs 99.37% -0.63% .
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:0 dstUSM:1 27555.344 μs 26707.698000 μs 96.92% -3.08% -
Relative perf in group graph (10): 98.886%
Benchmark This PR baseline Relative perf Change -
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:100 675.407000 μs 679.477 μs 100.60% 0.60% .
graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:1, numKernels:10 54.046000 μs 54.135 μs 100.16% 0.16% .
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:100 353371.241000 μs 353404.211 μs 100.01% 0.01% .
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:10 71859.921 μs 71856.495000 μs 100.00% -0.00% .
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:10 61.902 μs 61.889000 μs 99.98% -0.02% .
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:10 72654.469 μs 72543.241000 μs 99.85% -0.15% .
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:0, numKernels:100 57611.310 μs 57263.652000 μs 99.40% -0.60% .
graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:0, numKernels:10 5656.640 μs 5611.771000 μs 99.21% -0.79% .
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:0, numKernels:10 5697.200 μs 5615.778000 μs 98.57% -1.43% .
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:100 386332.337 μs 353223.514000 μs 91.43% -8.57% ----
Relative perf in group Velocity-Bench (9): 99.011%
Benchmark This PR baseline Relative perf Change -
Velocity-Bench svm 0.139000 s 0.141 s 101.22% 1.22% .
Velocity-Bench Sobel Filter 613.620000 ms 615.149 ms 100.25% 0.25% .
Velocity-Bench Easywave 235.000000 ms 235.000 ms 100.00% 0.00% .
Velocity-Bench dl-mnist 2.390000 s 2.390 s 100.00% 0.00% .
Velocity-Bench Hashtable 353.480 M keys/sec 353.884706 M keys/sec 99.89% -0.11% .
Velocity-Bench dl-cifar 23.937 s 23.892100 s 99.81% -0.19% .
Velocity-Bench CudaSift 205.715 ms 204.632000 ms 99.47% -0.53% .
Velocity-Bench QuickSilver 116.260 MMS/CTT 118.320000 MMS/CTT 98.26% -1.74% .
Velocity-Bench Bitcracker 38.640 s 35.731600 s 92.47% -7.53% ---
Relative perf in group Runtime (8): 98.330%
Benchmark This PR baseline Relative perf Change -
Runtime_IndependentDAGTaskThroughput_SingleTask 259.514000 ms 265.060 ms 102.14% 2.14% +
Runtime_DAGTaskThroughput_BasicParallelFor 1737.321000 ms 1747.525 ms 100.59% 0.59% .
Runtime_DAGTaskThroughput_SingleTask 1670.866000 ms 1678.531 ms 100.46% 0.46% .
Runtime_DAGTaskThroughput_NDRangeParallelFor 1677.683000 ms 1682.917 ms 100.31% 0.31% .
Runtime_DAGTaskThroughput_HierarchicalParallelFor 1714.752000 ms 1718.971 ms 100.25% 0.25% .
Runtime_IndependentDAGTaskThroughput_BasicParallelFor 302.978 ms 287.518000 ms 94.90% -5.10% --
Runtime_IndependentDAGTaskThroughput_HierarchicalParallelFor 293.792 ms 277.037000 ms 94.30% -5.70% ---
Runtime_IndependentDAGTaskThroughput_NDRangeParallelFor 292.481 ms 275.224000 ms 94.10% -5.90% ---
Relative perf in group MicroBench (14): 99.797%
Benchmark This PR baseline Relative perf Change -
MicroBench_HostDeviceBandwidth_1D_H2D_Contiguous 4.785000 ms 4.832 ms 100.98% 0.98% .
MicroBench_HostDeviceBandwidth_2D_H2D_Contiguous 4.689000 ms 4.730 ms 100.87% 0.87% .
MicroBench_HostDeviceBandwidth_1D_D2H_Strided 4.818000 ms 4.854 ms 100.75% 0.75% .
MicroBench_HostDeviceBandwidth_2D_H2D_Strided 5.119000 ms 5.130 ms 100.21% 0.21% .
MicroBench_HostDeviceBandwidth_3D_H2D_Contiguous 4.684000 ms 4.690 ms 100.13% 0.13% .
MicroBench_HostDeviceBandwidth_3D_D2H_Strided 617.485 ms 617.480000 ms 100.00% -0.00% .
MicroBench_HostDeviceBandwidth_2D_D2H_Strided 617.543 ms 617.529000 ms 100.00% -0.00% .
MicroBench_HostDeviceBandwidth_3D_D2H_Contiguous 618.156 ms 618.122000 ms 99.99% -0.01% .
MicroBench_HostDeviceBandwidth_2D_D2H_Contiguous 618.180 ms 618.120000 ms 99.99% -0.01% .
MicroBench_LocalMem_fp32_4096 29.931 ms 29.884000 ms 99.84% -0.16% .
MicroBench_LocalMem_int32_4096 29.945 ms 29.887000 ms 99.81% -0.19% .
MicroBench_HostDeviceBandwidth_3D_H2D_Strided 5.066 ms 5.024000 ms 99.17% -0.83% .
MicroBench_HostDeviceBandwidth_1D_D2H_Contiguous 4.833 ms 4.764000 ms 98.57% -1.43% .
MicroBench_HostDeviceBandwidth_1D_H2D_Strided 4.850 ms 4.700000 ms 96.91% -3.09% -
Relative perf in group Pattern (10): 99.964%
Benchmark This PR baseline Relative perf Change -
Pattern_Reduction_NDRange_int32 16.654000 ms 16.720 ms 100.40% 0.40% .
Pattern_SegmentedReduction_Hierarchical_int64 11.767000 ms 11.784 ms 100.14% 0.14% .
Pattern_SegmentedReduction_NDRange_int32 2.162000 ms 2.164 ms 100.09% 0.09% .
Pattern_SegmentedReduction_NDRange_fp32 2.163000 ms 2.165 ms 100.09% 0.09% .
Pattern_SegmentedReduction_NDRange_int16 2.264000 ms 2.266 ms 100.09% 0.09% .
Pattern_SegmentedReduction_NDRange_int64 2.336000 ms 2.338 ms 100.09% 0.09% .
Pattern_SegmentedReduction_Hierarchical_int32 11.587000 ms 11.588 ms 100.01% 0.01% .
Pattern_SegmentedReduction_Hierarchical_fp32 11.587 ms 11.585000 ms 99.98% -0.02% .
Pattern_SegmentedReduction_Hierarchical_int16 11.803 ms 11.799000 ms 99.97% -0.03% .
Pattern_Reduction_Hierarchical_int32 16.921 ms 16.716000 ms 98.79% -1.21% .
Relative perf in group ScalarProduct (6): 100.096%
Benchmark This PR baseline Relative perf Change -
ScalarProduct_NDRange_int32 3.757000 ms 3.769 ms 100.32% 0.32% .
ScalarProduct_NDRange_int64 5.447000 ms 5.461 ms 100.26% 0.26% .
ScalarProduct_Hierarchical_fp32 10.145000 ms 10.158 ms 100.13% 0.13% .
ScalarProduct_Hierarchical_int32 10.531000 ms 10.533 ms 100.02% 0.02% .
ScalarProduct_Hierarchical_int64 11.504 ms 11.502000 ms 99.98% -0.02% .
ScalarProduct_NDRange_fp32 3.778 ms 3.773000 ms 99.87% -0.13% .
Relative perf in group USM (7): 101.680%
Benchmark This PR baseline Relative perf Change -
USM_Allocation_latency_fp32_device 0.055000 ms 0.067 ms 121.82% 21.82% ++++++++++
USM_Instr_Mix_fp32_host_1:1mix_no_init_no_prefetch 1.244000 ms 1.256 ms 100.96% 0.96% .
USM_Allocation_latency_fp32_host 37.661 ms 37.342000 ms 99.15% -0.85% .
USM_Instr_Mix_fp32_device_1:1mix_no_init_no_prefetch 1.871 ms 1.850000 ms 98.88% -1.12% .
USM_Allocation_latency_fp32_shared 0.058 ms 0.057000 ms 98.28% -1.72% .
USM_Instr_Mix_fp32_host_1:1mix_with_init_no_prefetch 1.100 ms 1.074000 ms 97.64% -2.36% -
USM_Instr_Mix_fp32_device_1:1mix_with_init_no_prefetch 1.734 ms 1.684000 ms 97.12% -2.88% -
Relative perf in group VectorAddition (3): 100.503%
Benchmark This PR baseline Relative perf Change -
VectorAddition_fp32 1.440000 ms 1.468 ms 101.94% 1.94% .
VectorAddition_int32 1.474000 ms 1.475 ms 100.07% 0.07% .
VectorAddition_int64 3.076 ms 3.061000 ms 99.51% -0.49% .
Relative perf in group Polybench (3): 100.403%
Benchmark This PR baseline Relative perf Change -
Polybench_Atax 6.824000 ms 6.885 ms 100.89% 0.89% .
Polybench_2mm 1.221000 ms 1.227 ms 100.49% 0.49% .
Polybench_3mm 1.732 ms 1.729000 ms 99.83% -0.17% .
Relative perf in group Kmeans (1): 100.044%
Benchmark This PR baseline Relative perf Change -
Kmeans_fp32 16.073000 ms 16.080 ms 100.04% 0.04% .
Relative perf in group LinearRegressionCoeff (1): 100.098%
Benchmark This PR baseline Relative perf Change -
LinearRegressionCoeff_fp32 934.867000 ms 935.779 ms 100.10% 0.10% .
Relative perf in group MolecularDynamics (1): 100.000%
Benchmark This PR baseline Relative perf Change -
MolecularDynamics 0.029000 ms 0.029 ms 100.00% 0.00% .
Relative perf in group llama.cpp (6): 100.040%
Benchmark This PR baseline Relative perf Change -
llama.cpp Prompt Processing Batched 256 871.653691 token/s 867.896 token/s 100.43% 0.43% .
llama.cpp Prompt Processing Batched 128 832.668220 token/s 829.273 token/s 100.41% 0.41% .
llama.cpp Text Generation Batched 256 62.485068 token/s 62.452 token/s 100.05% 0.05% .
llama.cpp Text Generation Batched 128 62.471752 token/s 62.469 token/s 100.00% 0.00% .
llama.cpp Text Generation Batched 512 62.487 token/s 62.506870 token/s 99.97% -0.03% .
llama.cpp Prompt Processing Batched 512 425.918 token/s 428.586901 token/s 99.38% -0.62% .
Relative perf in group alloc/size:10000/0/4096/iterations:200000/threads:4 (4): 106.148%
Benchmark This PR baseline Relative perf Change -
alloc/size:10000/0/4096/iterations:200000/threads:4 os_provider 1870.030000 ns 2119.200 ns 113.32% 13.32% ++++++
alloc/size:10000/0/4096/iterations:200000/threads:4 glibc 2413.570000 ns 2723.560 ns 112.84% 12.84% ++++++
alloc/size:10000/0/4096/iterations:200000/threads:4 proxy_pool<os_provider> 3092.590000 ns 3124.490 ns 101.03% 1.03% .
alloc/size:10000/0/4096/iterations:200000/threads:4 scalable_pool<os_provider> 300.033 ns 294.824000 ns 98.26% -1.74% .
Relative perf in group alloc/size:10000/0/4096/iterations:200000/threads:1 (4): 100.357%
Benchmark This PR baseline Relative perf Change -
alloc/size:10000/0/4096/iterations:200000/threads:1 proxy_pool<os_provider> 265.012000 ns 269.830 ns 101.82% 1.82% .
alloc/size:10000/0/4096/iterations:200000/threads:1 os_provider 193.658000 ns 195.800 ns 101.11% 1.11% .
alloc/size:10000/0/4096/iterations:200000/threads:1 scalable_pool<os_provider> 213.735 ns 213.357000 ns 99.82% -0.18% .
alloc/size:10000/0/4096/iterations:200000/threads:1 glibc 709.114 ns 699.961000 ns 98.71% -1.29% .
Relative perf in group alloc/size:10000/100000/4096/iterations:200000/threads:4 (4): 96.625%
Benchmark This PR baseline Relative perf Change -
alloc/size:10000/100000/4096/iterations:200000/threads:4 glibc 1385.030000 ns 1399.010 ns 101.01% 1.01% .
alloc/size:10000/100000/4096/iterations:200000/threads:4 scalable_pool<os_provider> 262.966 ns 260.987000 ns 99.25% -0.75% .
alloc/size:10000/100000/4096/iterations:200000/threads:4 os_provider 2017.050 ns 1896.370000 ns 94.02% -5.98% ---
alloc/size:10000/100000/4096/iterations:200000/threads:4 proxy_pool<os_provider> 3441.860 ns 3183.170000 ns 92.48% -7.52% ---
Relative perf in group alloc/size:10000/100000/4096/iterations:200000/threads:1 (4): 89.777%
Benchmark This PR baseline Relative perf Change -
alloc/size:10000/100000/4096/iterations:200000/threads:1 proxy_pool<os_provider> 301.211000 ns 310.425 ns 103.06% 3.06% +
alloc/size:10000/100000/4096/iterations:200000/threads:1 os_provider 189.844000 ns 192.753 ns 101.53% 1.53% .
alloc/size:10000/100000/4096/iterations:200000/threads:1 glibc 918.205 ns 737.865000 ns 80.36% -19.64% ---------
alloc/size:10000/100000/4096/iterations:200000/threads:1 scalable_pool<os_provider> 264.591 ns 204.412000 ns 77.26% -22.74% ----------
Relative perf in group alloc/min (4): 100.146%
Benchmark This PR baseline Relative perf Change -
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 scalable_pool<os_provider> 1046.230000 ns 1083.760 ns 103.59% 3.59% ++
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 scalable_pool<os_provider> 949.123000 ns 960.784 ns 101.23% 1.23% .
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 glibc 176.547 ns 174.373000 ns 98.77% -1.23% .
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 glibc 825.542 ns 801.763000 ns 97.12% -2.88% -
Relative perf in group multiple (12): 102.635%
Benchmark This PR baseline Relative perf Change -
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 glibc 32389.400000 ns 34482.100 ns 106.46% 6.46% +++
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 scalable_pool<os_provider> 25845.900000 ns 27465.300 ns 106.27% 6.27% +++
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 scalable_pool<os_provider> 41352.600000 ns 43475.800 ns 105.13% 5.13% ++
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 scalable_pool<os_provider> 14377.000000 ns 15099.900 ns 105.03% 5.03% ++
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 glibc 29913.200000 ns 31243.600 ns 104.45% 4.45% ++
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 glibc 137438.000000 ns 141214.000 ns 102.75% 2.75% +
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 glibc 4108.970000 ns 4207.320 ns 102.39% 2.39% +
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 os_provider 144922.000000 ns 147271.000 ns 101.62% 1.62% .
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 scalable_pool<os_provider> 75401.700000 ns 75587.500 ns 100.25% 0.25% .
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 proxy_pool<os_provider> 1184380.000000 ns 1185020.000 ns 100.05% 0.05% .
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 os_provider 1210790.000 ns 1201570.000000 ns 99.24% -0.76% .
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 proxy_pool<os_provider> 162895.000 ns 160292.000000 ns 98.40% -1.60% .

Details

Benchmark details - environment, command...
api_overhead_benchmark_l0 SubmitKernel out of order

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_l0 --test=SubmitKernel --csv --noHeaders --Ioq=0 --DiscardEvents=0 --MeasureCompletion=0 --iterations=100000 --Profiling=0 --NumKernels=10 --KernelExecTime=1

api_overhead_benchmark_l0 SubmitKernel in order

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_l0 --test=SubmitKernel --csv --noHeaders --Ioq=1 --DiscardEvents=0 --MeasureCompletion=0 --iterations=100000 --Profiling=0 --NumKernels=10 --KernelExecTime=1

api_overhead_benchmark_sycl SubmitKernel out of order

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_sycl --test=SubmitKernel --csv --noHeaders --Ioq=0 --DiscardEvents=0 --MeasureCompletion=0 --iterations=100000 --Profiling=0 --NumKernels=10 --KernelExecTime=1

api_overhead_benchmark_sycl SubmitKernel in order

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_sycl --test=SubmitKernel --csv --noHeaders --Ioq=1 --DiscardEvents=0 --MeasureCompletion=0 --iterations=100000 --Profiling=0 --NumKernels=10 --KernelExecTime=1

memory_benchmark_sycl QueueInOrderMemcpy from Device to Device, size 1024

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/memory_benchmark_sycl --test=QueueInOrderMemcpy --csv --noHeaders --iterations=10000 --IsCopyOnly=0 --sourcePlacement=Device --destinationPlacement=Device --size=1024 --count=100

memory_benchmark_sycl QueueInOrderMemcpy from Host to Device, size 1024

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/memory_benchmark_sycl --test=QueueInOrderMemcpy --csv --noHeaders --iterations=10000 --IsCopyOnly=0 --sourcePlacement=Host --destinationPlacement=Device --size=1024 --count=100

memory_benchmark_sycl QueueMemcpy from Device to Device, size 1024

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/memory_benchmark_sycl --test=QueueMemcpy --csv --noHeaders --iterations=10000 --sourcePlacement=Device --destinationPlacement=Device --size=1024

memory_benchmark_sycl StreamMemory, placement Device, type Triad, size 10240

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/memory_benchmark_sycl --test=StreamMemory --csv --noHeaders --iterations=10000 --type=Triad --size=10240 --memoryPlacement=Device --useEvents=0 --contents=Zeros --multiplier=1

api_overhead_benchmark_sycl ExecImmediateCopyQueue out of order from Device to Device, size 1024

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_sycl --test=ExecImmediateCopyQueue --csv --noHeaders --iterations=100000 --ioq=0 --IsCopyOnly=1 --MeasureCompletionTime=0 --src=Device --dst=Device --size=1024

api_overhead_benchmark_sycl ExecImmediateCopyQueue in order from Device to Host, size 1024

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_sycl --test=ExecImmediateCopyQueue --csv --noHeaders --iterations=100000 --ioq=1 --IsCopyOnly=1 --MeasureCompletionTime=0 --src=Host --dst=Host --size=1024

miscellaneous_benchmark_sycl VectorSum

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/miscellaneous_benchmark_sycl --test=VectorSum --csv --noHeaders --iterations=1000 --numberOfElementsX=512 --numberOfElementsY=256 --numberOfElementsZ=256

multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:1 dstUSM:1

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/multithread_benchmark_ur --test=MemcpyExecute --csv --noHeaders --Ioq=1 --UseEvents=1 --MeasureCompletion=1 --UseQueuePerThread=1 --AllocSize=102400 --NumThreads=1 --NumOpsPerThread=400 --iterations=10 --SrcUSM=1 --DstUSM=1

multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:1 dstUSM:1

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/multithread_benchmark_ur --test=MemcpyExecute --csv --noHeaders --Ioq=1 --UseEvents=1 --MeasureCompletion=1 --UseQueuePerThread=1 --AllocSize=102400 --NumThreads=8 --NumOpsPerThread=100 --iterations=10 --SrcUSM=1 --DstUSM=1

multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:1 dstUSM:1

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/multithread_benchmark_ur --test=MemcpyExecute --csv --noHeaders --Ioq=1 --UseEvents=1 --MeasureCompletion=1 --UseQueuePerThread=1 --AllocSize=1024 --NumThreads=8 --NumOpsPerThread=400 --iterations=1000 --SrcUSM=1 --DstUSM=1

multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:1 dstUSM:1

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/multithread_benchmark_ur --test=MemcpyExecute --csv --noHeaders --Ioq=1 --UseEvents=1 --MeasureCompletion=1 --UseQueuePerThread=1 --AllocSize=1024 --NumThreads=16 --NumOpsPerThread=10 --iterations=10000 --SrcUSM=1 --DstUSM=1

multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:0 dstUSM:1

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/multithread_benchmark_ur --test=MemcpyExecute --csv --noHeaders --Ioq=1 --UseEvents=1 --MeasureCompletion=1 --UseQueuePerThread=1 --AllocSize=102400 --NumThreads=1 --NumOpsPerThread=400 --iterations=10 --SrcUSM=0 --DstUSM=1

multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:0 dstUSM:1

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/multithread_benchmark_ur --test=MemcpyExecute --csv --noHeaders --Ioq=1 --UseEvents=1 --MeasureCompletion=1 --UseQueuePerThread=1 --AllocSize=102400 --NumThreads=8 --NumOpsPerThread=100 --iterations=10 --SrcUSM=0 --DstUSM=1

multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:0 dstUSM:1

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/multithread_benchmark_ur --test=MemcpyExecute --csv --noHeaders --Ioq=1 --UseEvents=1 --MeasureCompletion=1 --UseQueuePerThread=1 --AllocSize=1024 --NumThreads=8 --NumOpsPerThread=400 --iterations=1000 --SrcUSM=0 --DstUSM=1

multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:0 dstUSM:1

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/multithread_benchmark_ur --test=MemcpyExecute --csv --noHeaders --Ioq=1 --UseEvents=1 --MeasureCompletion=1 --UseQueuePerThread=1 --AllocSize=1024 --NumThreads=16 --NumOpsPerThread=10 --iterations=10000 --SrcUSM=0 --DstUSM=1

multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:1, allocSize:1024 srcUSM:0 dstUSM:1 without events

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/multithread_benchmark_ur --test=MemcpyExecute --csv --noHeaders --Ioq=1 --UseEvents=0 --MeasureCompletion=1 --UseQueuePerThread=1 --AllocSize=1024 --NumThreads=1 --NumOpsPerThread=4096 --iterations=10 --SrcUSM=0 --DstUSM=1

multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:4, allocSize:1024 srcUSM:0 dstUSM:1 without events

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/multithread_benchmark_ur --test=MemcpyExecute --csv --noHeaders --Ioq=1 --UseEvents=0 --MeasureCompletion=1 --UseQueuePerThread=1 --AllocSize=1024 --NumThreads=4 --NumOpsPerThread=4096 --iterations=10 --SrcUSM=0 --DstUSM=1

graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:10

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_sycl --test=SinKernelGraph --csv --noHeaders --iterations=100 --numKernels=10 --withGraphs=0

graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:10

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_sycl --test=SinKernelGraph --csv --noHeaders --iterations=100 --numKernels=10 --withGraphs=1

graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:100

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_sycl --test=SinKernelGraph --csv --noHeaders --iterations=100 --numKernels=100 --withGraphs=0

graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:100

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_sycl --test=SinKernelGraph --csv --noHeaders --iterations=100 --numKernels=100 --withGraphs=1

graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:1, numKernels:10

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_sycl --test=SubmitExecGraph --csv --noHeaders --iterations=100 --measureSubmit=1 --ioq=0 --numKernels=10

graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:10

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_sycl --test=SubmitExecGraph --csv --noHeaders --iterations=100 --measureSubmit=1 --ioq=1 --numKernels=10

graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:100

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_sycl --test=SubmitExecGraph --csv --noHeaders --iterations=100 --measureSubmit=1 --ioq=1 --numKernels=100

graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:0, numKernels:10

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_sycl --test=SubmitExecGraph --csv --noHeaders --iterations=100 --measureSubmit=0 --ioq=0 --numKernels=10

graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:0, numKernels:10

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_sycl --test=SubmitExecGraph --csv --noHeaders --iterations=100 --measureSubmit=0 --ioq=1 --numKernels=10

graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:0, numKernels:100

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_sycl --test=SubmitExecGraph --csv --noHeaders --iterations=100 --measureSubmit=0 --ioq=1 --numKernels=100

api_overhead_benchmark_ur SubmitKernel out of order CPU count

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_ur --test=SubmitKernel --csv --noHeaders --Ioq=0 --DiscardEvents=0 --MeasureCompletion=0 --iterations=100000 --Profiling=0 --NumKernels=10 --KernelExecTime=1

api_overhead_benchmark_ur SubmitKernel out of order

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_ur --test=SubmitKernel --csv --noHeaders --Ioq=0 --DiscardEvents=0 --MeasureCompletion=0 --iterations=100000 --Profiling=0 --NumKernels=10 --KernelExecTime=1

api_overhead_benchmark_ur SubmitKernel in order CPU count

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_ur --test=SubmitKernel --csv --noHeaders --Ioq=1 --DiscardEvents=0 --MeasureCompletion=0 --iterations=100000 --Profiling=0 --NumKernels=10 --KernelExecTime=1

api_overhead_benchmark_ur SubmitKernel in order

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_ur --test=SubmitKernel --csv --noHeaders --Ioq=1 --DiscardEvents=0 --MeasureCompletion=0 --iterations=100000 --Profiling=0 --NumKernels=10 --KernelExecTime=1

api_overhead_benchmark_ur SubmitKernel in order with measure completion CPU count

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_ur --test=SubmitKernel --csv --noHeaders --Ioq=1 --DiscardEvents=0 --MeasureCompletion=1 --iterations=100000 --Profiling=0 --NumKernels=10 --KernelExecTime=1

api_overhead_benchmark_ur SubmitKernel in order with measure completion

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_ur --test=SubmitKernel --csv --noHeaders --Ioq=1 --DiscardEvents=0 --MeasureCompletion=1 --iterations=100000 --Profiling=0 --NumKernels=10 --KernelExecTime=1

Velocity-Bench Hashtable

Environment Variables:

Command:

/home/pmdk/bench_workdir/hashtable/hashtable_sycl --no-verify

Velocity-Bench Bitcracker

Environment Variables:

Command:

/home/pmdk/bench_workdir/bitcracker/bitcracker -f /home/pmdk/bench_workdir/velocity-bench-repo/bitcracker/hash_pass/img_win8_user_hash.txt -d /home/pmdk/bench_workdir/velocity-bench-repo/bitcracker/hash_pass/user_passwords_60000.txt -b 60000

Velocity-Bench CudaSift

Environment Variables:

Command:

/home/pmdk/bench_workdir/cudaSift/cudaSift

Velocity-Bench Easywave

Environment Variables:

Command:

/home/pmdk/bench_workdir/easywave/easyWave_sycl -grid /home/pmdk/bench_workdir/data/easywave/examples/e2Asean.grd -source /home/pmdk/bench_workdir/data/easywave/examples/BengkuluSept2007.flt -time 120

Velocity-Bench QuickSilver

Environment Variables:

QS_DEVICE=GPU

Command:

/home/pmdk/bench_workdir/QuickSilver/qs -i /home/pmdk/bench_workdir/velocity-bench-repo/QuickSilver/Examples/AllScattering/scatteringOnly.inp

Velocity-Bench Sobel Filter

Environment Variables:

OPENCV_IO_MAX_IMAGE_PIXELS=1677721600

Command:

/home/pmdk/bench_workdir/sobel_filter/sobel_filter -i /home/pmdk/bench_workdir/data/sobel_filter/sobel_filter_data/silverfalls_32Kx32K.png -n 5

Velocity-Bench dl-cifar

Environment Variables:

Command:

/home/pmdk/bench_workdir/dl-cifar/dl-cifar_sycl

Velocity-Bench dl-mnist

Environment Variables:

NEOReadDebugKeys=1
DisableScratchPages=0

Command:

/home/pmdk/bench_workdir/dl-mnist/dl-mnist-sycl -conv_algo ONEDNN_AUTO

Velocity-Bench svm

Environment Variables:

Command:

/home/pmdk/bench_workdir/svm/svm_sycl /home/pmdk/bench_workdir/velocity-bench-repo/svm/SYCL/a9a /home/pmdk/bench_workdir/velocity-bench-repo/svm/SYCL/a.m

Runtime_IndependentDAGTaskThroughput_SingleTask

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/dag_task_throughput_independent --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/IndependentDAGTaskThroughput_multi.csv --size=32768

Runtime_IndependentDAGTaskThroughput_BasicParallelFor

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/dag_task_throughput_independent --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/IndependentDAGTaskThroughput_multi.csv --size=32768

Runtime_IndependentDAGTaskThroughput_HierarchicalParallelFor

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/dag_task_throughput_independent --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/IndependentDAGTaskThroughput_multi.csv --size=32768

Runtime_IndependentDAGTaskThroughput_NDRangeParallelFor

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/dag_task_throughput_independent --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/IndependentDAGTaskThroughput_multi.csv --size=32768

Runtime_DAGTaskThroughput_SingleTask

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/dag_task_throughput_sequential --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/DAGTaskThroughput_multi.csv --size=327680

Runtime_DAGTaskThroughput_BasicParallelFor

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/dag_task_throughput_sequential --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/DAGTaskThroughput_multi.csv --size=327680

Runtime_DAGTaskThroughput_HierarchicalParallelFor

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/dag_task_throughput_sequential --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/DAGTaskThroughput_multi.csv --size=327680

Runtime_DAGTaskThroughput_NDRangeParallelFor

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/dag_task_throughput_sequential --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/DAGTaskThroughput_multi.csv --size=327680

MicroBench_HostDeviceBandwidth_1D_H2D_Contiguous

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/HostDeviceBandwidth_multi.csv --size=512

MicroBench_HostDeviceBandwidth_2D_H2D_Contiguous

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/HostDeviceBandwidth_multi.csv --size=512

MicroBench_HostDeviceBandwidth_3D_H2D_Contiguous

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/HostDeviceBandwidth_multi.csv --size=512

MicroBench_HostDeviceBandwidth_1D_D2H_Contiguous

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/HostDeviceBandwidth_multi.csv --size=512

MicroBench_HostDeviceBandwidth_2D_D2H_Contiguous

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/HostDeviceBandwidth_multi.csv --size=512

MicroBench_HostDeviceBandwidth_3D_D2H_Contiguous

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/HostDeviceBandwidth_multi.csv --size=512

MicroBench_HostDeviceBandwidth_1D_H2D_Strided

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/HostDeviceBandwidth_multi.csv --size=512

MicroBench_HostDeviceBandwidth_2D_H2D_Strided

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/HostDeviceBandwidth_multi.csv --size=512

MicroBench_HostDeviceBandwidth_3D_H2D_Strided

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/HostDeviceBandwidth_multi.csv --size=512

MicroBench_HostDeviceBandwidth_1D_D2H_Strided

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/HostDeviceBandwidth_multi.csv --size=512

MicroBench_HostDeviceBandwidth_2D_D2H_Strided

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/HostDeviceBandwidth_multi.csv --size=512

MicroBench_HostDeviceBandwidth_3D_D2H_Strided

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/HostDeviceBandwidth_multi.csv --size=512

MicroBench_LocalMem_int32_4096

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/local_mem --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/LocalMem_multi.csv --size=10240000

MicroBench_LocalMem_fp32_4096

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/local_mem --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/LocalMem_multi.csv --size=10240000

Pattern_Reduction_NDRange_int32

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/reduction --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/Pattern_Reduction_multi.csv --size=10240000

Pattern_Reduction_Hierarchical_int32

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/reduction --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/Pattern_Reduction_multi.csv --size=10240000

ScalarProduct_NDRange_int32

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/scalar_prod --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/ScalarProduct_multi.csv --size=102400000

ScalarProduct_NDRange_int64

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/scalar_prod --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/ScalarProduct_multi.csv --size=102400000

ScalarProduct_NDRange_fp32

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/scalar_prod --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/ScalarProduct_multi.csv --size=102400000

ScalarProduct_Hierarchical_int32

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/scalar_prod --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/ScalarProduct_multi.csv --size=102400000

ScalarProduct_Hierarchical_int64

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/scalar_prod --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/ScalarProduct_multi.csv --size=102400000

ScalarProduct_Hierarchical_fp32

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/scalar_prod --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/ScalarProduct_multi.csv --size=102400000

Pattern_SegmentedReduction_NDRange_int16

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/segmentedreduction --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/Pattern_SegmentedReduction_multi.csv --size=102400000

Pattern_SegmentedReduction_NDRange_int32

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/segmentedreduction --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/Pattern_SegmentedReduction_multi.csv --size=102400000

Pattern_SegmentedReduction_NDRange_int64

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/segmentedreduction --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/Pattern_SegmentedReduction_multi.csv --size=102400000

Pattern_SegmentedReduction_NDRange_fp32

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/segmentedreduction --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/Pattern_SegmentedReduction_multi.csv --size=102400000

Pattern_SegmentedReduction_Hierarchical_int16

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/segmentedreduction --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/Pattern_SegmentedReduction_multi.csv --size=102400000

Pattern_SegmentedReduction_Hierarchical_int32

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/segmentedreduction --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/Pattern_SegmentedReduction_multi.csv --size=102400000

Pattern_SegmentedReduction_Hierarchical_int64

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/segmentedreduction --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/Pattern_SegmentedReduction_multi.csv --size=102400000

Pattern_SegmentedReduction_Hierarchical_fp32

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/segmentedreduction --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/Pattern_SegmentedReduction_multi.csv --size=102400000

USM_Allocation_latency_fp32_device

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/usm_allocation_latency --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/USM_Allocation_latency_multi.csv --size=1024000000

USM_Allocation_latency_fp32_host

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/usm_allocation_latency --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/USM_Allocation_latency_multi.csv --size=1024000000

USM_Allocation_latency_fp32_shared

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/usm_allocation_latency --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/USM_Allocation_latency_multi.csv --size=1024000000

USM_Instr_Mix_fp32_device_1:1mix_with_init_no_prefetch

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/usm_instr_mix --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/USM_Instr_Mix_multi.csv --size=8192

USM_Instr_Mix_fp32_host_1:1mix_with_init_no_prefetch

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/usm_instr_mix --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/USM_Instr_Mix_multi.csv --size=8192

USM_Instr_Mix_fp32_device_1:1mix_no_init_no_prefetch

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/usm_instr_mix --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/USM_Instr_Mix_multi.csv --size=8192

USM_Instr_Mix_fp32_host_1:1mix_no_init_no_prefetch

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/usm_instr_mix --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/USM_Instr_Mix_multi.csv --size=8192

VectorAddition_int32

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/vec_add --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/VectorAddition_multi.csv --size=102400000

VectorAddition_int64

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/vec_add --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/VectorAddition_multi.csv --size=102400000

VectorAddition_fp32

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/vec_add --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/VectorAddition_multi.csv --size=102400000

Polybench_2mm

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/2mm --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/2mm.csv --size=512

Polybench_3mm

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/3mm --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/3mm.csv --size=512

Polybench_Atax

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/atax --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/Atax.csv --size=8192

Kmeans_fp32

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/kmeans --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/Kmeans.csv --size=700000000

LinearRegressionCoeff_fp32

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/lin_reg_coeff --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/LinearRegressionCoeff.csv --size=1638400000

MolecularDynamics

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/mol_dyn --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/MolecularDynamics.csv --size=8196

llama.cpp Prompt Processing Batched 128

Environment Variables:

Command:

/home/pmdk/bench_workdir/llamacpp-build/bin/llama-bench --output csv -n 128 -p 512 -b 128,256,512 --numa isolate -t 56 --model /home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf

llama.cpp Text Generation Batched 128

Environment Variables:

Command:

/home/pmdk/bench_workdir/llamacpp-build/bin/llama-bench --output csv -n 128 -p 512 -b 128,256,512 --numa isolate -t 56 --model /home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf

llama.cpp Prompt Processing Batched 256

Environment Variables:

Command:

/home/pmdk/bench_workdir/llamacpp-build/bin/llama-bench --output csv -n 128 -p 512 -b 128,256,512 --numa isolate -t 56 --model /home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf

llama.cpp Text Generation Batched 256

Environment Variables:

Command:

/home/pmdk/bench_workdir/llamacpp-build/bin/llama-bench --output csv -n 128 -p 512 -b 128,256,512 --numa isolate -t 56 --model /home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf

llama.cpp Prompt Processing Batched 512

Environment Variables:

Command:

/home/pmdk/bench_workdir/llamacpp-build/bin/llama-bench --output csv -n 128 -p 512 -b 128,256,512 --numa isolate -t 56 --model /home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf

llama.cpp Text Generation Batched 512

Environment Variables:

Command:

/home/pmdk/bench_workdir/llamacpp-build/bin/llama-bench --output csv -n 128 -p 512 -b 128,256,512 --numa isolate -t 56 --model /home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf

alloc/size:10000/0/4096/iterations:200000/threads:4 glibc

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/0/4096/iterations:200000/threads:1 glibc

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/100000/4096/iterations:200000/threads:4 glibc

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/100000/4096/iterations:200000/threads:1 glibc

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 glibc

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 glibc

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/0/4096/iterations:200000/threads:4 os_provider

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/0/4096/iterations:200000/threads:1 os_provider

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/100000/4096/iterations:200000/threads:4 os_provider

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/100000/4096/iterations:200000/threads:1 os_provider

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/0/4096/iterations:200000/threads:4 proxy_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/0/4096/iterations:200000/threads:1 proxy_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/100000/4096/iterations:200000/threads:4 proxy_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/100000/4096/iterations:200000/threads:1 proxy_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/0/4096/iterations:200000/threads:4 scalable_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/0/4096/iterations:200000/threads:1 scalable_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/100000/4096/iterations:200000/threads:4 scalable_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/100000/4096/iterations:200000/threads:1 scalable_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 scalable_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 scalable_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 glibc

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 glibc

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 glibc

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 glibc

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 proxy_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 proxy_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 os_provider

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 os_provider

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 scalable_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 scalable_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 scalable_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 scalable_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

@RossBrunton RossBrunton changed the title RFC: Remove virtual methods from ur_mem_handle_t_ Remove virtual methods from ur_mem_handle_t_ Jan 27, 2025
@RossBrunton RossBrunton marked this pull request as ready for review January 27, 2025 13:42
@RossBrunton RossBrunton requested a review from a team as a code owner January 27, 2025 13:42
We want to transition to handle pointers containing the ddi table as the
first element. For this to work, handle object must not have a vtable.

Since ur_mem_handle_t_ is relatively simple, it's easy enough to roll
out our own version of dynamic dispatch.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
level-zero L0 adapter specific issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant