Remove virtual methods from ur_mem_handle_t_ #2620

RossBrunton · 2025-01-27T12:36:34Z

We want to transition to handle pointers containing the ddi table as the
first element. For this to work, handle object must not have a vtable.

Since ur_mem_handle_t_ is relatively simple, it's easy enough to roll
out our own version of dynamic dispatch.

github-actions · 2025-01-27T12:37:21Z

Compute Benchmarks level_zero run (with params: ):
https://github.com/oneapi-src/unified-runtime/actions/runs/12989137257

github-actions · 2025-01-27T13:35:37Z

Compute Benchmarks level_zero run ():
https://github.com/oneapi-src/unified-runtime/actions/runs/12989137257
Job status: success. Test status: success.

Summary

Total 138 benchmarks in mean.
Geomean 99.919%.
Improved 16 Regressed 15 (threshold 2.00%)

(result is better)

Performance change in benchmark groups

Relative perf in group api (12): 100.144%

Benchmark	This PR	baseline	Relative perf	Change	-
api_overhead_benchmark_sycl ExecImmediateCopyQueue out of order from Device to Device, size 1024	2.116000 μs	2.175 μs	102.79%	2.79%	+
api_overhead_benchmark_sycl ExecImmediateCopyQueue in order from Device to Host, size 1024	1.663000 μs	1.706 μs	102.59%	2.59%	+
api_overhead_benchmark_ur SubmitKernel out of order	15.966000 μs	16.073 μs	100.67%	0.67%	.
api_overhead_benchmark_sycl SubmitKernel in order	24.579000 μs	24.664 μs	100.35%	0.35%	.
api_overhead_benchmark_ur SubmitKernel in order	16.648000 μs	16.703 μs	100.33%	0.33%	.
api_overhead_benchmark_ur SubmitKernel in order with measure completion CPU count	124071.000 instr	123991.000000 instr	99.94%	-0.06%	.
api_overhead_benchmark_ur SubmitKernel in order CPU count	110895.000 instr	110815.000000 instr	99.93%	-0.07%	.
api_overhead_benchmark_ur SubmitKernel out of order CPU count	105543.000 instr	105463.000000 instr	99.92%	-0.08%	.
api_overhead_benchmark_l0 SubmitKernel in order	11.852 μs	11.800000 μs	99.56%	-0.44%	.
api_overhead_benchmark_sycl SubmitKernel out of order	23.425 μs	23.287000 μs	99.41%	-0.59%	.
api_overhead_benchmark_l0 SubmitKernel out of order	11.785 μs	11.629000 μs	98.68%	-1.32%	.
api_overhead_benchmark_ur SubmitKernel in order with measure completion	21.982 μs	21.473000 μs	97.68%	-2.32%	-

Relative perf in group memory (4): 100.303%

Benchmark	This PR	baseline	Relative perf	Change	-
memory_benchmark_sycl QueueInOrderMemcpy from Device to Device, size 1024	253.243000 μs	256.472 μs	101.28%	1.28%	.
memory_benchmark_sycl StreamMemory, placement Device, type Triad, size 10240	3.084000 GB/s	3.074 GB/s	100.33%	0.33%	.
memory_benchmark_sycl QueueMemcpy from Device to Device, size 1024	5.930000 μs	5.932 μs	100.03%	0.03%	.
memory_benchmark_sycl QueueInOrderMemcpy from Host to Device, size 1024	220.111 μs	219.201000 μs	99.59%	-0.41%	.

Relative perf in group miscellaneous (1): 106.845%

Benchmark	This PR	baseline	Relative perf	Change	-
miscellaneous_benchmark_sycl VectorSum	806.080000 bw GB/s	861.253 bw GB/s	106.84%	6.84%	+++

Relative perf in group multithread (10): 100.053%

Benchmark	This PR	baseline	Relative perf	Change	-
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:1 dstUSM:1	16968.308000 μs	17230.283 μs	101.54%	1.54%	.
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:0 dstUSM:1	8936.953000 μs	9073.725 μs	101.53%	1.53%	.
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:0 dstUSM:1	1197.227000 μs	1210.999 μs	101.15%	1.15%	.
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:4, allocSize:1024 srcUSM:0 dstUSM:1 without events	113293.620000 μs	114139.645 μs	100.75%	0.75%	.
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:1 dstUSM:1	47170.440000 μs	47306.654 μs	100.29%	0.29%	.
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:1 dstUSM:1	6951.458 μs	6943.025000 μs	99.88%	-0.12%	.
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:1, allocSize:1024 srcUSM:0 dstUSM:1 without events	43161.273 μs	43064.999000 μs	99.78%	-0.22%	.
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:0 dstUSM:1	7868.920 μs	7821.718000 μs	99.40%	-0.60%	.
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:1 dstUSM:1	2097.041 μs	2083.870000 μs	99.37%	-0.63%	.
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:0 dstUSM:1	27555.344 μs	26707.698000 μs	96.92%	-3.08%	-

Relative perf in group graph (10): 98.886%

Benchmark	This PR	baseline	Relative perf	Change	-
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:100	675.407000 μs	679.477 μs	100.60%	0.60%	.
graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:1, numKernels:10	54.046000 μs	54.135 μs	100.16%	0.16%	.
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:100	353371.241000 μs	353404.211 μs	100.01%	0.01%	.
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:10	71859.921 μs	71856.495000 μs	100.00%	-0.00%	.
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:10	61.902 μs	61.889000 μs	99.98%	-0.02%	.
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:10	72654.469 μs	72543.241000 μs	99.85%	-0.15%	.
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:0, numKernels:100	57611.310 μs	57263.652000 μs	99.40%	-0.60%	.
graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:0, numKernels:10	5656.640 μs	5611.771000 μs	99.21%	-0.79%	.
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:0, numKernels:10	5697.200 μs	5615.778000 μs	98.57%	-1.43%	.
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:100	386332.337 μs	353223.514000 μs	91.43%	-8.57%	----

Relative perf in group Velocity-Bench (9): 99.011%

Benchmark	This PR	baseline	Relative perf	Change	-
Velocity-Bench svm	0.139000 s	0.141 s	101.22%	1.22%	.
Velocity-Bench Sobel Filter	613.620000 ms	615.149 ms	100.25%	0.25%	.
Velocity-Bench Easywave	235.000000 ms	235.000 ms	100.00%	0.00%	.
Velocity-Bench dl-mnist	2.390000 s	2.390 s	100.00%	0.00%	.
Velocity-Bench Hashtable	353.480 M keys/sec	353.884706 M keys/sec	99.89%	-0.11%	.
Velocity-Bench dl-cifar	23.937 s	23.892100 s	99.81%	-0.19%	.
Velocity-Bench CudaSift	205.715 ms	204.632000 ms	99.47%	-0.53%	.
Velocity-Bench QuickSilver	116.260 MMS/CTT	118.320000 MMS/CTT	98.26%	-1.74%	.
Velocity-Bench Bitcracker	38.640 s	35.731600 s	92.47%	-7.53%	---

Relative perf in group Runtime (8): 98.330%

Benchmark	This PR	baseline	Relative perf	Change	-
Runtime_IndependentDAGTaskThroughput_SingleTask	259.514000 ms	265.060 ms	102.14%	2.14%	+
Runtime_DAGTaskThroughput_BasicParallelFor	1737.321000 ms	1747.525 ms	100.59%	0.59%	.
Runtime_DAGTaskThroughput_SingleTask	1670.866000 ms	1678.531 ms	100.46%	0.46%	.
Runtime_DAGTaskThroughput_NDRangeParallelFor	1677.683000 ms	1682.917 ms	100.31%	0.31%	.
Runtime_DAGTaskThroughput_HierarchicalParallelFor	1714.752000 ms	1718.971 ms	100.25%	0.25%	.
Runtime_IndependentDAGTaskThroughput_BasicParallelFor	302.978 ms	287.518000 ms	94.90%	-5.10%	--
Runtime_IndependentDAGTaskThroughput_HierarchicalParallelFor	293.792 ms	277.037000 ms	94.30%	-5.70%	---
Runtime_IndependentDAGTaskThroughput_NDRangeParallelFor	292.481 ms	275.224000 ms	94.10%	-5.90%	---

Relative perf in group MicroBench (14): 99.797%

Benchmark	This PR	baseline	Relative perf	Change	-
MicroBench_HostDeviceBandwidth_1D_H2D_Contiguous	4.785000 ms	4.832 ms	100.98%	0.98%	.
MicroBench_HostDeviceBandwidth_2D_H2D_Contiguous	4.689000 ms	4.730 ms	100.87%	0.87%	.
MicroBench_HostDeviceBandwidth_1D_D2H_Strided	4.818000 ms	4.854 ms	100.75%	0.75%	.
MicroBench_HostDeviceBandwidth_2D_H2D_Strided	5.119000 ms	5.130 ms	100.21%	0.21%	.
MicroBench_HostDeviceBandwidth_3D_H2D_Contiguous	4.684000 ms	4.690 ms	100.13%	0.13%	.
MicroBench_HostDeviceBandwidth_3D_D2H_Strided	617.485 ms	617.480000 ms	100.00%	-0.00%	.
MicroBench_HostDeviceBandwidth_2D_D2H_Strided	617.543 ms	617.529000 ms	100.00%	-0.00%	.
MicroBench_HostDeviceBandwidth_3D_D2H_Contiguous	618.156 ms	618.122000 ms	99.99%	-0.01%	.
MicroBench_HostDeviceBandwidth_2D_D2H_Contiguous	618.180 ms	618.120000 ms	99.99%	-0.01%	.
MicroBench_LocalMem_fp32_4096	29.931 ms	29.884000 ms	99.84%	-0.16%	.
MicroBench_LocalMem_int32_4096	29.945 ms	29.887000 ms	99.81%	-0.19%	.
MicroBench_HostDeviceBandwidth_3D_H2D_Strided	5.066 ms	5.024000 ms	99.17%	-0.83%	.
MicroBench_HostDeviceBandwidth_1D_D2H_Contiguous	4.833 ms	4.764000 ms	98.57%	-1.43%	.
MicroBench_HostDeviceBandwidth_1D_H2D_Strided	4.850 ms	4.700000 ms	96.91%	-3.09%	-

Relative perf in group Pattern (10): 99.964%

Benchmark	This PR	baseline	Relative perf	Change	-
Pattern_Reduction_NDRange_int32	16.654000 ms	16.720 ms	100.40%	0.40%	.
Pattern_SegmentedReduction_Hierarchical_int64	11.767000 ms	11.784 ms	100.14%	0.14%	.
Pattern_SegmentedReduction_NDRange_int32	2.162000 ms	2.164 ms	100.09%	0.09%	.
Pattern_SegmentedReduction_NDRange_fp32	2.163000 ms	2.165 ms	100.09%	0.09%	.
Pattern_SegmentedReduction_NDRange_int16	2.264000 ms	2.266 ms	100.09%	0.09%	.
Pattern_SegmentedReduction_NDRange_int64	2.336000 ms	2.338 ms	100.09%	0.09%	.
Pattern_SegmentedReduction_Hierarchical_int32	11.587000 ms	11.588 ms	100.01%	0.01%	.
Pattern_SegmentedReduction_Hierarchical_fp32	11.587 ms	11.585000 ms	99.98%	-0.02%	.
Pattern_SegmentedReduction_Hierarchical_int16	11.803 ms	11.799000 ms	99.97%	-0.03%	.
Pattern_Reduction_Hierarchical_int32	16.921 ms	16.716000 ms	98.79%	-1.21%	.

Relative perf in group ScalarProduct (6): 100.096%

Benchmark	This PR	baseline	Relative perf	Change	-
ScalarProduct_NDRange_int32	3.757000 ms	3.769 ms	100.32%	0.32%	.
ScalarProduct_NDRange_int64	5.447000 ms	5.461 ms	100.26%	0.26%	.
ScalarProduct_Hierarchical_fp32	10.145000 ms	10.158 ms	100.13%	0.13%	.
ScalarProduct_Hierarchical_int32	10.531000 ms	10.533 ms	100.02%	0.02%	.
ScalarProduct_Hierarchical_int64	11.504 ms	11.502000 ms	99.98%	-0.02%	.
ScalarProduct_NDRange_fp32	3.778 ms	3.773000 ms	99.87%	-0.13%	.

Relative perf in group USM (7): 101.680%

Benchmark	This PR	baseline	Relative perf	Change	-
USM_Allocation_latency_fp32_device	0.055000 ms	0.067 ms	121.82%	21.82%	++++++++++
USM_Instr_Mix_fp32_host_1:1mix_no_init_no_prefetch	1.244000 ms	1.256 ms	100.96%	0.96%	.
USM_Allocation_latency_fp32_host	37.661 ms	37.342000 ms	99.15%	-0.85%	.
USM_Instr_Mix_fp32_device_1:1mix_no_init_no_prefetch	1.871 ms	1.850000 ms	98.88%	-1.12%	.
USM_Allocation_latency_fp32_shared	0.058 ms	0.057000 ms	98.28%	-1.72%	.
USM_Instr_Mix_fp32_host_1:1mix_with_init_no_prefetch	1.100 ms	1.074000 ms	97.64%	-2.36%	-
USM_Instr_Mix_fp32_device_1:1mix_with_init_no_prefetch	1.734 ms	1.684000 ms	97.12%	-2.88%	-

Relative perf in group VectorAddition (3): 100.503%

Benchmark	This PR	baseline	Relative perf	Change	-
VectorAddition_fp32	1.440000 ms	1.468 ms	101.94%	1.94%	.
VectorAddition_int32	1.474000 ms	1.475 ms	100.07%	0.07%	.
VectorAddition_int64	3.076 ms	3.061000 ms	99.51%	-0.49%	.

Relative perf in group Polybench (3): 100.403%

Benchmark	This PR	baseline	Relative perf	Change	-
Polybench_Atax	6.824000 ms	6.885 ms	100.89%	0.89%	.
Polybench_2mm	1.221000 ms	1.227 ms	100.49%	0.49%	.
Polybench_3mm	1.732 ms	1.729000 ms	99.83%	-0.17%	.

Relative perf in group Kmeans (1): 100.044%

Benchmark	This PR	baseline	Relative perf	Change	-
Kmeans_fp32	16.073000 ms	16.080 ms	100.04%	0.04%	.

Relative perf in group LinearRegressionCoeff (1): 100.098%

Benchmark	This PR	baseline	Relative perf	Change	-
LinearRegressionCoeff_fp32	934.867000 ms	935.779 ms	100.10%	0.10%	.

Relative perf in group MolecularDynamics (1): 100.000%

Benchmark	This PR	baseline	Relative perf	Change	-
MolecularDynamics	0.029000 ms	0.029 ms	100.00%	0.00%	.

Relative perf in group llama.cpp (6): 100.040%

Benchmark	This PR	baseline	Relative perf	Change	-
llama.cpp Prompt Processing Batched 256	871.653691 token/s	867.896 token/s	100.43%	0.43%	.
llama.cpp Prompt Processing Batched 128	832.668220 token/s	829.273 token/s	100.41%	0.41%	.
llama.cpp Text Generation Batched 256	62.485068 token/s	62.452 token/s	100.05%	0.05%	.
llama.cpp Text Generation Batched 128	62.471752 token/s	62.469 token/s	100.00%	0.00%	.
llama.cpp Text Generation Batched 512	62.487 token/s	62.506870 token/s	99.97%	-0.03%	.
llama.cpp Prompt Processing Batched 512	425.918 token/s	428.586901 token/s	99.38%	-0.62%	.

Relative perf in group alloc/size:10000/0/4096/iterations:200000/threads:4 (4): 106.148%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/size:10000/0/4096/iterations:200000/threads:4 os_provider	1870.030000 ns	2119.200 ns	113.32%	13.32%	++++++
alloc/size:10000/0/4096/iterations:200000/threads:4 glibc	2413.570000 ns	2723.560 ns	112.84%	12.84%	++++++
alloc/size:10000/0/4096/iterations:200000/threads:4 proxy_pool<os_provider>	3092.590000 ns	3124.490 ns	101.03%	1.03%	.
alloc/size:10000/0/4096/iterations:200000/threads:4 scalable_pool<os_provider>	300.033 ns	294.824000 ns	98.26%	-1.74%	.

Relative perf in group alloc/size:10000/0/4096/iterations:200000/threads:1 (4): 100.357%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/size:10000/0/4096/iterations:200000/threads:1 proxy_pool<os_provider>	265.012000 ns	269.830 ns	101.82%	1.82%	.
alloc/size:10000/0/4096/iterations:200000/threads:1 os_provider	193.658000 ns	195.800 ns	101.11%	1.11%	.
alloc/size:10000/0/4096/iterations:200000/threads:1 scalable_pool<os_provider>	213.735 ns	213.357000 ns	99.82%	-0.18%	.
alloc/size:10000/0/4096/iterations:200000/threads:1 glibc	709.114 ns	699.961000 ns	98.71%	-1.29%	.

Relative perf in group alloc/size:10000/100000/4096/iterations:200000/threads:4 (4): 96.625%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/size:10000/100000/4096/iterations:200000/threads:4 glibc	1385.030000 ns	1399.010 ns	101.01%	1.01%	.
alloc/size:10000/100000/4096/iterations:200000/threads:4 scalable_pool<os_provider>	262.966 ns	260.987000 ns	99.25%	-0.75%	.
alloc/size:10000/100000/4096/iterations:200000/threads:4 os_provider	2017.050 ns	1896.370000 ns	94.02%	-5.98%	---
alloc/size:10000/100000/4096/iterations:200000/threads:4 proxy_pool<os_provider>	3441.860 ns	3183.170000 ns	92.48%	-7.52%	---

Relative perf in group alloc/size:10000/100000/4096/iterations:200000/threads:1 (4): 89.777%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/size:10000/100000/4096/iterations:200000/threads:1 proxy_pool<os_provider>	301.211000 ns	310.425 ns	103.06%	3.06%	+
alloc/size:10000/100000/4096/iterations:200000/threads:1 os_provider	189.844000 ns	192.753 ns	101.53%	1.53%	.
alloc/size:10000/100000/4096/iterations:200000/threads:1 glibc	918.205 ns	737.865000 ns	80.36%	-19.64%	---------
alloc/size:10000/100000/4096/iterations:200000/threads:1 scalable_pool<os_provider>	264.591 ns	204.412000 ns	77.26%	-22.74%	----------

Relative perf in group alloc/min (4): 100.146%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 scalable_pool<os_provider>	1046.230000 ns	1083.760 ns	103.59%	3.59%	++
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 scalable_pool<os_provider>	949.123000 ns	960.784 ns	101.23%	1.23%	.
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 glibc	176.547 ns	174.373000 ns	98.77%	-1.23%	.
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 glibc	825.542 ns	801.763000 ns	97.12%	-2.88%	-

Relative perf in group multiple (12): 102.635%

Benchmark	This PR	baseline	Relative perf	Change	-
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 glibc	32389.400000 ns	34482.100 ns	106.46%	6.46%	+++
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 scalable_pool<os_provider>	25845.900000 ns	27465.300 ns	106.27%	6.27%	+++
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 scalable_pool<os_provider>	41352.600000 ns	43475.800 ns	105.13%	5.13%	++
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 scalable_pool<os_provider>	14377.000000 ns	15099.900 ns	105.03%	5.03%	++
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 glibc	29913.200000 ns	31243.600 ns	104.45%	4.45%	++
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 glibc	137438.000000 ns	141214.000 ns	102.75%	2.75%	+
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 glibc	4108.970000 ns	4207.320 ns	102.39%	2.39%	+
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 os_provider	144922.000000 ns	147271.000 ns	101.62%	1.62%	.
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 scalable_pool<os_provider>	75401.700000 ns	75587.500 ns	100.25%	0.25%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 proxy_pool<os_provider>	1184380.000000 ns	1185020.000 ns	100.05%	0.05%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 os_provider	1210790.000 ns	1201570.000000 ns	99.24%	-0.76%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 proxy_pool<os_provider>	162895.000 ns	160292.000000 ns	98.40%	-1.60%	.

Details

Benchmark details - environment, command...

api_overhead_benchmark_l0 SubmitKernel out of order

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_l0 --test=SubmitKernel --csv --noHeaders --Ioq=0 --DiscardEvents=0 --MeasureCompletion=0 --iterations=100000 --Profiling=0 --NumKernels=10 --KernelExecTime=1

api_overhead_benchmark_l0 SubmitKernel in order

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_l0 --test=SubmitKernel --csv --noHeaders --Ioq=1 --DiscardEvents=0 --MeasureCompletion=0 --iterations=100000 --Profiling=0 --NumKernels=10 --KernelExecTime=1

api_overhead_benchmark_sycl SubmitKernel out of order

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_sycl --test=SubmitKernel --csv --noHeaders --Ioq=0 --DiscardEvents=0 --MeasureCompletion=0 --iterations=100000 --Profiling=0 --NumKernels=10 --KernelExecTime=1

api_overhead_benchmark_sycl SubmitKernel in order

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_sycl --test=SubmitKernel --csv --noHeaders --Ioq=1 --DiscardEvents=0 --MeasureCompletion=0 --iterations=100000 --Profiling=0 --NumKernels=10 --KernelExecTime=1

memory_benchmark_sycl QueueInOrderMemcpy from Device to Device, size 1024

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/memory_benchmark_sycl --test=QueueInOrderMemcpy --csv --noHeaders --iterations=10000 --IsCopyOnly=0 --sourcePlacement=Device --destinationPlacement=Device --size=1024 --count=100

memory_benchmark_sycl QueueInOrderMemcpy from Host to Device, size 1024

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/memory_benchmark_sycl --test=QueueInOrderMemcpy --csv --noHeaders --iterations=10000 --IsCopyOnly=0 --sourcePlacement=Host --destinationPlacement=Device --size=1024 --count=100

memory_benchmark_sycl QueueMemcpy from Device to Device, size 1024

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/memory_benchmark_sycl --test=QueueMemcpy --csv --noHeaders --iterations=10000 --sourcePlacement=Device --destinationPlacement=Device --size=1024

memory_benchmark_sycl StreamMemory, placement Device, type Triad, size 10240

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/memory_benchmark_sycl --test=StreamMemory --csv --noHeaders --iterations=10000 --type=Triad --size=10240 --memoryPlacement=Device --useEvents=0 --contents=Zeros --multiplier=1

api_overhead_benchmark_sycl ExecImmediateCopyQueue out of order from Device to Device, size 1024

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_sycl --test=ExecImmediateCopyQueue --csv --noHeaders --iterations=100000 --ioq=0 --IsCopyOnly=1 --MeasureCompletionTime=0 --src=Device --dst=Device --size=1024

api_overhead_benchmark_sycl ExecImmediateCopyQueue in order from Device to Host, size 1024

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_sycl --test=ExecImmediateCopyQueue --csv --noHeaders --iterations=100000 --ioq=1 --IsCopyOnly=1 --MeasureCompletionTime=0 --src=Host --dst=Host --size=1024

miscellaneous_benchmark_sycl VectorSum

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/miscellaneous_benchmark_sycl --test=VectorSum --csv --noHeaders --iterations=1000 --numberOfElementsX=512 --numberOfElementsY=256 --numberOfElementsZ=256

multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:1 dstUSM:1

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/multithread_benchmark_ur --test=MemcpyExecute --csv --noHeaders --Ioq=1 --UseEvents=1 --MeasureCompletion=1 --UseQueuePerThread=1 --AllocSize=102400 --NumThreads=1 --NumOpsPerThread=400 --iterations=10 --SrcUSM=1 --DstUSM=1

multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:1 dstUSM:1

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/multithread_benchmark_ur --test=MemcpyExecute --csv --noHeaders --Ioq=1 --UseEvents=1 --MeasureCompletion=1 --UseQueuePerThread=1 --AllocSize=102400 --NumThreads=8 --NumOpsPerThread=100 --iterations=10 --SrcUSM=1 --DstUSM=1

multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:1 dstUSM:1

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/multithread_benchmark_ur --test=MemcpyExecute --csv --noHeaders --Ioq=1 --UseEvents=1 --MeasureCompletion=1 --UseQueuePerThread=1 --AllocSize=1024 --NumThreads=8 --NumOpsPerThread=400 --iterations=1000 --SrcUSM=1 --DstUSM=1

multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:1 dstUSM:1

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/multithread_benchmark_ur --test=MemcpyExecute --csv --noHeaders --Ioq=1 --UseEvents=1 --MeasureCompletion=1 --UseQueuePerThread=1 --AllocSize=1024 --NumThreads=16 --NumOpsPerThread=10 --iterations=10000 --SrcUSM=1 --DstUSM=1

multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:0 dstUSM:1

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/multithread_benchmark_ur --test=MemcpyExecute --csv --noHeaders --Ioq=1 --UseEvents=1 --MeasureCompletion=1 --UseQueuePerThread=1 --AllocSize=102400 --NumThreads=1 --NumOpsPerThread=400 --iterations=10 --SrcUSM=0 --DstUSM=1

multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:0 dstUSM:1

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/multithread_benchmark_ur --test=MemcpyExecute --csv --noHeaders --Ioq=1 --UseEvents=1 --MeasureCompletion=1 --UseQueuePerThread=1 --AllocSize=102400 --NumThreads=8 --NumOpsPerThread=100 --iterations=10 --SrcUSM=0 --DstUSM=1

multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:0 dstUSM:1

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/multithread_benchmark_ur --test=MemcpyExecute --csv --noHeaders --Ioq=1 --UseEvents=1 --MeasureCompletion=1 --UseQueuePerThread=1 --AllocSize=1024 --NumThreads=8 --NumOpsPerThread=400 --iterations=1000 --SrcUSM=0 --DstUSM=1

multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:0 dstUSM:1

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/multithread_benchmark_ur --test=MemcpyExecute --csv --noHeaders --Ioq=1 --UseEvents=1 --MeasureCompletion=1 --UseQueuePerThread=1 --AllocSize=1024 --NumThreads=16 --NumOpsPerThread=10 --iterations=10000 --SrcUSM=0 --DstUSM=1

multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:1, allocSize:1024 srcUSM:0 dstUSM:1 without events

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/multithread_benchmark_ur --test=MemcpyExecute --csv --noHeaders --Ioq=1 --UseEvents=0 --MeasureCompletion=1 --UseQueuePerThread=1 --AllocSize=1024 --NumThreads=1 --NumOpsPerThread=4096 --iterations=10 --SrcUSM=0 --DstUSM=1

multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:4, allocSize:1024 srcUSM:0 dstUSM:1 without events

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/multithread_benchmark_ur --test=MemcpyExecute --csv --noHeaders --Ioq=1 --UseEvents=0 --MeasureCompletion=1 --UseQueuePerThread=1 --AllocSize=1024 --NumThreads=4 --NumOpsPerThread=4096 --iterations=10 --SrcUSM=0 --DstUSM=1

graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:10

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_sycl --test=SinKernelGraph --csv --noHeaders --iterations=100 --numKernels=10 --withGraphs=0

graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:10

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_sycl --test=SinKernelGraph --csv --noHeaders --iterations=100 --numKernels=10 --withGraphs=1

graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:100

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_sycl --test=SinKernelGraph --csv --noHeaders --iterations=100 --numKernels=100 --withGraphs=0

graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:100

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_sycl --test=SinKernelGraph --csv --noHeaders --iterations=100 --numKernels=100 --withGraphs=1

graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:1, numKernels:10

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_sycl --test=SubmitExecGraph --csv --noHeaders --iterations=100 --measureSubmit=1 --ioq=0 --numKernels=10

graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:10

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_sycl --test=SubmitExecGraph --csv --noHeaders --iterations=100 --measureSubmit=1 --ioq=1 --numKernels=10

graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:100

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_sycl --test=SubmitExecGraph --csv --noHeaders --iterations=100 --measureSubmit=1 --ioq=1 --numKernels=100

graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:0, numKernels:10

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_sycl --test=SubmitExecGraph --csv --noHeaders --iterations=100 --measureSubmit=0 --ioq=0 --numKernels=10

graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:0, numKernels:10

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_sycl --test=SubmitExecGraph --csv --noHeaders --iterations=100 --measureSubmit=0 --ioq=1 --numKernels=10

graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:0, numKernels:100

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/graph_api_benchmark_sycl --test=SubmitExecGraph --csv --noHeaders --iterations=100 --measureSubmit=0 --ioq=1 --numKernels=100

api_overhead_benchmark_ur SubmitKernel out of order CPU count

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_ur --test=SubmitKernel --csv --noHeaders --Ioq=0 --DiscardEvents=0 --MeasureCompletion=0 --iterations=100000 --Profiling=0 --NumKernels=10 --KernelExecTime=1

api_overhead_benchmark_ur SubmitKernel out of order

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_ur --test=SubmitKernel --csv --noHeaders --Ioq=0 --DiscardEvents=0 --MeasureCompletion=0 --iterations=100000 --Profiling=0 --NumKernels=10 --KernelExecTime=1

api_overhead_benchmark_ur SubmitKernel in order CPU count

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_ur --test=SubmitKernel --csv --noHeaders --Ioq=1 --DiscardEvents=0 --MeasureCompletion=0 --iterations=100000 --Profiling=0 --NumKernels=10 --KernelExecTime=1

api_overhead_benchmark_ur SubmitKernel in order

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_ur --test=SubmitKernel --csv --noHeaders --Ioq=1 --DiscardEvents=0 --MeasureCompletion=0 --iterations=100000 --Profiling=0 --NumKernels=10 --KernelExecTime=1

api_overhead_benchmark_ur SubmitKernel in order with measure completion CPU count

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_ur --test=SubmitKernel --csv --noHeaders --Ioq=1 --DiscardEvents=0 --MeasureCompletion=1 --iterations=100000 --Profiling=0 --NumKernels=10 --KernelExecTime=1

api_overhead_benchmark_ur SubmitKernel in order with measure completion

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_ur --test=SubmitKernel --csv --noHeaders --Ioq=1 --DiscardEvents=0 --MeasureCompletion=1 --iterations=100000 --Profiling=0 --NumKernels=10 --KernelExecTime=1

Velocity-Bench Hashtable

Environment Variables:

Command:

/home/pmdk/bench_workdir/hashtable/hashtable_sycl --no-verify

Velocity-Bench Bitcracker

Environment Variables:

Command:

/home/pmdk/bench_workdir/bitcracker/bitcracker -f /home/pmdk/bench_workdir/velocity-bench-repo/bitcracker/hash_pass/img_win8_user_hash.txt -d /home/pmdk/bench_workdir/velocity-bench-repo/bitcracker/hash_pass/user_passwords_60000.txt -b 60000

Velocity-Bench CudaSift

Environment Variables:

Command:

/home/pmdk/bench_workdir/cudaSift/cudaSift

Velocity-Bench Easywave

Environment Variables:

Command:

/home/pmdk/bench_workdir/easywave/easyWave_sycl -grid /home/pmdk/bench_workdir/data/easywave/examples/e2Asean.grd -source /home/pmdk/bench_workdir/data/easywave/examples/BengkuluSept2007.flt -time 120

Velocity-Bench QuickSilver

Environment Variables:

QS_DEVICE=GPU

Command:

/home/pmdk/bench_workdir/QuickSilver/qs -i /home/pmdk/bench_workdir/velocity-bench-repo/QuickSilver/Examples/AllScattering/scatteringOnly.inp

Velocity-Bench Sobel Filter

Environment Variables:

OPENCV_IO_MAX_IMAGE_PIXELS=1677721600

Command:

/home/pmdk/bench_workdir/sobel_filter/sobel_filter -i /home/pmdk/bench_workdir/data/sobel_filter/sobel_filter_data/silverfalls_32Kx32K.png -n 5

Velocity-Bench dl-cifar

Environment Variables:

Command:

/home/pmdk/bench_workdir/dl-cifar/dl-cifar_sycl

Velocity-Bench dl-mnist

Environment Variables:

NEOReadDebugKeys=1
DisableScratchPages=0

Command:

/home/pmdk/bench_workdir/dl-mnist/dl-mnist-sycl -conv_algo ONEDNN_AUTO

Velocity-Bench svm

Environment Variables:

Command:

/home/pmdk/bench_workdir/svm/svm_sycl /home/pmdk/bench_workdir/velocity-bench-repo/svm/SYCL/a9a /home/pmdk/bench_workdir/velocity-bench-repo/svm/SYCL/a.m

Runtime_IndependentDAGTaskThroughput_SingleTask

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/dag_task_throughput_independent --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/IndependentDAGTaskThroughput_multi.csv --size=32768

Runtime_IndependentDAGTaskThroughput_BasicParallelFor

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/dag_task_throughput_independent --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/IndependentDAGTaskThroughput_multi.csv --size=32768

Runtime_IndependentDAGTaskThroughput_HierarchicalParallelFor

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/dag_task_throughput_independent --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/IndependentDAGTaskThroughput_multi.csv --size=32768

Runtime_IndependentDAGTaskThroughput_NDRangeParallelFor

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/dag_task_throughput_independent --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/IndependentDAGTaskThroughput_multi.csv --size=32768

Runtime_DAGTaskThroughput_SingleTask

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/dag_task_throughput_sequential --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/DAGTaskThroughput_multi.csv --size=327680

Runtime_DAGTaskThroughput_BasicParallelFor

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/dag_task_throughput_sequential --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/DAGTaskThroughput_multi.csv --size=327680

Runtime_DAGTaskThroughput_HierarchicalParallelFor

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/dag_task_throughput_sequential --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/DAGTaskThroughput_multi.csv --size=327680

Runtime_DAGTaskThroughput_NDRangeParallelFor

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/dag_task_throughput_sequential --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/DAGTaskThroughput_multi.csv --size=327680

MicroBench_HostDeviceBandwidth_1D_H2D_Contiguous

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/HostDeviceBandwidth_multi.csv --size=512

MicroBench_HostDeviceBandwidth_2D_H2D_Contiguous

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/HostDeviceBandwidth_multi.csv --size=512

MicroBench_HostDeviceBandwidth_3D_H2D_Contiguous

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/HostDeviceBandwidth_multi.csv --size=512

MicroBench_HostDeviceBandwidth_1D_D2H_Contiguous

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/HostDeviceBandwidth_multi.csv --size=512

MicroBench_HostDeviceBandwidth_2D_D2H_Contiguous

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/HostDeviceBandwidth_multi.csv --size=512

MicroBench_HostDeviceBandwidth_3D_D2H_Contiguous

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/HostDeviceBandwidth_multi.csv --size=512

MicroBench_HostDeviceBandwidth_1D_H2D_Strided

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/HostDeviceBandwidth_multi.csv --size=512

MicroBench_HostDeviceBandwidth_2D_H2D_Strided

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/HostDeviceBandwidth_multi.csv --size=512

MicroBench_HostDeviceBandwidth_3D_H2D_Strided

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/HostDeviceBandwidth_multi.csv --size=512

MicroBench_HostDeviceBandwidth_1D_D2H_Strided

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/HostDeviceBandwidth_multi.csv --size=512

MicroBench_HostDeviceBandwidth_2D_D2H_Strided

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/HostDeviceBandwidth_multi.csv --size=512

MicroBench_HostDeviceBandwidth_3D_D2H_Strided

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/host_device_bandwidth --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/HostDeviceBandwidth_multi.csv --size=512

MicroBench_LocalMem_int32_4096

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/local_mem --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/LocalMem_multi.csv --size=10240000

MicroBench_LocalMem_fp32_4096

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/local_mem --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/LocalMem_multi.csv --size=10240000

Pattern_Reduction_NDRange_int32

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/reduction --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/Pattern_Reduction_multi.csv --size=10240000

Pattern_Reduction_Hierarchical_int32

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/reduction --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/Pattern_Reduction_multi.csv --size=10240000

ScalarProduct_NDRange_int32

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/scalar_prod --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/ScalarProduct_multi.csv --size=102400000

ScalarProduct_NDRange_int64

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/scalar_prod --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/ScalarProduct_multi.csv --size=102400000

ScalarProduct_NDRange_fp32

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/scalar_prod --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/ScalarProduct_multi.csv --size=102400000

ScalarProduct_Hierarchical_int32

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/scalar_prod --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/ScalarProduct_multi.csv --size=102400000

ScalarProduct_Hierarchical_int64

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/scalar_prod --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/ScalarProduct_multi.csv --size=102400000

ScalarProduct_Hierarchical_fp32

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/scalar_prod --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/ScalarProduct_multi.csv --size=102400000

Pattern_SegmentedReduction_NDRange_int16

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/segmentedreduction --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/Pattern_SegmentedReduction_multi.csv --size=102400000

Pattern_SegmentedReduction_NDRange_int32

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/segmentedreduction --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/Pattern_SegmentedReduction_multi.csv --size=102400000

Pattern_SegmentedReduction_NDRange_int64

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/segmentedreduction --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/Pattern_SegmentedReduction_multi.csv --size=102400000

Pattern_SegmentedReduction_NDRange_fp32

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/segmentedreduction --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/Pattern_SegmentedReduction_multi.csv --size=102400000

Pattern_SegmentedReduction_Hierarchical_int16

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/segmentedreduction --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/Pattern_SegmentedReduction_multi.csv --size=102400000

Pattern_SegmentedReduction_Hierarchical_int32

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/segmentedreduction --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/Pattern_SegmentedReduction_multi.csv --size=102400000

Pattern_SegmentedReduction_Hierarchical_int64

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/segmentedreduction --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/Pattern_SegmentedReduction_multi.csv --size=102400000

Pattern_SegmentedReduction_Hierarchical_fp32

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/segmentedreduction --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/Pattern_SegmentedReduction_multi.csv --size=102400000

USM_Allocation_latency_fp32_device

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/usm_allocation_latency --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/USM_Allocation_latency_multi.csv --size=1024000000

USM_Allocation_latency_fp32_host

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/usm_allocation_latency --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/USM_Allocation_latency_multi.csv --size=1024000000

USM_Allocation_latency_fp32_shared

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/usm_allocation_latency --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/USM_Allocation_latency_multi.csv --size=1024000000

USM_Instr_Mix_fp32_device_1:1mix_with_init_no_prefetch

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/usm_instr_mix --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/USM_Instr_Mix_multi.csv --size=8192

USM_Instr_Mix_fp32_host_1:1mix_with_init_no_prefetch

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/usm_instr_mix --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/USM_Instr_Mix_multi.csv --size=8192

USM_Instr_Mix_fp32_device_1:1mix_no_init_no_prefetch

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/usm_instr_mix --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/USM_Instr_Mix_multi.csv --size=8192

USM_Instr_Mix_fp32_host_1:1mix_no_init_no_prefetch

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/usm_instr_mix --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/USM_Instr_Mix_multi.csv --size=8192

VectorAddition_int32

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/vec_add --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/VectorAddition_multi.csv --size=102400000

VectorAddition_int64

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/vec_add --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/VectorAddition_multi.csv --size=102400000

VectorAddition_fp32

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/vec_add --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/VectorAddition_multi.csv --size=102400000

Polybench_2mm

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/2mm --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/2mm.csv --size=512

Polybench_3mm

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/3mm --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/3mm.csv --size=512

Polybench_Atax

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/atax --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/Atax.csv --size=8192

Kmeans_fp32

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/kmeans --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/Kmeans.csv --size=700000000

LinearRegressionCoeff_fp32

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/lin_reg_coeff --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/LinearRegressionCoeff.csv --size=1638400000

MolecularDynamics

Environment Variables:

Command:

/home/pmdk/bench_workdir/sycl-bench-build/mol_dyn --warmup-run --num-runs=3 --output=/home/pmdk/bench_workdir/MolecularDynamics.csv --size=8196

llama.cpp Prompt Processing Batched 128

Environment Variables:

Command:

/home/pmdk/bench_workdir/llamacpp-build/bin/llama-bench --output csv -n 128 -p 512 -b 128,256,512 --numa isolate -t 56 --model /home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf

llama.cpp Text Generation Batched 128

Environment Variables:

Command:

/home/pmdk/bench_workdir/llamacpp-build/bin/llama-bench --output csv -n 128 -p 512 -b 128,256,512 --numa isolate -t 56 --model /home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf

llama.cpp Prompt Processing Batched 256

Environment Variables:

Command:

/home/pmdk/bench_workdir/llamacpp-build/bin/llama-bench --output csv -n 128 -p 512 -b 128,256,512 --numa isolate -t 56 --model /home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf

llama.cpp Text Generation Batched 256

Environment Variables:

Command:

/home/pmdk/bench_workdir/llamacpp-build/bin/llama-bench --output csv -n 128 -p 512 -b 128,256,512 --numa isolate -t 56 --model /home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf

llama.cpp Prompt Processing Batched 512

Environment Variables:

Command:

/home/pmdk/bench_workdir/llamacpp-build/bin/llama-bench --output csv -n 128 -p 512 -b 128,256,512 --numa isolate -t 56 --model /home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf

llama.cpp Text Generation Batched 512

Environment Variables:

Command:

/home/pmdk/bench_workdir/llamacpp-build/bin/llama-bench --output csv -n 128 -p 512 -b 128,256,512 --numa isolate -t 56 --model /home/pmdk/bench_workdir/models/Phi-3-mini-4k-instruct-q4.gguf

alloc/size:10000/0/4096/iterations:200000/threads:4 glibc