Replace loader handles with field at start of handle data #2622

RossBrunton · 2025-01-27T14:37:47Z

Currently only works for L0 (v1) and Hip.

github-actions · 2025-01-27T17:52:54Z

Compute Benchmarks level_zero run (with params: ):
https://github.com/oneapi-src/unified-runtime/actions/runs/12995136466

github-actions · 2025-01-27T18:07:25Z

Compute Benchmarks level_zero run ():
https://github.com/oneapi-src/unified-runtime/actions/runs/12995136466
Job status: success. Test status: success.

Summary

Total 38 benchmarks in mean.
Geomean 100.206%.
Improved 6 Regressed 7 (threshold 2.00%)

(result is better)

Performance change in benchmark groups

Relative perf in group memory (4): 100.808%

Benchmark	This PR	baseline	Relative perf	Change	-
memory_benchmark_sycl QueueMemcpy from Device to Device, size 1024	5.825000 μs	5.932 μs	101.84%	1.84%	.
memory_benchmark_sycl QueueInOrderMemcpy from Device to Device, size 1024	255.579000 μs	256.472 μs	100.35%	0.35%	.
memory_benchmark_sycl QueueInOrderMemcpy from Host to Device, size 1024	218.665000 μs	219.201 μs	100.25%	0.25%	.
memory_benchmark_sycl StreamMemory, placement Device, type Triad, size 10240	-	3.074000 GB/s

Relative perf in group api (12): 101.685%

Benchmark	This PR	baseline	Relative perf	Change	-
api_overhead_benchmark_sycl ExecImmediateCopyQueue out of order from Device to Device, size 1024	2.131000 μs	2.175 μs	102.06%	2.06%	++
api_overhead_benchmark_sycl ExecImmediateCopyQueue in order from Device to Host, size 1024	1.684000 μs	1.706 μs	101.31%	1.31%	.
api_overhead_benchmark_l0 SubmitKernel out of order	-	11.629000 μs
api_overhead_benchmark_l0 SubmitKernel in order	-	11.800000 μs
api_overhead_benchmark_sycl SubmitKernel out of order	-	23.287000 μs
api_overhead_benchmark_sycl SubmitKernel in order	-	24.664000 μs
api_overhead_benchmark_ur SubmitKernel out of order CPU count	-	105463.000000 instr
api_overhead_benchmark_ur SubmitKernel out of order	-	16.073000 μs
api_overhead_benchmark_ur SubmitKernel in order CPU count	-	110815.000000 instr
api_overhead_benchmark_ur SubmitKernel in order	-	16.703000 μs
api_overhead_benchmark_ur SubmitKernel in order with measure completion CPU count	-	123991.000000 instr
api_overhead_benchmark_ur SubmitKernel in order with measure completion	-	21.473000 μs

Relative perf in group Velocity-Bench (9): 99.170%

Benchmark	This PR	baseline	Relative perf	Change	-
Velocity-Bench dl-mnist	2.410 s	2.390000 s	99.17%	-0.83%	.
Velocity-Bench Hashtable	-	353.884706 M keys/sec
Velocity-Bench Bitcracker	-	35.731600 s
Velocity-Bench CudaSift	-	204.632000 ms
Velocity-Bench Easywave	-	235.000000 ms
Velocity-Bench QuickSilver	-	118.320000 MMS/CTT
Velocity-Bench Sobel Filter	-	615.149000 ms
Velocity-Bench dl-cifar	-	23.892100 s
Velocity-Bench svm	-	0.140700 s

Relative perf in group alloc/size:10000/0/4096/iterations:200000/threads:4 (4): 98.331%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/size:10000/0/4096/iterations:200000/threads:4 os_provider	2090.550000 ns	2119.200 ns	101.37%	1.37%	.
alloc/size:10000/0/4096/iterations:200000/threads:4 glibc	2695.080000 ns	2723.560 ns	101.06%	1.06%	.
alloc/size:10000/0/4096/iterations:200000/threads:4 scalable_pool<os_provider>	305.752 ns	294.824000 ns	96.43%	-3.57%	---
alloc/size:10000/0/4096/iterations:200000/threads:4 proxy_pool<os_provider>	3301.340 ns	3124.490000 ns	94.64%	-5.36%	----

Relative perf in group alloc/size:10000/0/4096/iterations:200000/threads:1 (4): 99.292%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/size:10000/0/4096/iterations:200000/threads:1 os_provider	192.902000 ns	195.800 ns	101.50%	1.50%	.
alloc/size:10000/0/4096/iterations:200000/threads:1 scalable_pool<os_provider>	213.998 ns	213.357000 ns	99.70%	-0.30%	.
alloc/size:10000/0/4096/iterations:200000/threads:1 glibc	708.029 ns	699.961000 ns	98.86%	-1.14%	.
alloc/size:10000/0/4096/iterations:200000/threads:1 proxy_pool<os_provider>	277.730 ns	269.830000 ns	97.16%	-2.84%	--

Relative perf in group alloc/size:10000/100000/4096/iterations:200000/threads:4 (4): 102.184%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/size:10000/100000/4096/iterations:200000/threads:4 glibc	1234.090000 ns	1399.010 ns	113.36%	13.36%	++++++++++
alloc/size:10000/100000/4096/iterations:200000/threads:4 os_provider	1872.920000 ns	1896.370 ns	101.25%	1.25%	.
alloc/size:10000/100000/4096/iterations:200000/threads:4 scalable_pool<os_provider>	261.410 ns	260.987000 ns	99.84%	-0.16%	.
alloc/size:10000/100000/4096/iterations:200000/threads:4 proxy_pool<os_provider>	3345.870 ns	3183.170000 ns	95.14%	-4.86%	----

Relative perf in group alloc/size:10000/100000/4096/iterations:200000/threads:1 (4): 98.976%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/size:10000/100000/4096/iterations:200000/threads:1 os_provider	190.629000 ns	192.753 ns	101.11%	1.11%	.
alloc/size:10000/100000/4096/iterations:200000/threads:1 scalable_pool<os_provider>	204.159000 ns	204.412 ns	100.12%	0.12%	.
alloc/size:10000/100000/4096/iterations:200000/threads:1 glibc	746.724 ns	737.865000 ns	98.81%	-1.19%	.
alloc/size:10000/100000/4096/iterations:200000/threads:1 proxy_pool<os_provider>	323.599 ns	310.425000 ns	95.93%	-4.07%	---

Relative perf in group alloc/min (4): 100.564%

Benchmark	This PR	baseline	Relative perf	Change	-
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 scalable_pool<os_provider>	1031.030000 ns	1083.760 ns	105.11%	5.11%	++++
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 scalable_pool<os_provider>	963.666 ns	960.784000 ns	99.70%	-0.30%	.
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 glibc	175.796 ns	174.373000 ns	99.19%	-0.81%	.
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 glibc	814.890 ns	801.763000 ns	98.39%	-1.61%	.

Relative perf in group multiple (12): 100.475%

Benchmark	This PR	baseline	Relative perf	Change	-
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 scalable_pool<os_provider>	25725.600000 ns	27465.300 ns	106.76%	6.76%	+++++
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 os_provider	1168990.000000 ns	1201570.000 ns	102.79%	2.79%	++
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 glibc	30426.700000 ns	31243.600 ns	102.68%	2.68%	++
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 glibc	33868.700000 ns	34482.100 ns	101.81%	1.81%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 scalable_pool<os_provider>	14839.400000 ns	15099.900 ns	101.76%	1.76%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 scalable_pool<os_provider>	42798.400000 ns	43475.800 ns	101.58%	1.58%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 os_provider	148110.000 ns	147271.000000 ns	99.43%	-0.57%	.
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 scalable_pool<os_provider>	76090.100 ns	75587.500000 ns	99.34%	-0.66%	.
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 glibc	142794.000 ns	141214.000000 ns	98.89%	-1.11%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 glibc	4271.520 ns	4207.320000 ns	98.50%	-1.50%	.
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 proxy_pool<os_provider>	1218490.000 ns	1185020.000000 ns	97.25%	-2.75%	--
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 proxy_pool<os_provider>	168045.000 ns	160292.000000 ns	95.39%	-4.61%	---

Relative perf in group miscellaneous (1): cannot calculate

Benchmark	This PR	baseline	Relative perf	Change	-
miscellaneous_benchmark_sycl VectorSum	-	861.253000 bw GB/s

Relative perf in group multithread (10): cannot calculate

Benchmark	This PR	baseline
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:1 dstUSM:1	-	6943.025000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:1 dstUSM:1	-	17230.283000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:1 dstUSM:1	-	47306.654000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:1 dstUSM:1	-	2083.870000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:0 dstUSM:1	-	7821.718000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:0 dstUSM:1	-	9073.725000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:0 dstUSM:1	-	26707.698000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:0 dstUSM:1	-	1210.999000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:1, allocSize:1024 srcUSM:0 dstUSM:1 without events	-	43064.999000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:4, allocSize:1024 srcUSM:0 dstUSM:1 without events	-	114139.645000 μs

Relative perf in group graph (10): cannot calculate

Benchmark	This PR	baseline
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:10	-	71856.495000 μs
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:10	-	72543.241000 μs
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:100	-	353404.211000 μs
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:100	-	353223.514000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:1, numKernels:10	-	54.135000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:10	-	61.889000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:100	-	679.477000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:0, numKernels:10	-	5611.771000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:0, numKernels:10	-	5615.778000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:0, numKernels:100	-	57263.652000 μs

Relative perf in group Runtime (8): cannot calculate

Benchmark	This PR	baseline
Runtime_IndependentDAGTaskThroughput_SingleTask	-	265.060000 ms
Runtime_IndependentDAGTaskThroughput_BasicParallelFor	-	287.518000 ms
Runtime_IndependentDAGTaskThroughput_HierarchicalParallelFor	-	277.037000 ms
Runtime_IndependentDAGTaskThroughput_NDRangeParallelFor	-	275.224000 ms
Runtime_DAGTaskThroughput_SingleTask	-	1678.531000 ms
Runtime_DAGTaskThroughput_BasicParallelFor	-	1747.525000 ms
Runtime_DAGTaskThroughput_HierarchicalParallelFor	-	1718.971000 ms
Runtime_DAGTaskThroughput_NDRangeParallelFor	-	1682.917000 ms

Relative perf in group MicroBench (14): cannot calculate

Benchmark	This PR	baseline
MicroBench_HostDeviceBandwidth_1D_H2D_Contiguous	-	4.832000 ms
MicroBench_HostDeviceBandwidth_2D_H2D_Contiguous	-	4.730000 ms
MicroBench_HostDeviceBandwidth_3D_H2D_Contiguous	-	4.690000 ms
MicroBench_HostDeviceBandwidth_1D_D2H_Contiguous	-	4.764000 ms
MicroBench_HostDeviceBandwidth_2D_D2H_Contiguous	-	618.120000 ms
MicroBench_HostDeviceBandwidth_3D_D2H_Contiguous	-	618.122000 ms
MicroBench_HostDeviceBandwidth_1D_H2D_Strided	-	4.700000 ms
MicroBench_HostDeviceBandwidth_2D_H2D_Strided	-	5.130000 ms
MicroBench_HostDeviceBandwidth_3D_H2D_Strided	-	5.024000 ms
MicroBench_HostDeviceBandwidth_1D_D2H_Strided	-	4.854000 ms
MicroBench_HostDeviceBandwidth_2D_D2H_Strided	-	617.529000 ms
MicroBench_HostDeviceBandwidth_3D_D2H_Strided	-	617.480000 ms
MicroBench_LocalMem_int32_4096	-	29.887000 ms
MicroBench_LocalMem_fp32_4096	-	29.884000 ms

Relative perf in group Pattern (10): cannot calculate

Benchmark	This PR	baseline
Pattern_Reduction_NDRange_int32	-	16.720000 ms
Pattern_Reduction_Hierarchical_int32	-	16.716000 ms
Pattern_SegmentedReduction_NDRange_int16	-	2.266000 ms
Pattern_SegmentedReduction_NDRange_int32	-	2.164000 ms
Pattern_SegmentedReduction_NDRange_int64	-	2.338000 ms
Pattern_SegmentedReduction_NDRange_fp32	-	2.165000 ms
Pattern_SegmentedReduction_Hierarchical_int16	-	11.799000 ms
Pattern_SegmentedReduction_Hierarchical_int32	-	11.588000 ms
Pattern_SegmentedReduction_Hierarchical_int64	-	11.784000 ms
Pattern_SegmentedReduction_Hierarchical_fp32	-	11.585000 ms

Relative perf in group ScalarProduct (6): cannot calculate

Benchmark	This PR	baseline
ScalarProduct_NDRange_int32	-	3.769000 ms
ScalarProduct_NDRange_int64	-	5.461000 ms
ScalarProduct_NDRange_fp32	-	3.773000 ms
ScalarProduct_Hierarchical_int32	-	10.533000 ms
ScalarProduct_Hierarchical_int64	-	11.502000 ms
ScalarProduct_Hierarchical_fp32	-	10.158000 ms

Relative perf in group USM (7): cannot calculate

Benchmark	This PR	baseline
USM_Allocation_latency_fp32_device	-	0.067000 ms
USM_Allocation_latency_fp32_host	-	37.342000 ms
USM_Allocation_latency_fp32_shared	-	0.057000 ms
USM_Instr_Mix_fp32_device_1:1mix_with_init_no_prefetch	-	1.684000 ms
USM_Instr_Mix_fp32_host_1:1mix_with_init_no_prefetch	-	1.074000 ms
USM_Instr_Mix_fp32_device_1:1mix_no_init_no_prefetch	-	1.850000 ms
USM_Instr_Mix_fp32_host_1:1mix_no_init_no_prefetch	-	1.256000 ms

Relative perf in group VectorAddition (3): cannot calculate

Benchmark	This PR	baseline
VectorAddition_int32	-	1.475000 ms
VectorAddition_int64	-	3.061000 ms
VectorAddition_fp32	-	1.468000 ms

Relative perf in group Polybench (3): cannot calculate

Benchmark	This PR	baseline
Polybench_2mm	-	1.227000 ms
Polybench_3mm	-	1.729000 ms
Polybench_Atax	-	6.885000 ms

Relative perf in group Kmeans (1): cannot calculate

Benchmark	This PR	baseline	Relative perf	Change	-
Kmeans_fp32	-	16.080000 ms

Relative perf in group LinearRegressionCoeff (1): cannot calculate

Benchmark	This PR	baseline	Relative perf	Change	-
LinearRegressionCoeff_fp32	-	935.779000 ms

Relative perf in group MolecularDynamics (1): cannot calculate

Benchmark	This PR	baseline	Relative perf	Change	-
MolecularDynamics	-	0.029000 ms

Relative perf in group llama.cpp (6): cannot calculate

Benchmark	This PR	baseline
llama.cpp Prompt Processing Batched 128	-	829.272674 token/s
llama.cpp Text Generation Batched 128	-	62.469368 token/s
llama.cpp Prompt Processing Batched 256	-	867.896489 token/s
llama.cpp Text Generation Batched 256	-	62.451865 token/s
llama.cpp Prompt Processing Batched 512	-	428.586901 token/s
llama.cpp Text Generation Batched 512	-	62.506870 token/s

Details

Benchmark details - environment, command...

memory_benchmark_sycl QueueInOrderMemcpy from Device to Device, size 1024

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/memory_benchmark_sycl --test=QueueInOrderMemcpy --csv --noHeaders --iterations=10000 --IsCopyOnly=0 --sourcePlacement=Device --destinationPlacement=Device --size=1024 --count=100

memory_benchmark_sycl QueueInOrderMemcpy from Host to Device, size 1024

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/memory_benchmark_sycl --test=QueueInOrderMemcpy --csv --noHeaders --iterations=10000 --IsCopyOnly=0 --sourcePlacement=Host --destinationPlacement=Device --size=1024 --count=100

memory_benchmark_sycl QueueMemcpy from Device to Device, size 1024

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/memory_benchmark_sycl --test=QueueMemcpy --csv --noHeaders --iterations=10000 --sourcePlacement=Device --destinationPlacement=Device --size=1024

api_overhead_benchmark_sycl ExecImmediateCopyQueue out of order from Device to Device, size 1024

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_sycl --test=ExecImmediateCopyQueue --csv --noHeaders --iterations=100000 --ioq=0 --IsCopyOnly=1 --MeasureCompletionTime=0 --src=Device --dst=Device --size=1024

api_overhead_benchmark_sycl ExecImmediateCopyQueue in order from Device to Host, size 1024

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_sycl --test=ExecImmediateCopyQueue --csv --noHeaders --iterations=100000 --ioq=1 --IsCopyOnly=1 --MeasureCompletionTime=0 --src=Host --dst=Host --size=1024

Velocity-Bench dl-mnist

Environment Variables:

NEOReadDebugKeys=1
DisableScratchPages=0

Command:

/home/pmdk/bench_workdir/dl-mnist/dl-mnist-sycl -conv_algo ONEDNN_AUTO

alloc/size:10000/0/4096/iterations:200000/threads:4 glibc