Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace loader handles with field at start of handle data #2622

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

RossBrunton
Copy link
Contributor

Currently only works for L0 (v1) and Hip.

@github-actions github-actions bot added loader Loader related feature/bug level-zero L0 adapter specific issues hip HIP adapter specific issues command-buffer Command Buffer feature addition/changes/specification labels Jan 27, 2025
@github-actions github-actions bot added cuda CUDA adapter specific issues native-cpu Native CPU adapter specific issues labels Jan 27, 2025
Copy link

Compute Benchmarks level_zero run (with params: ):
https://github.com/oneapi-src/unified-runtime/actions/runs/12995136466

@github-actions github-actions bot added the common Changes or additions to common utilities label Jan 27, 2025
Copy link

Compute Benchmarks level_zero run ():
https://github.com/oneapi-src/unified-runtime/actions/runs/12995136466
Job status: success. Test status: success.

Summary

Total 38 benchmarks in mean.
Geomean 100.206%.
Improved 6 Regressed 7 (threshold 2.00%)

(result is better)

Performance change in benchmark groups

Relative perf in group memory (4): 100.808%
Benchmark This PR baseline Relative perf Change -
memory_benchmark_sycl QueueMemcpy from Device to Device, size 1024 5.825000 μs 5.932 μs 101.84% 1.84% .
memory_benchmark_sycl QueueInOrderMemcpy from Device to Device, size 1024 255.579000 μs 256.472 μs 100.35% 0.35% .
memory_benchmark_sycl QueueInOrderMemcpy from Host to Device, size 1024 218.665000 μs 219.201 μs 100.25% 0.25% .
memory_benchmark_sycl StreamMemory, placement Device, type Triad, size 10240 - 3.074000 GB/s
Relative perf in group api (12): 101.685%
Benchmark This PR baseline Relative perf Change -
api_overhead_benchmark_sycl ExecImmediateCopyQueue out of order from Device to Device, size 1024 2.131000 μs 2.175 μs 102.06% 2.06% ++
api_overhead_benchmark_sycl ExecImmediateCopyQueue in order from Device to Host, size 1024 1.684000 μs 1.706 μs 101.31% 1.31% .
api_overhead_benchmark_l0 SubmitKernel out of order - 11.629000 μs
api_overhead_benchmark_l0 SubmitKernel in order - 11.800000 μs
api_overhead_benchmark_sycl SubmitKernel out of order - 23.287000 μs
api_overhead_benchmark_sycl SubmitKernel in order - 24.664000 μs
api_overhead_benchmark_ur SubmitKernel out of order CPU count - 105463.000000 instr
api_overhead_benchmark_ur SubmitKernel out of order - 16.073000 μs
api_overhead_benchmark_ur SubmitKernel in order CPU count - 110815.000000 instr
api_overhead_benchmark_ur SubmitKernel in order - 16.703000 μs
api_overhead_benchmark_ur SubmitKernel in order with measure completion CPU count - 123991.000000 instr
api_overhead_benchmark_ur SubmitKernel in order with measure completion - 21.473000 μs
Relative perf in group Velocity-Bench (9): 99.170%
Benchmark This PR baseline Relative perf Change -
Velocity-Bench dl-mnist 2.410 s 2.390000 s 99.17% -0.83% .
Velocity-Bench Hashtable - 353.884706 M keys/sec
Velocity-Bench Bitcracker - 35.731600 s
Velocity-Bench CudaSift - 204.632000 ms
Velocity-Bench Easywave - 235.000000 ms
Velocity-Bench QuickSilver - 118.320000 MMS/CTT
Velocity-Bench Sobel Filter - 615.149000 ms
Velocity-Bench dl-cifar - 23.892100 s
Velocity-Bench svm - 0.140700 s
Relative perf in group alloc/size:10000/0/4096/iterations:200000/threads:4 (4): 98.331%
Benchmark This PR baseline Relative perf Change -
alloc/size:10000/0/4096/iterations:200000/threads:4 os_provider 2090.550000 ns 2119.200 ns 101.37% 1.37% .
alloc/size:10000/0/4096/iterations:200000/threads:4 glibc 2695.080000 ns 2723.560 ns 101.06% 1.06% .
alloc/size:10000/0/4096/iterations:200000/threads:4 scalable_pool<os_provider> 305.752 ns 294.824000 ns 96.43% -3.57% ---
alloc/size:10000/0/4096/iterations:200000/threads:4 proxy_pool<os_provider> 3301.340 ns 3124.490000 ns 94.64% -5.36% ----
Relative perf in group alloc/size:10000/0/4096/iterations:200000/threads:1 (4): 99.292%
Benchmark This PR baseline Relative perf Change -
alloc/size:10000/0/4096/iterations:200000/threads:1 os_provider 192.902000 ns 195.800 ns 101.50% 1.50% .
alloc/size:10000/0/4096/iterations:200000/threads:1 scalable_pool<os_provider> 213.998 ns 213.357000 ns 99.70% -0.30% .
alloc/size:10000/0/4096/iterations:200000/threads:1 glibc 708.029 ns 699.961000 ns 98.86% -1.14% .
alloc/size:10000/0/4096/iterations:200000/threads:1 proxy_pool<os_provider> 277.730 ns 269.830000 ns 97.16% -2.84% --
Relative perf in group alloc/size:10000/100000/4096/iterations:200000/threads:4 (4): 102.184%
Benchmark This PR baseline Relative perf Change -
alloc/size:10000/100000/4096/iterations:200000/threads:4 glibc 1234.090000 ns 1399.010 ns 113.36% 13.36% ++++++++++
alloc/size:10000/100000/4096/iterations:200000/threads:4 os_provider 1872.920000 ns 1896.370 ns 101.25% 1.25% .
alloc/size:10000/100000/4096/iterations:200000/threads:4 scalable_pool<os_provider> 261.410 ns 260.987000 ns 99.84% -0.16% .
alloc/size:10000/100000/4096/iterations:200000/threads:4 proxy_pool<os_provider> 3345.870 ns 3183.170000 ns 95.14% -4.86% ----
Relative perf in group alloc/size:10000/100000/4096/iterations:200000/threads:1 (4): 98.976%
Benchmark This PR baseline Relative perf Change -
alloc/size:10000/100000/4096/iterations:200000/threads:1 os_provider 190.629000 ns 192.753 ns 101.11% 1.11% .
alloc/size:10000/100000/4096/iterations:200000/threads:1 scalable_pool<os_provider> 204.159000 ns 204.412 ns 100.12% 0.12% .
alloc/size:10000/100000/4096/iterations:200000/threads:1 glibc 746.724 ns 737.865000 ns 98.81% -1.19% .
alloc/size:10000/100000/4096/iterations:200000/threads:1 proxy_pool<os_provider> 323.599 ns 310.425000 ns 95.93% -4.07% ---
Relative perf in group alloc/min (4): 100.564%
Benchmark This PR baseline Relative perf Change -
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 scalable_pool<os_provider> 1031.030000 ns 1083.760 ns 105.11% 5.11% ++++
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 scalable_pool<os_provider> 963.666 ns 960.784000 ns 99.70% -0.30% .
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 glibc 175.796 ns 174.373000 ns 99.19% -0.81% .
alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 glibc 814.890 ns 801.763000 ns 98.39% -1.61% .
Relative perf in group multiple (12): 100.475%
Benchmark This PR baseline Relative perf Change -
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 scalable_pool<os_provider> 25725.600000 ns 27465.300 ns 106.76% 6.76% +++++
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 os_provider 1168990.000000 ns 1201570.000 ns 102.79% 2.79% ++
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 glibc 30426.700000 ns 31243.600 ns 102.68% 2.68% ++
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 glibc 33868.700000 ns 34482.100 ns 101.81% 1.81% .
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 scalable_pool<os_provider> 14839.400000 ns 15099.900 ns 101.76% 1.76% .
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 scalable_pool<os_provider> 42798.400000 ns 43475.800 ns 101.58% 1.58% .
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 os_provider 148110.000 ns 147271.000000 ns 99.43% -0.57% .
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 scalable_pool<os_provider> 76090.100 ns 75587.500000 ns 99.34% -0.66% .
multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 glibc 142794.000 ns 141214.000000 ns 98.89% -1.11% .
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 glibc 4271.520 ns 4207.320000 ns 98.50% -1.50% .
multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 proxy_pool<os_provider> 1218490.000 ns 1185020.000000 ns 97.25% -2.75% --
multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 proxy_pool<os_provider> 168045.000 ns 160292.000000 ns 95.39% -4.61% ---
Relative perf in group miscellaneous (1): cannot calculate
Benchmark This PR baseline Relative perf Change -
miscellaneous_benchmark_sycl VectorSum - 861.253000 bw GB/s
Relative perf in group multithread (10): cannot calculate
Benchmark This PR baseline Relative perf Change -
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:1 dstUSM:1 - 6943.025000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:1 dstUSM:1 - 17230.283000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:1 dstUSM:1 - 47306.654000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:1 dstUSM:1 - 2083.870000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:1, allocSize:102400 srcUSM:0 dstUSM:1 - 7821.718000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:100, numThreads:8, allocSize:102400 srcUSM:0 dstUSM:1 - 9073.725000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:400, numThreads:8, allocSize:1024 srcUSM:0 dstUSM:1 - 26707.698000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:10, numThreads:16, allocSize:1024 srcUSM:0 dstUSM:1 - 1210.999000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:1, allocSize:1024 srcUSM:0 dstUSM:1 without events - 43064.999000 μs
multithread_benchmark_ur MemcpyExecute opsPerThread:4096, numThreads:4, allocSize:1024 srcUSM:0 dstUSM:1 without events - 114139.645000 μs
Relative perf in group graph (10): cannot calculate
Benchmark This PR baseline Relative perf Change -
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:10 - 71856.495000 μs
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:10 - 72543.241000 μs
graph_api_benchmark_sycl SinKernelGraph graphs:0, numKernels:100 - 353404.211000 μs
graph_api_benchmark_sycl SinKernelGraph graphs:1, numKernels:100 - 353223.514000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:1, numKernels:10 - 54.135000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:10 - 61.889000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:1, numKernels:100 - 679.477000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:0, submit:0, numKernels:10 - 5611.771000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:0, numKernels:10 - 5615.778000 μs
graph_api_benchmark_sycl SubmitExecGraph ioq:1, submit:0, numKernels:100 - 57263.652000 μs
Relative perf in group Runtime (8): cannot calculate
Benchmark This PR baseline Relative perf Change -
Runtime_IndependentDAGTaskThroughput_SingleTask - 265.060000 ms
Runtime_IndependentDAGTaskThroughput_BasicParallelFor - 287.518000 ms
Runtime_IndependentDAGTaskThroughput_HierarchicalParallelFor - 277.037000 ms
Runtime_IndependentDAGTaskThroughput_NDRangeParallelFor - 275.224000 ms
Runtime_DAGTaskThroughput_SingleTask - 1678.531000 ms
Runtime_DAGTaskThroughput_BasicParallelFor - 1747.525000 ms
Runtime_DAGTaskThroughput_HierarchicalParallelFor - 1718.971000 ms
Runtime_DAGTaskThroughput_NDRangeParallelFor - 1682.917000 ms
Relative perf in group MicroBench (14): cannot calculate
Benchmark This PR baseline Relative perf Change -
MicroBench_HostDeviceBandwidth_1D_H2D_Contiguous - 4.832000 ms
MicroBench_HostDeviceBandwidth_2D_H2D_Contiguous - 4.730000 ms
MicroBench_HostDeviceBandwidth_3D_H2D_Contiguous - 4.690000 ms
MicroBench_HostDeviceBandwidth_1D_D2H_Contiguous - 4.764000 ms
MicroBench_HostDeviceBandwidth_2D_D2H_Contiguous - 618.120000 ms
MicroBench_HostDeviceBandwidth_3D_D2H_Contiguous - 618.122000 ms
MicroBench_HostDeviceBandwidth_1D_H2D_Strided - 4.700000 ms
MicroBench_HostDeviceBandwidth_2D_H2D_Strided - 5.130000 ms
MicroBench_HostDeviceBandwidth_3D_H2D_Strided - 5.024000 ms
MicroBench_HostDeviceBandwidth_1D_D2H_Strided - 4.854000 ms
MicroBench_HostDeviceBandwidth_2D_D2H_Strided - 617.529000 ms
MicroBench_HostDeviceBandwidth_3D_D2H_Strided - 617.480000 ms
MicroBench_LocalMem_int32_4096 - 29.887000 ms
MicroBench_LocalMem_fp32_4096 - 29.884000 ms
Relative perf in group Pattern (10): cannot calculate
Benchmark This PR baseline Relative perf Change -
Pattern_Reduction_NDRange_int32 - 16.720000 ms
Pattern_Reduction_Hierarchical_int32 - 16.716000 ms
Pattern_SegmentedReduction_NDRange_int16 - 2.266000 ms
Pattern_SegmentedReduction_NDRange_int32 - 2.164000 ms
Pattern_SegmentedReduction_NDRange_int64 - 2.338000 ms
Pattern_SegmentedReduction_NDRange_fp32 - 2.165000 ms
Pattern_SegmentedReduction_Hierarchical_int16 - 11.799000 ms
Pattern_SegmentedReduction_Hierarchical_int32 - 11.588000 ms
Pattern_SegmentedReduction_Hierarchical_int64 - 11.784000 ms
Pattern_SegmentedReduction_Hierarchical_fp32 - 11.585000 ms
Relative perf in group ScalarProduct (6): cannot calculate
Benchmark This PR baseline Relative perf Change -
ScalarProduct_NDRange_int32 - 3.769000 ms
ScalarProduct_NDRange_int64 - 5.461000 ms
ScalarProduct_NDRange_fp32 - 3.773000 ms
ScalarProduct_Hierarchical_int32 - 10.533000 ms
ScalarProduct_Hierarchical_int64 - 11.502000 ms
ScalarProduct_Hierarchical_fp32 - 10.158000 ms
Relative perf in group USM (7): cannot calculate
Benchmark This PR baseline Relative perf Change -
USM_Allocation_latency_fp32_device - 0.067000 ms
USM_Allocation_latency_fp32_host - 37.342000 ms
USM_Allocation_latency_fp32_shared - 0.057000 ms
USM_Instr_Mix_fp32_device_1:1mix_with_init_no_prefetch - 1.684000 ms
USM_Instr_Mix_fp32_host_1:1mix_with_init_no_prefetch - 1.074000 ms
USM_Instr_Mix_fp32_device_1:1mix_no_init_no_prefetch - 1.850000 ms
USM_Instr_Mix_fp32_host_1:1mix_no_init_no_prefetch - 1.256000 ms
Relative perf in group VectorAddition (3): cannot calculate
Benchmark This PR baseline Relative perf Change -
VectorAddition_int32 - 1.475000 ms
VectorAddition_int64 - 3.061000 ms
VectorAddition_fp32 - 1.468000 ms
Relative perf in group Polybench (3): cannot calculate
Benchmark This PR baseline Relative perf Change -
Polybench_2mm - 1.227000 ms
Polybench_3mm - 1.729000 ms
Polybench_Atax - 6.885000 ms
Relative perf in group Kmeans (1): cannot calculate
Benchmark This PR baseline Relative perf Change -
Kmeans_fp32 - 16.080000 ms
Relative perf in group LinearRegressionCoeff (1): cannot calculate
Benchmark This PR baseline Relative perf Change -
LinearRegressionCoeff_fp32 - 935.779000 ms
Relative perf in group MolecularDynamics (1): cannot calculate
Benchmark This PR baseline Relative perf Change -
MolecularDynamics - 0.029000 ms
Relative perf in group llama.cpp (6): cannot calculate
Benchmark This PR baseline Relative perf Change -
llama.cpp Prompt Processing Batched 128 - 829.272674 token/s
llama.cpp Text Generation Batched 128 - 62.469368 token/s
llama.cpp Prompt Processing Batched 256 - 867.896489 token/s
llama.cpp Text Generation Batched 256 - 62.451865 token/s
llama.cpp Prompt Processing Batched 512 - 428.586901 token/s
llama.cpp Text Generation Batched 512 - 62.506870 token/s

Details

Benchmark details - environment, command...
memory_benchmark_sycl QueueInOrderMemcpy from Device to Device, size 1024

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/memory_benchmark_sycl --test=QueueInOrderMemcpy --csv --noHeaders --iterations=10000 --IsCopyOnly=0 --sourcePlacement=Device --destinationPlacement=Device --size=1024 --count=100

memory_benchmark_sycl QueueInOrderMemcpy from Host to Device, size 1024

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/memory_benchmark_sycl --test=QueueInOrderMemcpy --csv --noHeaders --iterations=10000 --IsCopyOnly=0 --sourcePlacement=Host --destinationPlacement=Device --size=1024 --count=100

memory_benchmark_sycl QueueMemcpy from Device to Device, size 1024

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/memory_benchmark_sycl --test=QueueMemcpy --csv --noHeaders --iterations=10000 --sourcePlacement=Device --destinationPlacement=Device --size=1024

api_overhead_benchmark_sycl ExecImmediateCopyQueue out of order from Device to Device, size 1024

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_sycl --test=ExecImmediateCopyQueue --csv --noHeaders --iterations=100000 --ioq=0 --IsCopyOnly=1 --MeasureCompletionTime=0 --src=Device --dst=Device --size=1024

api_overhead_benchmark_sycl ExecImmediateCopyQueue in order from Device to Host, size 1024

Environment Variables:

Command:

/home/pmdk/bench_workdir/compute-benchmarks-build/bin/api_overhead_benchmark_sycl --test=ExecImmediateCopyQueue --csv --noHeaders --iterations=100000 --ioq=1 --IsCopyOnly=1 --MeasureCompletionTime=0 --src=Host --dst=Host --size=1024

Velocity-Bench dl-mnist

Environment Variables:

NEOReadDebugKeys=1
DisableScratchPages=0

Command:

/home/pmdk/bench_workdir/dl-mnist/dl-mnist-sycl -conv_algo ONEDNN_AUTO

alloc/size:10000/0/4096/iterations:200000/threads:4 glibc

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/0/4096/iterations:200000/threads:1 glibc

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/100000/4096/iterations:200000/threads:4 glibc

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/100000/4096/iterations:200000/threads:1 glibc

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 glibc

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 glibc

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/0/4096/iterations:200000/threads:4 os_provider

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/0/4096/iterations:200000/threads:1 os_provider

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/100000/4096/iterations:200000/threads:4 os_provider

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/100000/4096/iterations:200000/threads:1 os_provider

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/0/4096/iterations:200000/threads:4 proxy_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/0/4096/iterations:200000/threads:1 proxy_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/100000/4096/iterations:200000/threads:4 proxy_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/100000/4096/iterations:200000/threads:1 proxy_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/0/4096/iterations:200000/threads:4 scalable_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/0/4096/iterations:200000/threads:1 scalable_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/100000/4096/iterations:200000/threads:4 scalable_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/size:10000/100000/4096/iterations:200000/threads:1 scalable_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:4 scalable_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

alloc/min size:10000/max size:0/granularity:8/65536/8/iterations:200000/threads:1 scalable_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 glibc

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 glibc

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 glibc

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 glibc

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 proxy_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 proxy_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 os_provider

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 os_provider

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

multiple_malloc_free/size:10000/4096/iterations:2000/threads:4 scalable_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

multiple_malloc_free/size:10000/4096/iterations:2000/threads:1 scalable_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:4 scalable_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

multiple_malloc_free/min size:10000/max size:8/granularity:65536/8/iterations:2000/threads:1 scalable_pool

Environment Variables:

Command:

/home/pmdk/ur-actions-runner/_work/unified-runtime/unified-runtime/umf_build/benchmark/umf-benchmark --benchmark_format=csv

@RossBrunton RossBrunton force-pushed the ross/nohandle branch 6 times, most recently from a5e38c1 to 3d54672 Compare January 30, 2025 14:20
@github-actions github-actions bot added the opencl OpenCL adapter specific issues label Jan 30, 2025
@RossBrunton RossBrunton force-pushed the ross/nohandle branch 4 times, most recently from 42ac088 to 953f359 Compare January 31, 2025 11:50
omarahmed1111 and others added 2 commits January 31, 2025 12:10
We want to transition to handle pointers containing the ddi table as the
first element. For this to work, handle object must not have a vtable.

Since ur_mem_handle_t_ is relatively simple, it's easy enough to roll
out our own version of dynamic dispatch.
This replaces the handle logic in the loader from wrapped pointers
to a ddi table at the start of the handle struct itself.

Just testing something...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
command-buffer Command Buffer feature addition/changes/specification common Changes or additions to common utilities cuda CUDA adapter specific issues hip HIP adapter specific issues level-zero L0 adapter specific issues loader Loader related feature/bug native-cpu Native CPU adapter specific issues opencl OpenCL adapter specific issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants