Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
benchmark of fbgemm op - regroup_kts (pytorch#2159)
Summary: Pull Request resolved: pytorch#2159 # context * added **fn-level** benchmark for the `regroup_keyed_tensor` * `keyed_tensor_regroup` further reduces the CPU runtime from 2.0ms to 1.3ms (35% improvement) without hurting the GPU runtime/memory usage # conclusion * CPU runtime **reduces 40%** from 1.8 ms to 1.1 ms * GPU runtime **reduces 60%** from 4.9 ms to 2.0 ms * GPU memory **reduces 33%** from 1.5 K to 1.0 K * **we should migrate to the new op** unless any unknown concern/blocker # traces * [files](https://drive.google.com/drive/folders/1iiEf30LeG_i0xobMZVhmMneOQ5slmX3U?usp=drive_link) ``` [[email protected] /data/sandcastle/boxes/fbsource (04ad34da3)]$ ll *.json -rw-r--r-- 1 hhy hhy 552501 Jul 10 16:01 'trace-[1 Op] KT_regroup_dup.json' -rw-r--r-- 1 hhy hhy 548847 Jul 10 16:01 'trace-[1 Op] KT_regroup.json' -rw-r--r-- 1 hhy hhy 559006 Jul 10 16:01 'trace-[2 Ops] permute_multi_embs_dup.json' -rw-r--r-- 1 hhy hhy 553199 Jul 10 16:01 'trace-[2 Ops] permute_multi_embs.json' -rw-r--r-- 1 hhy hhy 5104239 Jul 10 16:01 'trace-[Module] KTRegroupAsDict_dup.json' -rw-r--r-- 1 hhy hhy 346643 Jul 10 16:01 'trace-[Module] KTRegroupAsDict.json' -rw-r--r-- 1 hhy hhy 895096 Jul 10 16:01 'trace-[Old Prod] permute_pooled_embs.json' -rw-r--r-- 1 hhy hhy 561685 Jul 10 16:01 'trace-[Prod] KeyedTensor.regroup_dup.json' -rw-r--r-- 1 hhy hhy 559147 Jul 10 16:01 'trace-[Prod] KeyedTensor.regroup.json' -rw-r--r-- 1 hhy hhy 7958676 Jul 10 16:01 'trace-[pytorch generic] fallback_dup.json' -rw-r--r-- 1 hhy hhy 7978141 Jul 10 16:01 'trace-[pytorch generic] fallback.json' ``` * pytorch generic {F1755208341} * current prod {F1755209251} * permute_multi_embedding (2 Ops) {F1755210682} * KT.regroup (1 Op) {F1755210008} * regroupAsDict (Module) {F1755210990} * metrics |Operator|CPU runtime|GPU runtime|GPU memory|notes| |---|---|---|---|---| |**[fallback] pytorch generic**|3.9 ms|3.2 ms|1.0 K|CPU-bounded, allow duplicates| |**[prod] _fbgemm_permute_pooled_embs**|1.9 ms|4.9 ms|1.5 K|GPU-boudned, does **NOT** allow duplicates, PT2 non-compatible `pin_and_move`| |**[hybrid python/cu] keyed_tensor_regroup**|1.5 ms|2.0 ms|1.0 K|both GPU runtime and memory improved, **ALLOW** duplicates, PT2 friendly| |**[pure c++/cu] permute_multi_embedding**|1.0 ms|2.0 ms|1.0 K|both CPU and GPU runtime/memory improved, **ALLOW** duplicates, PT2 friendly| Reviewed By: dstaay-fb Differential Revision: D58907223 fbshipit-source-id: 108ce355b9191cba6fe6a79e54dc7291b8463f7b
- Loading branch information