v0.2.3
We're excited to announce the update of KTransformers v0.2.3! You can now compile from the GitHub source code. Release packages and Docker images are being built/uploaded - stay tuned!
Key Updates:
-
Low-Precision Inference Optimization #754
-
Added IQ1_S/IQ2_XXS quantized matmul support, now compatible with Unsloth's DeepSeek-R1 1.58bit/2.51bit dynamic quantized weights
-
Released DeepSeek-R1 mixed-precision model (IQ1+FP8) achieving enhanced performance:
-
19GB VRAM usage & 140GB system memory consumption
-
MMLU score of 83.6, slightly outperforming full-precision DeepSeek-V3
-
Ongoing benchmarks: View Details (Special thanks to @moonshadow-25 and @godrosev for their huge contributions to v0.2.3)
-
-
-
Long Context Handling Enhancement #750
-
Implemented chunked prefill mechanism. Supports processing 139K-token contexts with DeepSeek-R1 on 24GB VRAM
-
Note: As DeepSeek's native context window only supports 128K tokens, we will pause further optimizations for extended context handling.
-
Coming Next - v0.2.4 Preview The upcoming v0.2.4 will be the final minor release in the 0.2 series, delivering the most crucial update to transform KTransformers from "a toy project" to "a practical solution" - multi-concurrency support.
Scheduled for release within two weeks, this update will be followed by our 0.3 version development featuring:
-
AMX-powered optimizations for enhanced performance
-
Expanded hardware support including AMD, XPU, MetaX(沐曦), Moore threads(摩尔线程), and Ascend(昇腾)GPUs