Looks interesting! Is there any intuition for why this should be the case? Did you discover it via that intuition, or just random experimentation?
A note, your install script appears to still have a placeholder at the "apply patch" step. A suggestion, might be more user-friendly to fork llama.cpp and then include that as a git submodule rather than make it a "git clone and apply patch" step.
A further note, everyone and their dog has a different local python set-up, might be nice to let people separate the llama.cpp stuff from the python stuff rather than bake in a dependence on homebrew python.
dipampaul17 12 hours ago [-]
Great question about the intuition! The difference comes from the core roles these components play in attention.
Keys determine which tokens to attend to - they create the actual attention pattern through similarity calculations. Values only store what information gets passed forward once attention is decided.
When a key vector is quantized too aggressively, it distorts the similarity calculations for every token interaction. A small error in keys can completely redirect attention to the wrong tokens.
Values, however, are much more forgiving. When a value vector is quantized, any error only affects the specific information content of that single token after the attention pattern is already established.
It's like a library catalog system vs. the books themselves. If catalog numbers (keys) are corrupted, you'll look in completely wrong sections. If some words in books (values) are smudged, you're still reading the right book - just with occasional noise.
Mathematically, keys participate in softmax calculations where small errors get exponentially amplified through the normalization process. Values just undergo linear weighted averaging, where errors tend to cancel out.
I first encountered this asymmetry in papers like "More for Keys, Less for Values" and "KV-AdaQuant," but wanted to quantify exactly how it impacts Apple Silicon inference. The 7× quality difference between K8V4 and K4V8 using identical memory was striking.
Thanks for the installation feedback too! I'll fix the placeholder and make the Python dependencies more flexible.
vlovich123 10 hours ago [-]
My understanding is that the roles of KVQ aren’t actually well understood and that while they’re called key/value/query tensors it’s not quite straightforward to tease out what they mean or the role they play.
gervwyk 12 hours ago [-]
Great explanation thanks for this!
Aurornis 9 hours ago [-]
> A note, your install script appears to still have a placeholder at the "apply patch" step. A suggestion, might be more user-friendly to fork llama.cpp and then include that as a git submodule rather than make it a "git clone and apply patch" step.
The patch doesn't actually apply to llama.cpp because argument parsing was moved to arg.cpp 8 months ago.
That doesn't matter, though, because the options to set K and V quantization were added to llama.cpp in 2023.
I don't understand why the patch exists at all, other than as an attempt to make this look novel by changing the settings through a different command line argument?
I would strongly recommend that nobody run an install.sh file from a new repo like this, especially when it's not necessary for something as simple as applying a patch file.
Aurornis 9 hours ago [-]
I finally had time to read the code. The patch is unnecessary because this functionality has been in llama.cpp since 2023 if I understand this PR correctly: https://github.com/ggml-org/llama.cpp/pull/4312
Instead of offering a forked llama.cpp with the changes applied as commits, the repo wants you to run an `install.sh` script which checks out the master branch of llama.cpp without specifying a revision, then applies a short patch to it. This alone should be a warning flag that something is amiss.
There are 4 different patch files in the repo and 1 extra version of the patch as a Heredoc embedded in the install script for some reason. The script has two different versions of code to clone the repo and attempt the patch, too.
The install.sh script overwrites one of the patch files with another patch file with this line:
The only thing it adds is a "--kvq" argument which is supposed to let you set K and V quantization at the same time, but immediately above it are the already built-in arguments for setting the K and V quantization separately. Surely the author must have noticed the functionality already existed at some point while shuffling these patches around?
I strongly recommend that people do not run shell scripts from new repos like this, especially when the shell script is so convoluted.
The HN post has 200+ upvotes and the GitHub repo has collected 200+ stars and climbing at this point, but I think the content is misleading. The flagged-to-death comment in this thread calling out the problem was actually correct. It's also concerning that the author continues to respond to this thread but is avoiding any questions about the functionality already existing.
EDIT: I misread the shell script. I think it actually applies this patch: https://github.com/dipampaul17/KVSplit/blob/main/patch/fixed... After applying the patch it mysteriously overwrites the fixed_kv_patch.diff patch with the split_kv_quant.diff file but then does nothing with it. I don't know if this is the result of vibecoding or just someone carelessly editing code, but I'll reiterate that nobody should run shell scripts like this from unknown repos.
EDIT 2: I'm even more confused now. The install.sh script references the old URL for the llama.cpp repo ( https://github.com/ggerganov/llama.cpp ) which now redirects because it was changed some time ago. The patches attempt to modify arg parsing in common.cpp, but that code was moved to arg.cpp 8 months ago ( https://github.com/ggml-org/llama.cpp/commit/bfe76d4a17228bf... ). So this install script and repo appear to be based on code from ~2024 using options added to llama.cpp in ~2023. What is going on here?
imiric 5 hours ago [-]
Finally someone making sense. The fact this project works by applying patches instead of forking the original project and committing changes should alone be reason for concern.
But OP's entire GitHub presence is suspicious. On May 12th they fired off LLM slop PRs to a bunch of popular projects, and only the JAX ones were rejected. Nevertheless, this allowed them to pin these popular projects to their profile, as if they were a contributor.
I can't put into words how despicable this all is. Anyone working in the AI field is complicit in the corruption of information, the ramifications of which we can't even predict yet. Dead internet and the flood of AI slop is just the beginning.
therealsmith 2 hours ago [-]
Yes, I didn't even say anything about how suspicious the rest of it is because maybe I really was missing something and the author would point it out here.
There are numerous red flags. At best it is someone trying to game their Github profile with LLM-generated code, just look at the May 12 activity from that profile.
Aurornis 19 minutes ago [-]
After looking more, this is definitely an AI driven attempt to game the system.
It’s too bad the comments calling it out earlier were downvoted away. The first one was downvoted until it was flagged.
It’s amazing that this person was able to collect 250 GitHub stars and make bold claims about enhancing llama.cpp when it wasn’t anything new and it didn’t work anyway.
behnamoh 18 hours ago [-]
Is this patch possible to do on MLX? I'm getting better speeds on MLX. That, combined with your approach, would finally let Mac users have long conversations at usable speeds.
landl0rd 10 hours ago [-]
Probably but I am currently deep in the MLX weeds and finding out that though it's a well-designed framework it's much less mature in terms of example code you can steal where someone has already benchmarked the "best way" to do something.
My biggest hope for it is actually Haskell bindings believe it or not. Someone pointed out the other day its laziness makes it fit really well for that paradigm and the more or less pure-function approach to the compile graph helps too. ML in Haskell would be fun.
ondra 17 hours ago [-]
Is this any different from using --cache-type-k and --cache-type-v?
Aurornis 11 minutes ago [-]
No, it appears to be an LLM-generated attempt to gain GitHub stars.
See my other comment for a sampling of the other oddities in the repo.
landl0rd 10 hours ago [-]
I'm guessing it's a bit different since MLX/MPS doesn't have native 4-bit support (or even 8 if I remember correctly?) It didn't launch with bf16 support even. So I think the lowest you could go on the old type_k/v solution and apple GPUs was 16-bit f16/bf16 but not a llama.cpp internals expert so maybe wrong?
azinman2 16 hours ago [-]
That’s what I want to know!
badmonster 18 hours ago [-]
I'm curious: is it possible to apply differentiated KV quantization (like K8V4) to models after they're already converted to .gguf format, or does this require rebuilding the model with special support? If it's compatible with any .gguf file, are there any limitations on model types (e.g. Mistral, Phi-3, etc.) or tokenizer configs?
dipampaul17 18 hours ago [-]
Yes, that's one of the key benefits - KVSplit works with any existing .gguf model without requiring reconstruction or special conversion. The quantization happens at runtime on the KV cache, not during model loading or conversion.
This works because the KV cache is created during inference as tokens are processed, completely separate from the model weights themselves. The --kvq-key and --kvq-val flags simply tell llama.cpp how to store these intermediate tensors in memory.
The only limitation is that it requires llama.cpp's Metal backend, and you need to disable Flash Attention with -fa 0 since the current FA implementation in llama.cpp bypasses the custom KV cache format. The technique itself should work with any transformer architecture that uses a standard attention mechanism.
fennecbutt 13 hours ago [-]
I thought flash attention was required for quantised KV?
entrepy123 18 hours ago [-]
Are these significantly faster/better on 64GB or 128GB Apple silicon (over 36GB or 48GB)?
I've been reading that large contexts and large models are just painfully slow, even on the fastest and largest Apple silicon that money can buy.
So I wonder if this helps make more use of greater memory, or if really smallish models are still where it's at for Apple silicon, practically speaking.
dipampaul17 12 hours ago [-]
The memory savings from KVSplit scale proportionally with context length, so higher-RAM Macs (64GB/128GB) benefit even more in absolute terms. On a 128GB Mac Studio, you could potentially handle context windows in the hundreds of thousands of tokens.
However, KVSplit doesn't fundamentally change computation speed - just memory efficiency. Our benchmarks show a 14.5% throughput improvement with K8V4, but this comes from better memory locality, not reduced computation.
The "painfully slow" issue with large models on Apple Silicon stems primarily from the compute limitations, not memory constraints. A 70B parameter model will still run at similar token generation speeds regardless of available RAM or KV cache optimizations.
What KVSplit does is make better use of whatever memory you have available. It's particularly valuable when your bottleneck is context length rather than model size.
For practical Apple Silicon usage, the sweet spot remains smaller models (7B-13B) with now-expanded context windows. This lets you process significantly more text while maintaining reasonable generation speeds.
If your workflow needs both massive contexts AND large models, you'd still want to consider server-grade GPUs, but KVSplit helps push the boundary of what's feasible on Apple hardware.
hiatus 12 hours ago [-]
Is this any different from using --cache-type-k and --cache-type-v?
andrewmcwatters 12 hours ago [-]
Thank you for these insights!
nico 18 hours ago [-]
Great work. This seems very interesting, but I need something slightly more high level to relate to it
Will it just allow me to run let’s say a model with a 2048 token context window with a 4-6k context window? Or a 128k model (like gemma3) with a 256k+ context window?
What’s the ideal use case for local models?
Thank you
dipampaul17 18 hours ago [-]
With the K8V4 configuration providing 59% memory savings, you can effectively run contexts 2.4× longer on the same hardware. A model with a 2048 token context can now handle about 5000 tokens, while an 8K context model can reach approximately 19.5K tokens.
In practical terms, this means processing entire books at once on a MacBook, analyzing large codebases without splitting files, or maintaining comprehensive conversation history in chat applications.
The memory savings scale linearly with context length - the longer your context window, the more absolute memory you save. On my M4 MacBook with 8K context, I reduced KV cache from 176MB to 72MB. At 128K context, that same percentage saving would free up gigabytes.
This optimization is most valuable when you're context-window limited rather than model-parameter limited. If you're hitting OOM errors due to long inputs rather than large model weights, KVSplit directly addresses your bottleneck.
kmacdough 18 hours ago [-]
> Will it just allow me to run let’s say a model with a 2048 token context window with a 4-6k context window
It reduces the memory footprint of a particular model. You can do what you like with that. Extending the context window post-training isn't trivial, so unless you know what you're doing, you'd be better off finding a model trained on a larger context window.
Many uses for local models like working offline or privacy/security. Most folks, though, are using it to experiment with tweaking models.
nico 17 hours ago [-]
Will that make the model run/feel faster?
I can run models with 30-40b parameters on my computer, but they feel a lot slower than the 1-7b ones
So would this make the 30-40b parameter modes run faster? Or at least “feel” faster?
fennecbutt 13 hours ago [-]
No, only more compute or fancy model architecture tweaks will get you more t/s.
However if using discrete gpu, reducing KV memory lets you load more layers onto gpu and therefore more performance, but only if you're already struggling to fit your model into vram.
dipampaul17 12 hours ago [-]
For 30-40B parameter models, you'll see two types of performance impacts:
First, there's a direct throughput improvement – our benchmarks show a 14.5% speed increase with K8V4 versus FP16. This comes from better memory bandwidth utilization when processing the KV cache.
However, this won't make a 30B model suddenly feel as responsive as a 7B model. The fundamental computation bottleneck remains – larger models need more matrix multiplications regardless of how efficiently you store the KV cache.
Where you might notice a bigger difference is in handling longer inputs. With 59% less memory used for KV cache, your system can dedicate more resources to computation rather than memory management, which can reduce stuttering during processing long documents.
The most noticeable improvement would be if you're currently hitting memory limits that force you to segment long inputs. Being able to process everything in one pass eliminates those artificial breaks.
@fennecbutt is spot-on that the core token generation speed is primarily determined by compute capability and model architecture. KVSplit complements those factors by optimizing memory usage, not by fundamentally changing the computation path.
3abiton 15 hours ago [-]
This is a brilliant idea, and initiative. Does this also apply to GPUs? And I assume should be compatible with other quantization techniques, albeit they probably require their own patches?
dipampaul17 12 hours ago [-]
Yup, this approach would likely work on NVIDIA/AMD GPUs as well - the underlying principle that keys require higher precision than values is hardware-independent.
The CUDA backend in llama.cpp already supports separate cache type settings with the `--cache-type-k` and `--cache-type-v` flags. Our particular patch is focused on Metal-specific optimizations, but the core technique transfers directly.
Regarding compatibility with other quantization methods - absolutely. This KV cache optimization is complementary to model weight quantization (Q4_K_M, GPTQ, AWQ, etc.). You can combine asymmetric KV cache precision with any model weight format.
Since KV cache quantization happens at runtime while processing tokens (separate from model weights), it doesn't conflict with how the model itself is quantized. They operate on different parts of the inference pipeline.
What would require additional work is integrating with specialized inference engines that have custom KV cache handling, like vLLM or TensorRT-LLM. Each would need its own implementation of asymmetric KV precision.
The most immediate GPU benefit would likely come from integrating these insights into the FlashAttention implementation directly, where the memory bandwidth savings could translate to even greater speedups on CUDA hardware.
zmmmmm 14 hours ago [-]
Amazing!
Curious, what happens to performance? I assume you still pay the same performance price for longer context, even if you can now fit it in memory.
fennecbutt 13 hours ago [-]
I think this is true, I've found I get roughly the same iteration speed for prompt processing no matter if the cache is fp16, q8 or q4.
It doesn't make sense to me though, I haven't looked into how it works inside but I would've thought it would pack vectors and then do 4-8b simd on all of them at once, but it really seems like it's not packing em.
smcleod 17 hours ago [-]
+0.86% perplexity it's quite a bit at such a small context size though isn't it? How is it at more reasonable context sizes like 64-128k?
nomel 16 hours ago [-]
> This means you can run LLMs with 2-3× longer context on the same Mac. Memory usage scales with sequence length, so savings compound as context grows.
The point seems to be that this reduces memory footprint. This makes it possible to run longer context, for the same limited memory, if you couldn't before. Or, you can use that free memory to do something else, like an IDE.
smcleod 14 hours ago [-]
Yeah I get that, that's what we yse k/v cache quantisation for now which has a lower impact on PPL than this unless I'm missing something?
dipampaul17 12 hours ago [-]
You're right to question the perplexity impact - 0.86% isn't negligible. Our extended testing shows this impact remains fairly consistent across context lengths up to 16K, which was our test limit.
We haven't benchmarked at 64-128K contexts yet, but theoretically the relative perplexity impact should remain stable. The absolute impact could potentially compound with very long contexts, though.
The key difference from standard KV quantization is the asymmetric approach. Most implementations use K8V8 (8-bit for both) which has a 0.03% perplexity impact but only 47% memory savings. K8V4 pushes this to 59% savings with the 0.86% quality tradeoff.
For reference, the quality impact is still well below the typical 5% threshold where differences become noticeable in generated text. It's a reasonable tradeoff for the additional memory savings, especially at long contexts.
@smcleod - We're using the same underlying quantization methods, just applying them asymmetrically between keys and values. If your existing approach already uses lower precision for values than keys, you're likely getting similar benefits.
segmondy 17 hours ago [-]
you can do this already with -ctk and -ctv, why would anyone need this?
-ctk, --cache-type-k TYPE KV cache data type for K
Am I missing something? As far as I can see this patch does nothing except add new options that replicate the functionality of the existing --cache-type-k and --cache-type-v options.
Using `--flash-attn --cache-type-k q8_0 --cache-type-v q8_0` is a very well known optimization to save VRAM.
And it's also very well known that the keys are more sensitive to quantization than values. E.g. https://arxiv.org/abs/2502.15075
Aurornis 14 hours ago [-]
> Using `--flash-attn --cache-type-k q8_0 --cache-type-v q8_0`
I think you meant ‘--cache-type-v q4_0’
I would also like an explanation for what’s different in this patch compared to the standard command line arguments.
14 hours ago [-]
14 hours ago [-]
leelou2 14 hours ago [-]
[dead]
Rendered at 14:24:33 GMT+0000 (UTC) with Wasmer Edge.
A note, your install script appears to still have a placeholder at the "apply patch" step. A suggestion, might be more user-friendly to fork llama.cpp and then include that as a git submodule rather than make it a "git clone and apply patch" step.
A further note, everyone and their dog has a different local python set-up, might be nice to let people separate the llama.cpp stuff from the python stuff rather than bake in a dependence on homebrew python.
Keys determine which tokens to attend to - they create the actual attention pattern through similarity calculations. Values only store what information gets passed forward once attention is decided.
When a key vector is quantized too aggressively, it distorts the similarity calculations for every token interaction. A small error in keys can completely redirect attention to the wrong tokens.
Values, however, are much more forgiving. When a value vector is quantized, any error only affects the specific information content of that single token after the attention pattern is already established.
It's like a library catalog system vs. the books themselves. If catalog numbers (keys) are corrupted, you'll look in completely wrong sections. If some words in books (values) are smudged, you're still reading the right book - just with occasional noise.
Mathematically, keys participate in softmax calculations where small errors get exponentially amplified through the normalization process. Values just undergo linear weighted averaging, where errors tend to cancel out.
I first encountered this asymmetry in papers like "More for Keys, Less for Values" and "KV-AdaQuant," but wanted to quantify exactly how it impacts Apple Silicon inference. The 7× quality difference between K8V4 and K4V8 using identical memory was striking.
Thanks for the installation feedback too! I'll fix the placeholder and make the Python dependencies more flexible.
The patch doesn't actually apply to llama.cpp because argument parsing was moved to arg.cpp 8 months ago.
That doesn't matter, though, because the options to set K and V quantization were added to llama.cpp in 2023.
I don't understand why the patch exists at all, other than as an attempt to make this look novel by changing the settings through a different command line argument?
I would strongly recommend that nobody run an install.sh file from a new repo like this, especially when it's not necessary for something as simple as applying a patch file.
Instead of offering a forked llama.cpp with the changes applied as commits, the repo wants you to run an `install.sh` script which checks out the master branch of llama.cpp without specifying a revision, then applies a short patch to it. This alone should be a warning flag that something is amiss.
There are 4 different patch files in the repo and 1 extra version of the patch as a Heredoc embedded in the install script for some reason. The script has two different versions of code to clone the repo and attempt the patch, too.
The install.sh script overwrites one of the patch files with another patch file with this line:
> cp patch/split_kv_quant.diff patch/fixed_kv_patch.diff
So the `fixed_kv_patch.diff` that is checked into the repo gets overwritten before being applied.
As far as I can tell, this is therefore the patch it's supposed to use: https://github.com/dipampaul17/KVSplit/blob/main/patch/split... (EDIT: I think it's actually this one, see comment at the end: https://github.com/dipampaul17/KVSplit/blob/main/patch/fixed... )
The only thing it adds is a "--kvq" argument which is supposed to let you set K and V quantization at the same time, but immediately above it are the already built-in arguments for setting the K and V quantization separately. Surely the author must have noticed the functionality already existed at some point while shuffling these patches around?
I strongly recommend that people do not run shell scripts from new repos like this, especially when the shell script is so convoluted.
The HN post has 200+ upvotes and the GitHub repo has collected 200+ stars and climbing at this point, but I think the content is misleading. The flagged-to-death comment in this thread calling out the problem was actually correct. It's also concerning that the author continues to respond to this thread but is avoiding any questions about the functionality already existing.
EDIT: I misread the shell script. I think it actually applies this patch: https://github.com/dipampaul17/KVSplit/blob/main/patch/fixed... After applying the patch it mysteriously overwrites the fixed_kv_patch.diff patch with the split_kv_quant.diff file but then does nothing with it. I don't know if this is the result of vibecoding or just someone carelessly editing code, but I'll reiterate that nobody should run shell scripts like this from unknown repos.
EDIT 2: I'm even more confused now. The install.sh script references the old URL for the llama.cpp repo ( https://github.com/ggerganov/llama.cpp ) which now redirects because it was changed some time ago. The patches attempt to modify arg parsing in common.cpp, but that code was moved to arg.cpp 8 months ago ( https://github.com/ggml-org/llama.cpp/commit/bfe76d4a17228bf... ). So this install script and repo appear to be based on code from ~2024 using options added to llama.cpp in ~2023. What is going on here?
But OP's entire GitHub presence is suspicious. On May 12th they fired off LLM slop PRs to a bunch of popular projects, and only the JAX ones were rejected. Nevertheless, this allowed them to pin these popular projects to their profile, as if they were a contributor.
I can't put into words how despicable this all is. Anyone working in the AI field is complicit in the corruption of information, the ramifications of which we can't even predict yet. Dead internet and the flood of AI slop is just the beginning.
There are numerous red flags. At best it is someone trying to game their Github profile with LLM-generated code, just look at the May 12 activity from that profile.
It’s too bad the comments calling it out earlier were downvoted away. The first one was downvoted until it was flagged.
It’s amazing that this person was able to collect 250 GitHub stars and make bold claims about enhancing llama.cpp when it wasn’t anything new and it didn’t work anyway.
My biggest hope for it is actually Haskell bindings believe it or not. Someone pointed out the other day its laziness makes it fit really well for that paradigm and the more or less pure-function approach to the compile graph helps too. ML in Haskell would be fun.
See my other comment for a sampling of the other oddities in the repo.
This works because the KV cache is created during inference as tokens are processed, completely separate from the model weights themselves. The --kvq-key and --kvq-val flags simply tell llama.cpp how to store these intermediate tensors in memory.
I've tested it successfully with:
- Llama-3 models - Mistral models - Phi-2/Phi-3 - TinyLlama - Qwen variants
The only limitation is that it requires llama.cpp's Metal backend, and you need to disable Flash Attention with -fa 0 since the current FA implementation in llama.cpp bypasses the custom KV cache format. The technique itself should work with any transformer architecture that uses a standard attention mechanism.
I've been reading that large contexts and large models are just painfully slow, even on the fastest and largest Apple silicon that money can buy.
So I wonder if this helps make more use of greater memory, or if really smallish models are still where it's at for Apple silicon, practically speaking.
However, KVSplit doesn't fundamentally change computation speed - just memory efficiency. Our benchmarks show a 14.5% throughput improvement with K8V4, but this comes from better memory locality, not reduced computation.
The "painfully slow" issue with large models on Apple Silicon stems primarily from the compute limitations, not memory constraints. A 70B parameter model will still run at similar token generation speeds regardless of available RAM or KV cache optimizations.
What KVSplit does is make better use of whatever memory you have available. It's particularly valuable when your bottleneck is context length rather than model size.
For practical Apple Silicon usage, the sweet spot remains smaller models (7B-13B) with now-expanded context windows. This lets you process significantly more text while maintaining reasonable generation speeds.
If your workflow needs both massive contexts AND large models, you'd still want to consider server-grade GPUs, but KVSplit helps push the boundary of what's feasible on Apple hardware.
Will it just allow me to run let’s say a model with a 2048 token context window with a 4-6k context window? Or a 128k model (like gemma3) with a 256k+ context window?
What’s the ideal use case for local models?
Thank you
In practical terms, this means processing entire books at once on a MacBook, analyzing large codebases without splitting files, or maintaining comprehensive conversation history in chat applications.
The memory savings scale linearly with context length - the longer your context window, the more absolute memory you save. On my M4 MacBook with 8K context, I reduced KV cache from 176MB to 72MB. At 128K context, that same percentage saving would free up gigabytes.
This optimization is most valuable when you're context-window limited rather than model-parameter limited. If you're hitting OOM errors due to long inputs rather than large model weights, KVSplit directly addresses your bottleneck.
It reduces the memory footprint of a particular model. You can do what you like with that. Extending the context window post-training isn't trivial, so unless you know what you're doing, you'd be better off finding a model trained on a larger context window.
Many uses for local models like working offline or privacy/security. Most folks, though, are using it to experiment with tweaking models.
I can run models with 30-40b parameters on my computer, but they feel a lot slower than the 1-7b ones
So would this make the 30-40b parameter modes run faster? Or at least “feel” faster?
However if using discrete gpu, reducing KV memory lets you load more layers onto gpu and therefore more performance, but only if you're already struggling to fit your model into vram.
First, there's a direct throughput improvement – our benchmarks show a 14.5% speed increase with K8V4 versus FP16. This comes from better memory bandwidth utilization when processing the KV cache.
However, this won't make a 30B model suddenly feel as responsive as a 7B model. The fundamental computation bottleneck remains – larger models need more matrix multiplications regardless of how efficiently you store the KV cache.
Where you might notice a bigger difference is in handling longer inputs. With 59% less memory used for KV cache, your system can dedicate more resources to computation rather than memory management, which can reduce stuttering during processing long documents.
The most noticeable improvement would be if you're currently hitting memory limits that force you to segment long inputs. Being able to process everything in one pass eliminates those artificial breaks.
@fennecbutt is spot-on that the core token generation speed is primarily determined by compute capability and model architecture. KVSplit complements those factors by optimizing memory usage, not by fundamentally changing the computation path.
The CUDA backend in llama.cpp already supports separate cache type settings with the `--cache-type-k` and `--cache-type-v` flags. Our particular patch is focused on Metal-specific optimizations, but the core technique transfers directly.
Regarding compatibility with other quantization methods - absolutely. This KV cache optimization is complementary to model weight quantization (Q4_K_M, GPTQ, AWQ, etc.). You can combine asymmetric KV cache precision with any model weight format.
Since KV cache quantization happens at runtime while processing tokens (separate from model weights), it doesn't conflict with how the model itself is quantized. They operate on different parts of the inference pipeline.
What would require additional work is integrating with specialized inference engines that have custom KV cache handling, like vLLM or TensorRT-LLM. Each would need its own implementation of asymmetric KV precision.
The most immediate GPU benefit would likely come from integrating these insights into the FlashAttention implementation directly, where the memory bandwidth savings could translate to even greater speedups on CUDA hardware.
Curious, what happens to performance? I assume you still pay the same performance price for longer context, even if you can now fit it in memory.
It doesn't make sense to me though, I haven't looked into how it works inside but I would've thought it would pack vectors and then do 4-8b simd on all of them at once, but it really seems like it's not packing em.
The point seems to be that this reduces memory footprint. This makes it possible to run longer context, for the same limited memory, if you couldn't before. Or, you can use that free memory to do something else, like an IDE.
We haven't benchmarked at 64-128K contexts yet, but theoretically the relative perplexity impact should remain stable. The absolute impact could potentially compound with very long contexts, though.
The key difference from standard KV quantization is the asymmetric approach. Most implementations use K8V8 (8-bit for both) which has a 0.03% perplexity impact but only 47% memory savings. K8V4 pushes this to 59% savings with the 0.86% quality tradeoff.
For reference, the quality impact is still well below the typical 5% threshold where differences become noticeable in generated text. It's a reasonable tradeoff for the additional memory savings, especially at long contexts.
@smcleod - We're using the same underlying quantization methods, just applying them asymmetrically between keys and values. If your existing approach already uses lower precision for values than keys, you're likely getting similar benefits.
-ctk, --cache-type-k TYPE KV cache data type for K
-ctv, --cache-type-v TYPE KV cache data type for VUsing `--flash-attn --cache-type-k q8_0 --cache-type-v q8_0` is a very well known optimization to save VRAM.
And it's also very well known that the keys are more sensitive to quantization than values. E.g. https://arxiv.org/abs/2502.15075
I think you meant ‘--cache-type-v q4_0’
I would also like an explanation for what’s different in this patch compared to the standard command line arguments.