Sparsegpt for Llama #580

BadisG · 2023-03-26T12:06:22Z

BadisG
Mar 26, 2023

Hello,
https://github.com/AlpinDale/sparsegpt-for-LLaMA
https://arxiv.org/abs/2301.00774

"We show for the first time that large-scale generative pretrained transformer (GPT) family models can be pruned to at least 50% sparsity in one-shot, without any retraining, at minimal loss of accuracy. This is achieved via a new pruning method called SparseGPT, specifically designed to work efficiently and accurately on massive GPT-family models."

Looks like someone is has implemented SparseGPT for the Llama model. If I understand correctly that means we can cut in half the size of the llama models without significant loss of precision.

I want to know what you think about it and if you're planning on testing it and see if you can get the same result with half the VRAM, I would finally be able to run the 13b on my shitty 8gb 2080 super 🤣

PS: In less than a month, 65B Llama will work on the super nintendo 👀

BetaDoggo · 2023-03-27T16:16:43Z

BetaDoggo
Mar 27, 2023

Based on the graph in the paper it seems like this is only really viable for models that are 100b+. Definitely great for enterprise users but probably not that useful for us mortals running on consumer hardware.

It's still very impressive that they've managed to cut out such a large portion of the model with so little quality loss.

5 replies

BadisG Mar 27, 2023
Author

"Based on the graph in the paper it seems like this is only really viable for models that are 100b+"
What makes you say that? :(

BetaDoggo Mar 27, 2023

The perplexity compared to the dense version of the model seems to be consistently much higher until around the 50b mark where it starts to get close. It may be different for llama but based on the results shown in the sparsegpt-for-llama repo it seems to be about the same.

For example, the sparse 30b model has a worse score than the gptq 4bit 13b while using double the memory. It wouldn't make sense to run the larger model sparse when a smaller model will give the same or better performance while using less resources.

A better example would probably be the 13b sparse model losing to the normal 7b model while only saving 2GB of memory.

BadisG Mar 27, 2023
Author

It might be advantageous to employ it for the 65 b Llama. I am confident that we can reduce its size to that of the 30b one while preserving its 65 billion quality. We need to test it to confirm this.

BadisG Mar 27, 2023
Author

Maybe using a 30-40 % sparity would give good results while decreasing the size of the model. The main goal here is to make the models accessible to the most people

Ph0rk0z Mar 27, 2023

I thought someone tried this before and it didn't help.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sparsegpt for Llama #580

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 5 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Sparsegpt for Llama #580

BadisG Mar 26, 2023

Replies: 1 comment · 5 replies

BetaDoggo Mar 27, 2023

BadisG Mar 27, 2023 Author

BetaDoggo Mar 27, 2023

BadisG Mar 27, 2023 Author

BadisG Mar 27, 2023 Author

Ph0rk0z Mar 27, 2023

BadisG
Mar 26, 2023

Replies: 1 comment 5 replies

BetaDoggo
Mar 27, 2023

BadisG Mar 27, 2023
Author

BadisG Mar 27, 2023
Author

BadisG Mar 27, 2023
Author