Epistemic Status: Quick tests, most of this was done in less than 48 hours

TL;DR: Can you skimp on GPU VRAM? 8bit quantized seems fine, for 4bit it depends.

I was asked by @Teun van der Weij, to what degree one can run evaluations on quantized models, and I was unsure. I have run some evaluations with Llama 3 and have some quick comparisons now.

Main Quantization Schemes

Here is a list of some different quantization schemes discussed:

GGUF - Special file format used in Llama.cpp. Not supported in transformers.
BNB - BitsAndBytes, the original default in huggingface transformers.
BNB NF4 - Alternative mode for bits and bytes, "4-bit NormalFloat"
HQQ - Half-Quadratic Quantization, supports 1-8 bits.
GPTQ - One of the older quantization methods. Requires training data
AWQ - "Activation-aware Weight Quantization". Requires training data

Llama 2

Some previous papers have compare perplexity of different methods. We can see an example of some research shown in the recent research paper using HQQ quantization:

Method	nBits	Llama-2-7B		Llama-2-13B		Llama-2-70B
Method	nBits	PPL ↓	MEM ↓	PPL ↓	MEM ↓	PPL ↓	MEM ↓
FP	16	5.18	13.5	4.63	25.6	OOM	OOM
BNB	8	5.22	7.9	4.67	14.4	3.17	68.15
GPTQ_g128	8	5.19	7.8	4.63	14.8	3.12	74.87
HQQ_g128	8	5.19	7.6	4.63	14	3.12	69.32
BNB_g64	4	5.43	4.7	4.79	8.2	3.29	39.11
GPTQ_g64	4	5.38	5	4.73	9.1	3.23	41.13
AWQ_g64	4	5.28	4.6	4.7	8.5	3.2	37.08
HQQ_g64	4	5.3	4.6	4.7	8.2	3.19	37.52

Table 1: Comparison of perplexities at different levels of quantization with different methods on WikiText2 dataset for Llama 2 7b, 13b and 70b.

However, it is quite unclear what effect this has on real world performance.

Llama 3 8B

Here is some examples of what perplexity looks like with different levels of quantization on Llama 3 8B, found from the llama.cpp repository^[1]. This only includes GGUF, and not any other quantization methods as far as I can tell.

Type	Size (GB)	PPL	Mean Δp	RMS Δp
f16	14.97	6.2331	-	-
q8_0	7.96	6.2342	-0.019 ± 0.003 %	1.198 %
q6_K	6.14	6.2533	-0.007 ± 0.006 %	2.295 %
q5_K_M	5.33	6.2886	-0.114 ± 0.008 %	3.160 %
q5_0	5.21	6.3632	-0.416 ± 0.012 %	4.634 %
q4_K_M	4.58	6.3830	-0.389 ± 0.014 %	5.251 %
q4_0	4.34	6.7001	-1.588 ± 0.022 %	8.434 %

Table 2: Comparison of perplexities at different levels of GGUF quantization on the WikiText2 dataset for Llama 3 8b.

This is all fine and good, but a lot of use are trying to do interpretability and whatnot, and I personally have found this easiest when using the HuggingFace transformers library. While for some it may make sense to retool, what about those of us who do not want to? While we can wait for potential compatibility with GGUF, there are other quantization methods we can use in the meantime, and I have tried to run some tests.

Llama 3 8B Instruct

Here are some benchmarks with Llama 3 8B, run with different quantization levels and schemes on MMLU, WMDP (% accuracy), and The Pile (perplexity):

gptq4 - Acc=55.21% PPL=8.575 T=233s
awq4 - Acc=55.55% PPL=8.483 T=270s

Method	nBits	Llama-3-8B-Instruct
		MMLU	WMDP	The Pile (100k tokens)
		↑(0-shot)	↑(0-shot)	PPL ↓	Acc↑	Time↓
BFloat16	16	63.87 %	54.99%	8.283	56.52%	53s^[2]
Float16	16	63.84 %	54.93%	8.279	56.50%	55s^[2]
Hqq Int8	8	63.87 %	54.66%	8.298	56.49%	122s
BnB Int8	8	63.05 %	54.96%	8.305	56.25%	74s
Hqq Int4	4	62.29 %	54.23%	8.482	55.85%	130s
BnB NF4	4	61.44 %	54.42%	8.499	55.80%	95s
BnB Int4	4	60.80 %	52.73%	8.633	55.19%	277s
GPTQ Int4	4	61.58 %	53.30%	8.575	55.21%	233s
AWQ Int4	4	61.84 %	54.55%	8.483	55.55%	270s
Hqq Int3	3	62.26 %	51.23%	8.872	54.49%	201s

We can see that the scored going from 16-bit to 8-bit are relatively unaffected, so likely it is fine to use 8bit. Going to 4-bit has a more noticeable effect, but it is not massively different on a qualitative level.

Chain of Though

OK, sure, maybe single-token prediction tasks are relatively unaffected, but Chain of Thought (CoT) requires many "correct" tokens at a time. Maybe these are affected more? I run some experiments on Minerva MATH Algebra dataset, with zero-shot and Chain-of-Thought. (I only use only a subset "Algebra" because CoT takes a long time to generate). Here are the results:

Method	nBits	Minerva MATH Algebra
Method	nBits	↑(0-shot CoT)	Time↓
BFloat16	16	37.2%	2.3h^[2]
Float16	16	37.5%	2.3h^[2]
Hqq Int8	8	37.9%	15.0h**
BnB Int8	8	36.3%	5.5h**
Hqq Int4	4	33.7 %	2.5h
BnB NF4	4	31.3 %	3.2h
BnB Int4	4	29.3 %	3.1h
GPTQ Int4	4		DNF**
AWQ Int4	4		DNF**
Hqq Int3	3		DNF**

Overall, we can see again that for 8-bit quantization, the effect on performance doesn't seem that large, though there is some noticeable degradation when going to 4-bits.

** One thing to note is that the 8-bit implementations, for some reason, were rather slow. I think this is likely some problem I have with torch.compile / torch._dynamo (since the effects were not as noticeable for 4bits or for perplexity results) but I did not have time to test this. This may be because in some runtimes, the weights need to be de-quantized at every step. Also, these results were run on an (architecturally) older A4000 which does not support FP8/FP4 compute, so results may vary.

Conclusion

2024-07-15T16:01:24.960Z

<img src="https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/qmPXQbyYA66DuJbht/g6vgy1urlmxmznywga2i" srcset="https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/qmPXQbyYA66DuJbht/anlzovzogs83jt6ywhrv 150w, https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/qmPXQbyYA66DuJbht/hhzmogr8warnijr7h4jn 300w, https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/qmPXQbyYA66DuJbht/cfqk3uchkdtbbnzm32hl 450w, https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/qmPXQbyYA66DuJbht/pj5m90ogbixfedlry9pp 600w, https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/qmPXQbyYA66DuJbht/thjn0znfnoo0xsepujtz 750w, https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/qmPXQbyYA66DuJbht/zmk6i2wsoehfbr5x9a8g 900w, https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/qmPXQbyYA66DuJbht/gghmtqp5sffsnpl6cfz2 1050w, https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/qmPXQbyYA66DuJbht/nmlx3kfuo0a8b69no7ed 1200w, https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/qmPXQbyYA66DuJbht/bzg0wq5lkp7teddotb39 1350w, https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/qmPXQbyYA66DuJbht/qwqdv52593jhwvl3h5kb 1478w"> Epistemic Status: Quick tests, most of this was done in less than 48 hours TL;DR: Can you skimp on GPU VRAM? 8bit quantized seems fine, for 4bit it depends. I was asked by @Teun van der Weij , to what degree one can run evaluations on quantized models, and I was unsure. I have run some evaluations with Llama 3 and have some quick comparisons now. Main Quantization Schemes Here is a list of some different quantization schemes discussed: GGUF - Special file format used in Llama.cpp. Not supported in transformers. BNB - BitsAndBytes, the original default in huggingface transformers. BNB NF4 - Alternative mode for bits and bytes, " 4-bit NormalFloat" HQQ - Half-Quadratic Quantization, supports 1-8 bits. GPTQ - One of the older quantization methods. Requires training data AWQ - "Activation-aware Weight Quantization". Requires training data Llama 2 Some previous papers have compare perplexity of different methods. We can see an example of some research shown in the recent research paper using HQQ quantization : Method nBits Llama-2-7B Llama-2-13B Llama-2-70B PPL ↓ MEM ↓ PPL ↓ MEM ↓ PPL ↓ MEM ↓ FP 16 5.18 13.5 4.63 25.6 OOM OOM BNB 8 5.22 7.9 4.67 14.4 3.17 68.15 GPTQ_g128 8 5.19 7.8 4.63 14.8 3.12 74.87 HQQ_g128 8 5.19 7.6 4.63 14 3.12 69.32 BNB_g64 4 5.43 4.7 4.79 8.2 3.29 39.11 GPTQ_g64 4 5.38 5 4.73 9.1 3.23 41.13 AWQ_g64 4 5.28 4.6 4.7 8.5 3.2 37.08 HQQ_g64 4 5.3 4.6 4.7 8.2 3.19 37.52 Table 1: Comparison of perplexities at different levels of quantization with different methods on WikiText2 dataset for Llama 2 7b, 13b and 70b. However, it is quite unclear what effect this has on real world performance. Llama 3 8B Here is some examples of what perplexity looks like with different levels of quantization on Llama 3 8B, found from the llama.cpp repository [1] . This only includes GGUF, and not any other quantization methods as far as I can tell. Type Size (GB) PPL Mean Δp RMS Δp f16 14.97 6.2331 - - q8_0 7.96 6.2342 -0.019 ± 0.003 % 1.198 % q6_K 6.14 6.2533 -0.007 ± 0.006 % 2.295 % q5_K_M 5.33 6.2886 -0.114 ± 0.008 % 3.160 % q5_0 5.21 6.3632 -0.416 ± 0.012 % 4.634 % q4_K_M 4.58 6.3830 -0.389 ± 0.014 % 5.251 % q4_0 4.34 6.7001 -1.588 ± 0.022 % 8.434 % Table 2: Comparison of perplexities at different levels of GGUF quantization on the WikiText2 dataset for Llama 3 8b. This is all fine and good, but a lot of use are trying to do interpretability and whatnot, and I personally have found this easiest when using the HuggingFace transformers library. While for some it may make sense to retool, what about those of us who do not want to? While we can wait for potential compatibility with GGUF , there are other quantization methods we can use in the meantime, and I have tried to run some tests. Llama 3 8B Instruct Here are some benchmarks with Llama 3 8B, run with different quantization levels and schemes on MMLU, WMDP (% accuracy), and The Pile (perplexity): gptq4 - Acc=55.21% PPL=8.575 T=233s awq4 - Acc=55.55% PPL=8.483 T=270s Method nBits Llama-3-8B-Instruct MMLU WMDP The Pile ( 100k tokens) ↑ (0-shot) ↑ (0-shot) PPL ↓ Acc↑ Time↓ BFloat16 16 63.87 % 54.99% 8.283 56.52% 53s [2] Float16 16 63.84 % 54.93% 8.279 56.50% 55s [2] Hqq Int8 8 63.87 % 54.66% 8.298 56.49% 122s BnB Int8 8 63.05 % 54.96% 8.305 56.25% 74s Hqq Int4 4 62.29 % 54.23% 8.482 55.85% 130s BnB NF4 4 61.44 % 54.42% 8.499 55.80% 95s BnB Int4 4 60.80 % 52.73% 8.633 55.19% 277s GPTQ Int4 4 61.58 % 53.30% 8.575 55.21% 233s AWQ Int4 4 61.84 % 54.55% 8.483 55.55% 270s Hqq Int3 3 62.26 % 51.23% 8.872 54.49% 201s We can see that the scored going from 16-bit to 8-bit are relatively unaffected, so likely it is fine to use 8bit. Going to 4-bit has a more noticeable effect, but it is not massively different on a qualitative level. Chain of Though OK, sure, maybe single-token prediction tasks are relatively unaffected, but Chain of Thought (CoT) requires many "correct" tokens at a time. Maybe these are affected more? I run some experiments on Minerva MATH Algebra dataset, with zero-shot and Chain-of-Thought. (I only use only a subset "Algebra" because CoT takes a long time to generate). Here are the results: Method nBits Minerva MATH Algebra ↑ (0-shot CoT) Time↓ BFloat16 16 37.2% 2.3h [2] Float16 16 37.5% 2.3h [2] Hqq Int8 8 37.9% 15.0h** BnB Int8 8 36.3% 5.5h** Hqq Int4 4 33.7 % 2.5h BnB NF4 4 31.3 % 3.2h BnB Int4 4 29.3 % 3.1h GPTQ Int4 4 DNF** AWQ Int4 4 DNF** Hqq Int3 3 DNF** Overall, we can see again that for 8-bit quantization, the effect on performance doesn't seem that large, though there is some noticeable degradation when going to 4-bits. ** One thing to note is that the 8-bit implementations, for some reason, were rather slow. I think this is likely some problem I have with torch.compile / torch._dynamo (since the effects were not as noticeable for 4bits or for perplexity results) but I did not have time to test this. This may be because in some runtimes, the weights need to be de-quantized at every step. Also, these results were run on an (architecturally) older A4000 which does not support FP8/FP4 compute, so results may vary. Conclusion It seems like it is basically fine to use fp16 models or int8 quantized models relatively interchangeably. You may be able to get away with using 4-bit quantized models depending on the task, but there is some noticeable degradation. ^ Note that some earlier implementations of GGUF with Llama 3 8B had some error when loading the tokeniser, and had much worse performance because of this. ^ Note that float16 and bfloat16 experiments were run on a dual-GPU setup, so the time may not be directly comparable

It seems like it is basically fine to use fp16 models or int8 quantized models relatively interchangeably.
You may be able to get away with using 4-bit quantized models depending on the task, but there is some noticeable degradation.

^{^}
Note that some earlier implementations of GGUF with Llama 3 8B had some error when loading the tokeniser, and had much worse performance because of this.
^{^}
Note that float16 and bfloat16 experiments were run on a dual-GPU setup, so the time may not be directly comparable

[-]Jacob Pfau1y30

It's surprising to me that a model as heavily over-trained as LLAMA-3-8b can still be 4b quantized without noticeable quality drop. Intuitively (and I thought I saw this somewhere in a paper or tweet) I'd have expected over-training to significantly increase quantization sensitivity. Thanks for doing this!

LESSWRONG
LW

35

Comparing Quantized Performance in Llama Models

35