LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

Category

Network Quantization

Year/Month

2022-08

Status

Done

Publications

Code

https://github.com/TimDettmers/bitsandbytes

TL; DR (Korean)Motivation Background Method Experiments

TL; DR (Korean)

Problem: LLM의 inference시 상당한 GPU memory가 필요함.

LLM.int8(): Transformer 구조에서 feed-forward & projection layer에 대해 Int8 matrix mutiplication 하는것을 제안.

각각의 matrix mutiplication의 inner product에 대해 regularization 상수를 사용하여 대부분의 feature(99.9%)들을 vector-wise하게 quantization.

나머지 “emergent outliers”에 대해서는 16bit matrix multiplication으로 분리. (new mixed-precision decomposition scheme)

Result: 175B parameter LLM(ex. OPT-175B/BLOOM)의 inference에 필요한 메모리를 성능 저하 없이 절반으로 줄임.

Motivation

6.7B > Transformer base LM의 경우 feed-forward & attention projection layer가 95%의 paremeter와 65-85%의 연산을 차지. → 이거 quantization 하자

근데 350M > 에서 performance 하락 없는 quantization은 지금까지 없었음.

6.7B 넘어가면 outlier features가 출현 →기존 8bit quantization method는 fail, LLM.int8()은 16bit acc 유지.

notion image

1) Vector-wise Quantization

We show that with the first part of our method, vector-wise quantization, it is possible to retain performance at scales up to 2.7B parameters.

vector-wise Q를 사용하면 2.7B까지 performance 유지 가능.

For vector-wise quantization, matrix multiplication can be seen as a sequence of independent inner products of row and column vectors. → key idea?

행렬 곱 = row V (dot) col V
따라서 각 내적에 대해 Q normarlization 상수를 따로 사용할 수 있음.
다음 operation 전에 col & row norm 상수의 외적을 denormalizaing해서 행렬곱의 결과를 복구할 수 있음.

2) Outlier feature의 등장(에 대한 근거 제시)

6.7B 이후부터는 inference동안 hidden state의 feature dimension에 extreme outlier가 등장.

** 6B 이후 다른 feature보다 최대 20배 큰 feature가 25%의 transformer layer에서 등장, 이후 다른 layer로 확산됨. → analysis 제공.

** 6.7B 이후 phase shift 발생: 모든 transformer layer와 75%의 seq dimension이 extreme feature(outlier)의 영향을 받음.

6.7B scale에서는 15만개/seq outlier 발생. 하지만 전체 transformer에서 6 feature dimension에만 집중되어 있음.

이 outlier feature dimensions를 0으로 만들면 top-1 attention softmax prob. mass가 20%이상 감소, validation perplexity가 600-1000% 저하.

얘네가 input feature의 0.1%임에도 불구.
반대로, 같은 양의 feature를 랜덤하게 지우면 prob.는 0.3%, perplexity는 0.1% 밖에 저하 안함.

3) 2번째 method 제시: mixed-precision decomposition

outlier(0.1%) → 16bit 행렬곱

나머지(99.9%) → 8bit 행렬곱

notion image

Background

zeropoint quantization(high-precision asymmentric quantization)

datatype의 full bit-range를 사용해서 높은 precision을 제공하지만 실제 제약으로 잘 사용되지 않음.

notion image

absolute maximum quantization(symmetric quantization)

일반적으로 사용되는 기술.

notion image

Method

notion image

Experiments

Questions

outlier feature 가 특정 dimenstion에서 나타난다 → resonable 한지?

왜 6.7B scale 이후 부터 나타나는지?
근거가 확실하다면 그 디멘션 빼고 다 prune 해도 되지 않는지?

memory save 말고 speed 개선?

training with int8 directly?