2024-NIPS QLoRA-Efficient Finetuning of Quantized LLMs

Motivation

A new efficient finetuning method based on quantile quantization was proposed, which realized the goal finetuning a 65B model on a single 48GB GPU, preserving full 16-bit task performance at the same time.

Method

Core concepts:

4-bit NormalFloat Quantization(NF4)
Double Quantization
Paged optimizer

4-bit NormalFloat Quantization(NF4)

The final goal

FP32 -> INT4, which the latter has a range of [-8, 7](the first bit is sign bit, and “1000” & “0000” represent the value of zero, it was agreed that “1000” points to “-8”, and “0000” points to “0”)

The overrall pipeline

(1) estimate the ${2^{\rm{k}}} + 1$ quantiles of a theoretical N(0, 1) distribution to obtain a k-bit quantile quantization data type for normal distributions;

(2) take this data type and normalize its values into the [−1, 1] range;

(3) quantize an input weight tensor by normalizing it into the [−1, 1] range through absolute maximum rescaling.
(4) quantization inference.

The detailed explaination can refer to: link1,link2

Here, we have:

The process of getting 16 quantiles can refer to the code below:

The process of quantization inference:

Double Quantization

The core of it is to quantize the constant!

Paged optimizer

Excerpted from the raw article: “We use this feature to allocate paged memory for the optimizer states which are then automatically evicted to CPU RAM when the GPU runs out-of-memory and paged back into GPU memory when the memory is needed in the optimizer update step.”

Results

Which can refer to the raw paper.

Large Language Model

#LLM-finetune

2024-NIPS QLoRA-Efficient Finetuning of Quantized LLMs

https://firrice.github.io/posts/2024-06-26-QLoRA/

Author

firrice

Posted on

June 26, 2024

Licensed under

2023-arxiv Retrieval-Augmented Generation for Large Language Models-A Survey Previous

2021-ICLR LORA LOW-RANK ADAPTATION OF LARGE LANGUAGE MODELS Next