Model Quantization Guide: Run 70B LLMs in 4 Bits — INT8, GPTQ, AWQ & GGUF [2026]

Key Findings

Quantization offers the highest ROI among model compression techniques — converting FP16 weights to INT4 immediately reduces memory by 75%, with accuracy loss below 1% in most cases
AWQ (MLSys 2024 Best Paper) discovered that protecting just 1% of critical weights enables nearly lossless 4-bit quantization — LLaMA-70B can fit on a single RTX 4090 (24GB)
QLoRA makes quantization more than just an inference technique — the combination of 4-bit quantization + LoRA fine-tuning enables fine-tuning a 65B model on a single 48GB GPU, with quality comparable to full-precision fine-tuning
BitNet b1.58 uses ternary {-1, 0, +1} weights to match FP16 LLaMA at the 3B scale, achieving 2.71x speed and 3.55x less memory — challenging the fundamental assumption that "models must use floating-point numbers"

I. The "Precision Prisoner" Dilemma in AI: You Are Paying for Unnecessary Precision

The AI industry has an expensive habit: storing and computing every model parameter with 32-bit or 16-bit floating-point numbers. LLaMA-70B requires 140GB of memory in FP16 — exceeding the capacity of any single consumer-grade GPU. Even enterprise-grade A100 80GB cards require at least two just to load the model. As Harvard Business Review points out^[1], the energy consumption of global AI infrastructure is expanding at an alarming rate, with a significant portion of compute power spent maintaining "unnecessary precision."

Research from MIT Sloan Management Review^[2] further demonstrates that smaller, more efficient AI deployments often deliver higher business returns than pursuing the largest models. The core question is: does a model really need 16-bit precision?

In most cases, the answer is "no." The core insight of quantization is that neural network weights and activations contain substantial redundant precision. Converting FP16 (16-bit floating-point) to INT4 (4-bit integer) immediately reduces memory by 75%, with accuracy loss typically under 1%. More aggressive research (such as BitNet b1.58) has even demonstrated that LLMs trained with just three values {-1, 0, +1} can achieve performance comparable to full-precision models.

Unlike pruning (removing parameters) and distillation (training new models), the core operation of quantization is reducing the numerical precision of each parameter — the model structure remains unchanged, the number of parameters stays the same, but the bits occupied by each parameter are dramatically reduced. This makes quantization the easiest to adopt and least retraining-dependent method among the three major compression techniques.

II. Technical Evolution: From INT8's Cautious Beginnings to 1.58-bit's Extreme Breakthrough

2.1 Fundamental Concepts: PTQ vs. QAT

Quantization techniques fall into two major approaches, with fundamental differences in cost, accuracy, and applicable scenarios:

Post-Training Quantization (PTQ) is performed directly after model training is complete, requiring only a small amount of calibration data (typically 128-512 samples) and no retraining. PTQ is currently the mainstream method for LLM quantization because the cost of retraining a 70B+ model is prohibitive. GPTQ, AWQ, and GGUF quantization all belong to the PTQ category.

Quantization-Aware Training (QAT) simulates quantization effects during the training process, allowing the model to learn to maintain accuracy at lower precision. QAT typically achieves better accuracy than PTQ but requires full training infrastructure. Google's pioneering work in 2018^[3] laid the engineering foundation for QAT, and it remains the core method for mobile device deployment (TensorFlow Lite) to this day.

Characteristic	PTQ (Post-Training Quantization)	QAT (Quantization-Aware Training)
Requires retraining?	No (calibration only)	Yes (full or partial training)
Time cost	Minutes to hours	Days to weeks
Accuracy (8-bit)	Nearly lossless	Nearly lossless
Accuracy (4-bit)	Slight degradation (controllable with GPTQ/AWQ)	Better (but higher cost)
Accuracy (2-bit)	Noticeable degradation	Acceptable (requires specialized design)
Applicable scenarios	LLM inference deployment (mainstream)	Edge devices, ultra-low-bit requirements

2.2 LLM.int8(): Breaking Through the Memory Wall for Large Models

In 2022, Tim Dettmers et al. published LLM.int8() at NeurIPS^[4], enabling INT8 inference for models with 175B parameters (such as OPT-175B) on GPUs for the first time — halving memory with zero accuracy degradation.

The core discovery of LLM.int8() is that large Transformer architectures contain a small number of "outlier features" — certain activation dimensions are over 100x larger than others. Direct quantization would "clip" these outliers, causing severe accuracy collapse. The solution is mixed-precision decomposition: maintaining FP16 for outlier dimensions and using INT8 for the rest — the matrix multiplication results from both are then merged.

This paper not only solved a technical challenge but also gave rise to the bitsandbytes library — now the most widely used quantization backend in the HuggingFace ecosystem.

2.3 GPTQ: One-Shot Compression to 3-4 Bit

If INT8 is the "safe zone" of quantization, then GPTQ^[5] (ICLR 2023) opened the "aggressive zone" — compressing LLMs to 3-4 bits per parameter.

GPTQ is based on a clever approximation: using second-order information (an approximation of the Hessian matrix) to determine the optimal compensation strategy for quantizing each weight. By quantizing layer by layer and "propagating" quantization error to subsequent weights, GPTQ compressed OPT-175B and BLOOM-176B to 3-4 bit within hours with almost no accuracy loss. This means a model that originally required 350GB of memory now needs only ~44-88GB — fitting on a single or dual high-end GPU.

Another important contribution of GPTQ was establishing the engineering benchmark for LLM quantization — subsequent methods like AWQ and SqueezeLLM all use GPTQ as their comparison baseline.

2.4 AWQ: Protecting Just 1% of Critical Weights

MIT's Song Han team's AWQ^[6] (MLSys 2024 Best Paper) made a surprising discovery: not all weights are equally important. By identifying the 1% of "critical weights" based on activation magnitudes and applying special protection (multiplying by a scaling factor before quantization), 4-bit quantization becomes nearly lossless.

Unlike GPTQ's layer-by-layer second-order optimization, AWQ's approach is more intuitive and faster, and hardware-friendly — the quantized format it produces can be efficiently executed directly on GPUs, achieving over 3x speedup on edge GPUs.

2.5 QLoRA: The Golden Combination of Quantization + Fine-Tuning

Quantization has traditionally been viewed as a pure inference technique — train the model first, then quantize for deployment. QLoRA^[7] (NeurIPS 2023 Oral) broke this boundary: it lets you fine-tune a model in its quantized state.

QLoRA's three major innovations:

4-bit NormalFloat (NF4): A quantization format specifically designed for normally distributed weights, offering better information retention than standard INT4
Double Quantization: Even the quantization parameters themselves are quantized, further saving memory
Paged Optimizers: Leveraging CPU memory to handle GPU memory overflow

The result: a single 48GB GPU can fine-tune a 65B model, with quality comparable to full 16-bit fine-tuning. QLoRA's Guanaco model (LLaMA fine-tuned on the OASST1 dataset) completed training in 24 hours with quality reaching 99.3% of ChatGPT.

2.6 Extreme Compression: SqueezeLLM, AQLM, and QuIP#

When bit-widths drop below 3-bit, traditional uniform quantization (mapping each value to equally spaced discrete points) begins to falter. In 2024, three ICML papers simultaneously tackled this challenge:

SqueezeLLM^[8] employs a "divide and conquer" strategy: extreme outliers are isolated into a sparse matrix (maintaining high precision), while the remaining weights undergo K-means non-uniform quantization — discrete points are not equally spaced but concentrated in regions where the weight distribution is dense.

AQLM^[9] borrows multi-codebook quantization techniques from information retrieval: each group of weights is represented by the "sum" of codewords from multiple codebooks, with learnable codebooks. This makes AQLM the first method to achieve Pareto optimality below 2-bit.

QuIP#^[10] takes a mathematical approach: first applying a random Hadamard transform to "scatter" the weights (eliminating inter-dimensional correlations), then using E8 lattice codebooks (the mathematically densest 8-dimensional sphere packing) for vector quantization.

2.7 BitNet b1.58: Challenging the Assumption That "Models Must Be Floating-Point"

While the methods above "compress" existing floating-point models, Microsoft's BitNet^[11] poses a more fundamental question: models don't need to be floating-point from the very beginning.

BitNet replaces the standard nn.Linear with BitLinear, using 1-bit weights from the very first training step. In 2024, the team published BitNet b1.58^[12], quantizing weights to three values {-1, 0, +1} (1.58 bits = log₂3). At the 3B parameter scale, BitNet b1.58 matched FP16 LLaMA's performance while achieving:

2.71x faster inference
3.55x less memory usage
71.4% lower energy consumption (matrix multiplication)

More remarkably, BitNet's efficiency advantage grows as model scale increases — suggesting that at larger scales, 1.58-bit models may actually surpass full-precision models. BitNet's inference engine bitnet.cpp (published at ACL 2025) runs ternary models efficiently on CPUs, making LLMs possible even on devices without GPUs.

III. Empirical Data: Comprehensive Overview of Quantization Compression Results

Model	Technique	Bit-width	Memory Savings	Accuracy Impact	Source
OPT-175B	LLM.int8()	INT8	~50%	Zero degradation	Dettmers et al., 2022
OPT-175B / BLOOM-176B	GPTQ	3-4 bit	75-81%	PPL nearly unchanged	Frantar et al., 2022
LLaMA family	AWQ	4 bit	~75%	Nearly lossless; 3x+ speedup	Lin et al., 2024
LLaMA-65B	QLoRA (NF4)	4 bit + LoRA	~75% (with fine-tuning)	Comparable to FP16 fine-tuning	Dettmers et al., 2023
LLaMA family	SqueezeLLM	3 bit	~81%	Lossless (Dense+Sparse)	Kim et al., 2024
LLaMA-2-70B	AQLM	2 bit	~87%	2-bit Pareto optimal	Egiazarian et al., 2024
3B scale	BitNet b1.58	1.58 bit	3.55x reduction	Matches FP16 LLaMA	Ma et al., 2024
FLUX.1-dev (12B)	SVDQuant	W4A4	3.5x reduction	Nearly lossless	Li et al., 2025

IV. Decision Framework: Benefits, Costs, and Applicability of Quantization

Quantization has the lowest adoption barrier among the three major model compression techniques (quantization, pruning, distillation). Understanding its positioning helps in choosing the right compression strategy:

Dimension	Original Model (FP16/BF16)	Quantized Model (INT4/INT8)
Memory Usage	2 bytes per parameter (e.g., LLaMA-70B: 140GB)	0.5 bytes per parameter (INT4: 35GB) to 1 byte (INT8: 70GB)
Inference Speed	Baseline speed	INT8: 1.5-2x speedup; INT4-AWQ: 3x+ speedup
Accuracy	Full precision	INT8 nearly lossless; INT4 loss <1%; 2-bit requires specialized methods
Adoption Cost	—	Extremely low (PTQ takes minutes to hours, no retraining required)
Fine-Tuning Capability	Full fine-tuning	QLoRA supports fine-tuning in quantized state
Hardware Requirements	Multiple high-end GPUs	Single consumer GPU for inference; CPU via GGUF format

Strategic Advantages

Lowest adoption barrier: PTQ doesn't require retraining — a single line of code is all it takes. bitsandbytes is integrated into HuggingFace, and load_in_4bit=True gets you started
Immediate impact: INT4 quantization directly reduces memory by 75%, transforming LLaMA-70B from "requires 4 A100s" to "runs on 1 RTX 4090"
Perfectly stackable with pruning and distillation: NVIDIA Minitron first prunes + distills, then quantizes; Pruna AI's smash() automatically combines multiple compression techniques
Mature ecosystem: The three major formats — GPTQ, AWQ, and GGUF — each have complete toolchains and community support, with thousands of pre-quantized models on HuggingFace

Risks That Must Be Managed

Outlier sensitivity: Outlier features in large Transformers cause naive quantization (direct clipping to INT8) to collapse. Methods like LLM.int8() and AWQ that specifically handle outliers must be used
Low-bit precision cliff: Above 4-bit is generally safe, but precision degradation below 3-bit is nonlinear. Each model's "sweet spot" differs and requires experimental validation
Calibration data impact: PTQ quality depends on the representativeness of calibration data. If calibration data significantly differs from the actual use case, post-quantization accuracy may fall short of expectations
Format fragmentation: GPTQ, AWQ, GGUF, and bitsandbytes each have different formats and toolchains. Choosing the wrong format may lead to inference engine incompatibility
Quantization is not compressed storage: INT4 memory reduction applies to on-GPU memory. Actual model file size reduction depends on the storage format (GGUF supports true file size reduction)

Quantization vs. Pruning vs. Distillation: When to Use Which?

Scenario	Recommended Technique	Reason
Quickly reduce inference memory	Quantization (AWQ / GPTQ / GGUF)	No retraining needed, completed in minutes
Need to change model architecture	Distillation	Student can use a different architecture
Remove redundant structures	Pruning	Structured pruning truly shrinks the model
Pursue extreme compression	Pruning + Distillation + Quantization	Combining all three achieves 35-49x compression
Edge device / CPU deployment	Quantization (GGUF) + Pruning	GGUF natively supports CPU inference
Low-budget large model fine-tuning	QLoRA (Quantization + LoRA)	Fine-tune 65B models on a single GPU

V. Hands-on Lab: Google Colab Online Workshop (CV Model Quantization)

Let's start with the most fundamental scenario: using PyTorch's built-in quantization tools to perform INT8 quantization on ResNet-18, comparing accuracy, speed, and model size before and after quantization. All code can be run directly on Google Colab (this experiment uses CPU, as PyTorch's dynamic quantization is optimized for CPU).

Open Google Colab, create a new Notebook, and paste the following code blocks in sequence:

5.1 Step 1 — Load Pre-trained Model and Data

import torch
import torch.nn as nn
import torch.quantization as quant
import torchvision
import torchvision.transforms as transforms
import torchvision.models as models
import time, os, copy

# This experiment uses CPU (PyTorch quantization inference is optimized for CPU)
device = torch.device("cpu")
print(f"Device: {device}")

# ---- Dataset ----
transform_test = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2470, 0.2435, 0.2616)),
])

testset = torchvision.datasets.CIFAR10(root='./data', train=False,
                                       download=True, transform=transform_test)
testloader = torch.utils.data.DataLoader(testset, batch_size=256,
                                         shuffle=False, num_workers=2)

# Calibration data
calibration_set = torchvision.datasets.CIFAR10(root='./data', train=True,
                                                download=True, transform=transform_test)
calibloader = torch.utils.data.DataLoader(calibration_set, batch_size=64,
                                           shuffle=True, num_workers=2)

# ---- Model: ResNet-18 (adapted for CIFAR-10) ----
model = models.resnet18(weights=None, num_classes=10)
model.conv1 = nn.Conv2d(3, 64, kernel_size=3, stride=1, padding=1, bias=False)
model.maxpool = nn.Identity()

# Load pre-trained weights (train for 10 epochs)
import torch.optim as optim
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=0.1, momentum=0.9, weight_decay=5e-4)
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=10)

trainset = torchvision.datasets.CIFAR10(root='./data', train=True,
                                        download=True, transform=transforms.Compose([
    transforms.RandomCrop(32, padding=4),
    transforms.RandomHorizontalFlip(),
    transforms.ToTensor(),
    transforms.Normalize((0.4914, 0.4822, 0.4465), (0.2470, 0.2435, 0.2616)),
]))
trainloader = torch.utils.data.DataLoader(trainset, batch_size=128, shuffle=True, num_workers=2)

print("Training baseline model (10 epochs)...")
model.train()
for epoch in range(10):
    for inputs, targets in trainloader:
        optimizer.zero_grad()
        loss = criterion(model(inputs), targets)
        loss.backward()
        optimizer.step()
    scheduler.step()
    if (epoch + 1) % 5 == 0:
        print(f"  Epoch {epoch+1}/10 complete")

model.eval()
print("Baseline model training complete")

5.2 Step 2 — Evaluation Utility Functions

def evaluate(model, dataloader):
    """Compute test set accuracy (CPU)"""
    model.eval()
    correct, total = 0, 0
    with torch.no_grad():
        for inputs, targets in dataloader:
            outputs = model(inputs)
            _, predicted = outputs.max(1)
            total += targets.size(0)
            correct += predicted.eq(targets).sum().item()
    return 100. * correct / total

def measure_speed(model, input_size=(1, 3, 32, 32), n_runs=300):
    """Measure CPU inference latency (ms)"""
    model.eval()
    dummy = torch.randn(*input_size)
    for _ in range(50):
        with torch.no_grad():
            model(dummy)
    start = time.perf_counter()
    for _ in range(n_runs):
        with torch.no_grad():
            model(dummy)
    return (time.perf_counter() - start) / n_runs * 1000

def get_model_size_mb(model):
    """Compute model size (MB)"""
    torch.save(model.state_dict(), "/tmp/_tmp_q.pth")
    size = os.path.getsize("/tmp/_tmp_q.pth") / 1024 / 1024
    os.remove("/tmp/_tmp_q.pth")
    return size

# ---- Baseline data ----
base_acc = evaluate(model, testloader)
base_speed = measure_speed(model)
base_size = get_model_size_mb(model)

print(f"{'='*55}")
print(f"  Baseline Model (FP32 ResNet-18)")
print(f"{'='*55}")
print(f"  Accuracy:     {base_acc:.2f}%")
print(f"  Latency:      {base_speed:.2f} ms")
print(f"  Model Size:   {base_size:.2f} MB")
print(f"{'='*55}")

5.3 Step 3 — Dynamic Quantization (Simplest: One Line of Code)

# Dynamic quantization: one line of code, done instantly
# Only quantizes Linear layer weights to INT8; activations are dynamically quantized at inference
model_dynamic = torch.quantization.quantize_dynamic(
    copy.deepcopy(model),
    {nn.Linear},   # Which layers to quantize
    dtype=torch.qint8
)

dyn_acc = evaluate(model_dynamic, testloader)
dyn_speed = measure_speed(model_dynamic)
dyn_size = get_model_size_mb(model_dynamic)

print(f"\nDynamic Quantization (INT8)")
print(f"  Accuracy:     {dyn_acc:.2f}% (delta {dyn_acc - base_acc:+.2f}%)")
print(f"  Latency:      {dyn_speed:.2f} ms (speedup {base_speed/dyn_speed:.2f}x)")
print(f"  Model Size:   {dyn_size:.2f} MB (compression {base_size/dyn_size:.2f}x)")

5.4 Step 4 — Static Quantization (Better: Both Weights and Activations Quantized)

# Static quantization requires inserting observers first, then collecting statistics with calibration data
model_static = copy.deepcopy(model)
model_static.eval()

# Set quantization configuration
model_static.qconfig = torch.quantization.get_default_qconfig('x86')

# Fuse layers (Conv + BN + ReLU -> single operation)
model_fused = torch.quantization.fuse_modules(model_static, [
    ['conv1', 'bn1', 'relu'],
])

# Insert observers
model_prepared = torch.quantization.prepare(model_fused)

# Calibration: run a few batches of training data
print("Static quantization calibrating...")
with torch.no_grad():
    for i, (inputs, _) in enumerate(calibloader):
        model_prepared(inputs)
        if i >= 15:  # ~1000 images
            break

# Convert to quantized model
model_quantized = torch.quantization.convert(model_prepared)

stat_acc = evaluate(model_quantized, testloader)
stat_speed = measure_speed(model_quantized)
stat_size = get_model_size_mb(model_quantized)

print(f"\nStatic Quantization (INT8)")
print(f"  Accuracy:     {stat_acc:.2f}% (delta {stat_acc - base_acc:+.2f}%)")
print(f"  Latency:      {stat_speed:.2f} ms (speedup {base_speed/stat_speed:.2f}x)")
print(f"  Model Size:   {stat_size:.2f} MB (compression {base_size/stat_size:.2f}x)")

5.5 Step 5 — Complete Comparison

print(f"\n{'='*70}")
print(f"  Complete Quantization Comparison (ResNet-18 / CIFAR-10 / CPU)")
print(f"{'='*70}")
print(f"{'Method':<14} {'Accuracy':>8} {'Delta':>8} {'Latency(ms)':>11} "
      f"{'Speedup':>7} {'Size(MB)':>9} {'Compress':>8}")
print(f"{'-'*70}")

results = [
    ('FP32 Orig', base_acc, 0, base_speed, 1.0, base_size, 1.0),
    ('Dynamic INT8', dyn_acc, dyn_acc-base_acc, dyn_speed,
     base_speed/dyn_speed, dyn_size, base_size/dyn_size),
    ('Static INT8', stat_acc, stat_acc-base_acc, stat_speed,
     base_speed/stat_speed, stat_size, base_size/stat_size),
]

for name, acc, delta, speed, speedup, size, compress in results:
    print(f"{name:<14} {acc:>7.2f}% {delta:>+7.2f}% {speed:>10.2f}  "
          f"{speedup:>6.2f}x {size:>8.2f}  {compress:>7.2f}x")

print(f"{'='*70}")
print(f"\nKey Observations:")
print(f"  - Dynamic quantization is simplest (one line of code) but only compresses Linear layers")
print(f"  - Static quantization is more comprehensive, with better model size and speed improvements")
print(f"  - Both have minimal accuracy loss — INT8 is the safest starting point for quantization")
print(f"  - CPU inference acceleration is the primary beneficiary of quantization")

VI. Hands-on Lab: LLM 4-bit Quantized Inference (Language Models)

CV model quantization demonstrated the fundamental principles. Now for the main event — loading and running large language models with 4-bit quantization on free Google Colab. We will use HuggingFace's bitsandbytes integration^[13] and the GPTQ format.

Open Google Colab (select T4 GPU), create a new Notebook, and paste the following code blocks in sequence:

6.1 Method 1: bitsandbytes 4-bit (Simplest)

!pip install transformers accelerate bitsandbytes -q

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import time

# One-line configuration to enable 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,                    # Enable 4-bit quantization
    bnb_4bit_quant_type="nf4",           # NF4 format (recommended by QLoRA)
    bnb_4bit_compute_dtype=torch.float16, # Use FP16 for computation
    bnb_4bit_use_double_quant=True,       # Double quantization (further memory savings)
)

model_name = "microsoft/phi-2"  # 2.7B parameters, runs on free Colab T4

print("Loading 4-bit quantized model...")
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model_4bit = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
    trust_remote_code=True,
)

# Memory statistics
mem_allocated = torch.cuda.memory_allocated() / 1024**3
print(f"4-bit model loaded successfully")
print(f"  GPU memory usage: {mem_allocated:.2f} GB")
print(f"  (FP16 requires ~{2.7*2:.1f} GB, 4-bit requires only ~{2.7*0.5:.1f} GB)")

# Generation test
prompts = [
    "The key advantage of model quantization is",
    "In machine learning, reducing model size while maintaining accuracy",
    "Knowledge distillation and quantization are complementary because",
]

model_4bit.eval()
for prompt in prompts:
    inputs = tokenizer(prompt, return_tensors="pt").to(model_4bit.device)
    with torch.no_grad():
        outputs = model_4bit.generate(
            **inputs, max_new_tokens=60,
            do_sample=True, temperature=0.7, top_p=0.9,
        )
    text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    print(f"\n  Prompt: {prompt}")
    print(f"  Output: {text[:200]}...")

6.2 Method 2: Loading a Pre-quantized GPTQ Model

!pip install auto-gptq optimum -q

from transformers import AutoModelForCausalLM, AutoTokenizer

# HuggingFace hosts numerous community pre-quantized GPTQ models
# For example, GPTQ quantized versions provided by TheBloke
model_name = "TheBloke/Llama-2-7B-Chat-GPTQ"

print("Loading GPTQ 4-bit model...")
tokenizer = AutoTokenizer.from_pretrained(model_name)
model_gptq = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype=torch.float16,
)

mem = torch.cuda.memory_allocated() / 1024**3
print(f"GPTQ model loaded successfully")
print(f"  GPU memory: {mem:.2f} GB (original FP16 requires ~14 GB)")

# Conversation test
messages = "What is model quantization and why is it important for AI deployment?"
inputs = tokenizer(messages, return_tensors="pt").to(model_gptq.device)
with torch.no_grad():
    outputs = model_gptq.generate(**inputs, max_new_tokens=100, temperature=0.7)
print(f"\n  Q: {messages}")
print(f"  A: {tokenizer.decode(outputs[0], skip_special_tokens=True)[:300]}...")

6.3 Method 3: Quantize a Model Yourself (AWQ)

!pip install autoawq -q

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer

# Choose the model to quantize
model_path = "facebook/opt-1.3b"  # 1.3B, quantizable on free Colab

print("Starting AWQ quantization...")
print("  Loading original model...")
model = AutoAWQForCausalLM.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_path)

# AWQ quantization configuration
quant_config = {
    "zero_point": True,   # Symmetric quantization
    "q_group_size": 128,  # Quantization group size
    "w_bit": 4,           # 4-bit quantization
    "version": "GEMM",    # GPU-accelerated version
}

# Execute quantization (a few minutes)
model.quantize(tokenizer, quant_config=quant_config)

# Save quantized model
save_path = "./opt-1.3b-awq-4bit"
model.save_quantized(save_path)
tokenizer.save_pretrained(save_path)

print(f"AWQ quantization complete, saved to {save_path}")

# Compare file sizes
import os
orig_size = sum(os.path.getsize(f"./opt-1.3b-awq-4bit/{f}")
                for f in os.listdir(save_path) if f.endswith('.safetensors'))
print(f"  Quantized model size: {orig_size / 1024**2:.0f} MB")
print(f"  (Original FP16 ~{1.3*2*1024:.0f} MB)")

6.4 Advanced: Running Quantized Models on CPU with llama.cpp

If your goal is to run LLMs on a computer without a GPU, the GGUF format from llama.cpp^[14] is the most practical solution:

# Run in local terminal (not Colab):

# 1. Install llama.cpp
git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp && make -j

# 2. Download community-quantized GGUF model (plenty available on HuggingFace)
# Q4_K_M is the best balance between quality and size
huggingface-cli download TheBloke/Llama-2-7B-Chat-GGUF \
    llama-2-7b-chat.Q4_K_M.gguf --local-dir ./models

# 3. Run inference directly on CPU!
./llama-cli -m ./models/llama-2-7b-chat.Q4_K_M.gguf \
    -p "Explain model quantization in simple terms:" \
    -n 200 -t 8

# GGUF quantization level comparison:
# Q2_K:   2-bit (smallest, lower quality)
# Q4_K_M: 4-bit (recommended, best quality/size balance)
# Q5_K_M: 5-bit (quality close to FP16)
# Q8_0:   8-bit (nearly lossless, but larger)

VII. Diffusion Model Quantization: Fitting the 12B FLUX on a 16GB GPU

Quantization is equally significant for diffusion models. FLUX.1-dev has 12B parameters, requiring 24GB VRAM in FP16 — exceeding the capacity of an RTX 4090. Quantization is the key technology enabling these models to run on consumer hardware.

7.1 Q-Diffusion: Pioneer of Diffusion Model-Specific Quantization

Quantizing diffusion models is trickier than LLMs because the same model operates across different denoising timesteps — each timestep has a different activation value distribution. Li et al.'s Q-Diffusion^[15], published at ICCV 2023, first addressed this problem: using timestep-aware calibration (rather than a single global calibration) to collect quantization statistics, with special handling for shortcut connections, achieving 4-bit weight quantization with virtually no FID degradation.

7.2 SVDQuant: Low-Rank Branches Absorb Outliers

MIT's Song Han team's SVDQuant^[16] (ICLR 2025 Spotlight) further pushed diffusion model quantization to W4A4 (both weights and activations at 4-bit). The core innovation: first using SVD to decompose weights into a low-rank branch (absorbing outliers, maintaining high precision) and a residual branch (4-bit quantized). Combined with their custom Nunchaku inference engine, FLUX.1-dev's 12B model runs smoothly on an RTX 4090 with 16GB:

3.5x memory reduction
3.0x speed improvement
Virtually lossless visual quality

# Run 4-bit FLUX.1 with Nunchaku (requires RTX 4090 or higher)
!pip install nunchaku diffusers transformers -q

import torch
from diffusers import FluxPipeline

# Load SVDQuant 4-bit version
pipe = FluxPipeline.from_pretrained(
    "black-forest-labs/FLUX.1-dev",
    torch_dtype=torch.bfloat16
)

# Replace with 4-bit quantized Transformer
from nunchaku.models.flux import load_quantized_model
pipe.transformer = load_quantized_model(
    "mit-han-lab/svdq-int4-flux.1-dev",
    device="cuda"
)
pipe.to("cuda")

image = pipe(
    prompt="A photorealistic mountain landscape at golden hour",
    num_inference_steps=28,
    guidance_scale=3.5,
).images[0]
image.save("flux_svdquant_result.png")
print("4-bit FLUX.1 generation complete")

7.3 GGUF Quantization: Running Stable Diffusion on CPU

Just as llama.cpp enables LLMs to run on CPU, stable-diffusion.cpp enables diffusion models to generate images in pure CPU environments. The community has already provided GGUF quantized versions for SD 1.x/2.x, SDXL, SD3.5, FLUX, and other models. Combined with the ComfyUI-GGUF plugin, even non-engineers can use quantized diffusion models on their local machines.

Method	Venue	Applicable Models	Bit-width	Results
Q-Diffusion	ICCV 2023	SD family	W4	First diffusion model PTQ
SVDQuant + Nunchaku	ICLR 2025	FLUX / SD3	W4A4	3.5x memory reduction, 3x speedup
GGUF (sd.cpp)	Community	SD / SDXL / FLUX	Q4-Q8	CPU inference, ComfyUI integration
TensorRT FP8	NVIDIA	SD / FLUX	FP8	2-2.3x speedup, 40% VRAM reduction

VIII. Ecosystem Tools Overview

The quantization tool ecosystem is the most mature among the three major compression techniques, with solutions ranging from single-line code to enterprise-grade deployment:

HuggingFace Native Integration

bitsandbytes^[13] (GitHub): load_in_4bit=True to activate with one line. Supports INT8 / NF4, the foundation of QLoRA. Officially recommended by HuggingFace
HuggingFace Quantization Guide^[17] (Docs): Unified interface supporting bitsandbytes, GPTQ, AWQ, Quanto, and other backends

LLM Quantization Formats

GPTQ — GPTQModel (GitHub): Modern implementation of the GPTQ format (successor to AutoGPTQ), supporting CUDA / ROCm / XPU / CPU, integrated with vLLM and SGLang
AWQ — AutoAWQ (GitHub): AWQ format quantization tool, 2x inference speedup
GGUF — llama.cpp^[14] (GitHub): Pure C/C++ LLM inference, GGUF format supporting 1.5-bit to 8-bit, 70k+ stars

Enterprise Platforms

NVIDIA TensorRT-LLM (GitHub): FP8/FP4/INT4-AWQ/INT8-SmoothQuant, KV cache quantization, Hopper + Blackwell GPU support
Intel Neural Compressor (GitHub): Unified quantization + pruning + distillation pipeline, includes AutoRound algorithm
TorchAO^[18] (GitHub): PyTorch's official quantization / sparsity / optimization library, integrating SpinQuant, INT4/INT8, FP8

Diffusion Model-Specific

Nunchaku (GitHub): SVDQuant's inference engine, running 4-bit FLUX on consumer GPUs
stable-diffusion.cpp (GitHub): GGUF format diffusion model inference, supporting SD / SDXL / FLUX / Wan2.x
ComfyUI-GGUF (GitHub): GGUF quantization plugin for ComfyUI, enabling non-engineers to use quantized models

IX. From Technical Metrics to Business Impact

The impact of quantization on enterprise AI deployment is direct and quantifiable (pun intended):

75% reduction in GPU costs: INT4 quantization transforms LLaMA-70B from requiring 4 A100s (monthly rental ~$10,000) to running on 1 RTX 4090 (one-time ~$1,600). For inference-intensive applications, this represents an order-of-magnitude cost difference
Halved latency: Memory bandwidth is the bottleneck for LLM inference. Quantization reduces the amount of data that needs to be read from memory, directly translating to inference speedup
LLMs can run on CPU: GGUF format enables 7B models to run at acceptable speeds on laptops without GPUs. This makes AI deployable on virtually any device
Dramatically reduced fine-tuning costs: QLoRA makes single-GPU fine-tuning of 65B models possible, lowering the hardware barrier for enterprise-customized LLMs from "requires an AI cluster" to "one graphics card"
Democratization of image generation: SVDQuant enables FLUX.1 to run on RTX 4090, and stable-diffusion.cpp enables SD on CPU. Professional-grade image generation no longer requires enterprise hardware
Sustainable AI: Lower precision = less computation = lower energy consumption. Harvard Business Review^[1] notes that model optimization is the most direct means of controlling AI's carbon footprint

X. Implementation Path: A Three-Phase Deployment Strategy

Immediate wins — use existing quantized models: HuggingFace already hosts thousands of pre-quantized models (TheBloke's GPTQ/GGUF versions, community AWQ versions). Download and use directly — no quantization expertise required. We recommend starting with GGUF Q4_K_M format — the best balance between quality and size
Incremental validation — quantize your own models with bitsandbytes: Add BitsAndBytesConfig(load_in_4bit=True) to your existing HuggingFace inference code and observe accuracy changes. If fine-tuning is needed, apply QLoRA directly to the quantized model
Production optimization — choose the deployment format: For GPU servers, choose AWQ + vLLM (fastest inference speed); for CPU / edge deployment, choose GGUF + llama.cpp; for NVIDIA GPU clusters, choose TensorRT-LLM (FP8/FP4). For image generation, choose SVDQuant (GPU) or ComfyUI-GGUF (universal)

Quantization is the most "plug-and-play" component of the model compression trilogy (pruning, distillation, quantization). It doesn't require modifying the model architecture (as pruning does), doesn't require retraining (as distillation does) — it only requires reducing numerical precision, and this simple operation alone can reduce the hardware barrier for AI deployment by an order of magnitude.

If your team is evaluating model compression strategies or needs to find the optimal balance between cost, latency, and accuracy, we welcome you to engage in an in-depth technical conversation with us. The research team at Meta Intelligence can accompany you through the complete journey from model diagnostics to production deployment.