Quantization reduces model memory usage by representing weights with lower precision. Learn how to use quantization techniques with Jamba models for efficient inference and training.
Jamba models support several quantization techniques:
These models leverage pre-quantized FP8 weights, significantly reducing storage requirements and memory footprint while not compromising output quality.
FP8 quantization requires Hopper architecture GPUs such as NVIDIA H100 and NVIDIA H200.
Load Pre-quantized FP8 Model
Generate Text
Pre-quantized FP8 models require no additional quantization parameters since the weights are already quantized.
ExpertsInt8 is an innovative and efficient quantization technique developed specifically for Mixture of Experts (MoE) models deployed in vLLM, including Jamba models. This technique enables:
Load Model with ExpertsInt8
Generate Text
With ExpertsInt8 quantization, you can fit prompts up to 100K tokens on a single 80GB A100 GPU with Jamba Mini.
With 8-bit quantization using BitsAndBytesConfig, it is possible to fit up to 140K sequence length on a single 80GB GPU.
Configure 8-bit Quantization
Load Model with Quantization
Run Inference
To maintain model quality, we recommend excluding Mamba blocks from quantization using llm_int8_skip_modules=["mamba"]
.
Quantization reduces model memory usage by representing weights with lower precision. Learn how to use quantization techniques with Jamba models for efficient inference and training.
Jamba models support several quantization techniques:
These models leverage pre-quantized FP8 weights, significantly reducing storage requirements and memory footprint while not compromising output quality.
FP8 quantization requires Hopper architecture GPUs such as NVIDIA H100 and NVIDIA H200.
Load Pre-quantized FP8 Model
Generate Text
Pre-quantized FP8 models require no additional quantization parameters since the weights are already quantized.
ExpertsInt8 is an innovative and efficient quantization technique developed specifically for Mixture of Experts (MoE) models deployed in vLLM, including Jamba models. This technique enables:
Load Model with ExpertsInt8
Generate Text
With ExpertsInt8 quantization, you can fit prompts up to 100K tokens on a single 80GB A100 GPU with Jamba Mini.
With 8-bit quantization using BitsAndBytesConfig, it is possible to fit up to 140K sequence length on a single 80GB GPU.
Configure 8-bit Quantization
Load Model with Quantization
Run Inference
To maintain model quality, we recommend excluding Mamba blocks from quantization using llm_int8_skip_modules=["mamba"]
.