vLLM

Overview

This guide walks you through self-deploying AI21’s Jamba models in your own infrastructure. Choose the deployment method that best fits your needs.

We recommend using vLLM version v0.6.5 to v0.8.5.post1 for optimal performance and compatibility.

Prerequisites

System Requirements

Model Size: 96.07GB
GPU Memory Required when Quantized: ~55GB

Deployment Options

Option 1: vLLM Direct Usage

Create a Python virtual environment and install the vLLM package (version ≥0.6.5, ≤0.8.5.post1 to ensure maximum compatibility with all Jamba models).

# Create and activate virtual environment
python -m venv vllm-env
source vllm-env/bin/activate

# Install vLLM
pip install vllm>=0.6.5,<=0.8.5.post1

Authenticate on the HuggingFace Hub using your access token $HF_TOKEN:

huggingface-cli login --token $HF_TOKEN

Launch vLLM server for API-based inference:

Start the vLLM server:

vllm serve ai21labs/AI21-Jamba-Mini-1.6 \
  --quantization="experts_int8" \
  --tensor-parallel-size=8 \
  --enable-auto-tool-choice \
  --tool-call-parser jamba

Test the API:

curl -X POST "http://localhost:8000/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ai21labs/AI21-Jamba-Mini-1.6",
    "messages": [
      {"role": "user", "content": "Who was the smartest person in history? Give reasons."}
    ]
  }'

Start the vLLM server:

vllm serve ai21labs/AI21-Jamba-Mini-1.6 \
  --quantization="experts_int8" \
  --tensor-parallel-size=8 \
  --enable-auto-tool-choice \
  --tool-call-parser jamba

Test the API:

curl -X POST "http://localhost:8000/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ai21labs/AI21-Jamba-Mini-1.6",
    "messages": [
      {"role": "user", "content": "Who was the smartest person in history? Give reasons."}
    ]
  }'

Start the vLLM server:

vllm serve ai21labs/AI21-Jamba-Large-1.6 \
  --quantization="experts_int8" \
  --tensor-parallel-size=8 \
  --enable-auto-tool-choice \
  --tool-call-parser jamba

Test the API:

curl -X POST "http://localhost:8000/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ai21labs/AI21-Jamba-Large-1.6",
    "messages": [
      {"role": "user", "content": "Who was the smartest person in history? Give reasons."}
    ]
  }'

Launch vLLM server for API-based inference:

Start the vLLM server:

vllm serve ai21labs/AI21-Jamba-Mini-1.6 \
  --quantization="experts_int8" \
  --tensor-parallel-size=8 \
  --enable-auto-tool-choice \
  --tool-call-parser jamba

Test the API:

curl -X POST "http://localhost:8000/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ai21labs/AI21-Jamba-Mini-1.6",
    "messages": [
      {"role": "user", "content": "Who was the smartest person in history? Give reasons."}
    ]
  }'

Start the vLLM server:

vllm serve ai21labs/AI21-Jamba-Mini-1.6 \
  --quantization="experts_int8" \
  --tensor-parallel-size=8 \
  --enable-auto-tool-choice \
  --tool-call-parser jamba

Test the API:

curl -X POST "http://localhost:8000/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ai21labs/AI21-Jamba-Mini-1.6",
    "messages": [
      {"role": "user", "content": "Who was the smartest person in history? Give reasons."}
    ]
  }'

Start the vLLM server:

vllm serve ai21labs/AI21-Jamba-Large-1.6 \
  --quantization="experts_int8" \
  --tensor-parallel-size=8 \
  --enable-auto-tool-choice \
  --tool-call-parser jamba

Test the API:

curl -X POST "http://localhost:8000/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ai21labs/AI21-Jamba-Large-1.6",
    "messages": [
      {"role": "user", "content": "Who was the smartest person in history? Give reasons."}
    ]
  }'

In offline mode, vLLM loads the model to perform batch inference tasks in a one-off, standalone manner.

from vllm import LLM
from vllm.sampling_params import SamplingParams

model_name = "ai21labs/AI21-Jamba-Mini-1.6"
sampling_params = SamplingParams(max_tokens=1024)

llm = LLM(
    model=model_name,
    quantization="experts_int8",
    tensor_parallel_size=8,
)

messages = [
    {
        "role": "user",
        "content": "Who was the smartest person in history? Give reasons.",
    }
]

res = llm.chat(messages=messages, sampling_params=sampling_params)
print(res[0].outputs[0].text)

Option 2: Quick Start with Docker

For containerized deployment, use vLLM’s official Docker image to run an inference server (refer to the vLLM Docker documentation for comprehensive details).

Pull the Docker image

docker pull vllm/vllm-openai:v0.8.5.post1

Run the container

Launch vLLM in server mode with your chosen model:

docker run --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "HUGGING_FACE_HUB_TOKEN=<your-hf-token>" \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai:v0.8.5.post1 \
  --model ai21labs/AI21-Jamba-Mini-1.6 \
  --quantization="experts_int8" \
  --tensor-parallel-size=8 \
  --enable-auto-tool-choice \
  --tool-call-parser jamba

docker run --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "HUGGING_FACE_HUB_TOKEN=<your-hf-token>" \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai:v0.8.5.post1 \
  --model ai21labs/AI21-Jamba-Mini-1.6 \
  --quantization="experts_int8" \
  --tensor-parallel-size=8 \
  --enable-auto-tool-choice \
  --tool-call-parser jamba

docker run --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "HUGGING_FACE_HUB_TOKEN=<your-hf-token>" \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai:v0.8.5.post1 \
  --model ai21labs/AI21-Jamba-Large-1.6 \
  --quantization="experts_int8" \
  --tensor-parallel-size=8 \
  --enable-auto-tool-choice \
  --tool-call-parser jamba

Once the container is up and in healthy state, you will be able to test your inference using the same code samples as in the “Online Inference (Server Mode)” section.

If you prefer to use your own storage for model weights, you can download them from your self-hosted storage (e.g., AWS S3, Google Cloud Storage) and mount the local path to the container using -v /path/to/model:/mnt/model/ and --model="/mnt/model/" instead of the HuggingFace model identifier.

Next Steps

Cloud Platform Deployment

Deploy on AWS, Google Cloud, or Azure for production workloads

Troubleshooting & Performance

Optimize performance and resolve common deployment issues

API Reference

Learn about the complete API interface and parameters

Resources

On this page

Overview
Prerequisites
System Requirements
Deployment Options
Option 1: vLLM Direct Usage
Option 2: Quick Start with Docker
Next Steps
Resources

Overview

This guide walks you through self-deploying AI21’s Jamba models in your own infrastructure. Choose the deployment method that best fits your needs.

We recommend using vLLM version v0.6.5 to v0.8.5.post1 for optimal performance and compatibility.

Prerequisites

System Requirements

Model Size: 96.07GB
GPU Memory Required when Quantized: ~55GB

Deployment Options

Option 1: vLLM Direct Usage

Create a Python virtual environment and install the vLLM package (version ≥0.6.5, ≤0.8.5.post1 to ensure maximum compatibility with all Jamba models).

# Create and activate virtual environment
python -m venv vllm-env
source vllm-env/bin/activate

# Install vLLM
pip install vllm>=0.6.5,<=0.8.5.post1

Authenticate on the HuggingFace Hub using your access token $HF_TOKEN:

huggingface-cli login --token $HF_TOKEN

Launch vLLM server for API-based inference:

Start the vLLM server:

vllm serve ai21labs/AI21-Jamba-Mini-1.6 \
  --quantization="experts_int8" \
  --tensor-parallel-size=8 \
  --enable-auto-tool-choice \
  --tool-call-parser jamba

Test the API:

curl -X POST "http://localhost:8000/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ai21labs/AI21-Jamba-Mini-1.6",
    "messages": [
      {"role": "user", "content": "Who was the smartest person in history? Give reasons."}
    ]
  }'

Start the vLLM server:

vllm serve ai21labs/AI21-Jamba-Mini-1.6 \
  --quantization="experts_int8" \
  --tensor-parallel-size=8 \
  --enable-auto-tool-choice \
  --tool-call-parser jamba

Test the API:

curl -X POST "http://localhost:8000/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ai21labs/AI21-Jamba-Mini-1.6",
    "messages": [
      {"role": "user", "content": "Who was the smartest person in history? Give reasons."}
    ]
  }'

Start the vLLM server:

vllm serve ai21labs/AI21-Jamba-Large-1.6 \
  --quantization="experts_int8" \
  --tensor-parallel-size=8 \
  --enable-auto-tool-choice \
  --tool-call-parser jamba

Test the API:

curl -X POST "http://localhost:8000/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ai21labs/AI21-Jamba-Large-1.6",
    "messages": [
      {"role": "user", "content": "Who was the smartest person in history? Give reasons."}
    ]
  }'

Launch vLLM server for API-based inference:

Start the vLLM server:

vllm serve ai21labs/AI21-Jamba-Mini-1.6 \
  --quantization="experts_int8" \
  --tensor-parallel-size=8 \
  --enable-auto-tool-choice \
  --tool-call-parser jamba

Test the API:

curl -X POST "http://localhost:8000/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ai21labs/AI21-Jamba-Mini-1.6",
    "messages": [
      {"role": "user", "content": "Who was the smartest person in history? Give reasons."}
    ]
  }'

Start the vLLM server:

vllm serve ai21labs/AI21-Jamba-Mini-1.6 \
  --quantization="experts_int8" \
  --tensor-parallel-size=8 \
  --enable-auto-tool-choice \
  --tool-call-parser jamba

Test the API:

curl -X POST "http://localhost:8000/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ai21labs/AI21-Jamba-Mini-1.6",
    "messages": [
      {"role": "user", "content": "Who was the smartest person in history? Give reasons."}
    ]
  }'

Start the vLLM server:

vllm serve ai21labs/AI21-Jamba-Large-1.6 \
  --quantization="experts_int8" \
  --tensor-parallel-size=8 \
  --enable-auto-tool-choice \
  --tool-call-parser jamba

Test the API:

curl -X POST "http://localhost:8000/v1/chat/completions" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "ai21labs/AI21-Jamba-Large-1.6",
    "messages": [
      {"role": "user", "content": "Who was the smartest person in history? Give reasons."}
    ]
  }'

In offline mode, vLLM loads the model to perform batch inference tasks in a one-off, standalone manner.

from vllm import LLM
from vllm.sampling_params import SamplingParams

model_name = "ai21labs/AI21-Jamba-Mini-1.6"
sampling_params = SamplingParams(max_tokens=1024)

llm = LLM(
    model=model_name,
    quantization="experts_int8",
    tensor_parallel_size=8,
)

messages = [
    {
        "role": "user",
        "content": "Who was the smartest person in history? Give reasons.",
    }
]

res = llm.chat(messages=messages, sampling_params=sampling_params)
print(res[0].outputs[0].text)

Option 2: Quick Start with Docker

For containerized deployment, use vLLM’s official Docker image to run an inference server (refer to the vLLM Docker documentation for comprehensive details).

Pull the Docker image

docker pull vllm/vllm-openai:v0.8.5.post1

Run the container

Launch vLLM in server mode with your chosen model:

docker run --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "HUGGING_FACE_HUB_TOKEN=<your-hf-token>" \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai:v0.8.5.post1 \
  --model ai21labs/AI21-Jamba-Mini-1.6 \
  --quantization="experts_int8" \
  --tensor-parallel-size=8 \
  --enable-auto-tool-choice \
  --tool-call-parser jamba

docker run --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "HUGGING_FACE_HUB_TOKEN=<your-hf-token>" \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai:v0.8.5.post1 \
  --model ai21labs/AI21-Jamba-Mini-1.6 \
  --quantization="experts_int8" \
  --tensor-parallel-size=8 \
  --enable-auto-tool-choice \
  --tool-call-parser jamba

docker run --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  --env "HUGGING_FACE_HUB_TOKEN=<your-hf-token>" \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai:v0.8.5.post1 \
  --model ai21labs/AI21-Jamba-Large-1.6 \
  --quantization="experts_int8" \
  --tensor-parallel-size=8 \
  --enable-auto-tool-choice \
  --tool-call-parser jamba

Once the container is up and in healthy state, you will be able to test your inference using the same code samples as in the “Online Inference (Server Mode)” section.

Next Steps

Cloud Platform Deployment

Deploy on AWS, Google Cloud, or Azure for production workloads

Troubleshooting & Performance

Optimize performance and resolve common deployment issues

API Reference

Learn about the complete API interface and parameters

Resources

On this page

Overview
Prerequisites
System Requirements
Deployment Options
Option 1: vLLM Direct Usage
Option 2: Quick Start with Docker
Next Steps
Resources

Overview

Prerequisites

System Requirements

Deployment Options

Option 1: vLLM Direct Usage

Option 2: Quick Start with Docker

Next Steps

Cloud Platform Deployment

Troubleshooting & Performance

API Reference

Resources

Getting Started

Foundation Models

Conversational RAG

AI21 Maestro [Beta]

Private AI

Guides

Usage

AI Ethics & Data Transperancy

Additional Resources

vLLM

Overview

Prerequisites

System Requirements

Deployment Options

Option 1: vLLM Direct Usage

Option 2: Quick Start with Docker

Next Steps

Cloud Platform Deployment

Troubleshooting & Performance

API Reference

Resources

​Overview

​Prerequisites

​System Requirements

​Deployment Options

​Option 1: vLLM Direct Usage

​Option 2: Quick Start with Docker

​Next Steps

Cloud Platform Deployment

Troubleshooting & Performance

API Reference

​Resources

Getting Started

Foundation Models

Conversational RAG

AI21 Maestro [Beta]

Private AI

Guides

Usage

AI Ethics & Data Transperancy

Additional Resources

​Overview

​Prerequisites

​System Requirements

​Deployment Options

​Option 1: vLLM Direct Usage

​Option 2: Quick Start with Docker

​Next Steps

Cloud Platform Deployment

Troubleshooting & Performance

API Reference

​Resources

Overview

Prerequisites

System Requirements

Deployment Options

Option 1: vLLM Direct Usage

Option 2: Quick Start with Docker

Next Steps

Resources

Overview

Prerequisites

System Requirements

Deployment Options

Option 1: vLLM Direct Usage

Option 2: Quick Start with Docker

Next Steps

Resources